PDFs are the format of choice in academia, but extracting the information they contain is annoyingly hard.
I’ve just started working on my degree’s final project. An academic project requires lots of research, which means reading lots of papers.
Papers are normally available in one form only, PDF.
While PDF is a format so ubiquitous nowadays that one can guarantee being able to display it as the writer(s) intended, its not a nice format, as I found out as soon as I needed to do something with it.
During the course of my research, I’ve been using PDF’s highlight annotations to highlight parts of a paper that’re particularly interesting.
I wanted to be able to retrieve the highlighted text at a later date so I didn’t have to open the paper again to find the parts I found interesting when I read it the first time.
You’d think that exporting annotations on text would be something that all PDF readers which support annotations (most of them do) would be capable of. I mean, surely its easy enough even if there arnt that many reasons why you’d want to do it.
Alas, none that I found running on Linux had this feature, so I delved into trying to write something to do what I needed.
I based my project on a tool I found in a StackOverflow answer to a question similar to mine.
The Python code in the answer utilises poppler-qt4 to export annotated text from a PDF. Unfortunately, the code is Python2 and the python poppler-qt4 package wouldn't install properly on my system anyway, even after installing the poppler-qt4 package.
Neither did Python’s poppler-qt5 bindings.
Convinced I could do a better job than a Python 2 script which depended on a package last updated in 2015, I translated the answer into the equivalent in C++.
I started with trying to use poppler-cpp, the C++ bindings for poppler where one has objects and namespaces, and none of the guff associated with GUI frameworks that I wouldn't need here. However, to my dismay, poppler-cpp doesn't support annotations at all. For whatever reason, annotation support only works with the bindings to a GUI framework, like glib or QT.
So instead I used poppler-glib (i.e glib from the GNOME project). Purely because I use GNOME, so wouldn't have to install anything extra.
Now, the PDF format is really odd. Annotations seem to be an after-thought to the format tacked on later.
Specifically highlighting is weird, because a highlight annotation has no connection to the document’s text.
As such, poppler’s poppler_annot_get_contents(PopplerAnnot *) which should return the annotation’s contents, returns nothing.
Instead, to get the text associated with a highlight annotation, one has to get the coordinates of the highlight annotation (A PopplerRectangle) and then utilise the function poppler_page_get_text_for_area(PopplerPage*, PopplerRectangle*) which returns the text in a defined area.
What an entirely baffling way to go about implementing highlighting. Attaching it as purely a visual element, rather than actually marking up the text.
Even more baffling is the fact that although my application works, it only mostly works.
Sometimes I get the full text highlighted, other times it chops off characters, and sometimes it adds things that’re nowhere near the highlighted text at all!
This is a problem I’m yet to solve, and I might never solve, because its ridiculous and the tool mostly does what I needed anyway.
In conclusion; The PDF format is weird, I wrote a thing.
If you use it, let me know how it goes!