comment 1

Parsing thousands of PDFs with Javascript

Back when I was working at Tages-Anzeiger, I was asked to find a way to condense the content of several hundred PDF files into one spreadsheet. These PDFs contained indicator variables about the performance of nursing and retirement homes, and for some strange reason, they were only available as individual PDFs. Here is an example.

While there are many good GUI-based tools to extract tables and the like from individual PDFs, such as Tabula and ScraperWiki (Tabula also has a command line interface and Ruby library called tabula-extractor), I had a hard time finding a generic tool or library that excels in processing large batches of similar or identical PDFs.

So I set out to program my own little solution, using good ol’ Javascript. I know that most people would probably use Python for such a task, but I am in the process of learning the in and outs of Node.js, which is, in my opinion, becoming an all-purpose language and really fit for such a task.

I was not really sure how to begin, so I posted this question on StackOverflow, and some helpful guy luckily pointed out the Unix program less. To read the contents of a PDF on the CLI, you simply have to type less yourpdf.pdf. In my case, this worked well, and the output looked something like below (expand the code or open it in a new window to see that the lines, i.e., the table rows, were correctly recognized by less):

Spawning child processes, in other words, firing up other programs from your code, is as easy as pie with Node, which made it even more suitable for the task. The whole code, together with some example PDFs, is available on Github, so you can try it out yourself while I walk you through the most important steps in the following paragraphs.

Besides some in-house libraries such as child_process and fs, my script makes use of the ingenious async library and json2csv. The former is especially helpful for making the most of Javascript’s asynchronous nature, which, I find, is a perfect fit for spawning hundreds of less instances at the same (pseudo-)time and waiting them to be finished to further process the data.

I removed the verbose comments from the following snippets, but you can find a full explanation on Github. The init function is the main logic of the program, it first reads the directory of PDFs, filtering out those files which are not actually PDFs through async‘s filter function, and then, with the help of eachLimit, the files are processed in parallel. I had to limit the concurrent instances of less (which is spawned in the process function) to 100, or I would get an Error: spawn EMFILE, which probably means that too many processes are spawned.

The interesting parts are in the process function displayed below. Spawning a less process in line 2 returns a Stream which can be listened to for data and end events. Once all the data is buffered to pdfData, it is “parsed” with the eponymous function in line 12 and a callback is called, basically telling async.eachLimit that another less process can be spawned. When all those process functions have completed, i.e., their callbacks have fired with err === null, onProcessComplete will be fired (explained below).

A quick look into the parse function: Herein, the extracted data is iterated line by line, and each line is matched with a range of regular expressions of varying complexity, which won’t be explained in detail here. Finding the correct regular expressions was the most time consuming part of writing the script, but RegExr was a great help. Once a line is matched and parsed, its contents are written to a JSON object, which is pushed to an array in the second last line. While less worked well for most of the characters in the PDF, there was a nuisance with the m which had to be dealt with separately (fixMProblem-function). I think that one will almost certainly run in these kinds of problems when parsing PDFs, so this time I was rather lucky it was only one character that gave me a headache.

Once each file is parsed, the onProcessComplete function fires and the data stored in jsonData is converted to CSV text and then written to output.csv.

To summarize, in my opinion, Javascript’s asynchronous nature is a really good fit for such a task and working with callbacks is actually fun when you can make use of libraries such as async. Parsing the somewhat 1550 PDFs took approximately a minute, which is quite ok from a performance point of view I reckon. I don’t know whether this would be faster or slower with another scripting language or another approach, but what does it matter, anyway? It certainly scales to several thousand source files or even more. Finding regular expressions for different types of rows or columns is tedious but there’s probably no way around that. Of course, getting machine readable data in the first place would be desirable, but, yeah… Dream on!

The data parsed from the PDFs found their way into an interactive data journalism piece, published some months later by SonntagsZeitung and Tages-Anzeiger.

1 Comment so far

  1. Pingback: Datenjournalismus im Dezember 2014/ Januar 2015 | Datenjournalist

Leave a Reply