comments 3

Parsing thousands of PDFs with Javascript

Back when I was working at Tages-Anzeiger, I was asked to find a way to condense the content of several hundred PDF files into one spreadsheet. These PDFs contained indicator variables about the performance of nursing and retirement homes, and for some strange reason, they were only available as individual PDFs. Here is an example.

While there are many good GUI-based tools to extract tables and the like from individual PDFs, such as Tabula and ScraperWiki (Tabula also has a command line interface and Ruby library called tabula-extractor), I had a hard time finding a generic tool or library that excels in processing large batches of similar or identical PDFs.

So I set out to program my own little solution, using good ol’ Javascript. I know that most people would probably use Python for such a task, but I am in the process of learning the in and outs of Node.js, which is, in my opinion, becoming an all-purpose language and really fit for such a task.

I was not really sure how to begin, so I posted this question on StackOverflow, and some helpful guy luckily pointed out the Unix program `less`. To read the contents of a PDF on the CLI, you simply have to type `less yourpdf.pdf`. In my case, this worked well, and the output looked something like below (expand the code or open it in a new window to see that the lines, i.e., the table rows, were correctly recognized by `less`):

ALTERS- & PFLEGEHEIM IM BRÜHL                                                                                       Kanton: AG

UNTERE DORFSTR. 10                                                                                          Rechtsform: Verein
8957 SPREITENBACH                                                                          Pflegeleistung: RAI-RUG KLV (Stufen)


Kennzahl / Jahr                                                                           2012       M ittelw ert
                                                                                                     Kanton           Schw eiz
1.        Aufenthalte und Klienten
1.01      Anzahl Plätze Langzeitaufenthalt                                                   82           62.2            58.7
1.02      Anzahl Plätze Kurzzeitaufenthalt                                                    2            1.3             0.9
1.03      Anzahl Plätze Akut- und Übergangspflege                                              -           0.2             0.2
1.04      Anzahl Tage Langzeitaufenthalt                                                 26'685       21'809.7        20'492.4

Spawning child processes, in other words, firing up other programs from your code, is as easy as pie with Node, which made it even more suitable for the task. The whole code, together with some example PDFs, is available on Github, so you can try it out yourself while I walk you through the most important steps in the following paragraphs.

Besides some in-house libraries such as `child_process` and `fs`, my script makes use of the ingenious `async` library and `json2csv`. The former is especially helpful for making the most of Javascript’s asynchronous nature, which, I find, is a perfect fit for spawning hundreds of `less` instances at the same (pseudo-)time and waiting them to be finished to further process the data.

I removed the verbose comments from the following snippets, but you can find a full explanation on Github. The `init` function is the main logic of the program, it first reads the directory of PDFs, filtering out those files which are not actually PDFs through `async`’s `filter` function, and then, with the help of `eachLimit`, the files are processed in parallel. I had to limit the concurrent instances of `less` (which is spawned in the `process` function) to 100, or I would get an `Error: spawn EMFILE`, which probably means that too many processes are spawned.

var init = function() {
    fs.readdir('pdfs', function(err, files) {
        if (err)
            throw err;
        async.filter(files, isPdf, function(pdfFiles) {
            async.eachLimit(pdfFiles, 100, process, onProcessComplete);
        });
    });
};

The interesting parts are in the `process` function displayed below. Spawning a `less` process in line 2 returns a Stream which can be listened to for `data` and `end` events. Once all the data is buffered to `pdfData`, it is “parsed” with the eponymous function in line 12 and a callback is called, basically telling `async.eachLimit` that another `less` process can be spawned. When all those `process` functions have completed, i.e., their callbacks have fired with `err === null`, `onProcessComplete` will be fired (explained below).

var process = function(file, callback) {
    var less = spawn('less', [file], {
        cwd: 'pdfs'
    });
    less.stdout.setEncoding('utf8');

    var pdfData = '';
    less.stdout.on('data', function(data) {
        pdfData += data;
    });
    less.stdout.on('end', function() {
        var err = parse(pdfData, file);
        if (err === null) console.log(file + ' parsed.');
        callback(err);
    });
};

A quick look into the `parse` function: Herein, the extracted `data` is iterated line by line, and each line is matched with a range of regular expressions of varying complexity, which won’t be explained in detail here. Finding the correct regular expressions was the most time consuming part of writing the script, but RegExr was a great help. Once a line is matched and parsed, its contents are written to a JSON object, which is pushed to an array in the second last line. While `less` worked well for most of the characters in the PDF, there was a nuisance with the `m` which had to be dealt with separately (`fixMProblem`-function). I think that one will almost certainly run in these kinds of problems when parsing PDFs, so this time I was rather lucky it was only one character that gave me a headache.

var parse = function(data, file) {
(onProcessComplete)
    var jsonObject = {};
    jsonObject.filename = file;
    // split the lines by \n whitespaces
    var lines = data.split(/(\r?\n)/g);

    function fixMProblem(string) {
        var reversedString = string.split("").reverse().join("");
        reversedString = reversedString.replace(/\sm(?!ieh)(?!urt)/ig, 'm');
        return reversedString.split("").reverse().join("");
    };
    
    for (var i = 0; i < lines.length; i++) {
        var line = lines[i];
        var result;
        // line with "Kanton: KT" also contains name of the institution
        var re = /(.*)(Kanton: ([A-Z]{2}))/;
        if (result = re.exec(line)) {
            jsonObject.kanton = result[3].trim();
        }
        // line with "Rechtsform: " also contains street
        var re = /(.*)(Rechtsform: (.*))/;
        if (result = re.exec(line)) {
            jsonObject.strasse = fixMProblem(result[1].trim()).toProperCase();
            // special case: "Ver waltung" needs to be fixed into "Verwaltung"
            jsonObject.rechtsform = result[3].trim().replace(/Verw altung/, 'Verwaltung');
        }
        // ...
    }
    jsonData.push(jsonObject);
    return null;
};

Once each file is parsed, the `onProcessComplete` function fires and the data stored in `jsonData` is converted to CSV text and then written to `output.csv`.

var onProcessComplete = function(err) {
    if (err) throw err;
    json2csv({
        data: jsonData,
        fields: Object.keys(jsonData[0])
    }, function(err, csv) {
        if (err) console.log(err);
        fs.writeFile('output.csv', csv, function(err) {
            if (err) throw err;
            console.log('Saved CSV.');
        });
    });
};

To summarize, in my opinion, Javascript’s asynchronous nature is a really good fit for such a task and working with callbacks is actually fun when you can make use of libraries such as `async`. Parsing the somewhat 1550 PDFs took approximately a minute, which is quite ok from a performance point of view I reckon. I don’t know whether this would be faster or slower with another scripting language or another approach, but what does it matter, anyway? It certainly scales to several thousand source files or even more. Finding regular expressions for different types of rows or columns is tedious but there’s probably no way around that. Of course, getting machine readable data in the first place would be desirable, but, yeah… Dream on!

The data parsed from the PDFs found their way into an interactive data journalism piece, published some months later by SonntagsZeitung and Tages-Anzeiger.

3 Comments

  1. Pingback: Datenjournalismus im Dezember 2014/ Januar 2015 | Datenjournalist

Leave a Reply