One-liner to extract OCR layer from PDF in Linux

PDF file png sticker, digital

I’m making this post because I always forget how I do certain things in my workflow, so besides keeping a local copy of the command, I’m posting it here for other people’s reference.

The situation

Say you have a PDF that contains both an image layer and a text layer (say, a scanned and OCR’d PDF). Many PDFs that one can download from the Internet Archive (the public domain ones at least) are not provided in black and white PDFs anymore. This may be a drawback for those who have, say, “paper-like” tablets with e-ink screens. Those screens render binary (black and white) pages a lot faster and cleaner than grayscale or color, because the software in the tablets essentially have to post-process the grayscale or color images to render them on the screen properly. In my case, I like to read at night on my Samsung Tablet using an e-reader app that inverts the page colors into “night mode”. If I take the average PDF from Internet Archive to do this and invert the colors, the page effectively looks like this:

Whereas if the page were rendered properly as a black and white image, it would look like this:

I think you can see which one is easier to read on a screen reader…

So, back to the point… You have an OCR’d PDF, probably compressed with some proprietary algorithm that magically takes hundreds of megabytes of data and compresses it down to less than 20 MB, making your processor do all of the work. Instead of all of that, why not just use a command-line tool like convert or mogrify (Imagemagick) to apply a threshold to the PDF images? Well, you can, but there are usually several steps involved.

Okay, how do I extract the B&W pages from a multilayered PDF, then? (Skip if you just want to get to the layering part)

I recommend taking the compressed and layered OCR’d PDF and putting it in a temporary working directory. Say, /tmp/layered_pdf. Next, extract the PDF’s layers as PNG images with pdfimages:

pdfimages -png foo.pdf bar

Where foo.pdf is the input PDF’s name and bar is the root of the output image name, which will be rendered as bar_001.png, etc. Next, we can use file to determine whether an image is 1-bit grayscale (B&W) or hi-res grayscale or whatnot. I’ve developed this one-liner to move all B&W images into a directory called ./bw.

mkdir bw && for layer in *.png; do file "$layer" | grep -q "1-bit grayscale" && mv "$layer" ./bw/ && echo "Moved $layer to ./bw/" || echo "$layer is not grayscale. No action taken."; done

Once all of the B&W images are in a separate directory, we can use mogrify to convert them back to PDF and invert their colors. Please note that if you do not specify the -format option, mogrify will overwrite the original file, so proceed slowly.

mogrify -format pdf -negate *.png

We now have each page with the colors as they should be (i.e. white BG and black text) and as a PDF ready to be combined. Again, pdftk comes to the rescue here:

pdftk *.pdf cat output BW_IMAGE_ONLY.pdf

This combines all of the B&W PDF pages we produced with the mogrify command into a single PDF called BW_IMAGE_ONLY.pdf which is now ready to be combined with the OCR layer, which is described in detail below.

The OCR Layering

So, we now have a combined PDF of the images, but the OCR layer (which is useful for searching and the like) has been lost! We have to first extract the invisible OCR text from the original foo.pdf. That’s where this one liner comes in:

gs -q -o TEXT_LAYER.pdf -dFILTERIMAGE -sDEVICE=pdfwrite -f foo.pdf

This is a ghostscript command that simply removes (filters out) all images in the original compressed OCR’d PDF (represented here by foo.pdf), leaving just the text and (if they exist) vector layers in a new PDF called TEXT_LAYER.pdf. Like many good things, this command was found on StackOverflow as part of a larger command that takes the invisible text layer and makes it visible. But, I don’t want that, so I didn’t put that code in.

You now have a PDF of the black and white images (with no OCR) and an OCR’d PDF with no images and no visible text. The only remaining step is to use pdftk to join the two, as shown here:

pdftk BW_IMAGE_ONLY.pdf multibackground TEXT_LAYER.pdf output FINAL.pdf

After invoking pdftk, you have to provide the PDF you created containing only the images and then after the multibackground option, provide the PDF containing only the text layer (with images stripped) and set the output file to FINAL.pdf. You now have your black and white PDF with the original OCR’d layer, without having to do any re-OCR or any garbage like that. Enjoy!

TODO: Make an all-in-one bash script of the process

Add a Comment