Cropping PDF files… don’t do it!

Every now and again you hear stories of organisations embarrassed by pdf files containing supposedly redacted material. You know the sort of thing: somebody put a solid black bar over somebody’s name in a confidential document and published it on the interweb, but some other clever person found the document and looked at the pdf code to see what was under the name.

The problem is that the pdf contains (i) code to draw the name on the page and then (ii) code to draw a thick black bar over the name. So obviously if you inspect the code, you can in theory see all the redacted material. Oops.

There another interesting way this can happen. If you use a pdf tool to crop a pdf page, you could be leaking information.

Let’s say your payslip printing program prints three payslips per page. And you used to cut them up, or they’d print onto special perforated paper. Whatever.

Anyway, nowadays nobody wants paper payslips. Eugh. They want pdf. Easier to distribute, easier to manage, easier to store. So you print the payslips to pdf, then use a pdf cropping program to cut each page up into three.

If you do this, you’re probably going to be sorry some day. You see, pdf is a page description language. It laboriously emits all the commands for drawing a  page, then it “emits” the page, then it starts the next page. The point is that you can fairly safely split a pdf document up at page boundaries … as far as I know, I could be wrong … but you can’t safely split a pdf document up into little bits of a page.

All the cropping program does is render the whole page, but then change the viewport (or whatever) to some subset of the page. All the original page’s data is still there. If the recipients of the payslips only knew it, they could read the contents of two colleague’s payslips.

Isn’t pdf interesting? I love it, but like any technology, it has hidden gotchas.


