Skip to content

Lessons Learned: That Time We Got Hit with 20GB PDFs (Yes, Really)

So picture this: it’s a regular workday, everything’s humming along, our systems are behaving (mostly), and we’re chugging through a backlog of scanned documents from one of our partner agencies. Then someone opens their email and lets out the kind of groan that signals something is very, very wrong.

“Hey… why is this PDF 19.7 gigabytes?”

Cue the panic.

The Root of the Problem: Communication, Assumptions, and DPI

As it turns out, somewhere along the chain of meetings, emails, and shoulder taps, nobody actually defined a maximum size for the scanned PDFs. The partner agency, bless their high-resolution hearts, had been scanning paper documents at ultra-high DPI—possibly high enough to detect the paper’s soul. And they did this for thousands of documents.

To be clear: these weren’t videos. These weren’t high-def image archives. These were PDFs. Some were bigger than Linux ISOs. We had a few breaking the 20GB barrier. One of our teammates tried opening one in Acrobat and their machine immediately considered early retirement.

And since this agency had done their due diligence—digitizing mountains of paper—it wasn’t wrong, just... completely unworkable for actual end users.

Users Just Wanted to Read Stuff, Not Summon Cthulhu

Our users were trying to access these PDFs via web-based viewers, Acrobat Reader, or (gasp) Chrome’s built-in PDF renderer. Spoiler alert: none of them could handle a 20GB document. Systems would freeze, fans would spin like a jet engine, and RAM would vanish faster than your will to live in a Monday morning meeting.

So now we had a new problem: how to let users read these documents without setting their laptops on fire.

The Solution: Divide, Conquer, and Python

This led us to create a dedicated PDF processing service that we affectionately referred to internally as the "PDF Shredder" (it sounds cooler than “pdf-split-service”).

  • Library of choice: After testing several tools (and promptly rejecting ones that tried to load the entire PDF into memory), we landed on PyMuPDF (fitz). This library is fast, efficient, and didn’t panic when it saw a multi-gigabyte file. Unlike our team.

  • Process pipeline:

  • A job picked up each giant PDF from Amazon S3.

  • We split the PDF into smaller chunks based on:

    • Page count,
    • Estimated file size (e.g., keep it under 200MB),
    • And the don’t-break-users’-computers principle.
    • Each chunk was uploaded back into S3 under a logical folder structure.
    • Users were notified via email and/or our internal notification service that their document was ready—in bite-sized, laptop-friendly pieces.
  • Progress monitoring UI: We built a dashboard that tracked all this so we could see which files were in queue, which had succeeded, and which had spontaneously decided to go rogue and throw exceptions.

DevOps to the Rescue: Enter Kubernetes

Processing one or two of these monsters? Sure. But we were facing thousands of them. Manually running scripts wasn’t going to cut it.

So we scaled the service using a Kubernetes deployment:

  • Multiple pods handled jobs concurrently.
  • Auto-scaling kicked in based on queue depth and system load.
  • Long-running jobs were tracked with timeouts and retry logic.
  • We included memory/CPU constraints to prevent nodes from being overwhelmed (because some PDFs tried very hard to do exactly that).

This architecture let us chew through terabytes of PDFs like a woodchipper running on caffeine and vengeance. Even so, the full job set took nearly a week of continuous processing.

What We Learned (Besides Deep Existential Sadness)

  • Set file size expectations upfront. If you're working with a partner agency, make sure the document specs are clearly defined. Include things like: max size, preferred DPI, acceptable formats, and whether the documents should be large enough to double as a ransomware payload.

  • Don’t assume the tools will just work. Popular PDF viewers have hard limits—some around 2GB. Testing with a 5-page sample PDF won’t prepare you for a 7,000-page behemoth.

  • Modular pipelines + Kubernetes = peace of mind. Our design let us distribute the load across dozens of workers, retry failed jobs, and recover from errors automatically. This was critical given the volume and size of documents.

  • Use efficient libraries. PyMuPDF handled large files with far more grace than the alternatives. We avoided using PyPDF2 and similar libraries after seeing them choke on modest files.

  • Humans still matter. Despite our automated infrastructure, some errors still required manual review—corrupt PDFs, missing metadata, etc. A UI to track and debug job status made our lives way easier.