Building PDF Tools with WebAssembly

When we started building OxygenPDF, the assumption was that serious PDF processing needs a server. Poppler, QPDF, Ghostscript — C/C++ libraries that have been doing this for decades. Browsers just display PDFs, right?

Turns out, the browser can do a lot more than people give it credit for.

The Browser PDF Stack

pdf-lib

pdf-lib is a pure JavaScript library for creating and modifying PDFs. It handles merging, splitting, adding text and images, modifying metadata, and embedding fonts with Unicode support. Everything runs in memory using ArrayBuffer and Uint8Array, which makes it a natural fit for the browser.

PDF.js

Mozilla's PDF.js is the same engine Firefox uses to display PDFs. We lean on it for rendering page previews, extracting text, parsing document structure, and handling encrypted files.

WebAssembly

OCR, image processing, and compression need more horsepower, so we bring in WASM modules. Tesseract.js compiles the Tesseract OCR engine for text recognition. PaddleOCR runs through ONNX Runtime Web as an alternative engine. Custom WASM modules handle image codec operations.

Architecture Decisions

Web Workers

PDF operations can peg the CPU. Running them on the main thread would freeze the entire UI, so everything goes through Web Workers.

Web Worker architecture — main thread stays responsive while the worker handles PDF processing

The UI thread never blocks. Users can scroll, click around, or switch tabs while a 200-page PDF processes in the background.

Streaming for Large Files

Loading a 100+ page PDF into memory all at once is asking for trouble. We process pages one at a time, release memory after each page completes, and report progress back to the UI as we go.

Progressive Enhancement

Not every browser supports every feature. Core operations (merge, split, reorder) are plain JavaScript. OCR and advanced compression use WebAssembly when available. For operations that genuinely exceed what a browser can do, we offer an optional cloud fallback.

Problems We Ran Into

Memory

Browsers cap memory per tab at roughly 2-4GB. Large PDFs with lots of embedded images can bump against that ceiling. We process pages sequentially, explicitly null out ArrayBuffer references so the GC can reclaim them, and warn the user before we get close to the limit.

Fonts

PDF font handling is genuinely awful. Embedded subsets, CID fonts, Type1, TrueType collections — the format has accumulated decades of font technology. PDF.js handles the rendering side. For embedding, we use pdf-lib's fontkit integration. When a font is too exotic to process, we fall back gracefully instead of crashing.

Encryption

A lot of PDF libraries offer an ignoreEncryption flag that sounds like it decrypts the file. It doesn't. It just skips the encryption marker and produces corrupted output. We use PDF.js for real decryption and re-export through pdf-lib to get a clean result.

Performance Numbers

Skipping the network round-trip matters more than the raw CPU numbers suggest:

Operation	Time
Merge 10 PDFs	~200ms
Split a 100-page PDF	~500ms
Compress with image optimization	2-5s (varies with image count)
OCR a single page	3-8s (WASM Tesseract)

CPU time per operation can be higher than a beefy server, but wall-clock time is often lower because there's no upload or download.

What We're Watching

The browser platform keeps shipping useful primitives. WebGPU opens the door to GPU-accelerated image processing. Origin Private File System gives us faster file I/O than the download-to-disk flow. Shared Array Buffers make real multi-threaded processing possible.

The gap between what a server can do and what a browser can do gets smaller with every Chrome release. We're betting on that trend continuing.

Building PDF Tools with WebAssembly

Building PDF Tools with WebAssembly

The Browser PDF Stack

pdf-lib

PDF.js

WebAssembly

Architecture Decisions

Web Workers

Streaming for Large Files

Progressive Enhancement

Problems We Ran Into

Memory

Fonts

Encryption

Performance Numbers

What We're Watching

Keep Reading

Wondershare PDFelement Alternative: The Cloud Hiding Inside a Desktop App

Sejda PDF Editor Alternative for the 3-Tasks-Per-Hour Wall