Environment Canada — the federal department now operating as Environment and Climate Change Canada — needed to bring roughly 2,500 pages of legacy printed material online. Reports, scientific publications, and reference documents that had lived on shelves and in archives now had to live on ec.gc.ca, in proper HTML, formatted to the same web standard every other federal department’s site was held to. The work happened in 2002, before the Government of Canada’s web infrastructure had consolidated onto a shared content management system, and before the rest of the public web had really settled the make-versus-buy question on CMS work either.

2,500 pages, by hand, was not the answer

The honest accounting on a job that size is the first thing a senior practitioner runs. Scan a page. Run optical character recognition on it. Read the OCR output against the original to catch the recognition errors — the l versus I versus 1 collisions, the rn that the software wants to call m, the scientific notation and special characters that always come back wrong. Mark up the corrected text in HTML to the federal standard of the day. Wrap that markup in the page template every Government of Canada site was bound to. Quality-check the result. Publish.

Done by hand, that is somewhere between thirty minutes and an hour per page. At 2,500 pages, that is twelve to twenty-five hundred hours of editorial work — a year of someone’s full-time attention, at best. The job needed a different shape. So I built one.

The pipeline

The build was a batch processing tool that ran the scanned page images through OCR, applied a layer of correction heuristics tuned to the kinds of errors the OCR software was reliably making on this corpus, and emitted clean HTML formatted to the markup standard the federal web operated under at the time. The output dropped into a working directory ready for human review and final publish through Dreamweaver™, which was the production seat the department’s team was already using.

The constraint that shaped the build more than anything else was the Common Look and Feel Standards — the Treasury Board Secretariat’s binding web standard for federal departments. CLF specified the navigation, the page chrome, the accessibility requirements, and the markup conventions that every federal site had to conform to. There was no “we’ll style it to taste” path. The output had to land on disk already conforming to a published, audited standard. That was the immovable target the rest of the build was shaped against.

Dreamweaver was where humans still belonged

Even with the pipeline handling the volume, the work that needed human eyes still needed human eyes. OCR confidence on dense technical material drops sharply on tables, footnotes, equations, and figure captions. A scientific publication scanned from a printed report would land in the review queue with most paragraphs clean and a small number of passages flagged for review. Dreamweaver was the seat where the editorial team picked up those flagged passages, compared them against the source scan, and signed off on the page before publish.

The split was deliberate. Automate the parts that don’t need a person — the markup wrapping, the standard-conformance, the file naming, the directory structure, the navigation links. Leave a person on the parts that do — the judgement calls about whether an OCR pass captured a technical paragraph well enough to ship under a federal department’s banner. The 2,500-page volume became tractable because the tool absorbed the work that was the same on every page, and the editors absorbed only the work that genuinely varied.

The build: Custom OCR-to-HTML batch processing pipeline with correction heuristics + federal-standard markup output + human-review queue for low-confidence passages
Scope: Roughly 2,500 pages of legacy printed Environment Canada material — reports, scientific publications, reference documents — digitized and published to ec.gc.ca
Constraint: Common Look and Feel Standards (Treasury Board Secretariat) for all federal department web content
Stack: Period-current OCR software, custom processing tool, Dreamweaver as the production seat, standard HTML
Period: 2002
Client: Environment Canada (now Environment and Climate Change Canada)

Where this pattern transfers

Any organisation sitting on a paper archive that has to come online — under any compliance regime where the markup itself is part of what gets audited — has a version of this engagement on its desk. Provincial ministries with shelves of policy documents. Universities digitizing their special collections. Hospital networks moving decades of patient-education material online. Municipal governments publishing bylaw and council-record archives. Regulated industries that need to expose their archive of safety bulletins to a public audience.

The shape of the work is the same in every case. The volume forces the build. The compliance regime sets the immovable target. What determines whether a project like this succeeds is knowing which parts of the work the tool absorbs and which parts the people still have to. A fully automated pipeline that lets bad pages through under a federal department’s banner is a worse problem than the manual workflow it replaced.

Environment Canada — OCR-to-HTML Pipeline for 2,500 Pages of Legacy Publications, 2002

2,500 pages, by hand, was not the answer

The pipeline

Dreamweaver was where humans still belonged

Where this pattern transfers

Christopher Ross

Ready for a clear next step?