How We Built an AI That Reads 200 Different Utility...

Last year we hit format number 147 and something broke. Not the AI - the AI was fine. Our assumption broke. We'd assumed that Australian utility bills, despite being visually different, followed roughly similar information hierarchies. Consumption near the top, charges in the middle, payment details at the bottom. Turns out a waste contractor in regional Queensland puts their tonnage figure inside a footnote on the second page, right below a paragraph about EPA licence conditions. No heading. No table. Just a number in a sentence.

That's the kind of thing that makes automated carbon accounting AI genuinely hard. Not the computer science. The chaos.

We've now processed bills from over 200 distinct formats - AGL, Origin, EnergyAustralia, Alinta, Red Energy, Simply Energy, council water authorities across every state, gas retailers you've never heard of, waste contractors who clearly designed their invoices in Word 97. Each one puts consumption data in a different place, uses different terminology, and structures the information in ways that would break any system relying on fixed coordinates.

This is the story of how we built Carbonly's document processing engine. Not the marketing version. The engineering version - including the parts that don't work yet.

The Template Problem Is a Scaling Problem

Before we wrote a line of code, we catalogued the utility bill formats we'd need to handle for a typical mid-market Australian company reporting under NGER. A construction company with 40 sites across NSW and Victoria might deal with 15-20 different energy retailers and utilities. A property manager with 80 tenancies could easily hit 25-30 unique formats. Add water authorities (which are state and council-specific), waste contractors, and fuel suppliers, and you're looking at 40-60 format variations for a single client.

Template-based OCR - the kind that's been around since the 1990s - handles this by having someone define, for each format, exactly where each field appears. "Total consumption is at coordinates (412, 318) on page 1." That works beautifully until it doesn't. And it stops working for three specific reasons.

First, retailers change their bill layouts. AGL has updated their residential bill format at least twice in the last three years. When they do, every template built for the old layout produces garbage or nothing. Someone has to notice the failure, build a new template, test it, and deploy it. The Institute of Finance & Management has documented that high exception rates from template breaks increase processing costs and cause significant payment delays across the invoicing industry. In carbon accounting, a template break during September means bad data in your October NGER submission.

Second, new providers appear. When a client switches from Origin to a smaller retailer like Powershop or Tango Energy, you've got a format you've never seen. That's a new template to build. For a carbon accounting platform serving hundreds of companies, this turns into a permanent staffing problem - you need template engineers on call, all the time, just to keep the lights on.

Third - and this is the one nobody talks about - template OCR can't handle the same retailer issuing different formats for different account types. AGL's residential electricity bill looks nothing like their large commercial bill, which looks nothing like their embedded network billing statement. Same company. Three templates. That pattern repeats across every major retailer.

We decided early that templates were a dead end for what we were building. Not because the technology is bad. It's fine for controlled environments with five or ten known formats. But carbon accounting at scale - across hundreds of clients, thousands of sites, and a format universe that shifts every quarter - needs something different.

Reading Documents the Way a Human Does

The shift we made was from coordinate extraction to contextual understanding. Instead of telling the system where to find the consumption figure, we built it to understand what a consumption figure looks like in context. That difference sounds subtle. Architecturally, it's everything.

Multimodal AI vision is what makes this possible. When the model processes a utility bill image, it doesn't parse pixels into characters and then try to match them to expected positions. It reads the entire page as a structured visual scene - headers, tables, body text, footnotes, logos - and builds a representation of how those elements relate to each other.

So when Origin puts "Total Usage: 4,230 kWh" in a blue box on the right side of page 1, and EnergyAustralia puts the same information in a table row labelled "Electricity Consumption" on page 2, the model understands that both represent the same semantic concept. Independent benchmarks testing extraction across thousands of documents show leading vision AI models achieving around 99% accuracy on field extraction from clean text input. Direct from images - which is our actual use case - the numbers are lower and depend heavily on document quality, but the contextual understanding holds up in ways that template systems simply can't match.

We paired this with something we call format-agnostic prompting. Rather than telling the model "extract the number at position X," we describe what we need semantically: the total energy consumption, in its original unit, for the billing period shown on this document. The model figures out where that information lives on its own. When a retailer redesigns their bill, nothing in our system needs to change. The prompts don't reference layout. They reference meaning.

That said - and this is where we need to be honest - contextual understanding introduces failure modes that template OCR doesn't have. A template either finds the number at the expected coordinates or it doesn't. Clear pass or fail. An LLM reading a document contextually might find a number that looks right but isn't. A demand charge that it interprets as consumption. A previous period's figure when you wanted the current one. These are harder failures to catch because the output looks plausible.

Which is exactly why we built validation into a separate pipeline phase rather than trusting any single extraction pass.

The Formats That Nearly Broke Us

Every engineering team has a wall of shame. Ours is a Slack channel called #cursed-documents. Here are some of the formats that forced us to rethink our approach.

Council water bills. These are the wild west of Australian utility billing. Each local government area issues its own water and sewerage bills with layouts that range from professional (Sydney Water, Melbourne Water) to what appears to be a mail merge from a 2004 council database. One regional NSW council we encountered puts water consumption in kilolitres on a quarterly bill where the billing period isn't explicitly stated - you have to infer it from the rate notice date and a statement that says "charges for the period." Another council combines water, sewerage, stormwater, and trade waste on a single page with a table that has no column headers. The consumption figure sits in the third row, second column, with nothing to identify it except its position relative to the dollar amounts.

For these, we had to teach the extraction agent to reason about implicit structure. If a number appears in a table with no headers, what else on the page provides context? Is there a unit anywhere nearby? Does the number's magnitude make sense for water consumption at a site of this type? This kind of inference is something a human does instantly and an LLM does reasonably well - but it's the opposite of deterministic, and we're not fully comfortable with the confidence levels yet.

Multi-page gas invoices. The big three retailers (AGL, Origin, EnergyAustralia) all issue multi-page gas bills for commercial accounts. The summary page typically shows the dollar amount. The consumption data - in MJ or cubic metres - might be on page 3, page 5, or page 9 depending on how many rate schedules and regulatory notices are included. Our extraction agent needs to scan the entire document, identify which page contains consumption data, and correlate it with the billing period from the summary page.

The tricky part is that some of these bills contain multiple consumption figures - current period, previous period, and same period last year - on the same page. The model needs to know which one to pull. We've seen it grab the year-on-year comparison figure instead of the current period figure, which produces an emission calculation that's silently wrong by whatever the year-over-year change was. Our cross-validation phase catches most of these by comparing against historical data for the same site, but it assumes we have historical data. For a brand new client, we're flying blind for the first quarter.

Fuel receipts. Construction companies generate hundreds of these. Diesel for excavators, petrol for site vehicles, sometimes LPG. The receipts range from proper tax invoices (structured, clear, typed) to thermal-print docket receipts that are already fading by the time someone photographs them. Transport companies have staff photographing fuel receipts on their phones - at the bowser, in bright sunlight, at an angle, sometimes with a thumb over the corner.

We had to build specific pre-processing for these: rotation correction, contrast enhancement, resolution upscaling. Even then, about 15% of fuel receipt photos produce extractions we'd flag as low confidence. The numbers might be right, but we can't be sure enough to send them into an emission calculation without a human checking. For a company processing 10,000 fuel receipts a year, that's 1,500 documents needing review. Better than 10,000, but not zero.

Handwritten annotations. This one surprised us. We expected maybe 2-3% of documents to have handwriting. The actual rate is closer to 8-10%. Site managers writing meter readings in pen over the printed estimate. Accountants scrawling "paid 14/3" across the top. Someone circling a number and writing "check this" in the margin. Each annotation adds noise that the vision model has to distinguish from the printed content it's actually trying to extract.

Our vision AI handles clean handwriting reasonably well. Research from the Emerald benchmarking study on LLMs for handwritten text recognition confirmed that performance drops noticeably with cursive or messy handwriting, and that LLMs "do not possess significant capability for self-correction" when they misread handwritten characters. Our approach is pragmatic: if handwritten content overlaps with a data field we need, we flag it for human review rather than guessing. The cost of a false extraction in carbon accounting - a wrong number flowing into an NGER submission - is too high to gamble on.

Why We Run Extraction Twice

This is the most expensive architectural decision we've made. And we'd make it again.

For every document, we run the data extraction phase twice with different prompting strategies. The first pass uses a directive approach: "Extract the following fields from this utility bill." The second pass uses a conversational approach: "You're looking at a utility bill. Walk me through what you see, then identify the consumption data and billing period."

We compare the results. If both passes agree on all fields - consumption, unit, billing period, meter ID - the document moves forward automatically. If they disagree on anything, it gets flagged.

This sounds like a waste of money. It roughly doubles our LLM inference cost per document. But here's the maths that justifies it.

A mid-market NGER reporter processing 800 utility bills per year at, say, $0.04 per extraction pass spends an extra $32 per year on double extraction. The cost of a single incorrect emission figure making it into an NGER submission - factoring in potential restatement, CER audit attention, and the Beach Energy precedent of forced reasonable assurance audits - is conservatively $20,000-$50,000. We're spending $32 to avoid a five-figure problem. That's not even a close call.

Research published in Information (MDPI) in 2025 on multi-agent hallucination mitigation frameworks found that requiring consensus between multiple extraction passes "greatly reduced hallucination risks" in structured data extraction tasks. Our own internal testing shows the double-pass approach catches about 3-4% of documents where the single pass would have produced a confident but incorrect result. On 800 documents, that's 24-32 bills per year where the AI would have quietly given you the wrong number.

We're still not sure this scales perfectly for every document type. Fuel receipts with poor image quality sometimes produce two different wrong answers rather than one right and one wrong. We're working on a third-pass tiebreaker approach for these cases, but we haven't deployed it yet because the latency hit is hard to justify for the incremental accuracy gain.

Handling Bills in Other Languages

This came up faster than we expected. Australian companies with international supply chains receive invoices in Mandarin, Japanese, Korean, Bahasa, and Thai - particularly for Scope 3 Category 1 (purchased goods and services) and Category 4 (upstream transport). Our vision AI handles multilingual documents, but the accuracy gap is real.

For English-language utility bills, our end-to-end extraction accuracy sits around 93-95% across all document quality levels. For non-Latin script documents, that drops to roughly 75-80%. The model reads the text correctly most of the time, but it struggles with two specific things.

First, unit conventions differ. Japanese electricity bills report consumption in kWh like Australia, but gas bills often use cubic metres with different heating value assumptions. Chinese industrial invoices sometimes use non-standard unit abbreviations. The normalisation agent needs to know these regional conventions, and we've only built out the common ones so far.

Second, the validation agent's anomaly detection relies on understanding what "reasonable" consumption looks like for a given site type. Our baselines are calibrated for Australian commercial properties. A manufacturing facility in Shenzhen has a completely different consumption profile, and we don't have enough data yet to set good thresholds.

We handle this honestly: non-English documents get processed with a lower confidence threshold, and more of them get routed to human review. It's not ideal. But claiming 95% accuracy on Mandarin invoices when we're actually hitting 78% would be the kind of misleading statement that gets companies in trouble - and not just with the ACCC. If your Scope 3 figures are built on unreliable extraction from supplier invoices, your assurance provider will find the cracks during the ASSA 5010 engagement.

The Engineering Philosophy Behind It

We've been asked why we didn't just fine-tune a model on utility bills and call it done. Fair question. There's a reason.

Fine-tuning locks you to a specific model version. When the next generation of vision AI drops, a fine-tuned model doesn't benefit from the improvement. You'd have to re-fine-tune on the new base model, re-validate, and re-deploy. Our architecture keeps the model as a swappable component. We've already upgraded between model generations without changing our pipeline logic - extraction accuracy improved by roughly 8-12% on degraded documents, and we picked that up for free.

The deeper reason is that our pipeline isn't really about the AI model at all. It's about the architecture around it - the classification, validation, normalisation, calculation, and audit trail phases that turn raw extraction into regulatory-grade emission figures. The model is the eyes. The pipeline is the brain.

When our team built data platforms at BHP, Rio Tinto, and Schneider Electric, the lesson was always the same: the ingestion layer is the least important part of a good data system. What matters is validation, lineage, and auditability. You can swap the ingestion method - manual entry, SCADA feeds, OCR, LLM extraction - and the downstream quality controls should still work. That principle carried straight into how we built Carbonly.

We treat every LLM output as untrusted input. Same way you'd treat data from an external API or a user-submitted form. It goes through validation before it touches a calculation. That mindset - trust nothing, verify everything - is what separates a carbon accounting system from a demo.

What 200 Formats Taught Us About Document Processing

After a year of building this, a few things became clear that we didn't expect going in.

Format diversity is a long tail. The first 30 formats (the major retailers and metro water utilities) covered about 70% of the documents we see. The next 70 formats covered another 20%. The remaining 100+ formats cover the last 10% - but that 10% includes some of the most important documents. A regional gas distributor's invoice might represent the largest single Scope 1 source for a manufacturing client. You can't just skip the long tail.

Document quality matters more than format. A clean, high-resolution scan of the strangest bill layout we've ever seen is easier to process than a blurry phone photo of a standard AGL electricity bill. We've invested more engineering time in image pre-processing (deskewing, contrast, denoising, resolution upscaling) than in handling new formats. The format problem largely solves itself with good contextual understanding. The quality problem requires dedicated image processing work that has nothing to do with LLMs.

The hardest part isn't extraction. It's knowing when extraction failed. A template system tells you loudly when it can't find data - the expected coordinates are empty. An LLM always gives you an answer. Sometimes that answer is wrong. Building systems that reliably detect incorrect-but-plausible outputs is genuinely harder than building the extraction itself. Our cross-validation, historical comparison, and dual-pass approaches all exist to answer one question: did the AI actually get this right, or did it just produce something that looks right?

We don't think we've fully solved that question. Nobody has. But for the documents we see in Australian carbon accounting - electricity bills for Scope 2, gas invoices for Scope 1, fuel receipts, water and waste statements - we're confident enough in the pipeline to put our name on the numbers it produces. And where we're not confident, the system says so explicitly rather than guessing.

That distinction - between confidence and silence - is the whole difference between AI that's useful for compliance and AI that's a liability.

Related Reading:

The 7-Phase AI Pipeline That Reads Your Utility Bills - technical walkthrough of each pipeline phase
AI Document Processing for Carbon Accounting - the business case and cost maths
How to Calculate Scope 2 Emissions from Australian Electricity Bills - the formula the pipeline applies after extraction
10,000 Fuel Receipts in One Quarter
Why Carbonly Is the Best Carbon Accounting Software in Australia