The 7-Phase AI Pipeline That Reads Your Utility Bills

We built our AI pipeline to handle utility bills at scale. And the thing that surprises us most isn't what the AI gets right - it's how creatively wrong documents can be.

A water bill from a regional council where someone scrawled the meter reading in pen over the printed estimate. A 14-page gas invoice where consumption data sits on page 9, sandwiched between regulatory notices nobody reads. An electricity bill photographed at an angle on a desk covered in coffee rings. These aren't edge cases. This is what real utility bill data extraction for ESG reporting actually looks like.

The reason we built a 7-phase pipeline - instead of one model that does everything - is because no single AI agent handles all of this well. An agent that's great at reading damaged scans is mediocre at unit conversion. One that nails emission factor lookups can't tell a water bill from a waste manifest. So we split the problem into seven discrete stages, each handled by a specialised agent, with explicit handoffs and validation gates between them.

This post is a technical walkthrough of that pipeline. What each phase does, what goes wrong, and the engineering decisions behind it. If you're evaluating AI document processing for carbon accounting, this is the detail most vendors won't show you.

Why One Model Isn't Enough

The temptation with large language models is to throw everything at a single prompt. Upload a bill, ask for structured JSON, get your answer. We tried this. It works about 80% of the time.

That other 20% will wreck your NGER submission.

The failure modes are specific and predictable. A single-model approach hallucinates numbers when it can't read them clearly - filling in a plausible-looking consumption figure instead of flagging uncertainty. It confuses billing period dates when they appear in multiple places on the document. It silently converts units wrong because it doesn't know that this particular gas retailer reports in megajoules while that one uses cubic metres.

Research published in 2025 found that even modest hallucination rates of 10-20% in financial reasoning tasks "can have outsized impacts when scaled across high-stakes decisions in portfolio management or regulatory reporting." Carbon reporting is exactly that kind of high-stakes structured extraction. When the CER audits your NGER submission, a plausible-looking number that didn't come from the source document is worse than a blank cell - because nobody catches it until it's a compliance problem.

Multi-agent architectures solve this by making each agent's job narrow enough that failure is detectable. If the classification agent is uncertain whether a document is an electricity bill or a gas invoice, it says so. It doesn't guess and pass garbage downstream. That explicit uncertainty is the whole point.

Phase 1: Document Classification

Every document enters the pipeline with no assumptions. We don't trust filenames (people name files "scan_003.pdf" and move on). We don't trust folder structures. The classification agent examines the document itself and determines what it is: electricity bill, gas invoice, water statement, fuel receipt, waste manifest, or something we don't handle.

This matters more than it sounds. An electricity bill feeds into Scope 2 calculations using grid emission factors. A natural gas invoice is Scope 1 - direct combustion. A diesel fuel receipt is also Scope 1 but uses completely different NGA emission factors. Classify wrong, and every downstream calculation is wrong even if the data extraction was perfect.

The classifier runs on document layout features, not just keywords. A document that says "Energy" in the header could be electricity or gas. The classifier looks at the presence of kWh vs MJ vs cubic metres, retailer-specific formatting patterns, and the relationship between charge types and units. We trained it on documents from over 40 Australian energy retailers, water utilities, and waste contractors.

Where it struggles: combined utility bills. Some retailers - particularly for small businesses - issue a single document covering electricity and gas. The classifier needs to flag these as multi-commodity and route them so both commodity types get extracted separately. We handle this, but it added about three weeks of engineering that we didn't originally plan for.

Phase 2: Vision-to-Text via Multimodal AI

This is where the LLM earns its keep. Our multimodal AI vision reads the document image - scanned PDF, phone photo, born-digital invoice - and produces a structured text representation of everything on the page.

The distinction from traditional OCR matters. Template-based OCR systems need someone to define where on the page each field appears. "Total consumption is in row 14, column B." That works until AGL changes their bill layout, which they do. Or until you receive an invoice from a retailer you've never seen. Then someone has to build and test a new template.

The model reads documents the way a human does. It understands that a number appearing after the words "Total Usage" and before "kWh" is a consumption figure, regardless of where on the page those elements sit. Independent benchmarks across thousands of documents show leading multimodal AI models achieving roughly 99% accuracy on field extraction when given clean text input, though that drops when working directly from images of varying quality.

We feed the model high-resolution document images and get back a structured representation that preserves the spatial relationships between fields. Tables, headers, footnotes, regulatory notices - all of it comes through with positional context intact.

Where it struggles: heavily degraded scans. A bill that went through a fax machine in 2019 and got scanned at 72 DPI is genuinely hard. Handwritten annotations over printed text cause problems too - the model sometimes reads the handwritten figure and sometimes the printed one, and it doesn't always tell you which. We've added pre-processing steps (contrast enhancement, rotation correction, resolution upscaling) that help, but we won't pretend we've solved the bottom 5% of document quality. Some bills need a human to look at them. That's just honest.

Phase 3: Structured Data Extraction

The vision-to-text output is rich but messy. It contains everything on the document - marketing messages, payment terms, tariff structures, regulatory fine print. The extraction agent's job is to pull out exactly the fields that matter for emissions calculations and nothing else.

Those fields are specific:

Consumption quantity (the number)
Unit of measurement (kWh, MJ, kL, litres, cubic metres, tonnes)
Billing period start and end dates
Meter number or NMI (National Metering Identifier)
Site address
Supplier name
Whether the reading is actual or estimated

That last one - actual vs estimated - trips people up constantly. Estimated reads are common when a meter reader can't access a site. The bill still shows a consumption figure, but it's the retailer's best guess. If you're calculating Scope 2 emissions from electricity bills, using an estimated read without flagging it means your emission figure might get revised when the actual read comes through. We tag every extraction with a confidence indicator and an actual/estimated flag so downstream systems know what they're working with.

The extraction agent outputs structured JSON with each field, its value, the confidence score, and the location in the source document where it found that value. That source location reference is critical for Phase 7 - the audit trail.

Where it struggles: multi-page bills where consumption data and billing period dates appear on different pages. The agent needs to correlate information across pages, and sometimes the billing period on page 1 refers to a different period than the consumption breakdown on page 4. We handle this by having the agent process the full document and cross-reference internal consistency, but edge cases still exist - particularly with large commercial accounts that have multiple meters on a single bill.

Phase 4: Cross-Validation

Here's where most AI extraction tools stop. They get the data out and call it done. We don't, because extraction without validation is just faster data entry with the same error rate.

The validation agent checks the extracted data against multiple signals. Is the consumption figure within a reasonable range for this type of site? A 200-square-metre office in Parramatta using 2,400,000 kWh in a quarter is obviously wrong - that's a smelter, not an office. Does the billing period align with previously processed bills from the same site? Is there a gap or overlap? Does the unit of measurement match what this retailer typically uses?

We also run extraction twice with different prompting strategies and compare results. This is computationally expensive - it roughly doubles the LLM cost per document - but it catches the exact class of errors that matter most: cases where the model confidently extracts the wrong number. Research from multi-agent consensus systems shows that requiring agreement between multiple extraction passes "greatly reduced hallucination risks, with disagreements flagging potential errors for manual review."

When the validator finds a discrepancy, the document doesn't silently pass through. It gets flagged for human review with a specific explanation: "Consumption figure 45,230 kWh is 3.2x the 12-month average for this site" or "Billing period overlaps with previously processed bill by 14 days." The human reviewer sees the original document, the extracted data, and the reason for the flag.

We're still calibrating the sensitivity on this. Too loose and errors get through. Too tight and you're drowning reviewers in false positives. Right now, about 8-12% of documents get flagged, and roughly half of those flags turn out to be legitimate issues. We think we can get the false positive rate lower, but it's a moving target as we see new document formats.

Phase 5: Unit Normalisation

This phase is unglamorous and absolutely critical. Every consumption figure needs to be in consistent units before emission factors get applied. Sounds simple. It isn't.

Electricity is usually in kWh, but some retailers report in MWh for large commercial accounts. Gas comes in megajoules, gigajoules, or cubic metres depending on the retailer and the state. Water appears in kilolitres or litres. Waste shows up in tonnes or cubic metres (and those aren't interchangeable without density assumptions that change by waste stream).

The normalisation agent converts everything to the base units that NGA emission factors expect. For electricity, that's kWh. For natural gas, that's gigajoules. For liquid fuels, that's kilolitres. Every conversion uses explicit conversion factors with full precision - not rounded approximations.

The gotcha people miss: natural gas heating values. When a gas bill shows consumption in cubic metres, you can't just convert to gigajoules using a fixed multiplier. The heating value of natural gas varies by distribution zone. In Western Australia, it's different from Victoria, which is different from Queensland. The NGA Factors 2025 workbook specifies these regional heating values, and the normalisation agent looks up the correct one based on the site's state and the gas network operator.

We also handle the billing period normalisation here. If your reporting period is the financial year (July to June, as required for NGER) but your electricity bill runs from the 15th of one month to the 14th of the next, those periods don't align. The normalisation agent pro-rates consumption to match reporting periods using daily consumption rates. It's not perfect - consumption isn't actually linear across a billing period - but it's the accepted method under the NGER Technical Guidelines, and it's a lot better than ignoring the mismatch.

Phase 6: Emission Factor Application and Calculation

Now we get to why all the previous phases exist. The calculation agent takes normalised consumption data and applies the correct emission factors from the NGA Factors 2025 workbook to produce tonnes CO2-e.

For Scope 2 electricity, this means state-based grid emission factors. And they vary significantly. Victoria's grid factor is 0.78 kg CO2-e/kWh. South Australia's is 0.22. Tasmania's is 0.20. If you use the national average of 0.62 for everything - something we've seen sustainability managers do - you'll overstate SA emissions by nearly threefold and understate Victorian emissions by about 20%.

The calculation agent selects the correct factor based on the site's location (determined during extraction) and the reporting year. It also stores which factor version was used, because NGA factors change annually. The 2025 factors show 2-3% reductions in grid intensity across most states compared to 2024-25 - NSW dropped from 0.66 to 0.64, Queensland from 0.71 to 0.67.

For NGER reporters, there's an additional wrinkle. NGER uses AR5 global warming potentials from the IPCC Fifth Assessment Report. AASB S2 requires AR6 values. The numerical difference is small for CO2 but material for methane (AR5: 28; AR6: 27.9) and some refrigerant gases. If you're reporting under both frameworks - and every NGER reporter pulled into ASRS Group 2 is - you potentially need both sets of calculations from the same underlying data.

We handle both. The calculation agent produces NGER-aligned and ASRS-aligned emission figures in parallel, flagged clearly so the right numbers go into the right submission. This is one of those details that sounds minor until you're explaining to an auditor why your ASRS Scope 1 figure doesn't exactly match your NGER figure.

Where we're still working: Scope 3 calculations from supplier invoices. The emission factors for purchased goods and services depend on the supplier's own emissions intensity, which varies wildly and is often unavailable. We can apply spend-based emission factors from DCCEEW's input-output tables, but those are averages across entire industry sectors. They're a placeholder, not an answer. We're honest about that limitation because overstating the accuracy of Scope 3 numbers is exactly the kind of claim that gets companies in trouble with the ACCC.

Phase 7: Audit Trail Generation

Every number produced by the pipeline needs to be traceable back to its source. Not next week when an auditor asks. Right now. Automatically.

The audit trail agent creates a record for every emission figure that links:

The final calculated emission (e.g., 847 tonnes CO2-e)
The emission factor used and its source (e.g., NGA Factors 2025, Table 1, Victorian grid, 0.78 kg CO2-e/kWh)
The normalised consumption that was multiplied (e.g., 1,085,897 kWh)
Any unit conversions applied (e.g., 1,085.9 MWh converted to kWh)
The raw extracted value from the document (e.g., "Total Usage: 1,085.9 MWh")
The exact location on the source document where that value appears
The original source document itself, stored as an immutable reference

That chain - from reported number back to highlighted text on a scanned PDF - is what the Clean Energy Regulator expects when they audit your NGER submission. It's what ASRS assurance providers will need under ASSA 5010. And it's what Beach Energy apparently didn't have, given the CER required them to rebuild their data collection systems as part of their 2025 enforceable undertaking.

The ANAO found that 72% of 545 NGER reports they examined contained errors. Seventy-two percent. Most of those errors trace back to the same root cause: a gap somewhere between the source document and the reported number. The audit trail doesn't prevent errors in the AI pipeline (that's what Phases 4 and 5 are for). What it does is make every error findable and fixable, instead of buried in a spreadsheet that nobody can reconstruct.

What This Pipeline Can't Do

We'd rather be upfront about the limitations than have you discover them after you've committed to a tool.

Handwritten documents - entirely handwritten utility records (still common in some regional water authorities) are unreliable. The pipeline will attempt extraction but flags everything as low confidence. You'll need human review.

Non-English bills - if you've got supplier invoices in Mandarin or Japanese for Scope 3 supply chain reporting, the pipeline handles them with lower accuracy. We're improving this, but for now, non-Latin scripts have roughly 15-20% more extraction errors than English documents.

Consolidated corporate invoices - some energy retailers issue a single invoice covering 50+ sites with a summary table and individual site details across dozens of pages. We handle these, but processing time is longer and the validation phase generates more flags because there are more opportunities for cross-page correlation errors.

Historical documents - bills from before 2015 tend to have different formats, lower scan quality, and sometimes use emission factors that have since been updated. The pipeline extracts data but can't retrospectively apply the correct historical NGA factors without manual configuration.

We think radical honesty about what AI can and can't do is more useful than a marketing page claiming 99.9% accuracy. The 99.9% number that some vendors quote typically comes from controlled benchmark conditions with clean, high-resolution, well-formatted documents. The real world includes coffee-stained photocopies and 14-page gas invoices. Our effective accuracy across the full range of document quality sits around 93-95% - which still beats manual transcription by a wide margin, but it isn't magic.

Why the Architecture Matters More Than the Model

The current vision model will be replaced by a better one. Probably within a year. The specific LLM sitting behind our vision-to-text phase is a component, not the product.

What matters is the architecture: the separation of concerns between agents, the validation gates between phases, the explicit uncertainty handling, and the audit trail that connects reported numbers to source documents. That architecture works regardless of which foundation model powers Phase 2. When we upgrade the underlying model - and we will - the extraction accuracy improves but the pipeline structure stays the same.

This is why we built Carbonly as a multi-agent system rather than a wrapper around a single API call. A wrapper breaks when the API changes. A pipeline with discrete, testable phases breaks in predictable, fixable ways.

If you're evaluating AI-powered carbon accounting tools, ask the vendor what happens when their model hallucinates a number. If the answer is "it doesn't" or "we have 99.9% accuracy," keep looking. The right answer is: "we detect it here, flag it for review, and maintain a full audit trail so you can verify any figure back to its source document."

That's not a feature. It's the minimum viable requirement for regulatory-grade emissions reporting.

Related Reading:

AI Document Processing for Carbon Accounting - the business case for automated extraction
How to Calculate Scope 2 Emissions from Australian Electricity Bills - the formula and NGA factors this pipeline applies
NGER Compliance Automation - how automated pipelines reduce compliance risk
NGER Reporting Thresholds 2026 - who needs to report and what the CER expects
How One AI System Reads 200+ Utility Bill Formats
10,000 Fuel Receipts in One Quarter
Why Carbonly Is the Best Carbon Accounting Software in Australia