Carbon Data Extraction: Why 72% of Emissions Reports Contain Errors

Carbon data extraction is where most emissions reports go wrong. The ANAO found 72% of NGER submissions contained errors, and manual data entry is the root cause. Here's how AI extraction actually works, where it fails, and what to look for in a tool.

Denis Kargl February 23, 2026 9 min read
Carbon Data ExtractionAI AutomationDocument Processing
Carbon Data Extraction: Why 72% of Emissions Reports Contain Errors

Last quarter, a construction company sent us a box of problems. Not literally a box — a shared drive with 10,000 fuel receipts from a single quarter. Diesel dockets from 47 different service stations. Some were thermal-printed and already fading. A few were photographed at angles that made you question whether the person holding the phone was driving at the time.

Their sustainability analyst had been copying fuel volumes from these receipts into a spreadsheet. One by one. For three months. She'd made it through about 2,100 before the NGER deadline started getting uncomfortable.

This is what carbon data extraction actually looks like at most Australian companies. Not a clean API integration. Not a tidy CSV export from an ERP system. It's someone squinting at a faded docket trying to figure out whether that number says 142.6 litres or 1,426 litres. And the difference between those two figures — when multiplied by 2.7 kg CO2-e per litre of diesel — is the difference between 385 kg and 3,850 kg of Scope 1 emissions. Per receipt.

The ANAO audited NGER submissions and found that 72% of 545 reports contained errors, with 17% including significant errors. The most common problems? Gaps in electricity data, missing sources, and errors in facility aggregates. These aren't sophisticated calculation mistakes. They're data entry problems. Carbon data extraction problems, specifically — the bit where numbers move from source documents into whatever system you're using to calculate emissions.

The manual extraction tax

Here's a number that should bother anyone with budget responsibility: BCG's 2021 GAMMA survey found that companies estimate a 30-40% error rate in their emissions measurements. Their 2022 follow-up showed marginal improvement to 25-30%. And only 9% of organisations could measure their total greenhouse gas emissions with any real frequency or accuracy.

We've seen this pattern across every industry we've worked in. A sustainability team's time breaks down roughly like this: 60% data collection and entry, 25% reconciliation and error correction, 15% actual analysis and reporting. That ratio is backwards. The people who understand your emissions profile are spending most of their hours doing work that a machine should do.

For a mid-market Australian company with 30 sites, the manual extraction workload looks something like this: 4 utility types per site (electricity, gas, water, waste), monthly billing cycles, 12 months per reporting year. That's 1,440 documents. Each one needs consumption quantity, unit of measurement, billing period dates, meter identifier, and site address pulled out and entered correctly. At 8-10 minutes per document including verification — and that's optimistic — you're looking at roughly 200 hours of pure data entry per year.

At $95,000 fully loaded cost for a sustainability analyst in Sydney, that's about $9,100 in salary spent copying numbers from PDFs. Before anyone's calculated a single emission.

And the errors compound. A study of operational spreadsheets by Dundalk Institute of Technology found that 94% contained errors. Not carbon-specific spreadsheets. All spreadsheets. But carbon reporting adds extra failure modes: unit confusion (kWh vs MWh vs GJ), billing period misalignment, estimated reads recorded as actual, and the perennial favourite — applying the wrong state-based emission factor. Using the national average of 0.62 kg CO2-e/kWh instead of South Australia's 0.22 overstates your Scope 2 by 182%. We wrote a detailed walkthrough of how these calculation errors happen and how to avoid them.

Why template-based OCR doesn't solve this

The first instinct most companies have is "better OCR." Just scan the documents and let optical character recognition pull the text out. This works in banking and insurance where every statement from the same institution looks identical. Build a template once, extract forever.

Utility bills aren't like that. AGL formats their electricity bills differently from Origin. EnergyAustralia's gas invoices look nothing like Alinta's. Regional water authorities each have their own layouts. And they all change their formats periodically — because some product team decided the bill needed a redesign.

We counted over 40 distinct bill formats from Australian energy retailers alone. Template-based OCR means building and maintaining 40+ templates. Every time a retailer updates their layout, someone needs to notice the extraction broke, build a new template, test it, and deploy it. That maintenance burden is why most template OCR projects for carbon reporting stall within 18 months.

The deeper problem is that template OCR extracts text but doesn't understand meaning. It can tell you there's a number "12,450" at coordinates (x, y) on the page. It can't tell you that number represents electricity consumption in kWh for the billing period ending 30 September. That semantic understanding — knowing which numbers matter for emissions and which are noise — requires something fundamentally different.

What AI-powered carbon data extraction actually looks like

When we say AI extraction, we don't mean OCR with a chatbot attached. We mean a multi-agent pipeline where specialised AI agents handle discrete tasks in sequence, with validation gates between each step.

At Carbonly, we built a 7-phase pipeline because no single model handles the full problem well. An agent that's excellent at reading damaged scans is mediocre at unit conversion. One that nails emission factor lookups can't reliably distinguish a water bill from a waste manifest. So we split it.

Phase 1 — Classification. Every document enters with zero assumptions. We don't trust filenames. The classification agent examines the document itself and determines the type: electricity, gas, water, fuel, waste. This matters because document type determines emission scope (Scope 1 vs Scope 2), which NGA emission factors apply, and which NGER category the emissions fall into.

Phase 2 — Vision-to-Text. A multimodal AI model reads the document image — scanned PDF, phone photo, born-digital invoice — and produces a structured text representation. Unlike template OCR, it reads layout and context. It understands that a number appearing after "Total Usage" and before "kWh" is a consumption figure, regardless of where on the page those elements sit. We support 8 input formats: PDF, CSV, Excel, Word, PowerPoint, RTF, images, and ZIP archives.

Phase 3 — Extraction. The extraction agent pulls out exactly the fields needed for emissions: consumption quantity, unit, billing period, meter number, site address, supplier name, and whether the reading is actual or estimated. Each field gets a confidence score from 0 to 100.

Phase 4 — Validation. This is where most tools stop. We don't. The validation agent checks whether extracted data makes sense. If an electricity bill shows 2,450,000 kWh for a small office, something's wrong. It flags anomalies, missing fields, and unit inconsistencies.

Phase 5 — Normalisation. Everything gets converted to consistent units. GJ to kWh. Litres to kilolitres. Billing periods aligned to reporting quarters. This step catches the unit conversion errors that cause a disproportionate number of NGER restatements.

Phase 6 — Emission Calculation. The correct NGA emission factors get applied automatically — state-based Scope 2 for electricity, fuel-specific for Scope 1, appropriate GWPs. Victoria's 0.78 kg CO2-e/kWh is not Tasmania's 0.20. The system knows the difference.

Phase 7 — Audit Trail. Every calculated emission links back through the chain: emission factor used, normalised consumption, raw extracted value, and the exact location in the source document. This is what the CER expects. It's what Beach Energy didn't have when they got hit with an enforceable undertaking in July 2025.

The 5-tier material matching problem nobody talks about

Extracting the right number from a document is only half the problem. The other half is matching that extracted data to the correct emission factor. And this is where most carbon accounting tools quietly fall apart.

When you extract "diesel" from a fuel receipt, that seems straightforward. But receipts say all kinds of things. "Automotive diesel," "ULSD," "B5 biodiesel blend," "fuel — site delivery," or just "product 2" with no description at all. Each needs to map to the right NGA emission factor, and the differences aren't trivial. Standard diesel is 69.9 kg CO2-e per GJ. A B20 biodiesel blend has a different factor for the bio component.

We built a 5-tier matching system because a single lookup approach fails too often:

Tier 1 — Direct name match. The extracted material name matches something in our library of 139+ pre-loaded NGA emission factors exactly. This handles the clean cases.

Tier 2 — Alias matching. "Automotive diesel" maps to "diesel oil" in the NGA workbook. "Nat gas" maps to "natural gas." We maintain an alias table that grows with use.

Tier 3 — AI context matching. The AI agent reads the surrounding document context. A receipt from a service station listing "product" with a price of $1.89/litre in Queensland is almost certainly diesel. The context — retailer type, price range, unit of sale — disambiguates what the label alone can't.

Tier 4 — Fuzzy and vector matching. For near-misses and misspellings. "Diesl" still needs to match. So does "unleaded petrol 91" when your factor library lists "gasoline — regular unleaded."

Tier 5 — AI fallback. When all else fails, a large language model evaluates the full context and suggests the most likely match, flagged at lower confidence for human review.

Every match below Tier 1 gets a reduced confidence score. Items scoring below the threshold land in a "Needs Review" bucket. And here's where it gets clever: when a human corrects a match, that correction feeds back into the system. The materialLearning table captures what was wrong, what it should have been, and the context. Next time, Tier 2 handles it automatically. The system gets better the more you use it.

We're not going to pretend this is perfect. Some fuel receipts genuinely don't contain enough information to determine the product with certainty. Thermal-printed dockets from independent service stations are the worst — faded ink, abbreviated product codes that mean nothing outside that specific station. For those, you need a human in the loop. But you need a human reviewing 5% of documents, not manually entering 100%.

How to evaluate a carbon data extraction tool

If you're looking at automating this part of your reporting — and with ASRS Group 2 mandatory reporting starting July 2026, you probably should be — here's what actually matters in a tool. Not what's on the feature comparison page. What matters when 1,440 documents hit the system in September and your NGER deadline is 31 October.

Does it handle your actual documents? Not demo PDFs. Your documents. The faded scan from the regional water authority. The multi-page gas invoice where consumption is on page 9. The phone photo of a fuel docket. Ask for a trial with your real data. If a vendor won't do that, they know their system can't handle it.

Does it show you confidence scores? A system that returns data without telling you how confident it is in the extraction is just a faster way to introduce errors. You need to know which extractions are solid and which need a human to verify. Anything that quietly makes up numbers when it's uncertain — and LLMs absolutely will do this if you don't build against it — is worse than manual entry. At least with manual entry, a human saw the source document.

Does it match to Australian emission factors? Plenty of international tools extract data beautifully and then apply US EPA or DEFRA emission factors. If you're reporting under NGER, you need NGA Factors. State-level, not national average. Current year, not whatever was loaded when the software shipped. We explained why this matters so much and how the factors differ by state.

Does it produce an audit trail? Under NGER, you must keep records for five years from the end of the reporting year. Under ASRS, your auditor will want to trace any emissions figure back to the source document. If the extraction tool gives you a number but can't show you where in the original document that number came from, you've traded one audit trail problem for another.

Does it learn from corrections? Every correction your team makes should improve the next extraction. If you're fixing the same misclassification every quarter because the system treats each document as if it's never seen anything like it before, that's template OCR in disguise.

What we still haven't solved

Honesty matters here. Carbon data extraction from Scope 1 and 2 source documents — utility bills, fuel receipts, gas invoices — is a problem we've largely cracked. Not perfectly, but well enough that the error rate from AI extraction plus human review of flagged items is materially lower than fully manual data entry.

Scope 3 is different. When you're trying to extract emissions-relevant data from supplier invoices — and those suppliers range from multinational logistics companies to a bloke with a ute who delivers gravel — the document quality and format variation is an order of magnitude worse. We wrote about the practical challenges of collecting Scope 3 supplier data separately because it deserves its own honest treatment.

We're also still working on cross-referencing extracted data against utility company portals. Some retailers offer CSV exports that could validate or replace what we extract from bills. Building those integrations is straightforward technically but slow commercially — getting API access from energy retailers in Australia is not a fast process.

And scanned documents below about 150 DPI remain genuinely difficult. Our OCR pre-processing (Tesseract.js with contrast enhancement and resolution upscaling) helps, but a 72 DPI fax of a fax isn't going to yield reliable data from any system. Some documents just need a human to look at them. That's not a failure of the technology. That's the reality of working with the documents Australian businesses actually have.

The construction company with 10,000 fuel receipts? Their analyst now reviews about 400 flagged items per quarter instead of manually entering all 10,000. That's a 96% reduction in manual handling. Not because the AI is perfect. Because 96% of those receipts are clean enough for the system to extract, match, and calculate with high confidence. The remaining 4% get human attention where it actually matters — on the ambiguous cases.

If your team is still copying numbers from PDFs into spreadsheets, and you've got an NGER deadline in October or an ASRS obligation starting in 2026, the maths on automating carbon data extraction isn't even close. Start with your messiest document type. For most companies, that's fuel receipts. Run them through a real trial — not a demo — and see what comes back. The confidence scores will tell you everything you need to know about whether the tool is ready for your data.

Related Reading: