Emission Factor Matching: Why Carbon Accounting Needs AI

Last month we ran an experiment. We took a single fuel docket - "DIESEL 87.3L" printed on a fading thermal receipt - and asked five different carbon accounting tools to calculate the emissions. Four of them returned a number. All four were different. And only one disclosed which emission factor it actually used.

The math itself is trivially simple. Activity data times emission factor equals emissions. Everyone who's done this for more than a week knows the formula. But the formula isn't the problem. Emission factor matching is the problem - figuring out which of the 139+ NGA factors (and thousands more across global databases) applies to this specific line item, in this specific context.

That diesel docket? It could be transport fuel for a fleet vehicle (2.7 kg CO2-e per litre, NGA Factors 2025 Table 9). Or it could be stationary energy for an on-site generator (69.9 kg CO2-e per GJ, Table 8 - which you'd need to convert from litres using the energy content of 38.6 GJ per kilolitre). Same word on the receipt. Different table. Different factor. Different answer. And if you're reporting under NGER, applying Table 9 when you should've used Table 8 is an audit finding. Not a rounding error - an audit finding.

The BCG GAMMA survey found companies estimate a 30-40% error rate in their emissions calculations. We'd bet most of that gap comes from factor selection, not from reading the wrong number off a document. You can extract "87.3 litres" perfectly and still get the emissions wrong by a factor of two if the system picks the wrong factor.

The matching problem is harder than the extraction problem

We've written about how our 7-phase AI pipeline extracts data from utility bills. Extraction gets a lot of attention because it's visible - document in, numbers out, satisfying before-and-after. But extraction is the easier half.

Once you've pulled "87.3 litres diesel" from a receipt, you need to answer a chain of questions that a human analyst does instinctively but software struggles with. Is this transport fuel or stationary fuel? What vehicle or equipment consumed it? Which NGA table applies? Was it standard diesel or a biodiesel blend? What's the reporting period and which year's emission factors should I use?

Electricity is its own minefield. You extract 25,000 kWh from a bill. Fine. But the Scope 2 emission factor depends on which state the site is in. Victoria's factor is 0.78 kg CO2-e per kWh. Tasmania's is 0.20. That's not a small difference - it's a 75% gap from the same consumption figure. If you don't know the site location (or the system assigns the wrong state), you can be off by almost four times.

And then there's the really hard category. A line item that reads "$15,000 - IT consulting services." That's not a physical activity. It's a spend figure. It needs a spend-based emission factor in kgCO2e per AUD, not an activity-based one in kgCO2e per kWh or per litre. The factor depends on the industry sector code of the service provider. Spend-based factors are already 30-90% less precise than activity-based ones - but picking the wrong sector makes it even worse.

This is the problem that AI actually needs to solve. Not reading numbers off documents. Matching those numbers to the right factor.

How we built a 5-tier matching system (and why it needs all five tiers)

We didn't arrive at five tiers because it sounded good in a product spec. We arrived at five tiers because a single matching approach fails too often on real-world data.

Our material library holds 139+ pre-loaded NGA factors from DCCEEW, plus factors sourced from IPCC, Ecoinvent, EXIOBASE, IEA, Climatiq, BEIS, EPIC, and FootprintLab. Every extracted line item runs through the tiers in order. The first tier that returns a high-confidence match wins.

Tier 1 is a direct match. The extracted material name or code maps exactly to an entry in the library. "Grid electricity - NSW" hits the NGA Table 1 factor for NSW at 0.64 kg CO2-e/kWh. This is fast and dead accurate when it works. It works maybe 40% of the time on real documents, because real documents don't say "Grid electricity - NSW." They say "Supply charges - Ausgrid network" or just "Electricity" with no state identifier.

Tier 2 is an alias match. This is where things get interesting. Every time a user confirms or corrects a factor match, the system stores that mapping as an alias. So if three different users have confirmed that "Origin Energy supply charges" maps to grid electricity in NSW, the next time that phrase appears, Tier 2 catches it instantly. The library gets smarter with every correction. We've seen organisations go from 40% Tier 1 hits to 70%+ combined Tier 1+2 hits within a few months, as their alias library fills in the gaps specific to their retailers and invoice formats.

Tier 3 is an AI context match. The system maintains structured context for every material in the library - keywords, common aliases, what to look for on a document, related terms. When an extracted line item doesn't hit Tier 1 or 2, the system searches this context. "Unleaded petrol 91" doesn't appear as a material name, but the AI-generated context for the petrol emission factor includes "ULP," "91 octane," "unleaded," "petrol pump," and "service station fuel." Tier 3 makes the connection.

Tier 4 is a fuzzy match. Similarity scoring on material names. "Diesl" (a typo from OCR on a damaged receipt) gets matched to "Diesel" with a confidence penalty. "Natural Gas - Reticulated" gets matched to "Natural gas" in the library. This tier catches OCR errors and minor naming variations that the exact-match tiers miss. But it also generates more false positives, which is why it sits behind the more precise tiers.

Tier 5 is the LLM fallback. When nothing else works, the full line item - with all its surrounding context from the document - goes to a language model. The LLM receives the extracted text, the available factors in the library, and instructions about Australian emission factor conventions. It reasons about what the line item is and which factor applies.

This tier has a critical behaviour we deliberately built in: when the extracted item is a monetary amount rather than a physical quantity, the LLM prefers spend-based emission factors. "$45,000 catering services" gets routed to a kgCO2e/AUD factor for food services, not a kgCO2e per meal factor that would require a unit the document doesn't contain. This spend-versus-activity awareness prevents a whole class of unit mismatch errors.

Every match at every tier gets a confidence score. Below a configurable threshold, the system flags it for human review instead of silently applying a factor that might be wrong. That threshold matters more than most people think.

A worked example: what actually happens when you upload a diesel receipt

Here's a real scenario (anonymised). A construction company's fuel docket reads:

FUEL 142.6L ULP - $248.92 BP EASTLINK 18/01/2026 14:37

Phase 3 of the pipeline extracts: quantity = 142.6, unit = litres, material = "ULP", cost = $248.92, site context = BP Eastlink (Victoria).

The matching engine starts working.

Tier 1: Looks for "ULP" as a direct material name. No exact match - the library has "Petrol - gasoline for use as a transport fuel" from the NGA Factors workbook, not "ULP."

Tier 2: Checks aliases. A previous user confirmed "ULP" maps to the petrol transport factor. Match found. Confidence: 0.95. Factor applied: approximately 2.3 kg CO2-e per litre (NGA Factors 2025, transport petrol, Scope 1).

The system applies the factor: 142.6L x 2.3 kg CO2-e/L = approximately 328 kg CO2-e.

If there had been no alias? Tier 3 would have caught it - "ULP" is in the AI-generated context for the petrol factor, listed alongside "unleaded," "91," "95," "98," and "petrol pump." Confidence would have been slightly lower (maybe 0.88), but still above the review threshold.

Now change the scenario. Same 142.6 litres, but the docket says "DIESEL" and the site is a mine site with both haul trucks (transport) and generators (stationary). Tier 2 aliases might map it to transport diesel by default. But if this organisation has confirmed that fuel purchased at this specific site goes into generators, the alias would route to stationary diesel instead. Same word, different factor, different emissions. The alias learning is site-aware, which matters enormously for companies where the same fuel serves different purposes at different locations.

Where the system still breaks

We're not going to pretend this is solved. It isn't.

Novel materials with no close match. A client uploaded invoices for "hydrotreated vegetable oil" (HVO100 renewable diesel). When we first encountered it, the NGA Factors workbook didn't have a specific factor. Tier 5 made a reasonable guess, suggesting a biodiesel blend factor. It wasn't wrong, exactly, but it wasn't right either - HVO has different lifecycle emissions characteristics. A human had to research it, select the appropriate factor, and the system learned from the correction. But that first instance? It needed a person.

Confidence scores aren't certainty scores. A match returned at 0.92 confidence feels reassuring. But we've seen 0.92 matches that were wrong - typically where a line item is ambiguous between two plausible factors. "Gas" could be natural gas (stationary energy) or LPG (bottled gas). Context usually resolves it, but not always. High confidence means the system is fairly sure. It doesn't mean the system is right.

The learning loop depends on humans actually reviewing. The alias system is powerful, but only if people correct wrong matches when they see them. If a user accepts every suggestion without checking, bad aliases get stored and propagated. We've built nudges into the review interface - flagging first-time matches, highlighting low-confidence items - but we can't force people to pay attention. The system is as good as the care its users put into the first few months of training it.

Spend-based matching is inherently imprecise. When an invoice line is "$12,500 - professional services" and Tier 5 selects a spend-based factor for "professional, scientific and technical services," that factor is a sector average. It doesn't know whether those services came from a one-person home office or a firm running data centres. Spend-based factors can differ from activity-based results by a factor of two, and that's before considering which sector code was chosen. We're honest about this: spend-based matching is a starting point. It gets you into the right order of magnitude. It doesn't get you audit-grade precision.

LLM hallucination in Tier 5 is a real risk. Language models are confident by nature. They don't say "I don't know" easily. When Tier 5 encounters something genuinely ambiguous, it sometimes invents a plausible-sounding rationale for a factor match that's actually wrong. That's why Tier 5 is the last resort, not the default - and why its confidence threshold is set higher than the other tiers. Every Tier 5 match gets extra scrutiny in the review queue.

Why this matters for NGER and ASRS compliance

The Clean Energy Regulator doesn't just check whether you reported emissions. It checks whether you applied the right methodology. Under NGER, diesel used for transport is reported under Division 2.3 of the NGER Measurement Determination. Diesel used for stationary energy falls under Division 2.2. Different divisions, different methods, different emission factors from different NGA tables. Apply the wrong one and your NGER submission has a methodological error - the kind the ANAO found in 72% of the 545 reports it audited.

Under ASRS, the stakes rise again. ASSA 5010 phases in assurance requirements progressively - limited assurance on Scope 1 and 2 from Year 1, expanding to reasonable assurance by Year 4. When an auditor reviews your Scope 1 emissions, they won't just check that you multiplied correctly. They'll trace the emission factor back to the NGA Factors workbook and verify that you selected the right factor for the right application. An audit trail that shows which factor was applied and why is no longer optional - it's what separates an assurable disclosure from a guess.

This is where the 5-tier system earns its complexity. Every match gets logged with the tier that produced it, the confidence score, the factor source (NGA table number, database reference), and the reasoning chain. If an auditor asks "why did you apply transport diesel instead of stationary diesel to this receipt?", the system can answer: "Tier 2 alias match, confirmed by user X on date Y, based on site Z being classified as fleet refuelling."

That level of traceability is what turns emission factor matching from a hidden assumption into a documented, auditable decision.

The difference between matching and guessing

Most carbon accounting tools treat factor selection as a configuration step. You set up your emission factors once - diesel is 2.7, electricity is the national average, natural gas is 51.53 - and then apply them to everything. That works until it doesn't. Until you have a Victorian site and a Tasmanian site on the same electricity factor. Until diesel is going into both trucks and generators. Until an invoice shows up with a line item nobody anticipated.

The shift from configuration to matching is fundamental. Configuration assumes you know all the answers upfront. Matching assumes you don't - and builds a system that figures out the right answer for each line item, learns from corrections, and documents its reasoning.

We built Carbonly's material library with 139+ NGA factors because the Australian context demands specificity. A tool built for the US market with EPA factors bolted on doesn't know that natural gas reported in GJ uses a different NGA factor than natural gas reported in cubic metres. It doesn't know that WA has two separate electricity grids (SWIS and NWIS) with different emission factors. It doesn't know about the AR5-to-AR6 GWP discrepancy between NGER and AASB S2.

We're not sure we've got every edge case covered yet. We're still finding new material names on invoices that surprise us - "R-454B refrigerant blend" showed up for the first time last month, and our library didn't have a factor for it. Someone on the team had to look up the GWP, create the material entry, and let the learning loop take it from there.

That's the honest reality. Emission factor matching isn't a solved problem. It's a problem that gets less wrong over time, if the system is built to learn. The gap between "less wrong over time" and "perfect from day one" is where most of the real work in carbon accounting happens.

Pick any invoice from your last NGER submission and trace the emission factor back to the NGA workbook. If you can't explain which table it came from and why - not the number, the why - that's your starting point.

Related Reading: