Confidence Scoring — Why Penny Tells You When She's Unsure

The Accounted Editorial Team·6 March 2026·8 min read

There's a quiet revolution happening in how AI communicates with humans, and it has nothing to do with generating images or writing essays. It's about honesty. Specifically, it's about AI systems being upfront when they're not sure about something.

In the world of bookkeeping, this matters enormously. A mistake in how a transaction is categorised might only be a few pounds here and there, but across a full tax year, those small errors compound. Miscategorised expenses can mean you're paying too much tax, or too little — and HMRC takes a dim view of both.

That's why we built confidence scoring into Penny from the very beginning. It's not a bolt-on feature or a nice-to-have. It's fundamental to how Accounted works, and we think it's one of the most important things that separates genuinely useful AI from the sort that causes more problems than it solves.

The Problem With Overconfident AI

Most AI tools present their outputs as if they're always right. Ask an AI to categorise a transaction, and it'll categorise it — no caveats, no uncertainty, no indication of whether it's 99% sure or 50% sure. It just puts the transaction in a box and moves on.

This creates a dangerous illusion of accuracy. When everything looks definitive, you naturally assume it's all correct. You stop checking. You trust the system. And most of the time, that trust is justified — AI categorisation is genuinely good at the routine stuff.

But "most of the time" isn't good enough when your tax return is at stake. That 5% of transactions that the AI isn't sure about? Those are the ones that matter most. They're the edge cases — the unusual purchases, the new suppliers, the transactions that could be business or personal depending on context that only you know.

If the AI silently miscategorises those transactions and you never review them, you end up with inaccurate records. And inaccurate records lead to inaccurate tax returns, which lead to either overpaying (costing you money) or underpaying (risking penalties from HMRC).

How Penny's Confidence Scoring Works

When Penny categorises a transaction, she doesn't just assign a category. She also generates a confidence score — a numerical assessment of how certain she is about her categorisation.

Think of it like a weather forecast. When the Met Office says there's a 95% chance of rain, you take an umbrella. When they say 50%, you might take one just in case. And when they say 20%, you probably leave it at home but accept you might get wet. The percentage tells you how much to trust the prediction.

Penny's confidence scores work the same way:

High confidence (85%+): Penny is very sure about this categorisation. It's a transaction she's seen before, from a merchant she recognises, fitting a clear pattern. These transactions are categorised automatically, and you don't need to do anything unless you want to review them.

Medium confidence (60-85%): Penny thinks she knows the right category, but there's some ambiguity. Maybe it's a merchant she hasn't seen often, or the transaction could reasonably fall into more than one category. She'll categorise it but flag it for your attention.

Low confidence (below 60%): Penny isn't sure. Rather than guessing and potentially getting it wrong, she'll present you with the transaction and ask you to tell her the right category. This might be a new merchant, an unusual amount, or a transaction that's genuinely ambiguous without additional context.

The thresholds aren't arbitrary — they're calibrated based on actual accuracy data across thousands of transactions. When Penny says she's 90% confident, she's right about 90% of the time. That calibration is important because it means the confidence scores are genuinely meaningful, not just feel-good numbers.

Why This Approach Is Better

The temptation when building an AI system is to make it look as autonomous as possible. Asking the user for input feels like an admission of weakness. Wouldn't it be better if Penny just handled everything silently?

No. And here's why.

Accuracy trumps convenience. Yes, it would be slightly more convenient if Penny never asked you anything. But the occasional question — taking perhaps 10 seconds to answer — is a tiny price to pay for significantly more accurate records. When your tax return depends on those records being right, a few seconds of human input is incredibly valuable.

Context requires a human. Some transactions are genuinely ambiguous without context that Penny can't access. A payment to "Amazon" could be business stationery or a personal purchase. A meal receipt could be client entertainment or a personal dinner. Only you know the context, and Penny is smart enough to recognise when she needs to ask.

Trust is built through honesty. When an AI system admits uncertainty, it paradoxically increases your trust in its confident categorisations. If Penny flags the things she's unsure about, you can trust that the things she doesn't flag are genuinely accurate. A system that never admits doubt gives you no way to calibrate your trust.

Mistakes are caught early. An error caught at the point of categorisation takes seconds to fix. An error caught during your year-end tax return takes much longer and may have knock-on effects. By flagging uncertain categorisations in real time, Penny ensures that corrections happen immediately, when they're easiest to make.

The Exception-First Experience

Confidence scoring is closely linked to what we call Accounted's "exception-first" approach. Rather than requiring you to review every single transaction — which is tedious and time-consuming — Penny only brings things to your attention when they need your input.

In practice, this means that after Penny has been working with your accounts for a few weeks and has learned your patterns, the vast majority of transactions are handled automatically with high confidence. You might have three or four transactions a week that Penny flags for review — and those reviews take seconds, not minutes.

This is a fundamentally different experience from traditional bookkeeping software, where you're expected to process every transaction yourself, or from other AI tools that present everything with equal importance regardless of whether it needs your attention or not.

Your time is valuable. You should spend it on the things that actually need your judgement, not on rubber-stamping decisions that Penny has already made correctly. That's the philosophy behind confidence scoring, and it's the philosophy behind Accounted as a whole.

The Learning Loop

There's a beautiful feedback mechanism built into the confidence scoring system. Every time you confirm a categorisation or correct one, Penny learns.

If she categorises a transaction with medium confidence and you confirm it, her confidence in similar future transactions increases. If you correct a categorisation, she adjusts her model to avoid making the same mistake again.

Over time, this creates a virtuous cycle:

Penny categorises transactions and flags uncertain ones
You confirm or correct
Penny learns from your responses
Her accuracy and confidence improve
Fewer transactions need your input
Your bookkeeping becomes increasingly effortless

Most Accounted users find that after two to three months, Penny's high-confidence rate on their transactions exceeds 90%. That means fewer than one in ten transactions needs any input from them at all. The rest is handled automatically, accurately, and silently.

Real-World Examples

Let's make this concrete with some examples of confidence scoring in action.

Example 1: Clear-cut categorisation Transaction: "SCREWFIX DIRECT - £47.82" Your business: Plumber Penny's category: Materials Confidence: 97% What happens: Categorised automatically. You see it in your records but don't need to do anything.

Example 2: Likely but not certain Transaction: "AMAZON.CO.UK - £23.99" Your business: Freelance writer Penny's category: Office supplies Confidence: 72% What happens: Penny sends you a message: "I've categorised this Amazon purchase as office supplies. Does that look right, or was this a personal purchase?" You reply, and she learns.

Example 3: Genuinely uncertain Transaction: "TRANSFER TO J SMITH - £500" Your business: Graphic designer Penny's category: Unknown Confidence: 35% What happens: Penny asks: "I'm not sure about this £500 transfer to J Smith. Was this a business expense (if so, what category?), or a personal transaction?" You explain it was payment to a subcontractor, and Penny categories it accordingly — and remembers J Smith for next time.

What This Means for Your Tax Return

The practical upshot of confidence scoring is that when your tax return comes around — or when you need to submit your quarterly MTD updates — your records are clean and accurate.

Every transaction has been categorised, either automatically by Penny with high confidence or with your explicit confirmation. There are no silent errors lurking in your accounts waiting to cause problems. Your accountant receives well-organised records that they can work with immediately, rather than spending billable hours reviewing and correcting categorisations.

This doesn't just save you money on accountancy fees (though it does that too). It means you can be confident that your tax position is accurate. You're claiming the expenses you're entitled to, you're not claiming things you shouldn't be, and if HMRC ever asks questions, you have a clear audit trail showing how every categorisation was determined.

The Bigger Principle

Confidence scoring is really about a broader principle: AI should augment human judgement, not replace it entirely. The best AI systems are the ones that handle routine tasks brilliantly and know when to escalate to a human for decisions that require context, nuance, or judgement.

Penny is exceptional at the things AI is good at — pattern recognition, data processing, consistency, speed. And she's honest about the things that require a human touch — context-dependent decisions, ambiguous transactions, and edge cases.

That combination — AI competence plus AI honesty — is what makes Accounted's bookkeeping genuinely reliable. It's not about having the cleverest technology. It's about having technology that knows its own limitations and works alongside you, transparently, to produce the best possible outcome.

That's why Penny tells you when she's unsure. And that's why it matters.

Related reading:

Accounted helps UK sole traders stay on top of their bookkeeping and tax. Start your free 30-day trial at getaccounted.co.uk.

More on Technology

Technology

Confidence Scoring — Why Penny Tells You When She's Unsure

The Problem With Overconfident AI

How Penny's Confidence Scoring Works

Why This Approach Is Better

The Exception-First Experience

The Learning Loop

Real-World Examples

What This Means for Your Tax Return

The Bigger Principle

Related Reading

Ready to try Accounted?

More on Technology

The Best Free Project Management Tools for Sole Traders

How to Create Professional PDFs and Documents (Without Word)

How to Create a Simple Business Website (Without Coding)

You might also like

What Is Confidence Scoring and Why Your Bookkeeping App Needs It

Confidence Scoring on Every Categorisation

Ready to try Accounted?