How I evaluate an LLM that reads invoices in production

If you ship an LLM into production and you cannot say how good it is with a number, you are not running a product, you are running a demo. Evaluating DokladBot, my invoicing SaaS that extracts data from receipts, is not a one-off benchmark. It is a standing discipline. Here is how I think about it and what I actually measure.

Accuracy is not one number

"Is the extraction accurate?" is the wrong question because a receipt has many fields, and they do not matter equally. Getting the supplier name slightly wrong is a shrug. Getting the amount or the VAT wrong is a trust-destroying failure. So I measure quality per field, not as a single blended score, and I weight the fields by how much a mistake costs the user.

The labeled set is the asset

The thing that makes evaluation possible is a fixed, labeled test set: real documents with the correct amount, VAT, supplier, date, and company ID written down by hand. This set is the asset, more valuable than any single prompt, because every change gets scored against the same documents. Without it, "the new prompt feels better" is a vibe, not a measurement.

The metrics that matter

For each field I track precision and recall: when the model fills a field, how often is it right (precision), and how often does it fill a field that should be filled (recall). For amount and VAT I care about precision above almost everything, because a confidently wrong number is worse than a blank one. For an end-to-end view I track the share of documents that need zero corrections, which is the product metric the model evaluation rolls up into.

The confidence threshold is a product decision

The model does not just return a value, it returns a value I can route on. Below a confidence threshold, the document goes to a human-in-the-loop review instead of being posted automatically. Setting that threshold is the central tradeoff: too low and people lose trust because wrong values slip through, too high and the product nags about documents it could have handled. Tuning it is not a model task, it is a product judgement informed by the eval numbers.

Multi-model fallback, measured not assumed

DokladBot can fall back across models. The point of the eval set is that I do not guess which model is better, I score them on the same documents and let the numbers decide, including cost. A cheaper model that holds precision on amount and VAT is a better product even if it loses a point on supplier names. That decision is only possible because the evaluation exists.

Regression testing across versions

The risk with LLM products is silent regression: a provider updates a model, or I change a prompt, and quality quietly drops on a field nobody was watching. So every change re-runs the labeled set and I compare field-level precision and recall to the previous version. If amount precision drops, the change does not ship, no matter how good it looks on a few examples.

Why this is the AI PM skill

Anyone can wire an LLM to an API. The harder, rarer skill is knowing whether it is good enough to trust, where to put the human, which model to pick, and when a change made things worse. That is evaluation, and it is the part of AI product work that separates a demo from a product. It is also the work I do every week on DokladBot.

DokladBot is live in beta. If you are shipping an LLM product and want this kind of evaluation discipline applied to it, let's talk.