Code Generation Architecture: How We Made AI Calculate Correctly

By March 2025 we had successfully hooked AskDiana into our business backend (SQL Server). And the numbers? A crapshoot. Some days were great, some days I could only explain as Friday afternoon responses — you know the kind you get when it's twenty minutes to the weekend and your inherited drone delivers minimum-effort answers with maximum-confidence energy. Not wrong enough to be obviously wrong. Just wrong enough to quietly ruin your Tuesday when someone notices.

Errors were small. A few hundreds when querying about a few millions of monthly sales. They wouldn't tank the monthly graphs, but they'd matter. And if you scale that across thousands of calculations in a busy business analytics platform, you start compounding small errors into big decisions made on the basis of numbers that are, diplomatically speaking, not entirely true.

The Fundamental Problem (Again)

I've said it before, but it bears repeating: LLMs are language models, not calculators. I wrote about this first on LinkedIn, with the full detail following in The Arithmetic Blindspot — both worth a read if you want the full horror show.

When you ask an LLM to calculate, it's predicting digit sequences based on patterns, not computing results. For simple maths, it's often right. For complex multi-step calculations? Coin flip. Actually, coin flip is generous — at least a coin has the decency not to present its result in a beautifully formatted table with three decimal places.

The Obvious Solution Everyone Ignores

Here's the insight that should be obvious but somehow isn't, perhaps because the industry collectively got very excited about AI and briefly forgot that computers exist:

Stop asking AI to do arithmetic.

Instead:

Use AI to understand the question
Use AI to generate code that performs the calculation
Execute the code (deterministic computation)
Use AI to explain the result

AI does what it's good at (language). Computers do what they're good at (maths). The ENIAC figured this out in 1945. We briefly forgot. We've remembered now.

The AskDiana Architecture

When you ask: "What's our average sale value for Q3?"

Traditional AI approach:

Retrieves data
Attempts calculation
Returns probably-wrong number
Presents it with the unearned confidence of a man who has never once been right but has also never once doubted himself

AskDiana approach:

Step 1: Intent Recognition (AI)
Parse the question, understand you need numerical data from a specific time period.

Step 2: Code Generation (AI)
Generate SQL: SELECT AVG(sale_value) FROM sales WHERE quarter = 3

Step 3: Code Validation (Rules Engine)
Verify the SQL is safe, properly formatted, and accesses appropriate tables. Not glamorous. Absolutely necessary.

Step 4: Execution (Database)
Run the query. Get a deterministic result. The database has been doing this since before most of our junior developers were born, and it has never once been distracted by having read too many financial reports.

Step 5: Natural Language Response (AI)
"Your average sale value for Q3 was £1,247.89 based on 3,429 transactions."

Step 6: Transparency
Show the SQL that was executed. Users can verify the logic. Trust is built through transparency, not blind faith in a system whose inner workings are roughly as legible as my handwriting after a conference dinner.

Why This Works

Separation of concerns — a principle so old it predates AI entirely:

Natural language understanding → AI
Code generation → AI
Computation → Computer
Explanation → AI

Verifiability: Users see the generated code. They can verify the logic. No black boxes. No "trust me, I'm an algorithm."

Accuracy: Computers don't make arithmetic errors. If the code is right, the answer is right. This is not aspirational. This is just how computers work.

The Prometheus Connection

Here's where it gets interesting.

We built Prometheus as a prompt management system. It lets users — or their analysts — manage and optimise the prompts that generate code. Because it turns out that the quality of code generation depends entirely on prompt quality, which is both obvious in retrospect and something most vendors quietly pretend isn't their problem.

Consider:

Bad prompt:
"Generate SQL for the user's question"

Good prompt:
"Generate SQL that: uses appropriate JOINs for multi-table queries, includes date range filters when time periods are mentioned, aggregates data correctly, handles NULL values, and returns results in the expected format"

The difference between those two is the difference between a junior developer who technically answered the question and a senior one who understood what you actually needed. Turns out you can encode that experience into a prompt. Turns out you can also let your analysts refine it over time without phoning us at 11pm.

With Prometheus, users can:

Refine prompts based on real-world results
Test different prompt variations against actual data
Track which prompts produce the best code
Deploy optimised prompts to production without a development cycle

The result: code generation quality improves continuously. And nobody has to call us.

Security Considerations

"Wait," I hear you say. "You're executing dynamically generated code?"

Yes. But carefully. There is a meaningful difference between "dynamically generated code that runs in a sandbox with explicit permissions and full audit logging" and "dynamically generated code that has root access and no oversight." We are emphatically the former.

Sandboxing: Code executes in isolated environments with limited permissions. It cannot reach outside its lane.

Validation: Before execution, we validate SQL syntax, table access permissions, query complexity limits, and the absence of destructive operations (DROP, DELETE, UPDATE unless explicitly authorised). The system is more paranoid than a CISO on their first day, and in this case that's a feature.

Rate Limiting: Prevent abuse through request throttling. Someone running 10,000 queries a minute is either very enthusiastic or very malicious. We treat both cases the same.

Audit Logging: Every generated query is logged with user, timestamp, and result. Not because we're watching. Because when something goes wrong — and something always eventually goes wrong — you want a trail.

All of this happens through our guardrails implementation as part of Genius². But that's another story for another post, because I try not to put everything into one article or people stop reading and go make tea. You know who you are.

Real Deployment: Manufacturing

Manufacturing client, April 2025. They needed to analyse production efficiency by shift, waste percentages by product line, comparative performance across facilities, and cost analysis with some genuinely hairy formulas. The kind of analysis where being 5% wrong doesn't just look embarrassing — it actively misleads capital allocation decisions.

The challenge: traditional AI was generating numbers that looked right. Small errors, consistently. The AI was, in effect, very good at producing plausible fiction. This is a known talent. It is not always a useful one.

The solution: AskDiana generating SQL for database queries and Python for complex calculations, with full code inspection available to anyone who wanted to check the working.

The result:

100% calculation accuracy — because deterministic computation is deterministic
Transparent code inspection — users could verify logic, which they did, enthusiastically, because engineers are like that
Faster iteration — no manual SQL writing, no waiting for the analyst to be free
Accessible to non-technical staff — ask in English, get an answer in English, with the SQL available if you want it

The Cognitive Load Reduction (Or: What We Actually Did To The Analysts)

Before: Analysts write SQL → generate reports → explain results to stakeholders → repeat for every variation of every question anyone ever has, forever, until retirement or madness, whichever comes first.

After: Stakeholders ask questions in English. Get immediate answers. Ask follow-ups. Go home at a reasonable hour.

The analysts are not eliminated. They are elevated. Instead of being query-writing drones, they're:

Refining prompts to improve output quality
Validating complex analyses
Handling edge cases and exceptions
Providing strategic decision support

We have made them smarter query monkeys. They now handle the interesting exceptions rather than the boring repetition, they get to actually think, and most importantly they've stopped sending me passive-aggressive emails about Tuesday's report. It's a win for everyone, including my inbox.

Where This Works and Where It Doesn't

This approach is genuinely excellent for:

Structured data queries
Numerical calculations
Aggregations and analytics
Report generation

It does not work for:

Unstructured data analysis
Sentiment analysis
Text summarisation
Creative generation

Different tools for different jobs. I know. Radical.

The Meta-Insight

We are all guilty of this: when we get a powerful new tool, we try to use it for everything.

"I have AI! It can do anything!"

No. AI is powerful. AI is useful. AI is transformative. Follow the yellow brick road — it's paved with appropriate use cases, not with hammers looking for nails.

AI is also a specific tool with specific strengths and specific weaknesses. Use it for what it's good at. Use other tools for what they're good at. The magic is in the orchestration, not in forcing one tool to do everything and wondering why it keeps catching fire.

Before you sign off: ask yourself:

Do you have a clear separation between understanding questions and computing answers?
Are you using the right tool for each step?

Because asking a language model to be a calculator is like asking a poet to be an accountant. Both will give you something beautifully formatted. Only one of them will give you numbers you can actually use.

Next: Real deployments across industries and what we learned.

Want to know what actually excites a query monkey? Give them a tool that handles the boring SQL and hands them the interesting problems. Try AskDiana for free and witness the transformation:

Try it out for free: askdiana.ai

All Posts