Code Generation Architecture: How We Made AI Calculate Correctly
By March 2025 we had successfully hooked AskDiana into our business backend (SQL Server). And the numbers? A crapshoot. Some days were great, some days I could only explain as Friday afternoon responses — you know the kind you get when it's twenty minutes to the weekend and your inherited drone delivers minimum-effort answers with maximum-confidence energy. Not wrong enough to be obviously wrong. Just wrong enough to quietly ruin your Tuesday when someone notices.
Errors were small. A few hundreds when querying about a few millions of monthly sales. They wouldn't tank the monthly graphs, but they'd matter. And if you scale that across thousands of calculations in a busy business analytics platform, you start compounding small errors into big decisions made on the basis of numbers that are, diplomatically speaking, not entirely true.
The Fundamental Problem (Again)
I've said it before, but it bears repeating: LLMs are language models, not calculators. I wrote about this first on LinkedIn, with the full detail following in The Arithmetic Blindspot — both worth a read if you want the full horror show.
When you ask an LLM to calculate, it's predicting digit sequences based on patterns, not computing results. For simple maths, it's often right. For complex multi-step calculations? Coin flip. Actually, coin flip is generous — at least a coin has the decency not to present its result in a beautifully formatted table with three decimal places.
The Obvious Solution Everyone Ignores
Here's the insight that should be obvious but somehow isn't, perhaps because the industry collectively got very excited about AI and briefly forgot that computers exist:
Stop asking AI to do arithmetic.
Instead:
- Use AI to understand the question
- Use AI to generate code that performs the calculation
- Execute the code (deterministic computation)
- Use AI to explain the result
AI does what it's good at (language). Computers do what they're good at (maths). The ENIAC figured this out in 1945. We briefly forgot. We've remembered now.
The AskDiana Architecture
When you ask: "What's our average sale value for Q3?"
Traditional AI approach:
- Retrieves data
- Attempts calculation
- Returns probably-wrong number
- Presents it with the unearned confidence of a man who has never once been right but has also never once doubted himself
AskDiana approach:
Step 1: Intent Recognition (AI)
Parse the question, understand you need numerical data from a specific time period.
Step 2: Code Generation (AI)
Generate SQL: SELECT AVG(sale_value) FROM sales WHERE quarter = 3
Step 3: Code Validation (Rules Engine)
Verify the SQL is safe, properly formatted, and accesses appropriate tables. Not glamorous. Absolutely necessary.
Step 4: Execution (Database)
Run the query. Get a deterministic result. The database has been doing this since before most of our junior developers were born, and it has never once been distracted by having read too many financial reports.
Step 5: Natural Language Response (AI)
"Your average sale value for Q3 was £1,247.89 based on 3,429 transactions."
Step 6: Transparency
Show the SQL that was executed. Users can verify the logic. Trust is built through transparency, not blind faith in a system whose inner workings are roughly as legible as my handwriting after a conference dinner.
Why This Works
Separation of concerns — a principle so old it predates AI entirely:
- Natural language understanding → AI
- Code generation → AI
- Computation → Computer
- Explanation → AI
Verifiability: Users see the generated code. They can verify the logic. No black boxes. No "trust me, I'm an algorithm."
Accuracy: Computers don't make arithmetic errors. If the code is right, the answer is right. This is not aspirational. This is just how computers work.
The Prometheus Connection
Here's where it gets interesting.
We built Prometheus as a prompt management system. It lets users — or their analysts — manage and optimise the prompts that generate code. Because it turns out that the quality of code generation depends entirely on prompt quality, which is both obvious in retrospect and something most vendors quietly pretend isn't their problem.
Consider:
Bad prompt:
"Generate SQL for the user's question"
Good prompt:
"Generate SQL that: uses appropriate JOINs for multi-table queries, includes date range filters when time periods are mentioned, aggregates data correctly, handles NULL values, and returns results in the expected format"
The difference between those two is the difference between a junior developer who technically answered the question and a senior one who understood what you actually needed. Turns out you can encode that experience into a prompt. Turns out you can also let your analysts refine it over time without phoning us at 11pm.
With Prometheus, users can:
- Refine prompts based on real-world results
- Test different prompt variations against actual data
- Track which prompts produce the best code
- Deploy optimised prompts to production without a development cycle
The result: code generation quality improves continuously. And nobody has to call us.
Security Considerations
"Wait," I hear you say. "You're executing dynamically generated code?"
Yes. But carefully. There is a meaningful difference between "dynamically generated code that runs in a sandbox with explicit permissions and full audit logging" and "dynamically generated code that has root access and no oversight." We are emphatically the former.
Sandboxing: Code executes in isolated environments with limited permissions. It cannot reach outside its lane.
Validation: Before execution, we validate SQL syntax, table access permissions, query complexity limits, and the absence of destructive operations (DROP, DELETE, UPDATE unless explicitly authorised). The system is more paranoid than a CISO on their first day, and in this case that's a feature.
Rate Limiting: Prevent abuse through request throttling. Someone running 10,000 queries a minute is either very enthusiastic or very malicious. We treat both cases the same.
Audit Logging: Every generated query is logged with user, timestamp, and result. Not because we're watching. Because when something goes wrong — and something always eventually goes wrong — you want a trail.
All of this happens through our guardrails implementation as part of Genius2. But that's another story for another post, because I try not to put everything into one article or people stop reading and go make tea. You know who you are.
Real Deployment: Manufacturing
Manufacturing client, April 2025. They needed to analyse production efficiency by shift, waste percentages by product line, comparative performance across facilities, and cost analysis with some genuinely hairy formulas. The kind of analysis where being 5% wrong doesn't just look embarrassing — it actively misleads capital allocation decisions.
The challenge: traditional AI was generating numbers that looked right. Small errors, consistently. The AI was, in effect, very good at producing plausible fiction. This is a known talent. It is not always a useful one.
The solution: AskDiana generating SQL for database queries and Python for complex calculations, with full code inspection available to anyone who wanted to check the working.
The result:
- 100% calculation accuracy — because deterministic computation is deterministic
- Transparent code inspection — users could verify logic, which they did, enthusiastically, because engineers are like that
- Faster iteration — no manual SQL writing, no waiting for the analyst to be free
- Accessible to non-technical staff — ask in English, get an answer in English, with the SQL available if you want it
The Cognitive Load Reduction (Or: What We Actually Did To The Analysts)
Before: Analysts write SQL → generate reports → explain results to stakeholders → repeat for every variation of every question anyone ever has, forever, until retirement or madness, whichever comes first.
After: Stakeholders ask questions in English. Get immediate answers. Ask follow-ups. Go home at a reasonable hour.
The analysts are not eliminated. They are elevated. Instead of being query-writing drones, they're:
- Refining prompts to improve output quality
- Validating complex analyses
- Handling edge cases and exceptions
- Providing strategic decision support
We have made them smarter query monkeys. They now handle the interesting exceptions rather than the boring repetition, they get to actually think, and most importantly they've stopped sending me passive-aggressive emails about Tuesday's report. It's a win for everyone, including my inbox.
Where This Works and Where It Doesn't
This approach is genuinely excellent for:
- Structured data queries
- Numerical calculations
- Aggregations and analytics
- Report generation
It does not work for:
- Unstructured data analysis
- Sentiment analysis
- Text summarisation
- Creative generation
Different tools for different jobs. I know. Radical.
The Meta-Insight
We are all guilty of this: when we get a powerful new tool, we try to use it for everything.
"I have AI! It can do anything!"
No. AI is powerful. AI is useful. AI is transformative. Follow the yellow brick road — it's paved with appropriate use cases, not with hammers looking for nails.
AI is also a specific tool with specific strengths and specific weaknesses. Use it for what it's good at. Use other tools for what they're good at. The magic is in the orchestration, not in forcing one tool to do everything and wondering why it keeps catching fire.
Before you sign off: ask yourself:
- Do you have a clear separation between understanding questions and computing answers?
- Are you using the right tool for each step?
Because asking a language model to be a calculator is like asking a poet to be an accountant. Both will give you something beautifully formatted. Only one of them will give you numbers you can actually use.
Next: Real deployments across industries and what we learned.
Want to know what actually excites a query monkey? Give them a tool that handles the boring SQL and hands them the interesting problems. Try AskDiana for free and witness the transformation:
Try it out for free: askdiana.ai