Production Deployments: What We Learned From Five Industries

April 2025. We went from prototype to production in two months.

Not because we were reckless. Because we had clients who couldn't wait. (These two things are not entirely unrelated, but we maintain that the distinction is meaningful and will continue to do so until someone proves otherwise.)

The gap between prototype and production, it turns out, is precisely the gap between a map and actual terrain. The map is very neat. The terrain has a pharmaceutical compliance department.

Pharmaceutical companies dealing with compliance audits. Security organisations protecting national interests. Manufacturers optimising complex operations. Logistics firms managing real-time fleet decisions. Travel companies handling customer experience at scale.

These aren't startups experimenting with AI for the novelty of it. These are enterprises where mistakes have consequences that go well beyond a slightly awkward all-hands meeting.

Here's what deploying Genius² and AskDiana across five industries actually taught us.

Pharmaceuticals: Where the Word "Hallucination" Loses Its Charm

There is a certain grim irony in the AI industry's choice of the word "hallucination" for when models confidently invent things. In most contexts, it's a slightly whimsical term for a software quirk. In pharmaceutical contexts, it is a description of a potential catastrophe wearing a very convincing suit.

Our pharmaceutical client needed AI for compound property lookups, regulatory compliance queries, clinical trial data analysis, and internal knowledge management. Standard AI wasn't cutting it. A single hallucinated dosage recommendation or falsified compound property wasn't going to generate a funny blog post. It was going to generate lawyers.

The Deployment: On-premise Genius² installation, completely air-gapped, local model execution with no external API calls whatsoever, custom compliance prompts, and full integration with internal documentation.

Lesson 1: Confidence thresholds matter more than you think

We initially set Genius²'s consensus threshold at 70%. Seemed reasonable. In everyday life, 70% confident means you'll probably order the fish and deal with whatever happens. In pharmaceuticals, 70% confident means someone's regulatory submission is about to have a very educational morning.

The client pushed back: "In pharma, 70% isn't confident enough to act on."

We adjusted to 85%+ for critical queries. Anything below that threshold is flagged for human review rather than presented as an answer. This felt like a limitation when we built it. The client considers it the entire point.

Lesson 2: "I don't know" is a feature, not a bug

When Genius² can't reach consensus, it says so clearly. Humanity took several thousand years to apply this principle to oracles, priests, and management consultants with varying degrees of success. We compressed it into six months of deployment.

The client now treats it as a core feature: "We'd rather AI admit uncertainty than confidently give us wrong information." Which is, when you think about it, a perfectly reasonable thing to want from any advisor, artificial or otherwise.

Lesson 3: Audit trails are non-negotiable

Every query, every response, every model consulted -- logged. For regulatory compliance, this transparency isn't a nice-to-have. It's the entire basis on which anyone is willing to use the system at all.

Metrics After 6 Months

<1% hallucination rate on compound queries

100% audit trail coverage

73% reduction in time spent on data lookups

Zero regulatory compliance issues from AI responses

Security: Where "Zero" Means Zero

Can't name the organisation. Obviously. If I could name them, they would probably need to reconsider their entire security posture, which would make this a significantly less cheerful case study.

What I can tell you is that they work on national security issues and their requirement was simple: AI capabilities with absolutely zero external data transmission. Not "encrypted." Not "anonymised." Not "we pinky-promise it's safe." Zero. This is not a requirement you can satisfy with a strongly-worded privacy policy and a Terms of Service that nobody reads.

The Deployment: Completely air-gapped AskDiana installation on physical hardware, no cloud, local model execution (Llama, Mistral, Qwen), and custom security hardening that I am not going to describe in a public blog post for what I trust are obvious reasons.

Lesson 1: Air-gapped doesn't mean ineffective

We were nervous. Local models, at the time, weren't as capable as their cloud-hosted counterparts. The concern was that we'd deliver a system that was very private and also largely useless, which is a category of achievement that nobody puts in their case studies.

But Genius²'s consensus approach compensated beautifully. Eight decent models reaching agreement is often more reliable than one excellent model delivering its singular opinion with great confidence. There is a pub quiz team somewhere in South London that proves this every Thursday, and they are correct approximately 73% of the time, which in a pub quiz is excellent and in pharmaceuticals is, as noted, not quite enough.

Lesson 2: Speed beats theoretical perfection

They valued response time over marginal capability improvements. A good answer in 3 seconds beats a perfect answer in 30 seconds. We optimised accordingly: parallel processing, efficient vectorisation, smart caching via Mnemonic.

Lesson 3: Trust requires verifiability, not promises

They didn't take our word that data wasn't leaking. They monitored network traffic, reviewed code, and tested with sensitive data markers. We designed the system specifically to be verifiable, not just trustworthy. There is an important difference. Trustworthy is what you claim. Verifiable is what you can prove. Only one of those stands up in a security review.

Metrics After 6 Months

Zero external data transmission (independently verified)

2.8 second average response time

94% user satisfaction score

Adopted across 7 departments

Manufacturing: Where Precision Is Everything and Patience Is Finite

Global manufacturer with complex operations across multiple facilities: production efficiency analysis, supply chain optimisation, quality control metrics, real-time decision support. They needed accurate calculations.

They had tried traditional AI. The arithmetic errors were not acceptable. When you are running a manufacturing operation, "close enough" is a phrase that belongs in horseshoes, hand grenades, and nowhere else.

The Deployment: AskDiana with code generation architecture, ERP integration, Prometheus for prompt management, and a hybrid deployment spanning on-premise and cloud infrastructure.

Lesson 1: Business users do not want to learn SQL

Initially, we displayed the generated SQL prominently. Transparency! Verifiability! Here is the code that produced your answer -- isn't that marvellous?

Users were not impressed. They wanted answers, not code reviews. The SQL is vitally important to have available, in the same way that nuclear safety documentation is vitally important to have available: absolutely essential, and not what most people want to read over breakfast.

We made the SQL viewable on demand but hidden by default. Technical staff could verify. Business users could, with equal dignity, ignore it entirely. Everyone was happier.

Lesson 2: Code generation quality compounds over time

With Prometheus, users refined prompts. Better prompts produced better code, which produced better results, which encouraged more refinement. It is, as productivity gains go, remarkably self-sustaining.

Month 1: 82% code quality score

Month 3: 94% code quality score

Month 6: 97% code quality score

Lesson 3: Calculations you can trust change how you work

Before: "Run these numbers, then verify manually before acting."
After: "What's the number?" -- followed immediately by a decision.

The speed of business decisions is often constrained not by the availability of data but by the confidence people have in it. Remove the doubt, and the decision cycle accelerates in ways that compound aggressively across thousands of decisions per month.

Metrics After 6 Months

100% calculation accuracy (deterministic computation)

89% reduction in manual report generation

34% faster decision cycles

£2.4M annualised productivity savings

Logistics: Where Seconds Are Not Metaphorical

Fleet management, route optimisation, real-time logistics decisions. In logistics, three seconds can be the difference between a truck taking the correct motorway junction and committing to a 40-minute diversion that arrives at the depot at a time that satisfies nobody and improves the driver's vocabulary considerably.

Speed was the constraint. Everything else was negotiable. Speed was not.

The Deployment: Cloud deployment on AWS VPC, Mnemonic caching layer, integration with fleet management systems, and real-time data processing pipelines.

Lesson 1: Caching is a multiplier

First deployment: 4-5 second average response time. Functional. Acceptable. In logistics, "acceptable" is often the enemy of the good.

After Mnemonic implementation: 0.8 seconds average for cached queries. Repeated questions (40% of total traffic) went from slow to essentially instant. The logistics industry, it turns out, asks a lot of the same questions repeatedly. This is not unique to logistics.

Lesson 2: Real-time data needs intelligent staleness management

Fleet locations change constantly. A cached answer about where a truck is becomes incorrect at the moment the truck moves, which in a functioning logistics operation is roughly every few minutes. We implemented smart cache invalidation based on data currency, with Knowledge Decay (KDecay) integration deciding when fresh data was necessary and when the cached answer was still perfectly good. The system knows the difference between "this information might be slightly dated" and "this information is now fictional."

Lesson 3: Decision velocity is self-reinforcing

Faster AI leads to faster decisions, which leads to better outcomes, which builds trust, which leads to more questions, which leads to more value. MBA students call this a virtuous cycle. Everyone else calls it "things working properly for a change." Either way, it compounds.

Metrics After 6 Months

0.8s average response (cached queries)

3.2s average response (non-cached)

67% cache hit rate

23% improvement in route efficiency

Travel: Where a Hallucinated Booking Reference Is Basically a Work of Fiction

Customer service at scale. Thousands of queries daily. Fast, accurate responses required across bookings, policies, and availability. And -- this is the critical part -- the customer is already slightly stressed before they've asked the question.

A hallucinated booking confirmation in travel has a very specific energy: it's exactly like a hotel concierge who is absolutely certain they have your reservation, will find it any moment, you're probably in the system under a slightly different name, and would you like to wait in the bar while they sort it out. Only there is no bar. And your flight is in four hours.

The Deployment: Multi-region cloud deployment, integration with booking systems, AI customer service assistant, and multi-language support across the full service territory.

Lesson 1: Customers have an extremely low tolerance for invented information

One false booking confirmation destroys trust with a customer. One. They do not give you a second chance to invent their holiday arrangements. Genius²'s consensus approach eliminated invented booking references, false policy statements, and incorrect availability. When the system isn't sure, it says so and routes to a human agent rather than fabricating something plausible.

Lesson 2: The same question arrives in approximately 47 different phrasings

"What's your cancellation policy?" and "Can I get a refund?" and "I need to cancel my trip" all mean variations of the same thing. Mnemonic's semantic matching handled this variance gracefully: one canonical, accurate answer, served across many phrasings, without requiring the system to be explicitly trained on every possible way a human being might express mild anxiety about their holiday plans.

Lesson 3: Peak load is not evenly distributed

Travel queries spike around booking periods, which are themselves clustered around pay days, school holiday announcements, and the precise moment someone sees a particularly tempting Instagram post. The system needed to scale elastically. Cloud deployment handled this. Cache hit rates reduced load during peak times. Nobody's question about Malaga in August went unanswered because everyone else was also asking about Malaga in August.

Metrics After 6 Months

96% customer satisfaction score

78% reduction in human escalations

0.04% error rate on booking information

£1.8M annualised support cost savings

Common Patterns Across Five Industries

After deploying across five sectors, patterns emerged with the kind of reliable consistency that suggests we might actually know what we're doing. We're cautiously optimistic on this point.

Pattern 1: Trust Through Transparency
Every industry valued seeing how AI reached its conclusions. Black boxes don't work for mission-critical operations. They don't really work anywhere, but most people only discover this when something goes wrong at an inconvenient moment.

Pattern 2: Admitting Uncertainty Builds Credibility
AI that says "I don't know" is more trustworthy than AI that invents something plausible. Every client, in every industry, arrived at this conclusion independently, which suggests it is simply true rather than a preference.

Pattern 3: Speed Matters More Than Marginal Quality
A good answer in 3 seconds beats a perfect answer in 30 seconds. Decision velocity is a real competitive advantage. The perfect is the enemy of the timely, which is itself the enemy of the practical.

Pattern 4: Privacy Is a Requirement, Not a Premium Feature
Once organisations understood they could keep their data on their own infrastructure, they stopped considering it optional. The question shifted from "is this possible?" to "why isn't everyone doing this?" We are still working on a satisfying answer to the second question.

Pattern 5: Production Means 95%+ or It Means Nothing
In production environments, 85% accuracy is not good enough. The industries that know this tend to be the ones where the 15% failure rate has a name and a legal definition. We deliver 98%+. This is not a marketing claim. It is a requirement for being in any of these sectors at all.

The Team Behind It

None of this happens without the right people. Our team includes engineers who question assumptions, think outside conventional architectures, prioritise real-world impact over theoretical elegance, and iterate quickly based on client feedback rather than defending their original designs on the grounds that they spent a long time on them.

We are not precious about our initial designs. We are not afraid to throw out code when a better approach emerges. Every deployment has taught us something. Every client has made the product better. This is occasionally humbling and invariably useful.

(We are also, it must be said, quite clever. But in a quietly confident, self-aware way rather than in the way of someone who needs to mention it in every meeting. We are mentioning it now in a blog post, which is different and completely fine.)

What's Next

These five deployments are the beginning, not the summary. We are working on industry-specific model pools, privacy-preserving federated learning across clients, advanced agentic integration, and real-time collaborative AI. Whether any of this will arrive in the order listed is a question the universe prefers not to answer in advance.

More importantly: we are listening. Every client teaches us something about how AI should work in production environments. Every deployment reveals optimisation opportunities that no whiteboard session would have identified, because whiteboards do not have compliance departments or fleet management systems or customers asking about Malaga in August.

The product in April 2025 was good. The product in March 2026 is significantly better. The product in April 2026 will be better still. Predicting beyond that is the kind of thing you do at the beginning of a project, before you've met the clients.

What This Means If You're Considering AI Deployment

Don't settle for "good enough." Production AI needs production quality. Pilots are encouraging. Mission-critical operations are the real test, and they are not interested in your pilot's success metrics.

Privacy is achievable. You do not have to send your data to external providers. On-premise AI is real, capable, and increasingly cost-effective. The fact that most vendors prefer you not to know this is, you will note, not a coincidence.

Speed and accuracy are not a trade-off. With the right architecture, you get both. The architecture is the work. Nobody said it was simple.

Start with real problems. Deploy AI because it solves problems you actually have, not because someone at a conference made it sound inevitable. It may well be inevitable. That's still not a good reason to skip the "does this solve a problem?" question.

Next: The team and methodology that made this possible -- and the decisions that almost didn't.

Can't wait for the next exciting instalment?

Why not try the thing we've been talking about for five industry case studies? It's free. Your query monkeys will thank you.

Click here to try AskDiana for FREE

All Posts