Genius2: The Engineering Journey from Concept to Production in 90 Days

AI Engineering Work Related

Dubai, January 2025. My team and I are in a conference room, whiteboarding solutions to the hallucination problem.

Every approach we sketched felt like putting a band-aid on a bullet wound. Fine-tuning? Helps, but limited. Confidence scores? Unreliable. Human verification? Defeats the purpose.

That night while phoning home to Athens I realised:

"What if we don't try to make one AI smarter? What if we make multiple AIs reach consensus? What if we implement an AI democracy?"

The Core Insight

The genius of Genius2 (pun intended) is built on a simple statistical principle:

All AI systems will hallucinate, but independent AIs won't hallucinate the same information.

Think about it: the Common Crawl notwithstanding, GPT-4 is trained by OpenAI on their data. Claude is trained by Anthropic on different data. Gemini uses Google's training approach. Grok has X's datasets. Open-source models like Llama have entirely different foundations.

If you ask them the same question and they all give you the same answer, statistically, that answer is almost certainly correct.

If they disagree? You know you have a problem. Which is valuable information in itself.

The Prototype: February 2025

Within a month, we had a working prototype. The architecture was elegant:

Stage 1: Parallel Query Execution

One question in → Eight simultaneous API calls to different LLMs → Eight independent responses.

The local Genius2 model library showing eleven LLMs running on-premises via Ollama

The local model library: eleven LLMs running on-premises. No API costs, no data leaving your infrastructure. Just a very busy server and a mild electricity bill.

Stage 2: Similarity Analysis

  • Convert responses to comparable vectors using TF-IDF
  • Calculate cosine similarity between each pair of responses
  • Build a similarity matrix showing agreement

Stage 3: Consensus Detection

  • Use graph-based centrality analysis
  • Identify which response has the strongest agreement with others
  • Apply confidence thresholds (85%+)
  • Return the consensus answer or flag disagreement

The Maths

Here's the technical bit:

We use TF-IDF vectorization (Term Frequency-Inverse Document Frequency) to convert each text response into a numerical vector. This captures not just which words appear, but which concepts are emphasised. Then cosine similarity measures how closely aligned each response is with every other response. This gives us a similarity score from 0 (completely different) to 1 (identical).

Finally, graph-based centrality treats responses as nodes in a network. Responses with high similarity to multiple other responses become central nodes. The most central node wins -- it's the consensus answer.

Why This Works

The statistical probability of multiple independent AI systems generating the same hallucination is exponentially lower than one system hallucinating.

If one LLM invents compound XR-4471's properties, that's a 10-15% chance hallucination.

If eight LLMs independently invent the same properties for a compound that doesn't exist? That's (0.15)8 = 0.0000000256% chance.

Essentially impossible.

The Tiebreaker Hierarchy

What if models disagree? We have a multi-level tiebreaker:

  1. Sum of similarities: Which response is closest to the most other responses?
  2. Response length: Longer, more comprehensive responses preferred
  3. Keyword matching: Which response contains more words from the original question?

This ensures we always return something, with transparency about confidence levels.

But surely you can have an LLM pick the best answer from your results? Excellent point -- we could. And that would be as useful as putting a dictator in charge of a country's voting process. I digress...

April 2025: Production

Prototype to production in two months. Here's what we learned:

Challenge 1: Latency

Querying eight LLMs simultaneously sounds slow. Solution: parallel processing with a 35-second timeout and dynamic grace periods. Most queries complete in 3-5 seconds. Do I care if an LLM is taking too long? Nope. Sorry, the voting window is closed.

Challenge 2: Cost

Eight API calls per query is expensive. Solution: own the infrastructure -- don't pay API call costs.

Challenge 3: Model Selection

Which models? Solution: configurable. Start with qwen, llama3, gemma, deepseek -- add more as you need. Dynamically adjust based on performance and cost.

The Genius2 Research Lab interface showing model selection with gpt-3.5-turbo, gemini, grok, gemma and qwen specimens

The Genius2 Research Lab -- select your specimens, define your hypothesis, generate responses. The "LAB NOTES" section politely reminds you this is for testing purposes only. We are nothing if not responsible.

The Transparency Advantage

Unlike black-box AI, Genius2 shows its work:

  • Which models were consulted
  • What each model said
  • The similarity scores
  • The confidence level
  • The winning response

Users can see the consensus process. Build trust. Understand when to question results.

Genius2 evaluation in progress screen showing six models processing in parallel with progress bars

Six models, one question, zero secrets. All of them simultaneously working through "Analyse the implications of quantum computing on modern cryptography standards." Heavy stuff for a Wednesday afternoon.

Real-World Performance

Pharmaceutical client, April 2025:

Before Genius2: 10-15% hallucination rate on compound queries

After Genius2: no measurable hallucination rate

Genius2 results view showing similarity matrix, vote counts, and evaluation metrics per model

The results view: similarity matrix, vote counts, evaluation metrics. The maths of consensus made visible. If you've ever wanted to see democracy rendered as a bar chart, this is your moment.

Travel client, same month:

Before: AI invented booking references, policy details, availability. After: when the system doesn't know, it says so clearly. When it answers, it's verifiable.

The Team That Built It

Here's where I need to be honest: I didn't do this alone.

This level of engineering requires outside-the-box thinking from people who don't accept "that's just how it works" as an answer.

My team includes brilliant engineers who'd rather solve hard problems than take the easy path. People who see architectural challenges as puzzles, not roadblocks.

We're not special. We're just stubborn enough to keep asking "why not?" when everyone else says "can't be done."

(And okay, maybe we're a little bit special. But modestly special. Well, 8% modest, 92% special.)

The Architecture Today

Current Genius2 production stats:

8+ LLMs queried per request (architecture supports thousands)

85%+ confidence threshold for consensus

95%+ accuracy on validated responses

<2% hallucination rate vs industry standard 10-15%

Integration Points

Genius2 isn't a standalone tool. It's the intelligence layer for:

  • AskDiana: Natural language business intelligence
  • Prometheus: Prompt management and optimisation
  • Custom integrations: Via RESTful API

What's Next

We're working on:

  • Expanding to 40+ models in rotation
  • Industry-specific model pools (medical models for healthcare, legal models for law, etc.)
  • Agentic integration for autonomous AI systems
  • Real-time model performance tracking and optimisation

The Lesson

Sometimes the best solution isn't making one thing better. It's making multiple imperfect things work together.

One AI will hallucinate. Eight won't hallucinate the same fiction.

That's not just engineering. That's philosophy applied to machine learning.

Next up: The privacy architecture that keeps your data on your infrastructure.

Here is the bloody sacrifice to the gods of marketing, please click to save me from eternal damnation:

Try it out for free: askdiana.ai