Comment le consensus multi-modèles prévient les hallucinations dans les applications d’affaires critiques
Ce document est un point de départ pour la discussion — pas une spécification de produit. Les idées et architectures décrites ici sont exploratoires et sujettes à modification au fur et à mesure du développement de DAITK™.
Executive Summary
Large Language Models (LLMs) have revolutionized business automation, but their tendency to "hallucinate" false information poses serious risks in mission-critical applications. This paper presents a voting-based approach inspired by NASA's spacecraft redundancy systems, where multiple LLMs process the same query and vote on results. This methodology dramatically improves accuracy while automatically flagging potential errors for human review.
1. Introduction: The Cost of Being Wrong
In government procurement, marine hardware specification, and defence contracting, accuracy isn't just important — it's legally and financially critical. A single error can result in:
- Financial Loss: Wrong quantities or specifications in DND contracts can cost thousands in returns, restocking, or contract penalties
- Safety Issues: Incorrect material grades for anchor chains or marine hardware can lead to equipment failure
- Legal Violations: Export control misclassification (ITAR, EAR) can result in severe penalties
- Reputation Damage: Delivering incorrect specifications to government clients damages long-term relationships
- Operational Delays: Wrong parts halt production lines and delay mission-critical operations
A single LLM misreading an NSN (National Stock Number) or confusing a part number digit can cascade into a procurement disaster. When dealing with items like corrosion preventive anodes (NSN 5340-15-017-5426) or specialized marine hardware, there's zero margin for AI hallucination.
2. The NASA Lesson: Redundancy Over Perfection
From Single Points of Failure to Voting Systems
In the early days of spaceflight, NASA faced an impossible challenge: computers weren't reliable enough to trust with human lives, yet missions were too complex to fly without them. The solution wasn't to build one perfect computer — it was to build multiple computers that checked each other's work.
Apollo Era (1960s–1970s): Each spacecraft module had one computer that simply had to work. Weight constraints made redundancy impractical. The IBM Launch Vehicle Digital Computer for the Saturn V rocket used Triple Modular Redundancy at the circuit level — three circuits solving the same equation, with the majority answer winning.
Space Shuttle Era (1981–2011): Five IBM AP-101 computers flew on every mission. Four ran identical software and voted on every decision. If one disagreed, it was outvoted and disabled. The fifth ran independently developed software as a backup in case all four failed together.
The Core Insight
NASA's breakthrough wasn't building perfect systems — it was building systems that could detect their own failures through consensus. When computers agreed, confidence was high. When they disagreed, automatic alerts triggered human intervention.
This same principle applies perfectly to AI/LLM deployment in business-critical applications.
3. The LLM Hallucination Problem
What Are Hallucinations?
LLM "hallucinations" occur when AI systems generate plausible-sounding but factually incorrect information. Unlike random errors, hallucinations are particularly dangerous because they appear confident and well-formatted, making them difficult to spot without domain expertise.
Common Hallucination Types in Business Applications
- Number Transposition: Reading "M2700512" as "M2750012"
- Specification Confusion: Confusing Grade 3 with Grade 4 steel specifications
- Context Misinterpretation: Misunderstanding technical abbreviations or industry jargon
- Fabricated Details: Adding plausible but nonexistent information to fill gaps
- Unit Conversion Errors: Incorrectly converting between metric and imperial measurements
Input: "Supply 100 ea. Anode, Corrosion Preventive, NSN 5340-15-017-5426,
P/N M2700512, NCAGE A5900, Manufacturer: ONDA SP (Italy)"
Single LLM Output:
- NSN: 5340-15-017-5426 ✓
- Part Number: M2750012 ✗ (transposed digits)
- Quantity: 100 ✓
- Cage Code: A5900 ✓
Result: Wrong part ordered, $8,500 loss + procurement delay 4. Voting LLM Systems: The Modern Application
The Architecture
User Query: "Extract NSN, P/N, and quantity from solicitation"
↓
┌───────────┼───────────┬───────────┐
│ │ │ │
┌───▼───┐ ┌───▼───┐ ┌───▼───┐ ┌───▼───┐
│ LLM 1 │ │ LLM 2 │ │ LLM 3 │ │ LLM 4 │ (optional)
│Claude │ │ GPT-4 │ │Gemini │ │Claude │
└───┬───┘ └───┬───┘ └───┬───┘ └───┬───┘
│ │ │ │
└───────────┼───────────┴───────────┘
↓
Voting Engine
(Compare Responses)
↓
┌───────┴────────┐
│ All Agree? │
└───┬────────┬───┘
YES │ │ NO
↓ ↓
Automatic Flag for
Approval Human Review Implementation Strategies
Same Model, Different Parameters
Use the same LLM three times with different temperature settings. Low cost, catches random variations and edge-case hallucinations.
Best for: High-volume, lower-risk tasksDifferent Models
Use three different LLM providers: Claude, GPT-4, Gemini. Each model has different training data and failure modes. Disagreement indicates genuine ambiguity.
Best for: Critical business processesHybrid Approach
Different models for different field types. Simple fields get single model (3 runs), critical fields get different models (3–5 runs).
Best for: Cost optimization5. Practical Implementation at Dibblee Industries
Use Case 1: DND Contract Processing
Processing Results (47-page DND solicitation for compressor units):
LLM 1 (Claude): NSN 4310-01-234-5678, P/N CP-8800-12, Qty: 24, Date: 2026-06-15
LLM 2 (GPT-4): NSN 4310-01-234-5678, P/N CP-8800-12, Qty: 24, Date: 2026-06-15
LLM 3 (Gemini): NSN 4310-01-234-5678, P/N CP-8800-12, Qty: 24, Date: 2026-06-15
VOTING RESULT: ✓ CONSENSUS — Automatic approval
Confidence: 100% | Human review: NOT REQUIRED LLM 1 (Claude): Material: "316 Stainless Steel"
LLM 2 (GPT-4): Material: "316 Stainless Steel"
LLM 3 (Gemini): Material: "304 Stainless Steel"
VOTING RESULT: ⚠ DISAGREEMENT DETECTED
Confidence: 67% | Human review: REQUIRED
Flag: Material specification ambiguous in source document Use Case 2: Technical Drawing Interpretation
When processing technical drawings for supplier distribution:
- Each LLM analyzes the drawing and extracts dimensions, tolerances, and material callouts
- Voting consensus validates measurements and specifications
- Disagreements flag ambiguous or poorly-scanned drawings for human review
- Result: Reduced supplier queries and improved first-time-right manufacturing
Use Case 3: Export Control Classification
Product: Naval anchor chain, Grade 4, 2-inch links
LLM 1: "ITAR controlled — military application"
LLM 2: "EAR99 — commercial item"
LLM 3: "ITAR controlled — military application"
VOTING RESULT: ⚠ NO CONSENSUS (2-1 split)
Action: MANDATORY human review by compliance officer
Risk Level: CRITICAL — legal penalties possible Outcome: Human expert determines item is ITAR-controlled when sold to DND for military vessels, but EAR99 for commercial marine use. The disagreement correctly identified the complexity requiring expert review.
6. Cost-Benefit Analysis
The Math Behind Voting Systems
| Metric | Single LLM | 3-LLM Voting | 5-LLM Voting |
|---|---|---|---|
| Accuracy (when confident) | 85–95% | 98–99% | 99.5%+ |
| API Cost per Query | $0.10 | $0.30 | $0.50 |
| Hallucination Detection | 0% | ~90% | ~95% |
| Human Review Rate | 100% | 10–20% | 5–10% |
| Processing Time | 2–5s | 5–10s | 8–15s |
Real Cost Comparison: Monthly DND Processing
| Approach | API Costs | Human Review | Error Risk | Total / Month |
|---|---|---|---|---|
| Single LLM + Full Review | $50 | 200 hrs @ $75 = $15,000 | $2,000 | $17,050 |
| 3-LLM Voting | $150 | 75 hrs @ $75 = $5,625 | $200 | $5,975 |
| 5-LLM Voting (Critical) | $250 | 40 hrs @ $75 = $3,000 | $50 | $3,300 |
You're not paying 3× the cost for 3× the processing. You're paying 3× the API cost to reduce human review time by 80% while simultaneously improving accuracy from 90% to 99%. The math is overwhelmingly favorable for any business-critical application.
7. Implementation Roadmap
Proof of Concept
- Select one high-value, high-volume process (e.g., DND contract extraction)
- Implement 3-LLM voting with Claude, GPT-4, and Gemini
- Process 50 historical contracts and compare results to known-correct data
- Measure accuracy, disagreement rate, and processing time
Production Deployment
- Build voting engine middleware to orchestrate LLM calls
- Create dashboard for reviewing flagged disagreements
- Establish confidence thresholds (100% = auto-approve, 67% = review, 33% = reject)
- Integrate with existing MCP server architecture and Xero workflow
Scaling and Optimization
- Expand to additional use cases (technical drawings, export classification, invoice validation)
- Implement hybrid strategy: 3 models for critical fields, single model for simple fields
- Build feedback loop: track which disagreements were actual errors vs. source ambiguity
Advanced Features
- Specialized models for specific domains
- Chain-of-thought voting where LLMs explain reasoning before voting
- Audit trail for compliance documentation
- Client-facing API for customers to leverage the voting system
8. Technical Architecture Details
Voting Engine Pseudocode
async function votingLLMQuery(prompt, field_criticality) {
const models = field_criticality === 'critical'
? ['claude-3-5-sonnet', 'gpt-4', 'gemini-pro', 'claude-3-5-sonnet', 'gpt-4']
: ['claude-3-5-sonnet', 'gpt-4', 'gemini-pro'];
const responses = await Promise.all(
models.map(model => callLLM(model, prompt))
);
const parsed = responses.map(r => parseStructuredOutput(r));
const results = {};
for (const field of Object.keys(parsed[0])) {
const votes = parsed.map(p => p[field]);
const consensus = findConsensus(votes);
results[field] = {
value: consensus.value,
confidence: consensus.percentage,
requires_review: consensus.percentage < 100,
all_votes: votes
};
}
return results;
}
function findConsensus(votes) {
const frequency = {};
votes.forEach(v => {
frequency[v] = (frequency[v] || 0) + 1;
});
const max = Math.max(...Object.values(frequency));
const winner = Object.keys(frequency).find(k => frequency[k] === max);
return {
value: winner,
percentage: (max / votes.length) * 100,
vote_breakdown: frequency
};
} Integration with Existing Systems
- MCP Server Integration: Voting engine exposed as MCP tool for Claude to call
- Database Layer: Store all votes, disagreements, and resolutions in PostgreSQL for audit trail
- Xero Integration: Validated contract data flows directly into invoice generation
- Human Review Dashboard: Web interface showing flagged items with side-by-side LLM responses
- Monitoring: Track consensus rates, processing costs, and accuracy metrics over time
9. Measuring Success
Key Performance Indicators
| KPI | Target | Measurement Method |
|---|---|---|
| Accuracy (Consensus Items) | >99% | Random sampling + human verification |
| Hallucination Detection | >90% | Inject known-error test cases |
| Human Review Reduction | >75% | Compare review hours pre/post |
| Cost per Document | <$5.00 | API costs + (human hours × rate) |
| Time to Process | <2 min | Timestamp tracking |
| Error Cost Reduction | >95% | Track procurement errors |
10. Risk Mitigation and Fallback Strategies
Risk: If source document is misleading, all LLMs might confidently agree on wrong answer.
Mitigation: Random sampling of 5% of "perfect consensus" items. Outlier detection for unusually formatted responses. Historical comparison against similar past contracts.
Risk: One LLM provider experiences downtime during critical processing.
Mitigation: Fallback model list. Queue system with exponential backoff. Manual override for urgent "2 out of 3" approval.
Risk: Some documents require many re-reads, multiplying costs.
Mitigation: Pre-processing to extract relevant sections. Tiered approach (simple fields use single LLM). Budget alerts on monthly API spend.
11. Competitive Advantage and Market Positioning
Implementing voting LLM systems positions Dibblee Industries as a technology leader in a traditionally conservative industry:
- Client Confidence: Government clients value reliability above all. 99%+ accuracy with automated hallucination detection builds trust.
- Faster Response Times: Same-day bid responses instead of multi-day turnarounds.
- Scalability: Handle 10× contract volume without proportional staff increases.
- Audit Trail: Complete record of all AI decisions and human reviews satisfies government compliance.
- Knowledge Transfer: System captures expert decisions, reducing dependency on individual staff.
- Service Offering: Potential to offer AI-validated procurement services to other distributors.
12. Conclusion: The Path Forward
The voting LLM approach represents a paradigm shift in how we think about AI reliability. Rather than pursuing the impossible goal of a single perfect AI, we embrace the NASA lesson: consensus reveals truth, and disagreement reveals uncertainty.
For Dibblee Industries, this isn't just about automation — it's about building a fundamentally more reliable procurement process than was ever possible with purely human operations.
- 99% accuracy on consensus items (vs. 90% with single LLM)
- 90% hallucination detection rate (vs. 0% with single LLM)
- 80% reduction in human review needs
- 65% reduction in total processing costs
- Near-elimination of costly procurement errors
Investment Required: 2–3× API costs
Return: 80% time savings + 95%+ error cost reduction
Payback Period: Immediate (first contract processed)
Next Steps
- Week 1: Select pilot project (recommend: DND contract extraction)
- Week 2: Implement 3-LLM voting proof-of-concept
- Week 3–4: Process 50 test cases and measure results
- Week 5–6: Build production voting engine and review dashboard
- Week 7+: Scale to additional use cases and optimize
NASA couldn't afford a single point of failure at 250,000 miles from Earth. Dibblee Industries can't afford hallucinations in government procurement. The voting LLM approach doesn't eliminate AI errors — it detects them automatically, ensuring human expertise is applied exactly where it's needed most.
That's not just automation. That's intelligent automation.