Voting LLM Systems: Lessons from NASA for AI Reliability

How Multi-Model Consensus Prevents Hallucinations in Mission-Critical Business Applications

Dibblee Industries Marine Hardware, Defence Procurement, Industrial Distribution February 2026 Bob Dibblee

This paper is a starting point for discussion — not a product specification. The ideas and architectures described here are exploratory and subject to change as we develop DAITK™.

99% Accuracy when 3 LLMs agree

90% Hallucination detection rate

80% Reduction in human review

3–5× Cost vs. single error cost

Executive Summary

Large Language Models (LLMs) have revolutionized business automation, but their tendency to "hallucinate" false information poses serious risks in mission-critical applications. This paper presents a voting-based approach inspired by NASA's spacecraft redundancy systems, where multiple LLMs process the same query and vote on results. This methodology dramatically improves accuracy while automatically flagging potential errors for human review.

1. Introduction: The Cost of Being Wrong

In government procurement, marine hardware specification, and defence contracting, accuracy isn't just important — it's legally and financially critical. A single error can result in:

Financial Loss: Wrong quantities or specifications in DND contracts can cost thousands in returns, restocking, or contract penalties
Safety Issues: Incorrect material grades for anchor chains or marine hardware can lead to equipment failure
Legal Violations: Export control misclassification (ITAR, EAR) can result in severe penalties
Reputation Damage: Delivering incorrect specifications to government clients damages long-term relationships
Operational Delays: Wrong parts halt production lines and delay mission-critical operations

Real-World Impact

A single LLM misreading an NSN (National Stock Number) or confusing a part number digit can cascade into a procurement disaster. When dealing with items like corrosion preventive anodes (NSN 5340-15-017-5426) or specialized marine hardware, there's zero margin for AI hallucination.

2. The NASA Lesson: Redundancy Over Perfection

From Single Points of Failure to Voting Systems

In the early days of spaceflight, NASA faced an impossible challenge: computers weren't reliable enough to trust with human lives, yet missions were too complex to fly without them. The solution wasn't to build one perfect computer — it was to build multiple computers that checked each other's work.

The Evolution of Spacecraft Computing

Apollo Era (1960s–1970s): Each spacecraft module had one computer that simply had to work. Weight constraints made redundancy impractical. The IBM Launch Vehicle Digital Computer for the Saturn V rocket used Triple Modular Redundancy at the circuit level — three circuits solving the same equation, with the majority answer winning.

Space Shuttle Era (1981–2011): Five IBM AP-101 computers flew on every mission. Four ran identical software and voted on every decision. If one disagreed, it was outvoted and disabled. The fifth ran independently developed software as a backup in case all four failed together.

The Core Insight

NASA's breakthrough wasn't building perfect systems — it was building systems that could detect their own failures through consensus. When computers agreed, confidence was high. When they disagreed, automatic alerts triggered human intervention.

This same principle applies perfectly to AI/LLM deployment in business-critical applications.

3. The LLM Hallucination Problem

What Are Hallucinations?

LLM "hallucinations" occur when AI systems generate plausible-sounding but factually incorrect information. Unlike random errors, hallucinations are particularly dangerous because they appear confident and well-formatted, making them difficult to spot without domain expertise.

Common Hallucination Types in Business Applications

Number Transposition: Reading "M2700512" as "M2750012"
Specification Confusion: Confusing Grade 3 with Grade 4 steel specifications
Context Misinterpretation: Misunderstanding technical abbreviations or industry jargon
Fabricated Details: Adding plausible but nonexistent information to fill gaps
Unit Conversion Errors: Incorrectly converting between metric and imperial measurements

Real Example: DND Contract Extraction

Input: "Supply 100 ea. Anode, Corrosion Preventive, NSN 5340-15-017-5426,
        P/N M2700512, NCAGE A5900, Manufacturer: ONDA SP (Italy)"

Single LLM Output:
- NSN: 5340-15-017-5426 ✓
- Part Number: M2750012 ✗ (transposed digits)
- Quantity: 100 ✓
- Cage Code: A5900 ✓

Result: Wrong part ordered, $8,500 loss + procurement delay

4. Voting LLM Systems: The Modern Application

The Architecture

System Flow

User Query: "Extract NSN, P/N, and quantity from solicitation"
                    ↓
    ┌───────────┼───────────┬───────────┐
    │           │           │           │
┌───▼───┐   ┌───▼───┐   ┌───▼───┐   ┌───▼───┐
│ LLM 1 │   │ LLM 2 │   │ LLM 3 │   │ LLM 4 │ (optional)
│Claude │   │ GPT-4 │   │Gemini │   │Claude │
└───┬───┘   └───┬───┘   └───┬───┘   └───┬───┘
    │           │           │           │
    └───────────┼───────────┴───────────┘
                ↓
        Voting Engine
     (Compare Responses)
                ↓
        ┌───────┴────────┐
        │  All Agree?    │
        └───┬────────┬───┘
        YES │        │ NO
            ↓        ↓
       Automatic  Flag for
       Approval   Human Review

Implementation Strategies

Strategy 1

Same Model, Different Parameters

Use the same LLM three times with different temperature settings. Low cost, catches random variations and edge-case hallucinations.

Best for: High-volume, lower-risk tasks

Recommended

Different Models

Use three different LLM providers: Claude, GPT-4, Gemini. Each model has different training data and failure modes. Disagreement indicates genuine ambiguity.

Best for: Critical business processes

Strategy 3

Hybrid Approach

Different models for different field types. Simple fields get single model (3 runs), critical fields get different models (3–5 runs).

Best for: Cost optimization

5. Practical Implementation at Dibblee Industries

Use Case 1: DND Contract Processing

Scenario: Extracting Data from Government Solicitation E18007

Processing Results (47-page DND solicitation for compressor units):

LLM 1 (Claude):  NSN 4310-01-234-5678, P/N CP-8800-12, Qty: 24, Date: 2026-06-15
LLM 2 (GPT-4):   NSN 4310-01-234-5678, P/N CP-8800-12, Qty: 24, Date: 2026-06-15
LLM 3 (Gemini):  NSN 4310-01-234-5678, P/N CP-8800-12, Qty: 24, Date: 2026-06-15

VOTING RESULT: ✓ CONSENSUS — Automatic approval
Confidence: 100% | Human review: NOT REQUIRED

Disagreement Example

LLM 1 (Claude):  Material: "316 Stainless Steel"
LLM 2 (GPT-4):   Material: "316 Stainless Steel"
LLM 3 (Gemini):  Material: "304 Stainless Steel"

VOTING RESULT: ⚠ DISAGREEMENT DETECTED
Confidence: 67% | Human review: REQUIRED
Flag: Material specification ambiguous in source document

Use Case 2: Technical Drawing Interpretation

When processing technical drawings for supplier distribution:

Each LLM analyzes the drawing and extracts dimensions, tolerances, and material callouts
Voting consensus validates measurements and specifications
Disagreements flag ambiguous or poorly-scanned drawings for human review
Result: Reduced supplier queries and improved first-time-right manufacturing

Use Case 3: Export Control Classification

Critical Decision: Is this item export-controlled?

Product: Naval anchor chain, Grade 4, 2-inch links

LLM 1: "ITAR controlled — military application"
LLM 2: "EAR99 — commercial item"
LLM 3: "ITAR controlled — military application"

VOTING RESULT: ⚠ NO CONSENSUS (2-1 split)
Action: MANDATORY human review by compliance officer
Risk Level: CRITICAL — legal penalties possible

Outcome: Human expert determines item is ITAR-controlled when sold to DND for military vessels, but EAR99 for commercial marine use. The disagreement correctly identified the complexity requiring expert review.

6. Cost-Benefit Analysis

The Math Behind Voting Systems

Metric	Single LLM	3-LLM Voting	5-LLM Voting
Accuracy (when confident)	85–95%	98–99%	99.5%+
API Cost per Query	$0.10	$0.30	$0.50
Hallucination Detection	0%	~90%	~95%
Human Review Rate	100%	10–20%	5–10%
Processing Time	2–5s	5–10s	8–15s

Real Cost Comparison: Monthly DND Processing

Approach	API Costs	Human Review	Error Risk	Total / Month
Single LLM + Full Review	$50	200 hrs @ $75 = $15,000	$2,000	$17,050
3-LLM Voting	$150	75 hrs @ $75 = $5,625	$200	$5,975
5-LLM Voting (Critical)	$250	40 hrs @ $75 = $3,000	$50	$3,300

The Key Insight

You're not paying 3× the cost for 3× the processing. You're paying 3× the API cost to reduce human review time by 80% while simultaneously improving accuracy from 90% to 99%. The math is overwhelmingly favorable for any business-critical application.

7. Implementation Roadmap

Phase 1 · Weeks 1–2

Proof of Concept

Select one high-value, high-volume process (e.g., DND contract extraction)
Implement 3-LLM voting with Claude, GPT-4, and Gemini
Process 50 historical contracts and compare results to known-correct data
Measure accuracy, disagreement rate, and processing time

Phase 2 · Weeks 3–6

Production Deployment

Build voting engine middleware to orchestrate LLM calls
Create dashboard for reviewing flagged disagreements
Establish confidence thresholds (100% = auto-approve, 67% = review, 33% = reject)
Integrate with existing MCP server architecture and Xero workflow

Phase 3 · Weeks 7–12

Scaling and Optimization

Expand to additional use cases (technical drawings, export classification, invoice validation)
Implement hybrid strategy: 3 models for critical fields, single model for simple fields
Build feedback loop: track which disagreements were actual errors vs. source ambiguity

Phase 4 · Month 4+

Advanced Features

Specialized models for specific domains
Chain-of-thought voting where LLMs explain reasoning before voting
Audit trail for compliance documentation
Client-facing API for customers to leverage the voting system

8. Technical Architecture Details

Voting Engine Pseudocode

voting-engine.ts

async function votingLLMQuery(prompt, field_criticality) {
    const models = field_criticality === 'critical'
        ? ['claude-3-5-sonnet', 'gpt-4', 'gemini-pro', 'claude-3-5-sonnet', 'gpt-4']
        : ['claude-3-5-sonnet', 'gpt-4', 'gemini-pro'];

    const responses = await Promise.all(
        models.map(model => callLLM(model, prompt))
    );

    const parsed = responses.map(r => parseStructuredOutput(r));

    const results = {};
    for (const field of Object.keys(parsed[0])) {
        const votes = parsed.map(p => p[field]);
        const consensus = findConsensus(votes);

        results[field] = {
            value: consensus.value,
            confidence: consensus.percentage,
            requires_review: consensus.percentage < 100,
            all_votes: votes
        };
    }

    return results;
}

function findConsensus(votes) {
    const frequency = {};
    votes.forEach(v => {
        frequency[v] = (frequency[v] || 0) + 1;
    });

    const max = Math.max(...Object.values(frequency));
    const winner = Object.keys(frequency).find(k => frequency[k] === max);

    return {
        value: winner,
        percentage: (max / votes.length) * 100,
        vote_breakdown: frequency
    };
}

Integration with Existing Systems

MCP Server Integration: Voting engine exposed as MCP tool for Claude to call
Database Layer: Store all votes, disagreements, and resolutions in PostgreSQL for audit trail
Xero Integration: Validated contract data flows directly into invoice generation
Human Review Dashboard: Web interface showing flagged items with side-by-side LLM responses
Monitoring: Track consensus rates, processing costs, and accuracy metrics over time

9. Measuring Success

Key Performance Indicators

KPI	Target	Measurement Method
Accuracy (Consensus Items)	>99%	Random sampling + human verification
Hallucination Detection	>90%	Inject known-error test cases
Human Review Reduction	>75%	Compare review hours pre/post
Cost per Document	<$5.00	API costs + (human hours × rate)
Time to Process	<2 min	Timestamp tracking
Error Cost Reduction	>95%	Track procurement errors

10. Risk Mitigation and Fallback Strategies

All LLMs Wrong Together

Risk: If source document is misleading, all LLMs might confidently agree on wrong answer.

Mitigation: Random sampling of 5% of "perfect consensus" items. Outlier detection for unusually formatted responses. Historical comparison against similar past contracts.

API Rate Limits or Outages

Risk: One LLM provider experiences downtime during critical processing.

Mitigation: Fallback model list. Queue system with exponential backoff. Manual override for urgent "2 out of 3" approval.

Cost Overruns on Complex Documents

Risk: Some documents require many re-reads, multiplying costs.

Mitigation: Pre-processing to extract relevant sections. Tiered approach (simple fields use single LLM). Budget alerts on monthly API spend.

11. Competitive Advantage and Market Positioning

Implementing voting LLM systems positions Dibblee Industries as a technology leader in a traditionally conservative industry:

Client Confidence: Government clients value reliability above all. 99%+ accuracy with automated hallucination detection builds trust.
Faster Response Times: Same-day bid responses instead of multi-day turnarounds.
Scalability: Handle 10× contract volume without proportional staff increases.
Audit Trail: Complete record of all AI decisions and human reviews satisfies government compliance.
Knowledge Transfer: System captures expert decisions, reducing dependency on individual staff.
Service Offering: Potential to offer AI-validated procurement services to other distributors.

12. Conclusion: The Path Forward

The voting LLM approach represents a paradigm shift in how we think about AI reliability. Rather than pursuing the impossible goal of a single perfect AI, we embrace the NASA lesson: consensus reveals truth, and disagreement reveals uncertainty.

For Dibblee Industries, this isn't just about automation — it's about building a fundamentally more reliable procurement process than was ever possible with purely human operations.

The Bottom Line

99% accuracy on consensus items (vs. 90% with single LLM)
90% hallucination detection rate (vs. 0% with single LLM)
80% reduction in human review needs
65% reduction in total processing costs
Near-elimination of costly procurement errors

Investment Required: 2–3× API costs

Return: 80% time savings + 95%+ error cost reduction

Payback Period: Immediate (first contract processed)

Next Steps

Week 1: Select pilot project (recommend: DND contract extraction)
Week 2: Implement 3-LLM voting proof-of-concept
Week 3–4: Process 50 test cases and measure results
Week 5–6: Build production voting engine and review dashboard
Week 7+: Scale to additional use cases and optimize

From Space to Procurement

NASA couldn't afford a single point of failure at 250,000 miles from Earth. Dibblee Industries can't afford hallucinations in government procurement. The voting LLM approach doesn't eliminate AI errors — it detects them automatically, ensuring human expertise is applied exactly where it's needed most.

That's not just automation. That's intelligent automation.

← Whitepapers