What hallucinations are and why they appear
An LLM hallucination is generating information that sounds credible but is factually untrue or unsupported. This isn't an “error” in the sense of a system failure — it's a consequence of how language models work. An LLM doesn't “know” the way a database does — it predicts the most probable next token based on training statistics. When the prompt asks something the model doesn't have good coverage for in training data, it generates the “most plausible-sounding” answer. Often that answer is correct. Sometimes — not.
Typical hallucination scenarios in business applications:
- Citing non-existent court rulings or statutory paragraphs in legal advisory
- Inventing function, class, or library names when generating code
- Providing incorrect statistics or dates in reports
- Making up contacts, addresses, phone numbers
- Mixing facts about different companies or people with similar names
Layer 1 — Grounding (RAG)
The single most effective technique for reducing hallucinations is grounding — providing the model with specific documents or data as context, from which it should draw answers. Classic RAG (Retrieval-Augmented Generation):
- User question → search for the most relevant document fragments (vector search in pgvector / Qdrant / Milvus)
- Fragments + question → prompt with instruction “answer only based on the documents below”
- Model answer → verification that it contains citations/references to sources
RAG typically reduces hallucinations by 60-80% in “answer questions about our knowledge base” applications. It doesn't eliminate them completely — the model may still “interpret” documents in unauthorized ways. Hence the need for further layers.
Layer 2 — Self-consistency and ensemble
Self-consistency is a technique of asking the same question multiple times (or asking several different models) and comparing answers. When answers are consistent — high confidence. When they differ — a signal the topic is uncertain.
Practical variant: ask Claude Sonnet, Llama 70B, and Bielik the same question. If all three return the same number, date, fact — probably correct. If they differ — escalate to human or to a more expensive model (Opus). This pattern, implemented in 8-tier LLM routing, combines cost reduction with credibility improvement.
Layer 3 — Evaluation pipelines
A production LLM deployment without an evaluation pipeline is like writing code without tests. Specific metrics:
- Faithfulness — whether the answer follows from the provided documents. Measured by a second AI model (LLM-as-judge) or libraries like RAGAS, deepeval.
- Answer relevance — whether the answer addresses the user's question.
- Context precision — whether the best fragments were returned by retrieval (vector search quality).
- Groundedness score — percentage of claims in the answer for which a source in context can be identified.
Each new LLM-based application build should pass a battery of 50-500 evaluation questions with known ground truth. If faithfulness drops below 90% — deployment blocked.
Layer 4 — Guardrails and output validation
Guardrails are rules validating LLM output before delivering it to the user. Examples:
- Schema validation — output must conform to a specific schema (JSON Schema, Pydantic). “Invented fields” hallucinations are detected mechanically.
- Forbidden patterns — detection and blocking of unacceptable patterns (PII without masking, financial data out of context, potentially harmful content).
- Citation enforcement — each factual claim must have a source citation. If the model doesn't cite — the answer is rejected.
- Numeric range validation — numbers in output checked for sense (e.g., price > 0, date ≤ today, percentage in 0-100 range).
- Cross-reference check — comparing output against a fact database (e.g., business registry, statutory citation dictionary).
Libraries: Guardrails AI, NeMo Guardrails, instructor (for schema enforcement). Custom implementation often simpler and cheaper to maintain.
Layer 5 — Human-in-the-loop
For high-risk applications (legal, medical, financial, HR decisions), the human-in-the-loop layer is essential. AI models do not make the final decision — they support a human. Specific patterns:
- Draft + review — AI generates the first version of a document/answer, human verifies and accepts before sending.
- Confidence threshold — low-confidence answers (from self-consistency or explicit confidence asking) automatically escalated to human.
- Random sampling QA — 5-10% of all LLM answers manually audited, regardless of confidence — baseline quality metric over time.
- Feedback loop — user can mark a wrong answer; system learns and improves retrieval, prompts, parameters.
Measurement — knowing reduction works
Specific production metrics worth monitoring:
- Hallucination rate — percentage of answers classified as hallucinations in manual evaluation (sampling). Target: below 2% for business-critical applications.
- User feedback rate — percentage of users marking an answer as wrong.
- Escalation rate — percentage of queries escalated to human. Too low (below 5%) — system probably misses uncertain cases. Too high (above 30%) — system doesn't deliver automation value.
- Faithfulness score in regression tests — monthly trend.
- Time-to-correction — from hallucination detection to fix deployment (better retrieval, new guardrail, fine-tuning).
Implications for decision-makers
Hallucinations are manageable — requiring investment in defensive architecture across multiple layers. Companies deploying LLMs without this architecture will sooner or later encounter a serious incident (publishing wrong information to a customer, wrong decision based on hallucinated data, reputational damage). The cost of building a full defensive stack (RAG + evaluation + guardrails + human-in-the-loop) is typically 15-30% of the LLM deployment cost itself — and this is an absolutely necessary investment for production applications. Consequences of skipping are asymmetric: low cost of inaction in 95% of cases, catastrophic in 5%.