LLM Hallucinations — How to Detect, Reduce, and Manage Risk in Production

What hallucinations are and why they appear

An LLM hallucination is generating information that sounds credible but is factually untrue or unsupported. This isn't an “error” in the sense of a system failure — it's a consequence of how language models work. An LLM doesn't “know” the way a database does — it predicts the most probable next token based on training statistics. When the prompt asks something the model doesn't have good coverage for in training data, it generates the “most plausible-sounding” answer. Often that answer is correct. Sometimes — not.

Typical hallucination scenarios in business applications:

Citing non-existent court rulings or statutory paragraphs in legal advisory
Inventing function, class, or library names when generating code
Providing incorrect statistics or dates in reports
Making up contacts, addresses, phone numbers
Mixing facts about different companies or people with similar names

Layer 1 — Grounding (RAG)

The single most effective technique for reducing hallucinations is grounding — providing the model with specific documents or data as context, from which it should draw answers. Classic RAG (Retrieval-Augmented Generation):

User question → search for the most relevant document fragments (vector search in pgvector / Qdrant / Milvus)
Fragments + question → prompt with instruction “answer only based on the documents below”
Model answer → verification that it contains citations/references to sources

RAG typically reduces hallucinations by 60-80% in “answer questions about our knowledge base” applications. It doesn't eliminate them completely — the model may still “interpret” documents in unauthorized ways. Hence the need for further layers.

Layer 2 — Self-consistency and ensemble

Self-consistency is a technique of asking the same question multiple times (or asking several different models) and comparing answers. When answers are consistent — high confidence. When they differ — a signal the topic is uncertain.

Practical variant: ask Claude Sonnet, Llama 70B, and Bielik the same question. If all three return the same number, date, fact — probably correct. If they differ — escalate to human or to a more expensive model (Opus). This pattern, implemented in 8-tier LLM routing, combines cost reduction with credibility improvement.

Layer 3 — Evaluation pipelines

A production LLM deployment without an evaluation pipeline is like writing code without tests. Specific metrics:

Faithfulness — whether the answer follows from the provided documents. Measured by a second AI model (LLM-as-judge) or libraries like RAGAS, deepeval.
Answer relevance — whether the answer addresses the user's question.
Context precision — whether the best fragments were returned by retrieval (vector search quality).
Groundedness score — percentage of claims in the answer for which a source in context can be identified.

Each new LLM-based application build should pass a battery of 50-500 evaluation questions with known ground truth. If faithfulness drops below 90% — deployment blocked.

Layer 4 — Guardrails and output validation

Guardrails are rules validating LLM output before delivering it to the user. Examples:

Schema validation — output must conform to a specific schema (JSON Schema, Pydantic). “Invented fields” hallucinations are detected mechanically.
Forbidden patterns — detection and blocking of unacceptable patterns (PII without masking, financial data out of context, potentially harmful content).
Citation enforcement — each factual claim must have a source citation. If the model doesn't cite — the answer is rejected.
Numeric range validation — numbers in output checked for sense (e.g., price > 0, date ≤ today, percentage in 0-100 range).
Cross-reference check — comparing output against a fact database (e.g., business registry, statutory citation dictionary).

Libraries: Guardrails AI, NeMo Guardrails, instructor (for schema enforcement). Custom implementation often simpler and cheaper to maintain.

Layer 5 — Human-in-the-loop

For high-risk applications (legal, medical, financial, HR decisions), the human-in-the-loop layer is essential. AI models do not make the final decision — they support a human. Specific patterns:

Draft + review — AI generates the first version of a document/answer, human verifies and accepts before sending.
Confidence threshold — low-confidence answers (from self-consistency or explicit confidence asking) automatically escalated to human.
Random sampling QA — 5-10% of all LLM answers manually audited, regardless of confidence — baseline quality metric over time.
Feedback loop — user can mark a wrong answer; system learns and improves retrieval, prompts, parameters.

Measurement — knowing reduction works

Specific production metrics worth monitoring:

Hallucination rate — percentage of answers classified as hallucinations in manual evaluation (sampling). Target: below 2% for business-critical applications.
User feedback rate — percentage of users marking an answer as wrong.
Escalation rate — percentage of queries escalated to human. Too low (below 5%) — system probably misses uncertain cases. Too high (above 30%) — system doesn't deliver automation value.
Faithfulness score in regression tests — monthly trend.
Time-to-correction — from hallucination detection to fix deployment (better retrieval, new guardrail, fine-tuning).

Implications for decision-makers

Hallucinations are manageable — requiring investment in defensive architecture across multiple layers. Companies deploying LLMs without this architecture will sooner or later encounter a serious incident (publishing wrong information to a customer, wrong decision based on hallucinated data, reputational damage). The cost of building a full defensive stack (RAG + evaluation + guardrails + human-in-the-loop) is typically 15-30% of the LLM deployment cost itself — and this is an absolutely necessary investment for production applications. Consequences of skipping are asymmetric: low cost of inaction in 95% of cases, catastrophic in 5%.