Tillbaka till ordlistan Teknik

Dokumentuppdelning (chunking)

Uppdelning av långa dokument i mindre segment för effektiv indexering och hämtning i RAG-system.

Why Chunking Matters

Document chunking is the process of dividing documents into smaller, semantically meaningful segments for storage in vector databases and retrieval by AI systems. It is a critical yet often underappreciated step in building retrieval-augmented generation (RAG) pipelines. Chunking quality directly impacts retrieval accuracy: chunks that are too large dilute relevance, while chunks that are too small lose important context. Getting chunking right can improve RAG performance more than upgrading the language model itself.

Chunking Strategies

The fundamental challenge is preserving meaning at the segment level while keeping chunks small enough for precise retrieval and within model context window limits.

Optimization Techniques

Fixed-size chunking splits text at regular character or token intervals — simple but often breaks mid-sentence or mid-concept. Recursive character splitting divides text at natural boundaries (paragraphs, sentences) within size constraints. Semantic chunking uses embedding similarity to group related content, creating chunks that represent coherent ideas. Document-structure-aware chunking respects headings, sections, and formatting to maintain the author's organizational logic.