TYPE·CHUNK·FILTER·MEASURE
Retrievable, by design.
Chunking strategy, typed DITA topics, metadata filters, and section-ID stability turn 85% RAG precision into a measurable property of the corpus — not a model parameter, not a hope. The translation layer between your documentation and the LLMs that consume it.
- Technical Docs & Publishing
- Content Migration
- XML Data Interoperability
- AI-Ready Content
Two pipelines. One measurement.
Same query, same LLM, same vector store. What the model retrieves is determined entirely by whether the content was engineered for retrieval.
UNSTRUCTURED
-
Content sources
Word · PDF · Wiki · Markdown
-
Flat ingestion
no typing · no metadata · no IDs
-
Arbitrary chunking
667-word splits · mid-sentence cuts · broken tables
-
Vector store
no metadata filters · all content ranked equally
-
LLM context window
outdated + irrelevant chunks mixed in · model averages conflicting signals
-
Outcome: 25–35% precision
wrong answers · hallucinations · escalated tickets
STRUCTURED
-
Content sources
Word · PDF · FrameMaker · Wiki · Markdown
-
Typed DITA topics
concept · task · reference · troubleshooting
-
Metadata + section IDs
audience · difficulty · domain · intent · stable IDs everywhere
-
Semantic chunking
section-ID boundaries · self-contained · chunks.jsonl + glossary-index.json
-
Vector store + metadata filters
audience/type/domain pre-filter · top-k reranked before the LLM sees anything
-
Outcome: 85%+ precision
correct answers · cited sources · reduced support load
The pipeline
Precision is a content engineering problem.
AI retrieval precision is a function of how the content was structured, not how the model was tuned. The 7-dimension AI-readiness scorecard quantifies the gap. Word / PDF / FrameMaker content gets converted to typed DITA with stable section IDs. A subject scheme and SKOS-based controlled vocabulary tag every topic — audience, difficulty, domain, intent. The DITA-OT build emits chunks.jsonl and glossary-index.json alongside the HTML/PDF outputs. The corpus lands in Pinecone, Weaviate, or pgvector with metadata filters that pre-scope the search before similarity scoring runs. Retrieval precision is measured against real user queries before the engagement closes.
-
Audit & Score
7-dimension AI-readiness scorecard
Content typing, metadata coverage, addressability, vocabulary control, semantic markup, reuse architecture, pipeline maturity. The corpus is quoted on what is recoverable for retrieval, not on raw topic count.
-
Structure & Type
DITA 1.3 specialization, controlled topic types, stable section IDs
Word / PDF / FrameMaker / wiki content converted to typed DITA. Every section gets an addressable, version-stable ID so retrieval can cite — not approximate.
-
Enrich Metadata
Subject scheme, SKOS taxonomy, audience / difficulty / domain / intent
Controlled vocabulary applied at scale. 100 % coverage gate on every topic. Validation rules prevent vocabulary drift on every commit.
-
Build Pipeline
DITA-OT, chunks.jsonl, glossary-index.json, content-manifest.json
A single build emits semantic chunks alongside the HTML/PDF outputs. Every chunk is metadata-tagged, citation-ready, and tied back to its source section ID.
-
Integrate & Validate
Pinecone / Weaviate / pgvector, metadata-filtered retrieval, replay tests
Audience and domain filter applied before similarity scoring runs. Precision measured against real user queries before the engagement closes — not promised.
Five AI use cases, one foundation.
Same engineering, different surface. The corpus you build for RAG is the same corpus that powers semantic search, AI translation, fine-tuning, and compliance auditing — one substrate, five consumers.
-
Chatbot & RAG retrieval.
Citation-ready answers, not averaged context.
Typed chunks with audience and domain filters. Top-k reranked before reaching the LLM context window. Each retrieved chunk carries its section ID, source URL, and provenance — so every answer is citable, traceable, and replayable against the exact paragraph that produced it.
Why is your RAG hallucinating on a topic you’ve already documented?
-
Semantic search.
Filter before you rank.
Metadata-filtered semantic search returns the right level for the user before similarity scoring even runs. A beginner concept and an advanced reference don’t get blended into the same top-k. The filter is the relevance — similarity scoring is just the tiebreaker.
Why is search “improving” without actually answering different questions?
-
AI-assisted translation.
Element-aware, not paragraph-blind.
Semantic DITA elements (<warning>, <step>, <shortdesc>) carry their own translation rules. Safety-critical phrasing fires the right NMT pipeline; reference content takes a different path; conref-driven reuse is translated once. The element type is the routing key.
Why is the same warning translated five different ways across five LMS modules?
-
LLM fine-tuning.
High-signal corpus, not noisy concatenation.
Typed, deduplicated, voice-consistent training data. No copy-paste duplicates injecting contradictions. No format drift teaching the model that your tone is six different tones. The fine-tune learns your voice because the corpus has a voice.
What did your model learn when 30% of the training examples disagreed with each other?
-
Personalization & compliance auditing.
Audience and intent metadata, not slide-deck theory.
Role, level, product-line, and locale metadata on every topic make personalization a delivery property, not a marketing slide. The same metadata enables compliance auditing — the system names which warnings are missing, which prerequisites are untagged, which sections lack the required audit attributes.
When the auditor asks what your AI told a regulated user, can you reproduce the answer?
The payoff
The measurement is the deliverable.
Precision isn’t a number that comes out of the model. It comes out of a test harness. A replay corpus of real user queries — the same ones that produced the baseline — runs again after the structured content is in place, scored against ground-truth answers, with the same vector store and the same retrieval prompt.
The same engineering pays compounding dividends. The corpus you build for chatbot retrieval is the same corpus that feeds semantic search, AI translation, fine-tuning, and compliance auditing. Five consumers, one substrate — every correction propagates everywhere.
On typical Extense engagements, RAG retrieval precision rises from a 25–35% baseline to 85%+ — measured against real user queries before the engagement closes. The number isn’t promised. It’s the deliverable.
Sample Content Assessment
Send us 20 topics from your current documentation. We’ll score them on 7 AI-readiness dimensions and return a gap analysis with the next concrete steps — within two business days.
Submit a sample →