TYPE·CHUNK·FILTER·MEASURE

Retrievable, by design.

Chunking strategy, typed DITA topics, metadata filters, and section-ID stability turn 85% RAG precision into a measurable property of the corpus — not a model parameter, not a hope. The translation layer between your documentation and the LLMs that consume it.

Solutions

Two pipelines. One measurement.

Same query, same LLM, same vector store. What the model retrieves is determined entirely by whether the content was engineered for retrieval.

UNSTRUCTURED

Content sources

Word · PDF · Wiki · Markdown
Flat ingestion

no typing · no metadata · no IDs
Arbitrary chunking

667-word splits · mid-sentence cuts · broken tables
Vector store

no metadata filters · all content ranked equally
LLM context window

outdated + irrelevant chunks mixed in · model averages conflicting signals
Outcome: 25–35% precision

wrong answers · hallucinations · escalated tickets

STRUCTURED

Content sources

Word · PDF · FrameMaker · Wiki · Markdown
Typed DITA topics

concept · task · reference · troubleshooting
Metadata + section IDs

audience · difficulty · domain · intent · stable IDs everywhere
Semantic chunking

section-ID boundaries · self-contained · chunks.jsonl + glossary-index.json
Vector store + metadata filters

audience/type/domain pre-filter · top-k reranked before the LLM sees anything
Outcome: 85%+ precision

correct answers · cited sources · reduced support load

The pipeline

Precision is a content engineering problem.

AI retrieval precision is a function of how the content was structured, not how the model was tuned. The 7-dimension AI-readiness scorecard quantifies the gap. Word / PDF / FrameMaker content gets converted to typed DITA with stable section IDs. A subject scheme and SKOS-based controlled vocabulary tag every topic — audience, difficulty, domain, intent. The DITA-OT build emits chunks.jsonl and glossary-index.json alongside the HTML/PDF outputs. The corpus lands in Pinecone, Weaviate, or pgvector with metadata filters that pre-scope the search before similarity scoring runs. Retrieval precision is measured against real user queries before the engagement closes.

Audit & Score

7-dimension AI-readiness scorecard

Content typing, metadata coverage, addressability, vocabulary control, semantic markup, reuse architecture, pipeline maturity. The corpus is quoted on what is recoverable for retrieval, not on raw topic count.
Structure & Type

DITA 1.3 specialization, controlled topic types, stable section IDs

Word / PDF / FrameMaker / wiki content converted to typed DITA. Every section gets an addressable, version-stable ID so retrieval can cite — not approximate.
Enrich Metadata

Subject scheme, SKOS taxonomy, audience / difficulty / domain / intent

Controlled vocabulary applied at scale. 100 % coverage gate on every topic. Validation rules prevent vocabulary drift on every commit.
Build Pipeline

DITA-OT, chunks.jsonl, glossary-index.json, content-manifest.json

A single build emits semantic chunks alongside the HTML/PDF outputs. Every chunk is metadata-tagged, citation-ready, and tied back to its source section ID.
Integrate & Validate

Pinecone / Weaviate / pgvector, metadata-filtered retrieval, replay tests

Audience and domain filter applied before similarity scoring runs. Precision measured against real user queries before the engagement closes — not promised.

Five AI use cases, one foundation.

Same engineering, different surface. The corpus you build for RAG is the same corpus that powers semantic search, AI translation, fine-tuning, and compliance auditing — one substrate, five consumers.

Chatbot & RAG retrieval.

Citation-ready answers, not averaged context.

Typed chunks with audience and domain filters. Top-k reranked before reaching the LLM context window. Each retrieved chunk carries its section ID, source URL, and provenance — so every answer is citable, traceable, and replayable against the exact paragraph that produced it.

Why is your RAG hallucinating on a topic you’ve already documented?
Semantic search.

Filter before you rank.

Metadata-filtered semantic search returns the right level for the user before similarity scoring even runs. A beginner concept and an advanced reference don’t get blended into the same top-k. The filter is the relevance — similarity scoring is just the tiebreaker.

Why is search “improving” without actually answering different questions?
AI-assisted translation.

Element-aware, not paragraph-blind.

Semantic DITA elements (<warning>, <step>, <shortdesc>) carry their own translation rules. Safety-critical phrasing fires the right NMT pipeline; reference content takes a different path; conref-driven reuse is translated once. The element type is the routing key.

Why is the same warning translated five different ways across five LMS modules?
LLM fine-tuning.

High-signal corpus, not noisy concatenation.

Typed, deduplicated, voice-consistent training data. No copy-paste duplicates injecting contradictions. No format drift teaching the model that your tone is six different tones. The fine-tune learns your voice because the corpus has a voice.

What did your model learn when 30% of the training examples disagreed with each other?
Personalization & compliance auditing.

Audience and intent metadata, not slide-deck theory.

Role, level, product-line, and locale metadata on every topic make personalization a delivery property, not a marketing slide. The same metadata enables compliance auditing — the system names which warnings are missing, which prerequisites are untagged, which sections lack the required audit attributes.

When the auditor asks what your AI told a regulated user, can you reproduce the answer?

The payoff

The measurement is the deliverable.

Precision isn’t a number that comes out of the model. It comes out of a test harness. A replay corpus of real user queries — the same ones that produced the baseline — runs again after the structured content is in place, scored against ground-truth answers, with the same vector store and the same retrieval prompt.

The same engineering pays compounding dividends. The corpus you build for chatbot retrieval is the same corpus that feeds semantic search, AI translation, fine-tuning, and compliance auditing. Five consumers, one substrate — every correction propagates everywhere.

On typical Extense engagements, RAG retrieval precision rises from a 25–35% baseline to 85%+ — measured against real user queries before the engagement closes. The number isn’t promised. It’s the deliverable.

See AI-Ready Content deliverables and standards

Sample Content Assessment

Send us 20 topics from your current documentation. We’ll score them on 7 AI-readiness dimensions and return a gap analysis with the next concrete steps — within two business days.

Submit a sample →

Retrievable, by design.

Two pipelines. One measurement.

Precision is a content engineering problem.

Audit & Score

Structure & Type

Enrich Metadata

Build Pipeline

Integrate & Validate

Five AI use cases, one foundation.

Chatbot & RAG retrieval.

Semantic search.

AI-assisted translation.

LLM fine-tuning.

Personalization & compliance auditing.

The measurement is the deliverable.

Sample Content Assessment