TYPE·CHUNK·FILTER·MEASURE

Retrievable, by design.


Chunking strategy, typed DITA topics, metadata filters, and section-ID stability turn 85% RAG precision into a measurable property of the corpus — not a model parameter, not a hope. The translation layer between your documentation and the LLMs that consume it.

Two pipelines. One measurement.

Same query, same LLM, same vector store. What the model retrieves is determined entirely by whether the content was engineered for retrieval.

UNSTRUCTURED

  1. Content sources

    Word · PDF · Wiki · Markdown

  2. Flat ingestion

    no typing · no metadata · no IDs

  3. Arbitrary chunking

    667-word splits · mid-sentence cuts · broken tables

  4. Vector store

    no metadata filters · all content ranked equally

  5. LLM context window

    outdated + irrelevant chunks mixed in · model averages conflicting signals

  6. Outcome: 25–35% precision

    wrong answers · hallucinations · escalated tickets

STRUCTURED

  1. Content sources

    Word · PDF · FrameMaker · Wiki · Markdown

  2. Typed DITA topics

    concept · task · reference · troubleshooting

  3. Metadata + section IDs

    audience · difficulty · domain · intent · stable IDs everywhere

  4. Semantic chunking

    section-ID boundaries · self-contained · chunks.jsonl + glossary-index.json

  5. Vector store + metadata filters

    audience/type/domain pre-filter · top-k reranked before the LLM sees anything

  6. Outcome: 85%+ precision

    correct answers · cited sources · reduced support load

The pipeline

Precision is a content engineering problem.


AI retrieval precision is a function of how the content was structured, not how the model was tuned. The 7-dimension AI-readiness scorecard quantifies the gap. Word / PDF / FrameMaker content gets converted to typed DITA with stable section IDs. A subject scheme and SKOS-based controlled vocabulary tag every topic — audience, difficulty, domain, intent. The DITA-OT build emits chunks.jsonl and glossary-index.json alongside the HTML/PDF outputs. The corpus lands in Pinecone, Weaviate, or pgvector with metadata filters that pre-scope the search before similarity scoring runs. Retrieval precision is measured against real user queries before the engagement closes.

  1. Audit & Score

    7-dimension AI-readiness scorecard

    Content typing, metadata coverage, addressability, vocabulary control, semantic markup, reuse architecture, pipeline maturity. The corpus is quoted on what is recoverable for retrieval, not on raw topic count.

  2. Structure & Type

    DITA 1.3 specialization, controlled topic types, stable section IDs

    Word / PDF / FrameMaker / wiki content converted to typed DITA. Every section gets an addressable, version-stable ID so retrieval can cite — not approximate.

  3. Enrich Metadata

    Subject scheme, SKOS taxonomy, audience / difficulty / domain / intent

    Controlled vocabulary applied at scale. 100 % coverage gate on every topic. Validation rules prevent vocabulary drift on every commit.

  4. Build Pipeline

    DITA-OT, chunks.jsonl, glossary-index.json, content-manifest.json

    A single build emits semantic chunks alongside the HTML/PDF outputs. Every chunk is metadata-tagged, citation-ready, and tied back to its source section ID.

  5. Integrate & Validate

    Pinecone / Weaviate / pgvector, metadata-filtered retrieval, replay tests

    Audience and domain filter applied before similarity scoring runs. Precision measured against real user queries before the engagement closes — not promised.

Five AI use cases, one foundation.

Same engineering, different surface. The corpus you build for RAG is the same corpus that powers semantic search, AI translation, fine-tuning, and compliance auditing — one substrate, five consumers.

  1. Chatbot & RAG retrieval.

    Citation-ready answers, not averaged context.

    Typed chunks with audience and domain filters. Top-k reranked before reaching the LLM context window. Each retrieved chunk carries its section ID, source URL, and provenance — so every answer is citable, traceable, and replayable against the exact paragraph that produced it.

    Why is your RAG hallucinating on a topic you’ve already documented?

  2. Semantic search.

    Filter before you rank.

    Metadata-filtered semantic search returns the right level for the user before similarity scoring even runs. A beginner concept and an advanced reference don’t get blended into the same top-k. The filter is the relevance — similarity scoring is just the tiebreaker.

    Why is search “improving” without actually answering different questions?

  3. AI-assisted translation.

    Element-aware, not paragraph-blind.

    Semantic DITA elements (<warning>, <step>, <shortdesc>) carry their own translation rules. Safety-critical phrasing fires the right NMT pipeline; reference content takes a different path; conref-driven reuse is translated once. The element type is the routing key.

    Why is the same warning translated five different ways across five LMS modules?

  4. LLM fine-tuning.

    High-signal corpus, not noisy concatenation.

    Typed, deduplicated, voice-consistent training data. No copy-paste duplicates injecting contradictions. No format drift teaching the model that your tone is six different tones. The fine-tune learns your voice because the corpus has a voice.

    What did your model learn when 30% of the training examples disagreed with each other?

  5. Personalization & compliance auditing.

    Audience and intent metadata, not slide-deck theory.

    Role, level, product-line, and locale metadata on every topic make personalization a delivery property, not a marketing slide. The same metadata enables compliance auditing — the system names which warnings are missing, which prerequisites are untagged, which sections lack the required audit attributes.

    When the auditor asks what your AI told a regulated user, can you reproduce the answer?

The payoff

The measurement is the deliverable.

Precision isn’t a number that comes out of the model. It comes out of a test harness. A replay corpus of real user queries — the same ones that produced the baseline — runs again after the structured content is in place, scored against ground-truth answers, with the same vector store and the same retrieval prompt.

The same engineering pays compounding dividends. The corpus you build for chatbot retrieval is the same corpus that feeds semantic search, AI translation, fine-tuning, and compliance auditing. Five consumers, one substrate — every correction propagates everywhere.

On typical Extense engagements, RAG retrieval precision rises from a 25–35% baseline to 85%+ — measured against real user queries before the engagement closes. The number isn’t promised. It’s the deliverable.

Sample Content Assessment

Send us 20 topics from your current documentation. We’ll score them on 7 AI-readiness dimensions and return a gap analysis with the next concrete steps — within two business days.

Submit a sample →