TYPE·CHUNK·FILTER·MEASURE

Your AI Is Only As Good As Your Content.


You’ve invested in LLMs, chatbots, and RAG pipelines. But if your content is unstructured — or structured but unenriched — your AI returns wrong answers, your users lose trust, and your investment delivers a fraction of its potential. Both problems have the same fix.

25–35%
Chatbot precision on unstructured or unenriched content
85%+
Precision after structured content engineering
Average improvement in AI answer accuracy
15 yrs
Enterprise content engineering experience

What Unstructured and Unenriched Content Is Costing You Right Now

These aren’t hypothetical risks. They’re the measurable consequences of deploying AI on content that is either unstructured (Word, PDF, wiki) or structured but unenriched — DITA or XML content that lacks the metadata, controlled vocabulary, and section addressability that AI systems actually need.

  • AI Hallucinations Fill Your Content Gaps

    When your AI can’t find a precise answer, it fabricates one. Without typed content, metadata filters, and controlled vocabulary, the model has no boundary between “what the docs say” and “what it thinks sounds right.” Your users get confident wrong answers.

  • Support Tickets Don’t Decrease

    Your chatbot was supposed to deflect support volume. Instead, users escalate because the bot gave them the wrong version of a procedure, or a developer answer when they needed a manager overview. Wrong answers cost more than no answer — they create frustrated users and follow-up work.

  • Localization Costs Stay High

    AI-assisted translation quality degrades on both unstructured content and unenriched DITA. Neural MT tools produce better output when they can read semantic element types — a <step> is translated as an imperative instruction; a <note type="warning"> triggers safety phrasing rules. Without typed elements or proper metadata, everything is a paragraph, and everything costs the same to translate badly.

  • Fine-Tuning Produces Inconsistent Models

    If you’re training or fine-tuning a model on your proprietary content, data quality determines output quality. Copy-pasted Word documents inject contradictions, tone shifts, and duplicates into your training set. The model learns to blend genres and confuse voice — exactly the failure mode you paid to avoid.

  • Personalization Stays a Slide Deck Concept

    Delivering the right content to the right user — by role, difficulty, product line — requires machine-readable audience and domain metadata on every topic. This is an unenriched content problem as much as a structural one: you may already have DITA XML, but if that DITA lacks audience, product, and difficulty attributes, your personalization engine falls back to serving the same content to everyone and calling it targeted.

  • Search Relevance Doesn’t Improve

    Semantic search still needs metadata to filter. The best embedding model in the world can’t distinguish a beginner concept from an advanced reference if those distinctions aren’t encoded in the content. This is a pure enrichment problem — your content can be perfectly valid DITA and still be invisible to a metadata filter if it was never enriched with audience, difficulty, or domain values.

Two Content Pipelines. One Outcome Difference.

The same user question. The same AI model. Two completely different results — determined entirely by whether the content feeding it is structured or not.

Without Structured, Enriched Content

  1. Content Sources

    Word · PDF · Wiki · Markdown

    no transformation
  2. Flat / Unenriched Ingestion

    No typing · No metadata · No IDs

  3. Arbitrary Chunking

    667-word splits · mid-sentence cuts · broken tables · lost context

  4. Vector Store

    No metadata filters available · All content ranked equally

  5. LLM Context Window

    Outdated + irrelevant chunks mixed in · Model averages conflicting signals

  6. 25–35% Precision

    Wrong answers · Hallucinations · Broken trust · Escalated tickets

With Extense Structured Content

  1. Content Sources

    Word · PDF · Wiki · Markdown

    Extense migration
  2. Typed DITA Topics

    concept · task · reference · troubleshooting

  3. Metadata & Section IDs

    audience · difficulty · domain · intent · Stable IDs on every section

  4. Semantic Chunking

    Boundaries at section IDs · self-contained · chunks.jsonl + glossary-index.json

  5. Vector Store + Metadata Filters

    Filter by audience, type, domain first · Top-k reranked before LLM sees anything

  6. 85%+ Precision

    Correct answers · Zero hallucinations · User trust · Reduced support load

Every AI Use Case Needs the Same Foundation

It’s not just chatbots. Every AI initiative your organization runs draws from the same content well.
Structure it once — every system benefits.

AI Use Case What It Needs from Content Without Structure With Extense Structure
Chatbot / RAG Typed chunks, metadata filters, section IDs for citation 25–35% precision. Hallucinations fill gaps. 85%+ precision. Cited, traceable answers.
Semantic Search audience, difficulty, domain metadata per topic Relevance improves slightly. Filtering impossible. Role-scoped results. Right level, right domain.
AI-Assisted Translation Semantic elements (<step>, <warning>, <shortdesc>) NMT treats everything as prose. Quality inconsistent. Element-aware NMT. Safety phrasing rules fire on <note type="warning">.
LLM Fine-Tuning Clean, typed, deduplicated training examples Noisy training data. Model learns contradictions. High-signal corpus. Model learns your voice, not an average.
Personalization audience + difficulty metadata on every topic Same content for everyone. Personalization is theoretical. Dynamic delivery by role, level, product line.
Compliance Auditing Schema-enforced structure to audit against AI can flag keywords but not structural gaps. AI detects missing <prereq>, untagged warnings, coverage gaps.
AI-Assisted Authoring Typed templates + controlled vocabulary constraints AI generates off-topic, inconsistent drafts. AI fills typed templates. Output is consistently on-structure.
  • Chatbot / RAG

    Needs Typed chunks, metadata filters, section IDs for citation
    Without structure 25–35% precision. Hallucinations fill gaps.
    With Extense 85%+ precision. Cited, traceable answers.
  • Semantic Search

    Needs audience, difficulty, domain metadata per topic
    Without structure Relevance improves slightly. Filtering impossible.
    With Extense Role-scoped results. Right level, right domain.
  • AI-Assisted Translation

    Needs Semantic elements (<step>, <warning>, <shortdesc>)
    Without structure NMT treats everything as prose. Quality inconsistent.
    With Extense Element-aware NMT. Safety phrasing rules fire on <note type="warning">.
  • LLM Fine-Tuning

    Needs Clean, typed, deduplicated training examples
    Without structure Noisy training data. Model learns contradictions.
    With Extense High-signal corpus. Model learns your voice, not an average.
  • Personalization

    Needs audience + difficulty metadata on every topic
    Without structure Same content for everyone. Personalization is theoretical.
    With Extense Dynamic delivery by role, level, product line.
  • Compliance Auditing

    Needs Schema-enforced structure to audit against
    Without structure AI can flag keywords but not structural gaps.
    With Extense AI detects missing <prereq>, untagged warnings, coverage gaps.
  • AI-Assisted Authoring

    Needs Typed templates + controlled vocabulary constraints
    Without structure AI generates off-topic, inconsistent drafts.
    With Extense AI fills typed templates. Output is consistently on-structure.

What “Good Enough” Content Is Actually Costing

These are conservative estimates for a 500-person organization with 1,000 documentation topics and an active AI program.

  • $180K

    Annual support escalation

    Chatbot wrong-answer rate driving ~40% escalation instead of the target 15%. At $90/ticket, that’s 2,000 unnecessary tickets per year.

  • $240K

    Wasted localization spend

    Unstructured or unenriched content translated at full cost per word, including duplicated fragments, outdated copy-paste blocks, and content that should have been reused. Proper reuse metadata eliminates most of this waste.

  • $120K

    AI retraining cycles

    Fine-tuning on noisy unstructured or unenriched data produces inconsistent models. Even valid DITA without controlled vocabulary injects tone inconsistency and contradictions into your training corpus.

Total annual drag: ~$540K — against a one-time structured content engineering engagement that eliminates the root cause.

From Unstructured or Unenriched to AI-Ready in 5 Phases

A deterministic engineering process — not a consulting engagement with open-ended deliverables. Whether you’re starting from Word documents or from DITA that was never properly enriched, each phase has a concrete output and a measurable improvement.

  1. Audit & Score

    7-dimension AI-readiness scorecard

    Content typing, metadata coverage, addressability, vocabulary control, semantic markup, reuse architecture, pipeline maturity. The corpus is quoted on what is recoverable for retrieval, not on raw topic count.

  2. Structure & Type

    DITA 1.3 specialization, controlled topic types, stable section IDs

    Word / PDF / FrameMaker / wiki content converted to typed DITA. Every section gets an addressable, version-stable ID so retrieval can cite — not approximate.

  3. Enrich Metadata

    Subject scheme, SKOS taxonomy, audience / difficulty / domain / intent

    Controlled vocabulary applied at scale. 100 % coverage gate on every topic. Validation rules prevent vocabulary drift on every commit.

  4. Build Pipeline

    DITA-OT, chunks.jsonl, glossary-index.json, content-manifest.json

    A single build emits semantic chunks alongside the HTML/PDF outputs. Every chunk is metadata-tagged, citation-ready, and tied back to its source section ID.

  5. Integrate & Validate

    Pinecone / Weaviate / pgvector, metadata-filtered retrieval, replay tests

    Audience and domain filter applied before similarity scoring runs. Precision measured against real user queries before the engagement closes — not promised.

We Engineer — We Don’t Just Advise

Every engagement ends with working, measurable infrastructure — not a report
with recommendations you have to implement yourself.

  1. Free AI-Readiness Assessment

    Send us 20 topics. We score them on 7 dimensions — content typing, metadata coverage, section addressability, vocabulary control, semantic markup, reuse architecture, and pipeline maturity — and return a gap analysis with specific next steps.

    Deliverable Scored gap analysis · No obligation

  2. Structured Migration

    We convert your Word, PDF, FrameMaker, or wiki content to typed DITA XML with section IDs. Automation tooling handles scale — 1,000 topics is not 1,000 hours of manual work.

    Deliverable Typed DITA corpus with full section addressability

  3. Taxonomy & Metadata Engineering

    We design a controlled vocabulary — audience, difficulty, domain, intent — and apply it at scale. Validation rules prevent vocabulary drift from day one.

    Deliverable Subject scheme + 100% metadata coverage

  4. AI-Native Publishing Pipeline

    We configure your DITA-OT toolchain to produce chunks.jsonl, glossary-index.json, and content-manifest.json alongside HTML and PDF — all from a single source, on every build.

    Deliverable Automated multi-format pipeline including JSONL

  5. Integration & Precision Validation

    We connect your output to your vector store, configure metadata filters for intent routing and audience scoping, run retrieval tests against real user queries, and deliver a measured precision baseline.

    Deliverable Live AI retrieval with measured 85%+ precision target

Sample Content Assessment

Send us 20 topics from your current documentation. We’ll score them on 7 AI-readiness dimensions and return a gap analysis with the next concrete steps — within two business days.

Submit a sample →