TYPE·CHUNK·FILTER·MEASURE
Your AI Is Only As Good As Your Content.
You’ve invested in LLMs, chatbots, and RAG pipelines. But if your content is unstructured — or structured but unenriched — your AI returns wrong answers, your users lose trust, and your investment delivers a fraction of its potential. Both problems have the same fix.
- 25–35%
- Chatbot precision on unstructured or unenriched content
- 85%+
- Precision after structured content engineering
- 3×
- Average improvement in AI answer accuracy
- 15 yrs
- Enterprise content engineering experience
What Unstructured and Unenriched Content Is Costing You Right Now
These aren’t hypothetical risks. They’re the measurable consequences of deploying AI on content that is either unstructured (Word, PDF, wiki) or structured but unenriched — DITA or XML content that lacks the metadata, controlled vocabulary, and section addressability that AI systems actually need.
-
AI Hallucinations Fill Your Content Gaps
When your AI can’t find a precise answer, it fabricates one. Without typed content, metadata filters, and controlled vocabulary, the model has no boundary between “what the docs say” and “what it thinks sounds right.” Your users get confident wrong answers.
-
Support Tickets Don’t Decrease
Your chatbot was supposed to deflect support volume. Instead, users escalate because the bot gave them the wrong version of a procedure, or a developer answer when they needed a manager overview. Wrong answers cost more than no answer — they create frustrated users and follow-up work.
-
Localization Costs Stay High
AI-assisted translation quality degrades on both unstructured content and unenriched DITA. Neural MT tools produce better output when they can read semantic element types — a <step> is translated as an imperative instruction; a <note type="warning"> triggers safety phrasing rules. Without typed elements or proper metadata, everything is a paragraph, and everything costs the same to translate badly.
-
Fine-Tuning Produces Inconsistent Models
If you’re training or fine-tuning a model on your proprietary content, data quality determines output quality. Copy-pasted Word documents inject contradictions, tone shifts, and duplicates into your training set. The model learns to blend genres and confuse voice — exactly the failure mode you paid to avoid.
-
Personalization Stays a Slide Deck Concept
Delivering the right content to the right user — by role, difficulty, product line — requires machine-readable audience and domain metadata on every topic. This is an unenriched content problem as much as a structural one: you may already have DITA XML, but if that DITA lacks audience, product, and difficulty attributes, your personalization engine falls back to serving the same content to everyone and calling it targeted.
-
Search Relevance Doesn’t Improve
Semantic search still needs metadata to filter. The best embedding model in the world can’t distinguish a beginner concept from an advanced reference if those distinctions aren’t encoded in the content. This is a pure enrichment problem — your content can be perfectly valid DITA and still be invisible to a metadata filter if it was never enriched with audience, difficulty, or domain values.
Two Content Pipelines. One Outcome Difference.
The same user question. The same AI model. Two completely different results — determined entirely by whether the content feeding it is structured or not.
Without Structured, Enriched Content
-
Content Sources
Word · PDF · Wiki · Markdown
no transformation -
Flat / Unenriched Ingestion
No typing · No metadata · No IDs
-
Arbitrary Chunking
667-word splits · mid-sentence cuts · broken tables · lost context
-
Vector Store
No metadata filters available · All content ranked equally
-
LLM Context Window
Outdated + irrelevant chunks mixed in · Model averages conflicting signals
-
25–35% Precision
Wrong answers · Hallucinations · Broken trust · Escalated tickets
With Extense Structured Content
-
Content Sources
Word · PDF · Wiki · Markdown
Extense migration -
Typed DITA Topics
concept · task · reference · troubleshooting
-
Metadata & Section IDs
audience · difficulty · domain · intent · Stable IDs on every section
-
Semantic Chunking
Boundaries at section IDs · self-contained · chunks.jsonl + glossary-index.json
-
Vector Store + Metadata Filters
Filter by audience, type, domain first · Top-k reranked before LLM sees anything
-
85%+ Precision
Correct answers · Zero hallucinations · User trust · Reduced support load
Every AI Use Case Needs the Same Foundation
It’s not just chatbots. Every AI initiative your organization runs draws from the same content well.
Structure it once — every system benefits.
| AI Use Case | What It Needs from Content | Without Structure | With Extense Structure |
|---|---|---|---|
| Chatbot / RAG | Typed chunks, metadata filters, section IDs for citation | 25–35% precision. Hallucinations fill gaps. | 85%+ precision. Cited, traceable answers. |
| Semantic Search | audience, difficulty, domain metadata per topic | Relevance improves slightly. Filtering impossible. | Role-scoped results. Right level, right domain. |
| AI-Assisted Translation | Semantic elements (<step>, <warning>, <shortdesc>) | NMT treats everything as prose. Quality inconsistent. | Element-aware NMT. Safety phrasing rules fire on <note type="warning">. |
| LLM Fine-Tuning | Clean, typed, deduplicated training examples | Noisy training data. Model learns contradictions. | High-signal corpus. Model learns your voice, not an average. |
| Personalization | audience + difficulty metadata on every topic | Same content for everyone. Personalization is theoretical. | Dynamic delivery by role, level, product line. |
| Compliance Auditing | Schema-enforced structure to audit against | AI can flag keywords but not structural gaps. | AI detects missing <prereq>, untagged warnings, coverage gaps. |
| AI-Assisted Authoring | Typed templates + controlled vocabulary constraints | AI generates off-topic, inconsistent drafts. | AI fills typed templates. Output is consistently on-structure. |
-
Chatbot / RAG
Needs Typed chunks, metadata filters, section IDs for citationWithout structure 25–35% precision. Hallucinations fill gaps.With Extense 85%+ precision. Cited, traceable answers. -
Semantic Search
Needs audience, difficulty, domain metadata per topicWithout structure Relevance improves slightly. Filtering impossible.With Extense Role-scoped results. Right level, right domain. -
AI-Assisted Translation
Needs Semantic elements (<step>, <warning>, <shortdesc>)Without structure NMT treats everything as prose. Quality inconsistent.With Extense Element-aware NMT. Safety phrasing rules fire on <note type="warning">. -
LLM Fine-Tuning
Needs Clean, typed, deduplicated training examplesWithout structure Noisy training data. Model learns contradictions.With Extense High-signal corpus. Model learns your voice, not an average. -
Personalization
Needs audience + difficulty metadata on every topicWithout structure Same content for everyone. Personalization is theoretical.With Extense Dynamic delivery by role, level, product line. -
Compliance Auditing
Needs Schema-enforced structure to audit againstWithout structure AI can flag keywords but not structural gaps.With Extense AI detects missing <prereq>, untagged warnings, coverage gaps. -
AI-Assisted Authoring
Needs Typed templates + controlled vocabulary constraintsWithout structure AI generates off-topic, inconsistent drafts.With Extense AI fills typed templates. Output is consistently on-structure.
What “Good Enough” Content Is Actually Costing
These are conservative estimates for a 500-person organization with 1,000 documentation topics and an active AI program.
-
$180K
Annual support escalation
Chatbot wrong-answer rate driving ~40% escalation instead of the target 15%. At $90/ticket, that’s 2,000 unnecessary tickets per year.
-
$240K
Wasted localization spend
Unstructured or unenriched content translated at full cost per word, including duplicated fragments, outdated copy-paste blocks, and content that should have been reused. Proper reuse metadata eliminates most of this waste.
-
$120K
AI retraining cycles
Fine-tuning on noisy unstructured or unenriched data produces inconsistent models. Even valid DITA without controlled vocabulary injects tone inconsistency and contradictions into your training corpus.
Total annual drag: ~$540K — against a one-time structured content engineering engagement that eliminates the root cause.
From Unstructured or Unenriched to AI-Ready in 5 Phases
A deterministic engineering process — not a consulting engagement with open-ended deliverables. Whether you’re starting from Word documents or from DITA that was never properly enriched, each phase has a concrete output and a measurable improvement.
-
Audit & Score
7-dimension AI-readiness scorecard
Content typing, metadata coverage, addressability, vocabulary control, semantic markup, reuse architecture, pipeline maturity. The corpus is quoted on what is recoverable for retrieval, not on raw topic count.
-
Structure & Type
DITA 1.3 specialization, controlled topic types, stable section IDs
Word / PDF / FrameMaker / wiki content converted to typed DITA. Every section gets an addressable, version-stable ID so retrieval can cite — not approximate.
-
Enrich Metadata
Subject scheme, SKOS taxonomy, audience / difficulty / domain / intent
Controlled vocabulary applied at scale. 100 % coverage gate on every topic. Validation rules prevent vocabulary drift on every commit.
-
Build Pipeline
DITA-OT, chunks.jsonl, glossary-index.json, content-manifest.json
A single build emits semantic chunks alongside the HTML/PDF outputs. Every chunk is metadata-tagged, citation-ready, and tied back to its source section ID.
-
Integrate & Validate
Pinecone / Weaviate / pgvector, metadata-filtered retrieval, replay tests
Audience and domain filter applied before similarity scoring runs. Precision measured against real user queries before the engagement closes — not promised.
We Engineer — We Don’t Just Advise
Every engagement ends with working, measurable infrastructure — not a report
with recommendations you have to implement yourself.
-
Free AI-Readiness Assessment
Send us 20 topics. We score them on 7 dimensions — content typing, metadata coverage, section addressability, vocabulary control, semantic markup, reuse architecture, and pipeline maturity — and return a gap analysis with specific next steps.
Deliverable Scored gap analysis · No obligation
-
Structured Migration
We convert your Word, PDF, FrameMaker, or wiki content to typed DITA XML with section IDs. Automation tooling handles scale — 1,000 topics is not 1,000 hours of manual work.
Deliverable Typed DITA corpus with full section addressability
-
Taxonomy & Metadata Engineering
We design a controlled vocabulary — audience, difficulty, domain, intent — and apply it at scale. Validation rules prevent vocabulary drift from day one.
Deliverable Subject scheme + 100% metadata coverage
-
AI-Native Publishing Pipeline
We configure your DITA-OT toolchain to produce chunks.jsonl, glossary-index.json, and content-manifest.json alongside HTML and PDF — all from a single source, on every build.
Deliverable Automated multi-format pipeline including JSONL
-
Integration & Precision Validation
We connect your output to your vector store, configure metadata filters for intent routing and audience scoping, run retrieval tests against real user queries, and deliver a measured precision baseline.
Deliverable Live AI retrieval with measured 85%+ precision target
Sample Content Assessment
Send us 20 topics from your current documentation. We’ll score them on 7 AI-readiness dimensions and return a gap analysis with the next concrete steps — within two business days.
Submit a sample →