CASE 02 / 03 2025 Enterprise documentation

AI-Ready Content Pipeline

Unstructured legacy XML transformed into chunk-safe, metadata-enriched DITA — with measurable AI-readiness scoring.

Duration: 12 weeks
Team: 1 architect · 1 conversion engineer · 1 retrieval engineer
Engagement: Project-based, fixed scope
Status: Shipped · powering production RAG retrieval

Challenge and approach

The challenge

Legacy documentation existed as generic XML and Markdown with no semantic typing, no metadata, and no section identifiers. AI systems treated all content as undifferentiated plain text — chatbot retrieval precision was roughly 25%, and there was no way to filter by audience, difficulty, or intent. Content chunking broke at arbitrary points because there were no semantic boundaries.

The approach

We built a metadata enrichment pipeline using Python tooling that batch-applies a 5-facet controlled vocabulary (audience, difficulty, content domain, content intent, and service line) to every topic. A DITA subject scheme enforces normalized values — preventing the inconsistency that breaks AI filtering. Every section received a stable ID and question-phrased title to match natural-language queries.

Artifact Ledger

100% metadata coverage
150+ section IDs
5 classification facets
2 conversion tools

Stack

Schema: DITA 1.3 · custom 5-facet subject scheme
Tooling: Python 3.12 · lxml · Pydantic validation models
Vector store: Vendor-neutral integration (JSONL export contract)
Validation: Schematron · subject scheme validator · pre-commit hooks
Export: JSONL chunks · content manifest · synonym glossary index

Before & after.

Before — unstructured XML

Generic <article> and <section> elements
No topic typing — AI cannot distinguish explanations from procedures
Zero metadata fields per topic
No section IDs — chunking breaks at arbitrary points
Keyword-only search; no intent or audience filtering
AI-readiness score: ~2/10

After — AI-ready DITA

Formal DITA types (concept, task, reference) with semantic elements
Intent routing: concept → 'What is X?', task → 'How do I X?'
10+ metadata fields per topic (audience, difficulty, duration, intent, domain, chatbot-priority)
150+ addressable section IDs with stable anchors
Metadata-filtered vector search with intent classification
AI-readiness score: ~9/10

Tools built.

enrich-metadata.py

Dictionary-driven Python script that batch-applies prolog metadata to every topic file. Maps each topic to its audience, difficulty, duration, content domain, content intent, chatbot priority, prerequisites, and service line — using normalized values from the controlled vocabulary.

dita-to-chatbot-json.py

Exports DITA topics to JSONL chunks for vector store ingestion. Section-aware chunking respects semantic boundaries. Generates a content manifest, glossary index with synonym lookup, and an ingestion report with quality metrics.

Controlled vocabulary (subject scheme)

Audience: technical-writer · developer · build-engineer · manager · all
Difficulty: beginner · intermediate · advanced · expert
Content domain: dita-fundamentals · authoring · content-reuse · publishing · customization · xslt · devops · architecture
Content intent: explain · how-to · reference · troubleshoot · best-practice · tutorial
Service line: dita-engineering · publishing-engineering · ccms-services · content-migration · xml-engineering

Decisions and trade-offs.

The choices that shaped the engagement, recorded with the option taken and what was rejected. The reasoning matters more than the outcome.

Schema for AI corpus

Chosen

DITA 1.3 + custom subject scheme

Rejected

Markdown + YAML frontmatter

Why: DITA schema rejects invalid metadata at author-time. Frontmatter drifts within weeks; the corpus stays trustworthy only if invalidity is impossible by construction.
Chunking unit

Chosen

Section element with stable @id

Rejected

Sliding-window character chunks

Why: Semantic boundaries beat arbitrary character spans on retrieval precision. The retrieval engine returns sections, the LLM cites them, the citation maps back to a stable anchor in source.
Section title phrasing

Chosen

Question-phrased ('How do I configure X?')

Rejected

Statement-phrased ('X configuration')

Why: Natural-language user queries match question titles. Statement titles fail title-match heuristics and degrade rank in mixed retrieval pipelines.
Metadata enforcement

Chosen

Subject scheme validation in pre-commit hook

Rejected

Run-time metadata cleanup pass

Why: Cleanup at run-time hides drift and lets bad metadata propagate downstream. Rejecting at author-time keeps the corpus honest, and the failure surfaces where the cost to fix is lowest.

A note on these numbers.

The figures in the artifact ledger are direct counts from the deliverables shipped on this engagement — not ROI projections or aggregated averages. Outcome percentages referenced anywhere on this site reflect industry benchmarks published by OASIS, Gartner, and CIDM for organizations that achieve 40%+ content reuse with structured metadata. Your actual results depend on content volume, language count, update frequency, and current toolchain maturity. Every engagement begins by measuring your baseline so projections are defensible.

Sample Content Assessment

Submit a 20-page sample. We'll return conversion feasibility, content recovery rate, and engineering effort within two business days. The analysis is the basis for any further engagement, with no obligation to proceed.

Submit a sample →