← Case studies

CASE 02 / 03 2025 Enterprise documentation

AI-Ready Content Pipeline

Unstructured legacy XML transformed into chunk-safe, metadata-enriched DITA — with measurable AI-readiness scoring.

Duration
12 weeks
Team
1 architect · 1 conversion engineer · 1 retrieval engineer
Engagement
Project-based, fixed scope
Status
Shipped · powering production RAG retrieval

Challenge and approach

The challenge

Legacy documentation existed as generic XML and Markdown with no semantic typing, no metadata, and no section identifiers. AI systems treated all content as undifferentiated plain text — chatbot retrieval precision was roughly 25%, and there was no way to filter by audience, difficulty, or intent. Content chunking broke at arbitrary points because there were no semantic boundaries.

The approach

We built a metadata enrichment pipeline using Python tooling that batch-applies a 5-facet controlled vocabulary (audience, difficulty, content domain, content intent, and service line) to every topic. A DITA subject scheme enforces normalized values — preventing the inconsistency that breaks AI filtering. Every section received a stable ID and question-phrased title to match natural-language queries.

Artifact Ledger


  • 100% metadata coverage
  • 150+ section IDs
  • 5 classification facets
  • 2 conversion tools

Stack


Schema
DITA 1.3 · custom 5-facet subject scheme
Tooling
Python 3.12 · lxml · Pydantic validation models
Vector store
Vendor-neutral integration (JSONL export contract)
Validation
Schematron · subject scheme validator · pre-commit hooks
Export
JSONL chunks · content manifest · synonym glossary index

Before & after.

Before — unstructured XML

  • Generic <article> and <section> elements
  • No topic typing — AI cannot distinguish explanations from procedures
  • Zero metadata fields per topic
  • No section IDs — chunking breaks at arbitrary points
  • Keyword-only search; no intent or audience filtering
  • AI-readiness score: ~2/10

After — AI-ready DITA

  • Formal DITA types (concept, task, reference) with semantic elements
  • Intent routing: concept → 'What is X?', task → 'How do I X?'
  • 10+ metadata fields per topic (audience, difficulty, duration, intent, domain, chatbot-priority)
  • 150+ addressable section IDs with stable anchors
  • Metadata-filtered vector search with intent classification
  • AI-readiness score: ~9/10

Tools built.

enrich-metadata.py

Dictionary-driven Python script that batch-applies prolog metadata to every topic file. Maps each topic to its audience, difficulty, duration, content domain, content intent, chatbot priority, prerequisites, and service line — using normalized values from the controlled vocabulary.

dita-to-chatbot-json.py

Exports DITA topics to JSONL chunks for vector store ingestion. Section-aware chunking respects semantic boundaries. Generates a content manifest, glossary index with synonym lookup, and an ingestion report with quality metrics.

Controlled vocabulary (subject scheme)


Audience
technical-writer · developer · build-engineer · manager · all
Difficulty
beginner · intermediate · advanced · expert
Content domain
dita-fundamentals · authoring · content-reuse · publishing · customization · xslt · devops · architecture
Content intent
explain · how-to · reference · troubleshoot · best-practice · tutorial
Service line
dita-engineering · publishing-engineering · ccms-services · content-migration · xml-engineering

Decisions and trade-offs.

The choices that shaped the engagement, recorded with the option taken and what was rejected. The reasoning matters more than the outcome.

  1. Schema for AI corpus

    Chosen

    DITA 1.3 + custom subject scheme

    Rejected

    Markdown + YAML frontmatter

    Why: DITA schema rejects invalid metadata at author-time. Frontmatter drifts within weeks; the corpus stays trustworthy only if invalidity is impossible by construction.

  2. Chunking unit

    Chosen

    Section element with stable @id

    Rejected

    Sliding-window character chunks

    Why: Semantic boundaries beat arbitrary character spans on retrieval precision. The retrieval engine returns sections, the LLM cites them, the citation maps back to a stable anchor in source.

  3. Section title phrasing

    Chosen

    Question-phrased ('How do I configure X?')

    Rejected

    Statement-phrased ('X configuration')

    Why: Natural-language user queries match question titles. Statement titles fail title-match heuristics and degrade rank in mixed retrieval pipelines.

  4. Metadata enforcement

    Chosen

    Subject scheme validation in pre-commit hook

    Rejected

    Run-time metadata cleanup pass

    Why: Cleanup at run-time hides drift and lets bad metadata propagate downstream. Rejecting at author-time keeps the corpus honest, and the failure surfaces where the cost to fix is lowest.

A note on these numbers.

The figures in the artifact ledger are direct counts from the deliverables shipped on this engagement — not ROI projections or aggregated averages. Outcome percentages referenced anywhere on this site reflect industry benchmarks published by OASIS, Gartner, and CIDM for organizations that achieve 40%+ content reuse with structured metadata. Your actual results depend on content volume, language count, update frequency, and current toolchain maturity. Every engagement begins by measuring your baseline so projections are defensible.

Sample Content Assessment

Submit a 20-page sample. We'll return conversion feasibility, content recovery rate, and engineering effort within two business days. The analysis is the basis for any further engagement, with no obligation to proceed.

Submit a sample →