AI-readiness is a content engineering problem.

Retrieval precision is upstream of model selection. A practitioner guide to the seven dimensions of content engineering that determine whether your AI pipeline returns the right answer or the closest keyword match.

Why AI-readiness matters.

Most enterprise content sits in unstructured formats — Word, PDF, wiki pages, Markdown. When that content gets ingested into a vector store, the AI treats every paragraph as undifferentiated text. It can't distinguish a worked example from a marketing blurb, a step-by-step procedure from a concept explanation, or a beginner topic from an expert reference.

The failure shows up downstream as retrieval precision in the 25–35% range. Users get wrong answers, lose trust, and escalate to human support. Worse, AI hallucinations fill gaps where metadata should have guided the system. The asymmetry that matters in regulated workloads: a hallucinated citation is materially worse than a missed one. The missed result gets escalated; the hallucinated citation gets followed.

Content that is semantically typed, metadata-enriched, and section-addressable delivers production-grade retrieval precision because the AI knows what each piece of content is, who it's for, and when to surface it. That's the difference between a chatbot that guesses and one that answers — and it's an engineering problem, not a model selection problem.

The seven-dimension scorecard.

Rate your content on each dimension below: 0, 1, or 2 points. Total range is 0–14. Sum your score and consult the interpretation rubric at the end. Honest self-scoring is the prerequisite for an honest baseline.

  1. Content typing

    Is every document formally typed — concept, task, reference, troubleshooting — or is everything a generic 'page'?

    Why it matters

    AI intent routing depends on type. 'What is X?' → concept. 'How do I X?' → task. 'List of X' → reference. Without types, the AI cannot match question intent to answer format.

    Scoring

    • 0 Everything is a generic document
    • 1 Some documents have informal categories
    • 2 Every topic is formally typed with semantic structure
  2. Metadata coverage

    Does every topic have machine-readable metadata — audience, difficulty, domain, intent, duration?

    Why it matters

    Metadata enables filtering. Without it, the AI retrieves beginner content for experts, developer docs for writers, and overview content when the user needs a step-by-step procedure.

    Scoring

    • 0 No metadata (or only title and date)
    • 1 Some topics have partial metadata
    • 2 Every topic has 5+ controlled metadata fields
  3. Section addressability

    Does every major section have a stable, unique ID?

    Why it matters

    Vector stores chunk content at section boundaries. Without IDs, chunks break at arbitrary points — mid-sentence, mid-procedure, mid-table. The AI retrieves fragments instead of complete answers.

    Scoring

    • 0 No section IDs
    • 1 Some sections have IDs
    • 2 Every section has a stable, meaningful ID
  4. Vocabulary control

    Are metadata values drawn from a controlled vocabulary, or is it free-text tagging?

    Why it matters

    Free-text tags produce inconsistency: 'dev', 'developer', 'software engineer', 'eng' all mean the same thing. AI filters break when vocabulary is uncontrolled. A subject scheme or taxonomy enforces consistency.

    Scoring

    • 0 Free-text or no tagging
    • 1 Informal guidelines for tagging
    • 2 Formal taxonomy enforced by schema
  5. Semantic markup richness

    Do topics use semantic elements — shortdesc, prereqs, steps, result, tables, notes — or is everything paragraphs?

    Why it matters

    Semantic elements tell the AI the role of each block. A <shortdesc> becomes the chatbot's one-line answer. <steps> become numbered instructions. <prereq> tells the AI what the user needs before starting.

    Scoring

    • 0 Paragraphs only
    • 1 Some semantic elements used
    • 2 Full semantic markup throughout
  6. Reuse architecture

    Is content reused via conref / keyref / transclusion, or duplicated by copy-paste?

    Why it matters

    Copy-paste duplication means the AI retrieves multiple slightly-different versions of the same content. Users see conflicting answers. Proper reuse means one authoritative source — one answer.

    Scoring

    • 0 Copy-paste duplication
    • 1 Some shared components
    • 2 Systematic reuse via conref / keyref
  7. Output pipeline maturity

    Can the pipeline produce HTML, PDF, JSON, and chatbot-ready JSONL from a single source automatically?

    Why it matters

    AI pipelines consume JSON / JSONL — not PDF. If your publishing pipeline only produces PDF and HTML, you need a separate export step for AI ingestion. A mature pipeline includes chatbot-ready output as a first-class format.

    Scoring

    • 0 Manual export to 1-2 formats
    • 1 Automated build to HTML + PDF
    • 2 Automated multi-format including JSON / JSONL for AI

Interpret your score.

0 – 4
AI will treat your content as plain text. Chatbot answers will be imprecise, often wrong. This is where most organizations start.
5 – 7
Partial readiness. Some AI features work, but gaps in metadata or structure cause retrieval failures for edge cases.
8 – 10
Production-ready. AI can filter, route, chunk, and cite your content precisely. This is where our clients end up.
11 – 14
Best-in-class. Full semantic markup, enforced taxonomy, multi-format output including JSONL — your content is an organizational asset, not just documentation.

The preparation roadmap.

Five phases to take content from unstructured to AI-ready. Each phase builds on the previous one; skipping the foundation phases is the most common way these programs fail.

  1. Audit & baseline

    Score your current content on the seven dimensions above. Identify the biggest gaps. Inventory content by type, volume, language, and update frequency. This baseline is what every improvement will be measured against.

    Deliverable: Content audit report with AI-readiness score and gap analysis.

  2. Structure & type

    Convert content to a typed format (DITA concept / task / reference, or equivalent). Split monolithic documents into focused single-purpose topics. Add section IDs to every major block. This is the foundation — without it, nothing downstream works.

    Deliverable: Typed, modular topics with section-level addressability.

  3. Enrich metadata

    Define a controlled vocabulary (subject scheme or taxonomy) with facets for audience, difficulty, content domain, content intent, and any organization-specific dimensions. Apply metadata to every topic via batch tooling — not manual entry.

    Deliverable: 100% metadata coverage with enforced vocabulary.

  4. Build the pipeline

    Configure the publishing toolchain to produce chatbot-ready output alongside traditional formats. JSONL export with section-aware chunking, metadata inheritance, and glossary extraction. Test the chunks against sample queries.

    Deliverable: Automated build producing HTML5, PDF, and JSONL from a single source.

  5. Integrate & validate

    Connect the JSONL output to your vector store and chatbot. Configure metadata filters for intent routing and audience matching. Run retrieval tests with real user queries. Measure precision and iterate on vocabulary and chunking strategy.

    Deliverable: Working AI retrieval with measured precision rate.

Typical timeline


Small corpus 50 – 200 topics
4 – 8 weeks through all 5 phases.
Medium corpus 200 – 1,000 topics
8 – 16 weeks, often with phases 2 and 3 running in parallel batches.
Large corpus 1,000+ topics
3 – 6 months, phased rollout by product line or content domain. Automation tooling (enrich-metadata.py, dita-to-chatbot-json.py) is critical at this scale.

What AI-ready content looks like.

Concrete examples of the transformations the scorecard measures. Each pair shows the failure mode at score 0 and the working pattern at score 2.

Metadata that AI systems can filter


  • Audience filtering

    A developer asks 'How do I create a custom plugin?' — the AI filters for audience=developer and intent=how-to, skipping the manager-level overview of the same feature. Without metadata, both would be returned with equal rank.

  • Difficulty matching

    A beginner asks 'What is DITA?' — the AI returns the difficulty=beginner concept topic, not the advanced architecture reference. The right answer at the right level prevents overwhelm and helpdesk escalation.

  • Intent routing

    'What is X?' routes to concepts. 'How do I X?' routes to tasks. 'Show me the parameters' routes to reference tables. Topic typing plus intent metadata makes this automatic — no prompt engineering required.


Section-aware chunking

Without section IDs

A 2,000-word topic gets split into three chunks of ~667 words each. Chunk 2 starts mid-paragraph, references "the step above" (which is now in Chunk 1), and includes half a table. The AI retrieves Chunk 2 and the user sees a fragment that doesn't make sense.

With section IDs

The same topic has five sections, each with a stable ID and self-contained heading. Chunking happens at section boundaries. Each chunk is a complete, citable unit. The AI retrieves the exact section, and the user sees a coherent answer with a deep-link back to source.

Controlled vocabulary vs. free-text tags

Free-text chaos

Topic A is tagged "developer". Topic B is tagged "dev". Topic C is tagged "software engineer". Topic D has no audience tag. A filter for audience=developer returns only Topic A — missing 75% of relevant content. The chatbot appears to have knowledge gaps that don't actually exist.

Controlled vocabulary

A DITA subject scheme defines exactly seven valid audience values: technical-writer, developer, build-engineer, manager, localization-specialist, editor, all. Validation rejects any other value. The same filter returns 100% of relevant content — zero false negatives.

Output: chatbot-ready JSONL.

Each section becomes a JSONL record with the text content, topic title, section heading, topic type (concept / task / reference), all metadata fields (audience, difficulty, domain, intent, duration, chatbot-priority), the section ID for deep-linking, and the parent map path for breadcrumb context. A glossary index maps terms to definitions with synonyms. A content manifest provides a searchable index of all topics.

Three exported artifacts


chunks.jsonl
One JSON object per section. Ready for vector store embedding. Metadata fields enable filtered retrieval.
glossary-index.json
Term definitions with synonyms. Enables the chatbot to resolve terminology and expand queries.
content-manifest.json
Topic inventory with metadata. Used for corpus validation, coverage analysis, and pipeline health checks.

Sample Content Assessment

Submit a 20-topic sample. We'll score it on all seven dimensions, return a gap analysis with recommended next steps, and indicate the engineering effort required to reach production-grade retrieval. No obligation to proceed.

Submit a sample →