The Ultimate DITA Migration Playbook

Don't just copy-paste — transform. A practitioner playbook for moving content from unstructured legacy formats (Word, FrameMaker, MadCap, RoboHelp, AuthorIT) into typed, validated DITA. Migration is recovery, not transcription — the disciplines, the lifecycle, and the field rules drawn from the practice.

Why migration is recovery.

Migration is recovery, not transcription. How a migration is approached decides whether the new system inherits the old corpus's problems or sheds them — and that decision is made long before the first file converts. Treated as transcription, the work looks finished the moment the schema is full; treated as recovery, it is the one chance to set the content right.

Transcription

lift-and-shift

Copy-paste source content into a new schema.

  • The same redundancies
  • The same broken navigation
  • The same ungoverned metadata

Two years on, the CCMS investment hasn't paid off.

Recovery

extense

Lift reusable assets; retire ROT; rewrite the rest with intent.

  • Typed, validated topics
  • Governed metadata
  • Architecture before conversion rules

The architecture pays off.

Migration practice by the numbers.

  • 2M+ Pages migrated to date
  • ~40% Typical ROT Dead content in a legacy corpus
  • 13 Source formats handled
  • 4 Sectors served Federal · defense · life sciences · commercial

Phase 1

Analysis & strategy.

Three disciplines that take you from corpus to architecture. Skip them and you migrate dead weight into a new system. Done in order, they turn an unfamiliar corpus into a plan the conversion can follow.

  1. Lifecycle stage · Audit

    Content audit

    Catalog everything before converting anything. Count topics, count formats, identify duplicates, surface ROT — redundant, obsolete, trivial content that should not survive into the target system. Migration teams routinely move 40% dead content into a new CCMS, where it costs money to convert, money to maintain, and reduces retrieval signal downstream. The audit is the only opportunity to delete that content with the team's authorization intact.

  2. Lifecycle stage · Map

    Information architecture

    Define topic types — concept, task, reference, troubleshooting — before scoping conversion rules. Define the metadata strategy and the controlled vocabulary before mapping a single style. The conversion engine cannot guess the right type from source markup; it follows rules that someone defined deliberately. Skipping IA means the conversion produces structured-garbage-out from unstructured-garbage-in — the type system is empty of meaning.

  3. Lifecycle stage · Map → Convert

    Pilot migration

    Take fifty representative pages — chosen to span the difficult patterns in the source corpus, not the easy ones. Run them through the conversion manually or via script. Inspect every output for type fidelity, structural validity, and metadata coverage. Refine the conversion rules. Re-run. This pilot cycle prevents the most expensive migration failure: discovering a structural defect after running 50,000 pages through the wrong rule set.

The migration lifecycle.

Five stages that sit underneath the disciplines. The disciplines are what a migration team does; the lifecycle is where in the program each one applies.

  1. 01

    Audit

    Catalog the corpus. Delete ROT before anything converts.

  2. 02

    Map

    Define topic types, taxonomy, and metadata before mapping styles.

  3. 03

    Convert

    Run the refined engine across the full corpus in one pass.

  4. 04

    Refine

    Deduplicate, enrich, and resolve the structural exceptions.

  5. 05

    Publish

    Validate outputs against the use cases that defined the architecture.

Source formats covered.

The conversion engine handles thirteen named source formats. Conversion rules differ by format — Word's style-driven model is not the same problem as FrameMaker's structural conditions or RoboHelp's compiled output — and each has named patterns the engine recognizes.

  • Microsoft Word
  • FrameMaker
  • InDesign
  • RoboHelp
  • MadCap Flare
  • DocBook
  • HTML
  • Markdown
  • AuthorIT
  • Confluence
  • SGML
  • Excel
  • PDF (OCR)

Phase 2

Conversion & enrichment.

Two disciplines that take you from converted content to enriched, validated DITA — the part where the architecture pays off. It is where reuse and metadata stop being intentions and become properties of the corpus.

  1. Lifecycle stage · Convert

    Batch conversion

    Feed the full corpus through the refined engine. Context-based pattern matching distinguishes 'numbered steps in a task' from 'numbered lists in a reference topic' — a distinction that style-mapping tools cannot reach. The engine handles row-and-column-spanning tables, FrameMaker equations to MathML, embedded drawings to SVG, and the structurally awkward constructs that defeat lift-and-shift conversion.

  2. Lifecycle stage · Refine

    Deduplication & enrichment

    After conversion, run exact-topic deduplication across the entire collection. Identify redundant topics and consolidate them into reusable warehouse topics that downstream content can conref. Inject metadata, keys, conref targets, and taxonomy attributes during the conversion pass — not after. Enriching post-conversion means inspecting every topic by hand; enriching during conversion lets the rule engine carry the load.

Field rules from the practice.

Four positions taken — backed by named failure modes and measured costs. None of these are stylistic; all of them are reasoning a migration team has to apply, not just remember. Each is a place migrations lose time or money when the call goes the other way.

  1. Delete ROT first.

    Why: Redundant, obsolete, and trivial content should never reach the conversion engine. Teams routinely migrate 40% dead content — that content costs money to convert, money to maintain, and reduces retrieval signal downstream. ROT is identified during audit and gets removed before anything else moves.

  2. Pilot with fifty pages. Not five. Not five hundred.

    Why: Five pages don't give statistical significance — defects don't surface until the long tail. Five hundred pages mean weeks of conversion work before the first structural defect is caught. Fifty pages is the empirically validated sweet spot — large enough to expose the difficult patterns, small enough that re-running the pilot is cheap.

  3. Don't fix formatting in source.

    Why: Source cleanup is wasted effort — the formatting won't survive conversion, and the conversion engine doesn't depend on it. Fix typography and styling in DITA stylesheets after conversion, where the fix lives permanently and applies across the corpus, not just to the file someone happened to clean up.

  4. Convert all languages simultaneously.

    Why: Migrating English first and 'doing translations later' breaks TM (translation memory) alignment between source and target. The localization team ends up reconciling drift that didn't exist before migration. Run all locales through the same conversion engine in the same pass — the TM stays coherent.

Sample Content Assessment

Submit a 20-page sample document. We'll return conversion feasibility, expected content recovery rate, ROT estimate, and engineering effort within two business days. The analysis is the basis for any further engagement, with no obligation to proceed.

Submit a sample →