Migration is recovery, not transcription.
A practitioner playbook for moving content from unstructured legacy formats — Word, FrameMaker, MadCap, RoboHelp, AuthorIT — into typed, validated DITA. The disciplines, the lifecycle, and the field rules drawn from the practice.
Why migration is recovery.
A migration that copy-pastes source content into a new schema is not a migration — it's a lift-and-shift. The schema gets populated; the structural problems in the source travel intact into the target. The new system inherits the same redundancies, the same broken navigation, the same ungoverned metadata. Two years later, the team is asking why the CCMS investment didn't pay off.
Migration is recovery, not transcription. The conversion engine identifies reusable assets in legacy content and lifts them into the target schema; what isn't recoverable gets retired or rewritten with intent. The QA harness is designed before the first batch runs — failing migrations fail because the validation strategy was an afterthought. The architecture decisions get made before the conversion rules, not after.
The playbook below codifies the disciplines, the lifecycle, and the field rules — patterns drawn from two million pages of migration work across federal, defense, life sciences, and enterprise commercial engagements. None of it is theory; all of it is what a migration team has to do, not just what they have to know.
The five disciplines.
Each discipline maps to one or more stages of the migration lifecycle. They are sequential — skipping the foundation disciplines is the most common way migration programs fail.
-
Lifecycle stage · Audit
Content audit
Catalog everything before converting anything. Count topics, count formats, identify duplicates, surface ROT — redundant, obsolete, trivial content that should not survive into the target system. Migration teams routinely move 40% dead content into a new CCMS, where it costs money to convert, money to maintain, and reduces retrieval signal downstream. The audit is the only opportunity to delete that content with the team's authorization intact.
-
Lifecycle stage · Map
Information architecture
Define topic types — concept, task, reference, troubleshooting — before scoping conversion rules. Define the metadata strategy and the controlled vocabulary before mapping a single style. The conversion engine cannot guess the right type from source markup; it follows rules that someone defined deliberately. Skipping IA means the conversion produces structured-garbage-out from unstructured-garbage-in — the type system is empty of meaning.
-
Lifecycle stage · Map → Convert
Pilot migration
Take fifty representative pages — chosen to span the difficult patterns in the source corpus, not the easy ones. Run them through the conversion manually or via script. Inspect every output for type fidelity, structural validity, and metadata coverage. Refine the conversion rules. Re-run. This pilot cycle prevents the most expensive migration failure: discovering a structural defect after running 50,000 pages through the wrong rule set.
-
Lifecycle stage · Convert
Batch conversion
Feed the full corpus through the refined engine. Context-based pattern matching distinguishes 'numbered steps in a task' from 'numbered lists in a reference topic' — a distinction that style-mapping tools cannot reach. The engine handles row-and-column-spanning tables, FrameMaker equations to MathML, embedded drawings to SVG, and the structurally awkward constructs that defeat lift-and-shift conversion.
-
Lifecycle stage · Refine
Deduplication & enrichment
After conversion, run exact-topic deduplication across the entire collection. Identify redundant topics and consolidate them into reusable warehouse topics that downstream content can conref. Inject metadata, keys, conref targets, and taxonomy attributes during the conversion pass — not after. Enriching post-conversion means inspecting every topic by hand; enriching during conversion lets the rule engine carry the load.
Source formats covered.
The conversion engine handles thirteen named source formats. Conversion rules differ by format — Word's style-driven model is not the same problem as FrameMaker's structural conditions or RoboHelp's compiled output — and each has named patterns the engine recognizes.
Formats
- Microsoft Word
- FrameMaker
- InDesign
- RoboHelp
- MadCap Flare
- DocBook
- HTML
- Markdown
- AuthorIT
- Confluence
- SGML
- Excel
- PDF (OCR)
Field rules from the practice.
Four positions taken — backed by named failure modes and measured costs. None of these are stylistic; all of them are reasoning a migration team has to apply, not just remember.
-
Delete ROT first.
Why: Redundant, obsolete, and trivial content should never reach the conversion engine. Teams routinely migrate 40% dead content — that content costs money to convert, money to maintain, and reduces retrieval signal downstream. ROT is identified during audit and gets removed before anything else moves.
-
Pilot with fifty pages. Not five. Not five hundred.
Why: Five pages don't give statistical significance — defects don't surface until the long tail. Five hundred pages mean weeks of conversion work before the first structural defect is caught. Fifty pages is the empirically validated sweet spot — large enough to expose the difficult patterns, small enough that re-running the pilot is cheap.
-
Don't fix formatting in source.
Why: Source cleanup is wasted effort — the formatting won't survive conversion, and the conversion engine doesn't depend on it. Fix typography and styling in DITA stylesheets after conversion, where the fix lives permanently and applies across the corpus, not just to the file someone happened to clean up.
-
Convert all languages simultaneously.
Why: Migrating English first and 'doing translations later' breaks TM (translation memory) alignment between source and target. The localization team ends up reconciling drift that didn't exist before migration. Run all locales through the same conversion engine in the same pass — the TM stays coherent.
Sample Content Assessment
Submit a 20-page sample document. We'll return conversion feasibility, expected content recovery rate, ROT estimate, and engineering effort within two business days. The analysis is the basis for any further engagement, with no obligation to proceed.
Submit a sample →