AUDIT·CONVERT·DEDUPE·VALIDATE

The conversion that finishes.


Twenty years of FrameMaker, Word, RoboHelp, InDesign, SGML, and unstructured XML — audited for what’s recoverable, parsed into DITA topics, deduplicated against itself, and validated against your new content model. By the engineering, not by a typing pool.

Why manual conversion never finishes.

Three failure modes turn a migration into an open-ended engagement. Each one compounds the others, and all three are visible at the audit — if anyone looks.

  1. The estate is bigger than the team thinks.

    Every audit reveals more legacy formats than the migration plan accounted for. A “FrameMaker to DITA” project quietly absorbs a parallel Word estate, three RoboHelp microsites, and procedure PDFs nobody indexed. The work expands faster than the team converts it.

  2. Hand-rekeying loses structure.

    When a human re-authors a 200-page FrameMaker book into DITA, the only structure that survives is the structure that was already legible. Implicit hierarchies — page break, running header, indentation level — get translated as paragraphs of body text. Months later, the topics still read as Word docs.

  3. Nobody finds the duplicates.

    Across a 50,000-topic estate, 15–30% of paragraphs are exact or near-duplicate of paragraphs elsewhere. Hand-rekeying catches none of them. The new DITA repository inherits the same redundancy the legacy estate had — minus the format conversion fee.

The pipeline

The conversion is engineered, not transcribed.


Every migration starts with a wall-to-wall audit. Format inventory, structure-density scoring, recoverability assessment per asset family — the work is quoted on what’s recoverable. Format-specific parsers do the conversion: FrameMaker MIF via custom XSL, Word OOXML via OpenXML, RoboHelp HHP unpacked and topic-split, InDesign IDML mapped to DITA sections, SGML normalized through OmniMark, scanned PDF run through ABBYY FineReader. Paragraph- and span-level hash matching catches the 15–30% of topics that are exact or near-duplicate of others elsewhere. Schematron rule sets validate the result. Failed topics route to remediation, with provenance metadata back to the legacy source.

  1. Audit

    Custom inventory scripts, structure-density scoring

    Wall-to-wall format inventory and recoverability score per asset family. The estate is quoted on what is recoverable, not on page count.

  2. Model

    DITA 1.3 + subject scheme, conref strategy

    Information architecture for the target project: topic types, metadata vocabulary, reuse strategy. Modeled against the actual content, not boilerplate templates.

  3. Convert

    FrameMaker MIF, Word OOXML, RoboHelp HHP/CHM, InDesign IDML, XSLT 3.0, ABBYY FineReader

    Format-specific parsers extract maximum structure. Output is DITA 1.3, validated on emission. No human re-keys the body text.

  4. Deduplicate

    Paragraph + span hash matching, near-duplicate review queue

    Exact matches consolidated automatically. Near-duplicates routed to a reviewer queue with diff. Originals become conref pointers into a topic warehouse.

  5. Validate

    Schematron rule sets, DITA-OT validation, provenance metadata

    Failed topics route to remediation with diagnostic reports. Final deliverable: a clean, deduplicated DITA project with traceability back to every legacy source.

Five fronts, one target.

Four legacy format families, plus the engineering capability that compounds the savings. Each is a separate parser, transform, or hash-matching pass in the same engineered conversion.

  1. FrameMaker & InDesign.

    Adobe page-layout estates.

    FrameMaker MIF parsed via custom XSL. InDesign IDML extracted topic-by-topic, with frame hierarchy mapped to DITA section structure. Cross-references, conditional text, and variable definitions migrated as conref keys and DITAVAL filters.

    What does a 4,000-page FrameMaker book turn into, when the page-layout heuristics finally agree?

  2. Word, RoboHelp & Confluence.

    Office and help-authoring estates.

    Word OOXML parsed via OpenXML with paragraph-style mappings to DITA element vocabularies. RoboHelp HHP and CHM unpacked and topic-split. Confluence space exports converted via the REST API with macro mappings to conref keys. Heading-style consistency rebuilt where the source forgot it.

    Where do all the inconsistent heading styles go?

  3. Unstructured XML, SGML & HTML.

    XML estates that lost their DTD.

    DocBook, IBMIDDoc, MIL-STD-2361, and bespoke XML schemas converted via element-by-element XSLT 3.0 transforms. SGML normalized through OmniMark or SGML→XML round-tripping. HTML estates — intranet exports, knowledge bases — parsed into DITA topics with link-graph preservation.

    When the original DTD is lost, what tells you what the elements meant?

  4. PDF & scanned print.

    The estate that never had structure.

    Born-digital PDFs parsed via PDF.js or pdfminer, with tagged-PDF structure extraction where present. Scanned PDFs OCR’d through ABBYY FineReader with layout reconstruction. Output is a structured DITA shell — imperfect, but a measurable starting point that hand-keying cannot match for cost.

    Is the parts catalog locked in a 1998 PDF still authoritative? Then why is it unsearchable?

  5. Topic deduplication.

    The compounding payoff.

    Paragraph and span hashes computed across the entire converted estate. Exact matches consolidated automatically; near-duplicates (typically 92%+ similarity) routed to a reviewer queue with diff. Consolidated content becomes a topic warehouse; originals carry conref pointers to it. Typical result: 15–30% of topics retired in the first pass, full provenance retained.

    Why discover the 30% redundancy after migration, when you can collapse it during?

Recovery rate

Recovery rate is the metric, not page count.

A migration project quoted in pages is a project that ignores its own findings. The first thing the audit produces is a recoverability score per asset family — the fraction of each estate that parses cleanly, the fraction that needs human triage, and the fraction that was always too unstructured to recover.

That number is the basis for every downstream commitment. Schedule estimates, translation budget, search-index coverage, and RAG retrieval precision all depend on it. A page count tells you nothing about which of those numbers will survive the conversion.

On typical Extense engagements, deduplication and conref-driven reuse push the final reuse rate to 45% — forty-five out of every hundred paragraphs in the new DITA repository point at content authored once. That is the baseline every translation, search, and AI pipeline that comes after gets to measure itself against.

Sample Content Assessment

Send us a 20-page sample — FrameMaker, Word, RoboHelp, InDesign, scanned PDF, anything. We’ll return a conversion feasibility, content recovery rate, and migration-effort estimate within two business days.

Submit a sample →