AUDIT·CONVERT·DEDUPE·VALIDATE

The conversion that finishes.


Hand a twenty-year estate of FrameMaker, Word, RoboHelp, InDesign, and scanned PDF to a pipeline that actually finishes — clean DITA, the duplicates already collapsed, every topic traceable to the source it came from. The migration that used to stall in a typing pool ships on a schedule. Because the duplicates collapse during the conversion — not months after go-live — the new repository starts smaller and cheaper to translate, search, and feed an AI pipeline. What lands is a clean DITA project your team can build on, not a re-keyed copy of the estate you started with.

Source Formats We Convert From

One engineered pipeline, every legacy estate — not a different tool and a different team for each format.

  • Microsoft Word
  • HTML
  • Custom XML / DTD
  • Excel
  • FrameMaker
  • SGML
  • PowerPoint / PPTX
  • RoboHelp
  • Confluence
  • AsciiDoc
  • Legacy DITA (1.0–1.2)
  • DocBook
  • MadCap Flare
  • Markdown
  • QuarkXPress
  • InDesign
  • Google Docs
  • AuthorIT
  • PDF (OCR)

How Conversion Works

Our conversion methodology combines proven migration tools with deep DITA expertise to handle the heavy lifting. You focus on quality.

  1. Source Document

    • Inconsistent heading styles
    • Manual numbered lists
    • Embedded images at random DPI
    • No metadata or taxonomy
    • Duplicate content across files
  2. DITA Output

    • Clean topic types (concept, task, reference)
    • Semantic <ol> and <steps> markup
    • Normalized images, consistent DPI
    • Rich metadata and SubjectScheme keys
    • Deduplicated with conref / conkeyref reuse
  3. AI-Ready Content

    • Chunked topics optimized for RAG retrieval
    • Schema.org and JSON-LD metadata mapping
    • Semantic labels for LLM grounding and citation
    • Embedding-friendly short descriptions per topic
    • Knowledge-graph nodes from the subject-scheme taxonomy

Automatic Topic Deduplication

During conversion, we analyze every paragraph and span across your entire document collection using industry-proven migration tools and structured content analysis. Exact-match duplicates are identified and consolidated into reusable warehouse topics with conref pointers — saving you from discovering redundancy months later.

Typical result: 15–30% of topics identified as duplicates and consolidated in the first pass.

The Modernization Journey

We don’t just move files; we re-engineer information architecture. From unstructured chaos to intelligent, reusable XML.

  1. Analyze

    Dedup and audit the legacy source for recoverability

  2. Model

    Define the target information architecture and reuse model

  3. Transform

    Automated, validated XML conversion, format by format

  4. Enrich

    Add semantic metadata, taxonomy, and reuse keys

  5. Go Live

    Validate, baseline, and hand off with full provenance

  • Legacy Conversion

    We build custom parsers for FrameMaker, InDesign, Word, and RoboHelp to extract maximum structure.

  • Metadata Strategy

    Applying taxonomy tags during migration for future faceted search and dynamic filtering.

  • Quality Assurance

    Automated schematron validation ensures every migrated topic adheres to your new content model.

Sample Content Assessment

Send us a 20-page sample — FrameMaker, Word, RoboHelp, InDesign, scanned PDF, anything. We’ll return a conversion feasibility, content recovery rate, and migration-effort estimate within two business days.

Submit a sample →