AUDIT·CONVERT·DEDUPE·VALIDATE
The conversion that finishes.
Hand a twenty-year estate of FrameMaker, Word, RoboHelp, InDesign, and scanned PDF to a pipeline that actually finishes — clean DITA, the duplicates already collapsed, every topic traceable to the source it came from. The migration that used to stall in a typing pool ships on a schedule. Because the duplicates collapse during the conversion — not months after go-live — the new repository starts smaller and cheaper to translate, search, and feed an AI pipeline. What lands is a clean DITA project your team can build on, not a re-keyed copy of the estate you started with.
Source Formats We Convert From
One engineered pipeline, every legacy estate — not a different tool and a different team for each format.
- Microsoft Word
- HTML
- Custom XML / DTD
- Excel
- FrameMaker
- SGML
- PowerPoint / PPTX
- RoboHelp
- Confluence
- AsciiDoc
- Legacy DITA (1.0–1.2)
- DocBook
- MadCap Flare
- Markdown
- QuarkXPress
- InDesign
- Google Docs
- AuthorIT
- PDF (OCR)
How Conversion Works
Our conversion methodology combines proven migration tools with deep DITA expertise to handle the heavy lifting. You focus on quality.
-
Source Document
- Inconsistent heading styles
- Manual numbered lists
- Embedded images at random DPI
- No metadata or taxonomy
- Duplicate content across files
-
DITA Output
- Clean topic types (concept, task, reference)
- Semantic <ol> and <steps> markup
- Normalized images, consistent DPI
- Rich metadata and SubjectScheme keys
- Deduplicated with conref / conkeyref reuse
-
AI-Ready Content
- Chunked topics optimized for RAG retrieval
- Schema.org and JSON-LD metadata mapping
- Semantic labels for LLM grounding and citation
- Embedding-friendly short descriptions per topic
- Knowledge-graph nodes from the subject-scheme taxonomy
Automatic Topic Deduplication
During conversion, we analyze every paragraph and span across your entire document collection using industry-proven migration tools and structured content analysis. Exact-match duplicates are identified and consolidated into reusable warehouse topics with conref pointers — saving you from discovering redundancy months later.
Typical result: 15–30% of topics identified as duplicates and consolidated in the first pass.
The Modernization Journey
We don’t just move files; we re-engineer information architecture. From unstructured chaos to intelligent, reusable XML.
-
Analyze
Dedup and audit the legacy source for recoverability
-
Model
Define the target information architecture and reuse model
-
Transform
Automated, validated XML conversion, format by format
-
Enrich
Add semantic metadata, taxonomy, and reuse keys
-
Go Live
Validate, baseline, and hand off with full provenance
-
Legacy Conversion
We build custom parsers for FrameMaker, InDesign, Word, and RoboHelp to extract maximum structure.
-
Metadata Strategy
Applying taxonomy tags during migration for future faceted search and dynamic filtering.
-
Quality Assurance
Automated schematron validation ensures every migrated topic adheres to your new content model.
Sample Content Assessment
Send us a 20-page sample — FrameMaker, Word, RoboHelp, InDesign, scanned PDF, anything. We’ll return a conversion feasibility, content recovery rate, and migration-effort estimate within two business days.
Submit a sample →