Manual publishing is a risk surface.
Documentation that ships from someone's laptop is a quality problem, a reproducibility problem, and an audit problem at once. A practitioner guide to the CI/CD pipeline that solves it — containerized builds, validation gates, parallel multi-format output, and chatbot-ready JSONL as a first-class artifact.
Why manual publishing fails.
The manual publishing process looks deceptively functional. A writer runs DITA-OT locally on their laptop. PDF generation takes 15 to 40 minutes. Broken links and invalid XML are discovered after the build, on the rendered output. Someone emails the PDF to a reviewer. The branded template hasn't been updated in two years. There is no build for chatbot-ready JSONL because "that's an AI team problem." Every one of those steps is a quality control that isn't being run.
Publishing automation isn't about speed — it's about risk reduction. The pipeline is the place where validation runs deterministically, where the same toolchain produces the same output every time, where defects fail loudly at the cheapest moment to fix them. The headline numbers (build time, format count) are downstream effects of the actual change, which is moving every quality control upstream of the writer's discretion.
The guide below codifies the pipeline: eight stages from commit to production, four parallel output formats including JSONL for AI retrieval, four CI/CD platform options, and a containerization model that gives every build identical inputs and identical outputs. None of it is theory; all of it is what a production documentation team has to run.
- Build time
- 40 minutes → 3 minutes
- Output formats
- 1 → 4 (in parallel)
- Validation
- 0% → 100% coverage
- Rework
- ~70% fewer cycles
The eight-stage pipeline.
Eight stages from commit to production. Each stage names the action that runs and the result it produces. The validation gates that surface defects are integrated into the stages, not staged separately — defects fail at the moment they're cheapest to fix.
-
Git push
What runs
Writer commits to a feature branch and opens a pull request. The PR automatically triggers the pipeline — there is no 'did you remember to run the build' question.
Result
Every commit produces a build attempt. Stale local builds and inconsistent author environments stop being a source of drift.
-
Schema & metadata validation
What runs
DTD or RelaxNG validates every topic file. Schematron rules enforce business-level constraints: required metadata fields, prohibited elements, naming conventions, topic length limits, image resolution thresholds.
Result
Invalid XML or incomplete metadata blocks the build immediately, before downstream stages run. The fix surfaces at the cheapest possible moment — at author time, not at deploy time.
-
Link & reference integrity
What runs
Every xref, href, conref, and keyref is resolved and verified at build time. Missing targets, circular references, and dead external URLs fail the build.
Result
Catches the most common source of broken documentation: refactoring topics without updating cross-references. The link integrity gate runs before format generation, so broken refs cost a second to detect, not an hour to chase down post-publish.
-
Terminology & style
What runs
Vale (open source) or Acrolinx (enterprise) runs against the corpus. Custom rules enforce the style guide — banned terms, passive voice limits, abbreviation consistency, product-name capitalization.
Result
Style violations fail the build or warn, depending on rule severity. Editorial consistency stops being a manual-review burden and becomes a CI artifact.
-
Build all formats
What runs
DITA-OT generates HTML5, branded PDF, and JSONL in parallel. Custom plugins handle company branding, navigation, and metadata enrichment for each output.
Result
Three production-ready output formats in a single build pass — typically under three minutes for a 200-topic corpus. No 'PDF will be ready Friday' email.
-
Image & asset audit
What runs
Missing alt text, oversized files (over 500 KB), orphaned graphics (referenced but never committed), and unlicensed stock images are flagged as build warnings or failures depending on severity.
Result
Accessibility compliance and repo health are continuously verified. Uncompressed screenshots and stock-license violations stop reaching production.
-
Deploy to staging
What runs
HTML5 output is deployed to a staging URL accessible to reviewers and stakeholders. Pull-request comments link directly to the rendered artifact, not the raw XML.
Result
Review feedback happens against what users will actually see. Comments are written on rendered prose, screenshots, and navigation — not on XML markup.
-
Promote to production
What runs
After PR approval, output is deployed: HTML5 to S3 / Netlify / CDN, PDF uploaded to the doc portal, JSONL pushed to the vector store. Build dashboards record pass/fail and timing trends; Slack, Teams, or email notify on failure.
Result
Three artifacts shipped in lockstep from one commit. Documentation, branded PDF, and AI corpus are all promoted together — never out of sync.
Containerization is the single largest change.
DITA-OT depends on Java, Saxon, Apache FOP, and a stack of plugins — each with version-specific behavior. When Writer A runs DITA-OT 4.1 with Java 17 and Writer B runs DITA-OT 3.7 with Java 11, the same source produces different output. In regulated industries — defense, life sciences, financial services — that is a compliance failure, not a stylistic one.
A Docker image locks the entire toolchain into a single, versioned artifact: DITA-OT version, Java version, custom plugins, fonts, and configuration. The pipeline pulls the image and runs the build. Every build. Every environment. Identical output. The "it works on my machine" failure mode disappears because the machine is now the image.
What's in the image
- Base
- Eclipse Temurin JDK 17 (slim)
- DITA-OT
- Pinned version (e.g., 4.2.3) installed via official distribution
- Custom plugins
- Branded PDF plugin, HTML5 theme, JSONL exporter — all baked in
- Fonts
- Corporate fonts for PDF generation (no more 'font not found' on the build server)
- Validation tools
- Vale CLI, xmllint, custom Schematron rules
- Build script
- Gradle wrapper with tasks for each output format
Dockerfile
# Dockerfile — production DITA-OT build image
FROM eclipse-temurin:17-jdk-jammy AS base
# Install DITA-OT
ARG DITA_OT_VERSION=4.2.3
RUN curl -sL https://github.com/dita-ot/dita-ot/releases/download/${DITA_OT_VERSION}/dita-ot-${DITA_OT_VERSION}.zip \
-o /tmp/dita-ot.zip && unzip /tmp/dita-ot.zip -d /opt
ENV DITA_HOME=/opt/dita-ot-${DITA_OT_VERSION}
ENV PATH="${DITA_HOME}/bin:${PATH}"
# Install custom plugins
COPY plugins/ ${DITA_HOME}/plugins/
RUN dita install
# Install corporate fonts for PDF
COPY fonts/ /usr/share/fonts/custom/
RUN fc-cache -fv
WORKDIR /workspace
ENTRYPOINT ["dita"] Once the image is built and pushed to a container registry, every CI job uses the same image. GitHub Actions, Azure DevOps, Jenkins, and GitLab CI all support container-based jobs natively — the CI platform pulls the image, mounts the source, and runs the build.
Pipeline usage
docker run --rm -v $(pwd):/workspace \
your-registry/dita-ot:4.2.3 \
-i main.ditamap -f html5 -o /workspace/out/html5 Output formats from a single source.
Four production-grade output formats generated in parallel from one DITA bookmap. JSONL is treated as a first-class output, not a separate "AI pipeline" — it ships from the same commit, validated by the same gates.
-
Branded HTML5
Responsive, searchable web output with company visual identity. Custom DITA-OT HTML5 plugin handles navigation, breadcrumbs, version selector, and search integration. Deployed to S3, Netlify, or CDN with cache-busting asset fingerprints.
Typical build: 200 topics → 45s
-
Branded PDF
Print-ready PDF using Apache FOP or Antenna House with brand fonts, headers/footers, and cover page. Generated from a custom PDF plugin — not a post-processing step. Includes auto-generated TOC, index, and cross-reference page numbers.
Typical build: 200 topics → 90s
-
Chatbot-ready JSONL
Section-aware chunks with full metadata inheritance: topic type, audience, difficulty, domain, intent, section ID. Each chunk is a self-contained unit ready for vector store embedding. Includes a glossary index and content manifest for corpus validation.
Typical build: 200 topics → 30s
-
Additional formats
EPUB (for offline readers), Markdown (for developer portals), normalized DITA (for CCMS ingestion), and custom XML transforms (for data interchange). Each format is a Gradle task — adding a new format is a one-line configuration change.
Typical build: All 4 in parallel → under 3 minutes
Build configuration
// build.gradle — parallel multi-format build
task buildAll {
dependsOn 'buildHtml5', 'buildPdf', 'buildJsonl'
}
task buildHtml5(type: Exec) {
commandLine 'dita', '-i', 'main.ditamap', '-f', 'com.extense.html5',
'-o', 'out/html5', '-Dargs.css=brand.css'
}
task buildPdf(type: Exec) {
commandLine 'dita', '-i', 'main.ditamap', '-f', 'com.extense.pdf',
'-o', 'out/pdf'
}
task buildJsonl(type: Exec) {
commandLine 'dita', '-i', 'main.ditamap', '-f', 'com.extense.jsonl',
'-o', 'out/jsonl', '-Dchunk.strategy=section-aware'
} CI/CD platforms.
The pipeline is platform-portable because containerization decouples the build from the CI runner. Below: four supported platforms and the criteria for choosing each. The GitHub Actions configuration is included as a worked example — the Azure DevOps, Jenkins, and GitLab equivalents are structurally similar.
-
GitHub Actions
Container-first CI with a marketplace of pre-built actions. Recommended starting point for teams already using GitHub. Container jobs pull the same Docker image used in local development.
-
Azure DevOps
Multi-stage YAML pipelines with artifact management and approval gates. Container jobs pull the same Docker image. Ideal for enterprises already on Azure — integrates with Azure Blob Storage for hosting and Azure AI Search for vector retrieval.
-
Jenkins
Declarative or scripted Groovy pipelines running on self-hosted agents. Maximum control over infrastructure. The recommended pick for air-gapped defense environments where SaaS CI platforms are prohibited. Supports Docker agent blocks for containerized builds.
-
GitLab CI
Integrated CI/CD with built-in container registry and merge-request pipelines. Great for teams already using GitLab for source control — pipeline, container registry, and Pages hosting are all in one platform.
GitHub Actions — worked example
# .github/workflows/docs.yml
name: Build & Deploy Docs
on:
push: { branches: [main] }
pull_request: { branches: [main] }
jobs:
build:
runs-on: ubuntu-latest
container: your-registry/dita-ot:4.2.3
steps:
- uses: actions/checkout@v4
- name: Validate XML
run: xmllint --noout --dtdvalid $DITA_HOME/dtd/topic.dtd src/**/*.dita
- name: Check links
run: dita -i main.ditamap -f lint --check.links=true
- name: Build HTML5
run: dita -i main.ditamap -f com.extense.html5 -o out/html5
- name: Build PDF
run: dita -i main.ditamap -f com.extense.pdf -o out/pdf
- name: Build JSONL
run: dita -i main.ditamap -f com.extense.jsonl -o out/jsonl
- name: Deploy to S3
if: github.ref == 'refs/heads/main'
run: aws s3 sync out/html5 s3://docs.example.com/ --delete Custom DITA-OT plugins.
Off-the-shelf DITA-OT output is a starting point. Production documentation needs custom branding, navigation, and export formats — and those customizations belong in versioned plugins, not in someone's local install.
-
Branded PDF plugin
Custom XSL-FO transforms for Apache FOP or Antenna House. Cover pages, headers and footers with logos, corporate fonts, color schemes, auto-generated TOC and index, and conditional content filtering by audience or product variant.
-
Custom HTML5 theme
Responsive web output with the company design system. Navigation sidebar, breadcrumbs, version selector, search integration (Algolia, Elasticsearch, or Lunr.js), and analytics tracking. Deployed as a static site — no server infrastructure required.
-
JSONL exporter
Section-aware chunking with full metadata inheritance for vector store ingestion. Produces chunks.jsonl, glossary-index.json, and content-manifest.json. Configurable chunking strategy: by section, by topic, or by semantic boundary.
Every custom plugin is a versioned, testable artifact in the Git repository — not a one-off customization buried in someone's local DITA-OT installation. We build plugins with automated tests, document the configuration parameters, and train the client team to maintain them. When DITA-OT releases a new version, we validate plugin compatibility and update where needed.
Sample Content Assessment
Send your current build process — Jenkinsfile, GitHub Actions workflow, or a description of the manual steps — and we'll return a gap analysis with a recommended automation roadmap within two business days. No obligation to proceed.
Submit a sample →