Manual publishing is a risk surface.

Documentation that ships from someone's laptop is a quality problem, a reproducibility problem, and an audit problem at once. A practitioner guide to the CI/CD pipeline that solves it — containerized builds, validation gates, parallel multi-format output, and chatbot-ready JSONL as a first-class artifact.

Why manual publishing fails.

The manual publishing process looks deceptively functional. A writer runs DITA-OT locally on their laptop. PDF generation takes 15 to 40 minutes. Broken links and invalid XML are discovered after the build, on the rendered output. Someone emails the PDF to a reviewer. The branded template hasn't been updated in two years. There is no build for chatbot-ready JSONL because "that's an AI team problem." Every one of those steps is a quality control that isn't being run.

Publishing automation isn't about speed — it's about risk reduction. The pipeline is the place where validation runs deterministically, where the same toolchain produces the same output every time, where defects fail loudly at the cheapest moment to fix them. The headline numbers (build time, format count) are downstream effects of the actual change, which is moving every quality control upstream of the writer's discretion.

The guide below codifies the pipeline: eight stages from commit to production, four parallel output formats including JSONL for AI retrieval, four CI/CD platform options, and a containerization model that gives every build identical inputs and identical outputs. None of it is theory; all of it is what a production documentation team has to run.

Build time: 40 minutes → 3 minutes
Output formats: 1 → 4 (in parallel)
Validation: 0% → 100% coverage
Rework: ~70% fewer cycles

The eight-stage pipeline.

Eight stages from commit to production. Each stage names the action that runs and the result it produces. The validation gates that surface defects are integrated into the stages, not staged separately — defects fail at the moment they're cheapest to fix.

Git push

What runs

Writer commits to a feature branch and opens a pull request. The PR automatically triggers the pipeline — there is no 'did you remember to run the build' question.

Result

Every commit produces a build attempt. Stale local builds and inconsistent author environments stop being a source of drift.
Schema & metadata validation

What runs

DTD or RelaxNG validates every topic file. Schematron rules enforce business-level constraints: required metadata fields, prohibited elements, naming conventions, topic length limits, image resolution thresholds.

Result

Invalid XML or incomplete metadata blocks the build immediately, before downstream stages run. The fix surfaces at the cheapest possible moment — at author time, not at deploy time.
Link & reference integrity

What runs

Every xref, href, conref, and keyref is resolved and verified at build time. Missing targets, circular references, and dead external URLs fail the build.

Result

Catches the most common source of broken documentation: refactoring topics without updating cross-references. The link integrity gate runs before format generation, so broken refs cost a second to detect, not an hour to chase down post-publish.
Terminology & style

What runs

Vale (open source) or Acrolinx (enterprise) runs against the corpus. Custom rules enforce the style guide — banned terms, passive voice limits, abbreviation consistency, product-name capitalization.

Result

Style violations fail the build or warn, depending on rule severity. Editorial consistency stops being a manual-review burden and becomes a CI artifact.
Build all formats

What runs

DITA-OT generates HTML5, branded PDF, and JSONL in parallel. Custom plugins handle company branding, navigation, and metadata enrichment for each output.

Result

Three production-ready output formats in a single build pass — typically under three minutes for a 200-topic corpus. No 'PDF will be ready Friday' email.
Image & asset audit

What runs

Missing alt text, oversized files (over 500 KB), orphaned graphics (referenced but never committed), and unlicensed stock images are flagged as build warnings or failures depending on severity.

Result

Accessibility compliance and repo health are continuously verified. Uncompressed screenshots and stock-license violations stop reaching production.
Deploy to staging

What runs

HTML5 output is deployed to a staging URL accessible to reviewers and stakeholders. Pull-request comments link directly to the rendered artifact, not the raw XML.

Result

Review feedback happens against what users will actually see. Comments are written on rendered prose, screenshots, and navigation — not on XML markup.
Promote to production

What runs

After PR approval, output is deployed: HTML5 to S3 / Netlify / CDN, PDF uploaded to the doc portal, JSONL pushed to the vector store. Build dashboards record pass/fail and timing trends; Slack, Teams, or email notify on failure.

Result

Three artifacts shipped in lockstep from one commit. Documentation, branded PDF, and AI corpus are all promoted together — never out of sync.

Containerization is the single largest change.

DITA-OT depends on Java, Saxon, Apache FOP, and a stack of plugins — each with version-specific behavior. When Writer A runs DITA-OT 4.1 with Java 17 and Writer B runs DITA-OT 3.7 with Java 11, the same source produces different output. In regulated industries — defense, life sciences, financial services — that is a compliance failure, not a stylistic one.

A Docker image locks the entire toolchain into a single, versioned artifact: DITA-OT version, Java version, custom plugins, fonts, and configuration. The pipeline pulls the image and runs the build. Every build. Every environment. Identical output. The "it works on my machine" failure mode disappears because the machine is now the image.

What's in the image

Base: Eclipse Temurin JDK 17 (slim)
DITA-OT: Pinned version (e.g., 4.2.3) installed via official distribution
Custom plugins: Branded PDF plugin, HTML5 theme, JSONL exporter — all baked in
Fonts: Corporate fonts for PDF generation (no more 'font not found' on the build server)
Validation tools: Vale CLI, xmllint, custom Schematron rules
Build script: Gradle wrapper with tasks for each output format

Dockerfile

# Dockerfile — production DITA-OT build image
FROM eclipse-temurin:17-jdk-jammy AS base

# Install DITA-OT
ARG DITA_OT_VERSION=4.2.3
RUN curl -sL https://github.com/dita-ot/dita-ot/releases/download/${DITA_OT_VERSION}/dita-ot-${DITA_OT_VERSION}.zip \
    -o /tmp/dita-ot.zip && unzip /tmp/dita-ot.zip -d /opt
ENV DITA_HOME=/opt/dita-ot-${DITA_OT_VERSION}
ENV PATH="${DITA_HOME}/bin:${PATH}"

# Install custom plugins
COPY plugins/ ${DITA_HOME}/plugins/
RUN dita install

# Install corporate fonts for PDF
COPY fonts/ /usr/share/fonts/custom/
RUN fc-cache -fv

WORKDIR /workspace
ENTRYPOINT ["dita"]

Once the image is built and pushed to a container registry, every CI job uses the same image. GitHub Actions, Azure DevOps, Jenkins, and GitLab CI all support container-based jobs natively — the CI platform pulls the image, mounts the source, and runs the build.

Pipeline usage

docker run --rm -v $(pwd):/workspace \
  your-registry/dita-ot:4.2.3 \
  -i main.ditamap -f html5 -o /workspace/out/html5

Output formats from a single source.

Four production-grade output formats generated in parallel from one DITA bookmap. JSONL is treated as a first-class output, not a separate "AI pipeline" — it ships from the same commit, validated by the same gates.

Branded HTML5

Responsive, searchable web output with company visual identity. Custom DITA-OT HTML5 plugin handles navigation, breadcrumbs, version selector, and search integration. Deployed to S3, Netlify, or CDN with cache-busting asset fingerprints.

Typical build: 200 topics → 45s
Branded PDF

Print-ready PDF using Apache FOP or Antenna House with brand fonts, headers/footers, and cover page. Generated from a custom PDF plugin — not a post-processing step. Includes auto-generated TOC, index, and cross-reference page numbers.

Typical build: 200 topics → 90s
Chatbot-ready JSONL

Section-aware chunks with full metadata inheritance: topic type, audience, difficulty, domain, intent, section ID. Each chunk is a self-contained unit ready for vector store embedding. Includes a glossary index and content manifest for corpus validation.

Typical build: 200 topics → 30s
Additional formats

EPUB (for offline readers), Markdown (for developer portals), normalized DITA (for CCMS ingestion), and custom XML transforms (for data interchange). Each format is a Gradle task — adding a new format is a one-line configuration change.

Typical build: All 4 in parallel → under 3 minutes

Build configuration

// build.gradle — parallel multi-format build
task buildAll {
  dependsOn 'buildHtml5', 'buildPdf', 'buildJsonl'
}

task buildHtml5(type: Exec) {
  commandLine 'dita', '-i', 'main.ditamap', '-f', 'com.extense.html5',
    '-o', 'out/html5', '-Dargs.css=brand.css'
}

task buildPdf(type: Exec) {
  commandLine 'dita', '-i', 'main.ditamap', '-f', 'com.extense.pdf',
    '-o', 'out/pdf'
}

task buildJsonl(type: Exec) {
  commandLine 'dita', '-i', 'main.ditamap', '-f', 'com.extense.jsonl',
    '-o', 'out/jsonl', '-Dchunk.strategy=section-aware'
}

CI/CD platforms.

The pipeline is platform-portable because containerization decouples the build from the CI runner. Below: four supported platforms and the criteria for choosing each. The GitHub Actions configuration is included as a worked example — the Azure DevOps, Jenkins, and GitLab equivalents are structurally similar.

GitHub Actions

Container-first CI with a marketplace of pre-built actions. Recommended starting point for teams already using GitHub. Container jobs pull the same Docker image used in local development.
Azure DevOps

Multi-stage YAML pipelines with artifact management and approval gates. Container jobs pull the same Docker image. Ideal for enterprises already on Azure — integrates with Azure Blob Storage for hosting and Azure AI Search for vector retrieval.
Jenkins

Declarative or scripted Groovy pipelines running on self-hosted agents. Maximum control over infrastructure. The recommended pick for air-gapped defense environments where SaaS CI platforms are prohibited. Supports Docker agent blocks for containerized builds.
GitLab CI

Integrated CI/CD with built-in container registry and merge-request pipelines. Great for teams already using GitLab for source control — pipeline, container registry, and Pages hosting are all in one platform.

GitHub Actions — worked example

# .github/workflows/docs.yml
name: Build & Deploy Docs
on:
  push: { branches: [main] }
  pull_request: { branches: [main] }

jobs:
  build:
    runs-on: ubuntu-latest
    container: your-registry/dita-ot:4.2.3
    steps:
      - uses: actions/checkout@v4

      - name: Validate XML
        run: xmllint --noout --dtdvalid $DITA_HOME/dtd/topic.dtd src/**/*.dita

      - name: Check links
        run: dita -i main.ditamap -f lint --check.links=true

      - name: Build HTML5
        run: dita -i main.ditamap -f com.extense.html5 -o out/html5

      - name: Build PDF
        run: dita -i main.ditamap -f com.extense.pdf -o out/pdf

      - name: Build JSONL
        run: dita -i main.ditamap -f com.extense.jsonl -o out/jsonl

      - name: Deploy to S3
        if: github.ref == 'refs/heads/main'
        run: aws s3 sync out/html5 s3://docs.example.com/ --delete

Custom DITA-OT plugins.

Off-the-shelf DITA-OT output is a starting point. Production documentation needs custom branding, navigation, and export formats — and those customizations belong in versioned plugins, not in someone's local install.

Branded PDF plugin

Custom XSL-FO transforms for Apache FOP or Antenna House. Cover pages, headers and footers with logos, corporate fonts, color schemes, auto-generated TOC and index, and conditional content filtering by audience or product variant.
Custom HTML5 theme

Responsive web output with the company design system. Navigation sidebar, breadcrumbs, version selector, search integration (Algolia, Elasticsearch, or Lunr.js), and analytics tracking. Deployed as a static site — no server infrastructure required.
JSONL exporter

Section-aware chunking with full metadata inheritance for vector store ingestion. Produces chunks.jsonl, glossary-index.json, and content-manifest.json. Configurable chunking strategy: by section, by topic, or by semantic boundary.

Every custom plugin is a versioned, testable artifact in the Git repository — not a one-off customization buried in someone's local DITA-OT installation. We build plugins with automated tests, document the configuration parameters, and train the client team to maintain them. When DITA-OT releases a new version, we validate plugin compatibility and update where needed.

Sample Content Assessment

Send your current build process — Jenkinsfile, GitHub Actions workflow, or a description of the manual steps — and we'll return a gap analysis with a recommended automation roadmap within two business days. No obligation to proceed.

Submit a sample →

Manual publishing is a risk surface.

Why manual publishing fails.

The eight-stage pipeline.

Containerization is the single largest change.

Output formats from a single source.

Branded HTML5

Branded PDF

Chatbot-ready JSONL

Additional formats

CI/CD platforms.

GitHub Actions

Azure DevOps

Jenkins

GitLab CI

Custom DITA-OT plugins.

Branded PDF plugin

Custom HTML5 theme

JSONL exporter

Sample Content Assessment