Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Dumpling documentation

Dumpling is a streaming anonymizer for plain SQL dumps. It supports PostgreSQL (pg_dump plain format), SQLite (.dump), and SQL Server / MSSQL (SSMS / mssql-scripter plain SQL output). For PostgreSQL custom-format or directory-format archives (e.g. Heroku pg:backups:download), Dumpling auto-detects them when --format postgres (default) and invokes pg_restore -f -—see PostgreSQL archives and compressed inputs in the configuration guide.

New here? Start with Getting started: generate a draft policy with scaffold-config, review and add secrets, run Dumpling, then tighten with lint-policy and optional CI flags.

This documentation covers the operating model for day-to-day use:

  • how to build and run Dumpling locally,
  • how to configure transformation behavior safely,
  • how CI validates quality before changes merge,
  • and how maintainers produce tagged releases.

Documentation quality gate

The mdBook site is built in CI as follows:

  • Pull requests: the Docs (PR) workflow runs mdbook build when docs-related paths change (no deploy).
  • main: the Docs workflow builds and deploys to GitHub Pages when docs-related paths change.

This keeps the docs in a continuously deployable state instead of drifting from the codebase.

Getting started

This page is the shortest path from zero to a first successful run. For strategy details, row filters, dump seals, and CI patterns, continue with the configuration guide and the repository README.md.

Prerequisites

  • Rust stable toolchain (rustup recommended). The repo includes rust-toolchain.toml (stable + rustfmt + clippy) so CI and local cargo stay aligned.
  • cargo on your PATH

Optional: run ./scripts/setup-dev.sh once from the repo root — it installs toolchain components, cargo fetch, and a pinned mdBook under .tools/ for the same docs build CI uses.

Build

cargo build --release
./target/release/dumpling --help

Python / pip (dumpling-cli)

pip install dumpling-cli
dumpling --help

First anonymization

  1. Generate a draft policy (recommended) — From your project root (or anywhere you keep config):

    dumpling scaffold-config -i dump.sql -o .dumplingconf
    

    This beta subcommand streams the dump once and writes inferred [rules] from SQL column names (CREATE TABLE, INSERT, and PostgreSQL COPY column lists). Heuristics are English-oriented; output is draft only—review and edit every rule, add a top-level salt (for hashing) and any ${…} secret placeholders before production use.

    Useful flags:

    • --infer-json-paths — Keep up to five sampled rows per table (reservoir) and suggest nested JSON rules as column.path.leaf.
    • --max-json-depth — Cap JSON walking depth when using --infer-json-paths (default 24).
    • --formatpostgres (default), sqlite, or mssql.
    • --pg-restore-path / --pg-restore-arg — Optional pg_restore binary and extra arguments when --input is a PostgreSQL custom-format or directory-format archive (auto-detected with --format postgres); see PostgreSQL archives and compressed inputs.

    Run dumpling scaffold-config --help for the full flag list.

  2. Or start from the example policy — Copy .dumplingconf.example to .dumplingconf (or merge under [tool.dumpling] in pyproject.toml) and author [rules] by hand. Set environment variables for salt and any ${…} references.

  3. Align rules with your dump (manual path only) — If you skipped scaffold-config, use CREATE TABLE, COPY … (…), and INSERT INTO … (…) lines to name [rules."table"] or [rules."schema.table"] keys. Trim to the tables you care about first.

  4. Run Dumplingdumpling -i dump.sql -o sanitized.sql (add -c path if the config is not in the default search path). Use dumpling --check -i dump.sql when you only want to know whether anything would change.

  5. Tighten the policy — Run dumpling lint-policy on your config. When you are ready for stricter gates, add [sensitive_columns] and use --strict-coverage, --report, and --scan-output as described in the configuration guide and the repository README.md.

PostgreSQL custom-format archives

If your input is a PostgreSQL custom-format file or directory-format folder (not plain SQL), use --format postgres (default): Dumpling auto-detects the archive and runs pg_restore -f - (needs pg_restore from PostgreSQL client tools). Gzip-wrapped plain SQL is streamed without a temp file; ZIP (or gzip wrapping PGDMP) uses a temp extract that is cleaned up afterward. See PostgreSQL archives and compressed inputs in the configuration guide.

Test locally (contributors)

cargo fmt --all -- --check
cargo clippy --all-targets --all-features
cargo test --all-targets --all-features

Configuration guide

Dump format

Use --format to declare the SQL dialect of your input file:

ValueDescription
postgres (default)PostgreSQL pg_dump plain-text format. Supports COPY … FROM stdin blocks, "double-quoted" identifiers, ''-escaped strings. Custom-format (PGDMP) and directory-format (toc.dat) dumps are auto-detected and decoded with pg_restore -f - (requires client tools). Gzip — wrapped plain SQL is decompressed in-process; ZIP (or gzip wrapping PGDMP/nested ZIP) uses a temp file that is removed after the run. By default the archive is deleted after success; use --keep-original or keep_original in config to retain it.
sqliteSQLite .dump format. Adds INSERT OR REPLACE INTO / INSERT OR IGNORE INTO support. No COPY blocks.
mssqlSQL Server / MSSQL plain SQL. Adds [bracket] identifier quoting, N'…' Unicode string literals, and nvarchar(n) / nchar(n) length extraction. No COPY blocks.

Example:

dumpling --format sqlite -i data.db.sql -o anonymized.sql
dumpling --format mssql  -i backup.sql  -o anonymized.sql

PostgreSQL archives and compressed inputs

Heroku PGBackups and many pipelines ship pg_dump custom format (-Fc), directory-format dumps, or gzip/ZIP-wrapped files. Dumpling’s SQL engine still expects plain text at the parser; anything else is normalized first.

Custom-format and directory dumps (auto-detected)

With --format postgres (default), Dumpling detects:

  • Custom-format files (magic PGDMP at the start of the file), and
  • Directory-format folders (a toc.dat beside table blobs),

then runs pg_restore -f - (script to stdout inside the process — no database) and pipes the result through the same anonymizer as a normal plain-SQL file. Detection from --input is automatic.

Requirements: PostgreSQL client tools on PATH (pg_restore), or --pg-restore-path.

Extra pg_restore arguments:

  • CLI: --pg-restore-arg (repeatable), e.g. --pg-restore-arg=--no-owner --pg-restore-arg=--no-acl
  • Config (optional): [pg_restore] — CLI overrides these when you pass path or args:
[pg_restore]
path = "/usr/bin/pg_restore"
args = ["--no-owner", "--no-acl"]

Gzip and ZIP wrappers

  • Gzip (.gz) whose decompressed payload is plain SQL: decompressed in-process (streamed); no temporary dump file.
  • ZIP containing a single dump file (or a single .sql when multiple files exist), gzip wrapping PGDMP, or gzip wrapping an inner ZIP: Dumpling writes under the system temp directory and removes those paths when the run completes (including after errors — cleanup runs on drop).

--in-place is rejected when Dumpling had to materialize a temp file for compression or when the resolved input is a PostgreSQL archive decoded via pg_restore (use --output or stdout).

Keeping inputs and --check

After a fully successful run, Dumpling removes the --input archive path (single file or directory-format folder) by default. To keep it:

  • --keep-original, or
  • keep_original = true at the top level of .dumplingconf / [tool.dumpling] (merged with CLI; --keep-original cannot be used with --in-place).

--check with a PostgreSQL archive requires an effective keep-original (CLI or config); otherwise the default deletion would remove the dump before you iterate on policy.

Examples (e.g. after heroku pg:backups:download):

dumpling -i latest.dump -c .dumplingconf -o anonymized.sql

Dry run while keeping the downloaded file:

dumpling --keep-original --check -i latest.dump -c .dumplingconf

Configuration sources

Configuration can be loaded from:

  1. --config <path> (highest precedence)
  2. .dumplingconf in the current working directory
  3. [tool.dumpling] in pyproject.toml

If no configuration is found, Dumpling fails closed by default and exits non-zero. Error output includes every checked location. If you intentionally want a no-op run, pass --allow-noop.


Dump seal (always on)

Every successful run that writes output prefixes the stream with a single-line SQL comment:

-- dumpling-seal: v=3 version=<semver> profile=<standard|hardened> sha256=<64 hex chars>

The sha256 is over canonical JSON that includes the Dumpling version, the active security profile, a stable encoding of the resolved policy (rules, row filters, column cases, sensitive columns, output scan, global salt), and runtime options that affect transforms: --format and the effective --seed / DUMPLING_SEED value in standard profile (null in hardened, where seeds are ignored).

If the input already begins with a seal line and it matches the current run, Dumpling copies the rest of the file through unchanged. If the line looks like a seal but does not match (stale policy, different flags, or older v=), that line is dropped and the dump is re-processed so you do not end up with two seal lines. --strict-coverage cannot be combined with a matching seal (table definitions are not scanned in passthrough mode). --check writes no output and therefore emits no seal line.

Hardened security profile

For adversarial risk environments — where an internal or external actor may have partial auxiliary data — use --security-profile hardened:

dumpling --security-profile hardened -i dump.sql -o sanitized.sql

What changes in hardened mode

AspectStandardHardened
Random generationxorshift64* seeded from system timeOS CSPRNG (getrandom) — non-predictable
hash strategySHA-256(salt || input)HMAC-SHA-256(key=salt, data=input)
Deterministic domain byte streamSHA-256 CTR-modeHMAC-SHA-256 CTR-mode
Report security_profile field"standard""hardened"
--seed / DUMPLING_SEEDSeeds the PRNGIgnored (warning emitted)

Why this matters

  • Non-predictable output: xorshift64* is seeded from system time, which is guessable. The OS CSPRNG cannot be predicted from timing alone.
  • Proper keyed hashing: SHA-256(key || data) is vulnerable to length-extension attacks and weak as a MAC. HMAC-SHA-256 uses the salt as a genuine cryptographic key, providing provable PRF security.
  • Domain separation: HMAC construction ensures outputs from one salt/key cannot be confused with another.

Key management guidance

Configure a per-environment secret via an env-backed reference to prevent key leakage:

# .dumplingconf
salt = "${DUMPLING_HMAC_KEY}"

[rules."public.users"]
ssn = { strategy = "hash", as_string = true }
email = { strategy = "email", domain = "users" }
export DUMPLING_HMAC_KEY="$(openssl rand -base64 32)"
dumpling --security-profile hardened -i dump.sql -o sanitized.sql

Key rotation: Changing DUMPLING_HMAC_KEY will produce entirely different pseudonyms for all salted/domain-mapped columns. If you rely on referential consistency across separately-processed dumps (e.g., snapshots over time), keep the same key or re-anonymize all related dumps together. Rotate keys when:

  • A key may have been compromised.
  • You intentionally want to break prior referential linkability.

Report metadata

The JSON report always includes the active security profile:

{
  "security_profile": "hardened",
  "total_rows_processed": 1000,
  ...
}

Faker strategy and the fake crate

When you use strategy = "faker" with faker = "module::Type", those names align with the Rust fake crate’s faker modules (for example name::FirstNamefake::faker::name::raw::FirstName). Use the upstream docs to discover available generators and options:

Dumpling only exposes a subset wired in src/faker_dispatch.rs; unsupported module::Type pairs fail at config load.

Anonymization strategies

Strategy names and per-strategy options (min, scale, as_string, faker, …) are documented in the repository README under Anonymization strategies (each strategy lists only the keys it accepts, plus Choosing a strategy for when to prefer cheap vs realistic transforms, and Cross-cutting options for domain, unique_within_domain, and as_string). Row filters, JSON path rules, and conditional column_cases are also covered in the README before the full TOML example.

The sections below expand on JSON path rules (same semantics as the README) and secret references in more depth.

Baseline config template

salt = "${DUMPLING_GLOBAL_SALT}"

[rules."public.users"]
email = { strategy = "hash", salt = "${env:DUMPLING_USERS_EMAIL_SALT}", as_string = true }
full_name = { strategy = "faker", faker = "name::Name" }

[sensitive_columns]
"public.users" = ["employee_number", "tax_id"]

[output_scan]
enabled_categories = ["email", "ssn", "pan", "token"]
default_threshold = 0
default_severity = "high"
fail_on_severity = "low"
sample_limit_per_category = 5

[output_scan.thresholds]
email = 0
ssn = 0
pan = 0
token = 0

[output_scan.severities]
email = "medium"
ssn = "high"
pan = "critical"
token = "high"
[row_filters."public.users"]
retain = [
  { column = "country", op = "eq", value = "US" },
  { column = "profile.plan", op = "eq", value = "gold" }
]
delete = [
  { column = "is_admin", op = "eq", value = "true" },
  { column = "devices__platform", op = "eq", value = "android" }
]

Secret references

Dumpling supports secret substitution in string config fields using two providers:

SyntaxProviderDescription
${ENV_VAR}env (implicit)Read from environment variable ENV_VAR
${env:ENV_VAR}env (explicit)Read from environment variable ENV_VAR
${file:/path/to/secret}fileRead from a file (trailing newlines are stripped)

Example using both providers:

salt = "${DUMPLING_GLOBAL_SALT}"

[rules."public.users"]
ssn   = { strategy = "hash", salt = "${env:DUMPLING_USERS_SSN_SALT}" }
email = { strategy = "hash", salt = "${file:/run/secrets/dumpling_email_salt}" }

Behavior:

  • Missing env references fail fast at startup with a non-zero exit and an actionable error including the config path.
  • Missing or empty file references fail fast with a non-zero exit and an actionable error.
  • Plaintext salt values are accepted for backward compatibility, but Dumpling prints a warning because plaintext secrets are insecure.
  • Unknown providers fail startup with a list of supported providers.

Environment-variable secrets (CI / local dev)

# local development
export DUMPLING_GLOBAL_SALT='local-dev-salt'
export DUMPLING_USERS_EMAIL_SALT='users-email-salt'
dumpling --input dump.sql --check

# CI environment (values injected from your secret manager)
export DUMPLING_GLOBAL_SALT="$CI_DUMPLING_GLOBAL_SALT"
export DUMPLING_USERS_EMAIL_SALT="$CI_DUMPLING_USERS_EMAIL_SALT"
dumpling --input dump.sql --check --strict-coverage --report coverage.json

File-mounted secrets (Docker / Kubernetes)

The file: provider reads the secret value from a file on disk and trims trailing newlines. This is the natural format for Docker Swarm secrets (/run/secrets/<name>), Kubernetes mounted secrets, and HashiCorp Vault Agent injected files.

Docker Swarm — declare a secret and mount it into the service:

# docker-compose.yml
secrets:
  dumpling_hmac_key:
    external: true

services:
  anonymizer:
    image: your-image
    secrets:
      - dumpling_hmac_key
    environment:
      - DUMPLING_CONFIG=/app/.dumplingconf
# .dumplingconf
salt = "${file:/run/secrets/dumpling_hmac_key}"

Kubernetes — mount a Secret as a volume:

# deployment.yaml (excerpt)
volumes:
  - name: dumpling-secrets
    secret:
      secretName: dumpling-keys
volumeMounts:
  - name: dumpling-secrets
    mountPath: /run/secrets
    readOnly: true
# .dumplingconf
salt = "${file:/run/secrets/hmac_key}"

HashiCorp Vault Agent — inject secrets as files using the template stanza:

# vault-agent.hcl (excerpt)
template {
  contents     = "{{ with secret \"secret/dumpling\" }}{{ .Data.data.hmac_key }}{{ end }}"
  destination  = "/run/secrets/dumpling_hmac_key"
}
# .dumplingconf
salt = "${file:/run/secrets/dumpling_hmac_key}"

Nested JSON targeting is supported in predicate column values via either:

  • dot notation (payload.profile.tier)
  • Django-style separators (payload__profile__tier)

When a JSON path traverses an array, Dumpling checks each element (useful for list-of-dicts JSON structures).

JSON path rules (json / jsonb columns)

You can anonymise values inside a text column that holds JSON using the same path syntax as row-filter predicates, but on [rules] keys:

  • Dot notation: "payload.profile.email" = { strategy = "email", domain = "orders_email", as_string = true }
  • Django-style: "payload__profile__email" = { strategy = "hash", salt = "${env:ORDER_SECRET_SALT}", as_string = true }

The part before the first dot or __ is the SQL column name; the rest is the path inside the parsed JSON document. Use quoted keys in TOML when the name contains dots. For a given table, you can use either path-level rules for a column or one whole-column rule for that column’s base name, not both (Dumpling rejects the conflict at startup). If a path is missing in a given row, that rule is skipped for that row. When only path rules apply (no whole-column rule), the rest of the JSON is left unchanged. Path rules are applied in longest-path-first order. column_cases still match the SQL column name only; use when predicates with nested column paths to branch on JSON content.

Safety recommendations

  • Prefer deterministic runs in CI by passing --seed (or DUMPLING_SEED).
  • Keep fail-closed behavior enabled in CI/CD; avoid --allow-noop unless a no-op run is explicitly intended.
  • Treat new or changed anonymization rules as code changes and require review.
  • Keep table/column names lowercase in config to avoid case-mismatch surprises.
  • table_options are no longer supported; define explicit rules and optional conditional column_cases instead.
  • Use --strict-coverage --report <file> --check in CI so uncovered sensitive columns fail the build.

Strict sensitive coverage

--strict-coverage enforces explicit policy coverage for sensitive columns.

  • Sensitive columns are detected by:
    1. built-in column-name patterns, and
    2. explicit per-table lists under [sensitive_columns].
  • A sensitive column is considered covered only if it has an explicit rules or column_cases entry (including JSON path rules whose base name is that column, e.g. payload.x.y covers payload).
  • If uncovered sensitive columns are found, Dumpling exits non-zero.

When --report is enabled, coverage fields are added to JSON output:

  • sensitive_columns_detected
  • sensitive_columns_covered
  • sensitive_columns_uncovered

Example CI gate:

dumpling --input dump.sql --check --strict-coverage --report coverage.json

Residual output scanning

Enable output scanning with:

dumpling --input dump.sql --scan-output --report scan_report.json

Add fail gates with:

dumpling --input dump.sql --check --scan-output --fail-on-findings --report scan_report.json

Output scanning inspects transformed output for common sensitive categories:

  • email
  • ssn
  • pan (Luhn-validated card-like numbers)
  • token (common secret/token formats)

When --report is set, report JSON includes an output_scan object with per-category:

  • category
  • count
  • threshold
  • severity
  • sample_locations (line + column + snippet where available)

CI guardrails and policy linting

Dumpling ships with a built-in policy linter (lint-policy) and a reference GitHub Actions workflow so you can catch anonymization policy regressions in pull requests before they reach production.


dumpling lint-policy

dumpling lint-policy                          # auto-discover config
dumpling lint-policy --config .dumplingconf   # explicit config path
dumpling lint-policy --allow-noop             # treat missing config as empty (no violations)

The command loads your configuration, runs a set of policy checks, prints any violations to stderr, and exits:

Exit codeMeaning
0No violations found
1One or more violations found

Checks performed

CodeSeverityDescription
empty-rules-tablewarningA [rules] entry has no column rules. Likely a stale or incomplete config section.
empty-column-cases-tablewarningA [column_cases] entry has no column cases.
unsalted-hashwarningA hash strategy is used with no salt (neither per-column salt nor global salt). Unsalted hashes are reversible via precomputed lookup tables for low-entropy inputs (names, emails, common IDs).
inconsistent-domain-strategyerrorThe same domain name is used with two or more different strategies. This breaks referential integrity: a domain shared between incompatible generators (for example faker with different faker targets, or faker vs hash) cannot maintain a single stable mapping.
uncovered-sensitive-columnerrorA column listed in [sensitive_columns] has no matching anonymization rule or case. The column will pass through unmodified, making the sensitive declaration misleading.

Minimal (policy lint only)

# .github/workflows/policy-lint.yml
name: Policy Lint

on:
  pull_request:
  push:
    branches: [main]

jobs:
  policy-lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: dtolnay/rust-toolchain@stable
      - uses: Swatinem/rust-cache@v2
      - run: cargo build --release --locked
      - run: ./target/release/dumpling lint-policy

This single job will block merges whenever a policy violation is introduced.

Production-ready: lint + strict coverage + PII scan

Combine lint-policy with Dumpling's other CI gates for defence in depth:

name: Anonymization CI

on:
  pull_request:
  push:
    branches: [main]

jobs:
  policy-lint:
    name: Policy lint
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: dtolnay/rust-toolchain@stable
      - uses: Swatinem/rust-cache@v2
      - run: cargo build --release --locked
      - name: Lint anonymization policy
        run: ./target/release/dumpling lint-policy

  anonymize-and-scan:
    name: Anonymize + residual PII scan
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: dtolnay/rust-toolchain@stable
      - uses: Swatinem/rust-cache@v2
      - run: cargo build --release --locked

      - name: Anonymize with strict coverage enforcement
        run: |
          ./target/release/dumpling \
            --strict-coverage \
            --scan-output \
            --fail-on-findings \
            --report report.json \
            -i dump.sql \
            -o sanitized.sql

      - name: Upload anonymization report
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: anonymization-report
          path: report.json

Gating on report diff against a baseline

To detect increases in risk findings across PRs, store a baseline report as a CI artifact on your main branch and compare against it in PRs:

- name: Download baseline report
  uses: dawidd6/action-download-artifact@v6
  with:
    workflow: ci.yml
    branch: main
    name: anonymization-report
    path: baseline/
  continue-on-error: true   # first run has no baseline yet

- name: Fail if findings increased
  run: |
    BASELINE=$(jq '.output_scan.total_findings // 0' baseline/report.json 2>/dev/null || echo 0)
    CURRENT=$(jq '.output_scan.total_findings' report.json)
    echo "Baseline findings: $BASELINE  Current findings: $CURRENT"
    if [ "$CURRENT" -gt "$BASELINE" ]; then
      echo "ERROR: residual PII findings increased from $BASELINE to $CURRENT"
      exit 1
    fi

Tips

  • Run dumpling lint-policy locally before opening a PR to catch violations early: cargo run -- lint-policy.
  • Treat error-severity violations as mandatory fixes; warning-severity violations are advisory but should be reviewed.
  • If you intentionally use hash without a salt (e.g. for non-sensitive low-cardinality fields), add a salt = "${ENV_VAR}" at the global level to suppress the unsalted-hash warning globally.
  • In hardened security profile environments, a global salt is required anyway (--security-profile hardened will error without it), so unsalted-hash warnings become informational.

Release process

This project uses tag-driven releases.

Release policy

  • Versioning follows Semantic Versioning (MAJOR.MINOR.PATCH).
  • Every release is tied to an immutable git tag (vX.Y.Z).
  • The release workflow runs quality checks before publishing artifacts.

Maintainer checklist

  1. Ensure main is green in CI.

  2. Update Cargo.toml and pyproject.toml versions and CHANGELOG.md.

  3. Open and merge a release preparation PR.

  4. Create and push a tag from main:

    git tag -a vX.Y.Z -m "Release vX.Y.Z"
    git push origin vX.Y.Z
    
  5. Verify the Release GitHub Actions workflow passes.

  6. Validate uploaded artifacts and checksums from the GitHub Release page.

  7. Announce the release with upgrade notes and rollback guidance.

Python package publishing (PyPI/TestPyPI)

Dumpling is configured for Python packaging via maturin (pyproject.toml). The Python distribution name is dumpling-cli, while the installed CLI command is still dumpling.

The Publish GitHub Actions workflow (.github/workflows/publish.yml) builds cross-platform wheels and an sdist, then publishes:

  • Automatically to PyPI when a v*.*.* tag is pushed.
  • Manually to PyPI or TestPyPI via workflow_dispatch.

Environment setup

  • Configure a GitHub environment named pypi for trusted publishing.
  • Optionally configure testpypi for manual pre-release validation.

Local dry-run commands

maturin build --release
maturin sdist

Rollback guidance

  • If a release is faulty, create a new patch release (for example v1.2.4) that reverts or fixes the issue.
  • Avoid deleting published tags; treat tags as immutable history.