Dumpling documentation

Dumpling is a streaming anonymizer for plain SQL dumps. It supports PostgreSQL (pg_dump plain format), SQLite (.dump), and SQL Server / MSSQL (SSMS / mssql-scripter plain SQL output). For PostgreSQL custom-format or directory-format archives (e.g. Heroku pg:backups:download), Dumpling auto-detects them when --format postgres (default) and invokes pg_restore -f -—see PostgreSQL archives and compressed inputs in the configuration guide.

New here? Start with Getting started: generate a draft policy with scaffold-config, review and add secrets, run Dumpling, then tighten with lint-policy and optional CI flags.

This documentation covers the operating model for day-to-day use:

how to build and run Dumpling locally,
how to configure transformation behavior safely,
how CI validates quality before changes merge,
and how maintainers produce tagged releases.

Documentation quality gate

The mdBook site is built in CI as follows:

Pull requests: the Docs (PR) workflow runs mdbook build when docs-related paths change (no deploy).
main: the Docs workflow builds and deploys to GitHub Pages when docs-related paths change.

This keeps the docs in a continuously deployable state instead of drifting from the codebase.

Getting started

This page is the shortest path from zero to a first successful run. For strategy details, row filters, dump seals, and CI patterns, continue with the configuration guide and the repository README.md.

Prerequisites

Rust stable toolchain (rustup recommended). The repo includes rust-toolchain.toml (stable + rustfmt + clippy) so CI and local cargo stay aligned.
cargo on your PATH

Optional: run ./scripts/setup-dev.sh once from the repo root — it installs toolchain components, cargo fetch, and a pinned mdBook under .tools/ for the same docs build CI uses.

Build

cargo build --release
./target/release/dumpling --help

Python / pip (`dumpling-cli`)

pip install dumpling-cli
dumpling --help

First anonymization

Generate a draft policy (recommended) — From your project root (or anywhere you keep config):
```
dumpling scaffold-config -i dump.sql -o .dumplingconf
```
This beta subcommand streams the dump once and writes inferred [rules] from SQL column names (CREATE TABLE, INSERT, and PostgreSQL COPY column lists). Heuristics are English-oriented; output is draft only—review and edit every rule, add a top-level salt (for hashing) and any ${…} secret placeholders before production use.

Useful flags:
- --infer-json-paths — Keep up to five sampled rows per table (reservoir) and suggest nested JSON rules as column.path.leaf.
- --max-json-depth — Cap JSON walking depth when using --infer-json-paths (default 24).
- --format — postgres (default), sqlite, or mssql.
- --pg-restore-path / --pg-restore-arg — Optional pg_restore binary and extra arguments when --input is a PostgreSQL custom-format or directory-format archive (auto-detected with --format postgres); see PostgreSQL archives and compressed inputs.
Run dumpling scaffold-config --help for the full flag list.
Or start from the example policy — Copy .dumplingconf.example to .dumplingconf (or merge under [tool.dumpling] in pyproject.toml) and author [rules] by hand. Set environment variables for salt and any ${…} references.
Align rules with your dump (manual path only) — If you skipped scaffold-config, use CREATE TABLE, COPY … (…), and INSERT INTO … (…) lines to name [rules."table"] or [rules."schema.table"] keys. Trim to the tables you care about first.
Run Dumpling — dumpling -i dump.sql -o sanitized.sql (add -c path if the config is not in the default search path). Use dumpling --check -i dump.sql when you only want to know whether anything would change.
Tighten the policy — Run dumpling lint-policy on your config. When you are ready for stricter gates, add [sensitive_columns] and use --strict-coverage, --report, and --scan-output as described in the configuration guide and the repository README.md.

PostgreSQL custom-format archives

If your input is a PostgreSQL custom-format file or directory-format folder (not plain SQL), use --format postgres (default): Dumpling auto-detects the archive and runs pg_restore -f - (needs pg_restore from PostgreSQL client tools). Gzip-wrapped plain SQL is streamed without a temp file; ZIP (or gzip wrapping PGDMP) uses a temp extract that is cleaned up afterward. See PostgreSQL archives and compressed inputs in the configuration guide.

Test locally (contributors)

cargo fmt --all -- --check
cargo clippy --all-targets --all-features
cargo test --all-targets --all-features

Configuration guide

Dump format

Use --format to declare the SQL dialect of your input file:

Value	Description
`postgres` (default)	PostgreSQL `pg_dump` plain-text format. Supports `COPY … FROM stdin` blocks, `"double-quoted"` identifiers, `''`-escaped strings. Custom-format (`PGDMP`) and directory-format (`toc.dat`) dumps are auto-detected and decoded with `pg_restore -f -` (requires client tools). Gzip — wrapped plain SQL is decompressed in-process; ZIP (or gzip wrapping `PGDMP`/nested ZIP) uses a temp file that is removed after the run. By default the archive is deleted after success; use `--keep-original` or `keep_original` in config to retain it.
`sqlite`	SQLite `.dump` format. Adds `INSERT OR REPLACE INTO` / `INSERT OR IGNORE INTO` support. No COPY blocks.
`mssql`	SQL Server / MSSQL plain SQL. Adds `[bracket]` identifier quoting, `N'…'` Unicode string literals, and `nvarchar(n)` / `nchar(n)` length extraction. No COPY blocks.

Example:

dumpling --format sqlite -i data.db.sql -o anonymized.sql
dumpling --format mssql  -i backup.sql  -o anonymized.sql

PostgreSQL archives and compressed inputs

Heroku PGBackups and many pipelines ship pg_dump custom format (-Fc), directory-format dumps, or gzip/ZIP-wrapped files. Dumpling’s SQL engine still expects plain text at the parser; anything else is normalized first.

Custom-format and directory dumps (auto-detected)

With --format postgres (default), Dumpling detects:

Custom-format files (magic PGDMP at the start of the file), and
Directory-format folders (a toc.dat beside table blobs),

then runs pg_restore -f - (script to stdout inside the process — no database) and pipes the result through the same anonymizer as a normal plain-SQL file. Detection from --input is automatic.

Requirements: PostgreSQL client tools on PATH (pg_restore), or --pg-restore-path.

Extra pg_restore arguments:

CLI: --pg-restore-arg (repeatable), e.g. --pg-restore-arg=--no-owner --pg-restore-arg=--no-acl
Config (optional): [pg_restore] — CLI overrides these when you pass path or args:

[pg_restore]
path = "/usr/bin/pg_restore"
args = ["--no-owner", "--no-acl"]

Gzip and ZIP wrappers

Gzip (.gz) whose decompressed payload is plain SQL: decompressed in-process (streamed); no temporary dump file.
ZIP containing a single dump file (or a single .sql when multiple files exist), gzip wrapping PGDMP, or gzip wrapping an inner ZIP: Dumpling writes under the system temp directory and removes those paths when the run completes (including after errors — cleanup runs on drop).

--in-place is rejected when Dumpling had to materialize a temp file for compression or when the resolved input is a PostgreSQL archive decoded via pg_restore (use --output or stdout).

Keeping inputs and `--check`

After a fully successful run, Dumpling removes the --input archive path (single file or directory-format folder) by default. To keep it:

--keep-original, or
keep_original = true at the top level of .dumplingconf / [tool.dumpling] (merged with CLI; --keep-original cannot be used with --in-place).

--check with a PostgreSQL archive requires an effective keep-original (CLI or config); otherwise the default deletion would remove the dump before you iterate on policy.

Examples (e.g. after heroku pg:backups:download):

dumpling -i latest.dump -c .dumplingconf -o anonymized.sql

Dry run while keeping the downloaded file:

dumpling --keep-original --check -i latest.dump -c .dumplingconf

Configuration sources

Configuration can be loaded from:

--config <path> (highest precedence)
.dumplingconf in the current working directory
[tool.dumpling] in pyproject.toml

If no configuration is found, Dumpling fails closed by default and exits non-zero. Error output includes every checked location. If you intentionally want a no-op run, pass --allow-noop.

Dump seal (always on)

Every successful run that writes output prefixes the stream with a single-line SQL comment:

-- dumpling-seal: v=3 version=<semver> profile=<standard|hardened> sha256=<64 hex chars>

The sha256 is over canonical JSON that includes the Dumpling version, the active security profile, a stable encoding of the resolved policy (rules, row filters, column cases, sensitive columns, output scan, global salt), and runtime options that affect transforms: --format and the effective --seed / DUMPLING_SEED value in standard profile (null in hardened, where seeds are ignored).

If the input already begins with a seal line and it matches the current run, Dumpling copies the rest of the file through unchanged. If the line looks like a seal but does not match (stale policy, different flags, or older v=), that line is dropped and the dump is re-processed so you do not end up with two seal lines. --strict-coverage cannot be combined with a matching seal (table definitions are not scanned in passthrough mode). --check writes no output and therefore emits no seal line.

Hardened security profile

For adversarial risk environments — where an internal or external actor may have partial auxiliary data — use --security-profile hardened:

dumpling --security-profile hardened -i dump.sql -o sanitized.sql

What changes in hardened mode

Aspect	Standard	Hardened
Random generation	xorshift64* seeded from system time	OS CSPRNG (`getrandom`) — non-predictable
`hash` strategy	SHA-256(salt \|\| input)	HMAC-SHA-256(key=salt, data=input)
Deterministic domain byte stream	SHA-256 CTR-mode	HMAC-SHA-256 CTR-mode
Report `security_profile` field	`"standard"`	`"hardened"`
`--seed` / `DUMPLING_SEED`	Seeds the PRNG	Ignored (warning emitted)

Why this matters

Non-predictable output: xorshift64* is seeded from system time, which is guessable. The OS CSPRNG cannot be predicted from timing alone.
Proper keyed hashing: SHA-256(key || data) is vulnerable to length-extension attacks and weak as a MAC. HMAC-SHA-256 uses the salt as a genuine cryptographic key, providing provable PRF security.
Domain separation: HMAC construction ensures outputs from one salt/key cannot be confused with another.

Key management guidance

Configure a per-environment secret via an env-backed reference to prevent key leakage:

# .dumplingconf
salt = "${DUMPLING_HMAC_KEY}"

[rules."public.users"]
ssn = { strategy = "hash", as_string = true }
email = { strategy = "email", domain = "users" }

export DUMPLING_HMAC_KEY="$(openssl rand -base64 32)"
dumpling --security-profile hardened -i dump.sql -o sanitized.sql

Key rotation: Changing DUMPLING_HMAC_KEY will produce entirely different pseudonyms for all salted/domain-mapped columns. If you rely on referential consistency across separately-processed dumps (e.g., snapshots over time), keep the same key or re-anonymize all related dumps together. Rotate keys when:

A key may have been compromised.
You intentionally want to break prior referential linkability.

Report metadata

The JSON report always includes the active security profile:

{
  "security_profile": "hardened",
  "total_rows_processed": 1000,
  ...
}

Faker strategy and the `fake` crate

When you use strategy = "faker" with faker = "module::Type", those names align with the Rust fake crate’s faker modules (for example name::FirstName ↔ fake::faker::name::raw::FirstName). Use the upstream docs to discover available generators and options:

Dumpling only exposes a subset wired in src/faker_dispatch.rs; unsupported module::Type pairs fail at config load.

Anonymization strategies

Strategy names and per-strategy options (min, scale, as_string, faker, …) are documented in the repository README under Anonymization strategies (each strategy lists only the keys it accepts, plus Choosing a strategy for when to prefer cheap vs realistic transforms, and Cross-cutting options for domain, unique_within_domain, and as_string). Row filters, JSON path rules, and conditional column_cases are also covered in the README before the full TOML example.

The sections below expand on JSON path rules (same semantics as the README) and secret references in more depth.

Baseline config template

salt = "${DUMPLING_GLOBAL_SALT}"

[rules."public.users"]
email = { strategy = "hash", salt = "${env:DUMPLING_USERS_EMAIL_SALT}", as_string = true }
full_name = { strategy = "faker", faker = "name::Name" }

[sensitive_columns]
"public.users" = ["employee_number", "tax_id"]

[output_scan]
enabled_categories = ["email", "ssn", "pan", "token"]
default_threshold = 0
default_severity = "high"
fail_on_severity = "low"
sample_limit_per_category = 5

[output_scan.thresholds]
email = 0
ssn = 0
pan = 0
token = 0

[output_scan.severities]
email = "medium"
ssn = "high"
pan = "critical"
token = "high"
[row_filters."public.users"]
retain = [
  { column = "country", op = "eq", value = "US" },
  { column = "profile.plan", op = "eq", value = "gold" }
]
delete = [
  { column = "is_admin", op = "eq", value = "true" },
  { column = "devices__platform", op = "eq", value = "android" }
]

Secret references

Dumpling supports secret substitution in string config fields using two providers:

Syntax	Provider	Description
`${ENV_VAR}`	`env` (implicit)	Read from environment variable `ENV_VAR`
`${env:ENV_VAR}`	`env` (explicit)	Read from environment variable `ENV_VAR`
`${file:/path/to/secret}`	`file`	Read from a file (trailing newlines are stripped)

Example using both providers:

salt = "${DUMPLING_GLOBAL_SALT}"

[rules."public.users"]
ssn   = { strategy = "hash", salt = "${env:DUMPLING_USERS_SSN_SALT}" }
email = { strategy = "hash", salt = "${file:/run/secrets/dumpling_email_salt}" }

Behavior:

Missing env references fail fast at startup with a non-zero exit and an actionable error including the config path.
Missing or empty file references fail fast with a non-zero exit and an actionable error.
Plaintext salt values are accepted for backward compatibility, but Dumpling prints a warning because plaintext secrets are insecure.
Unknown providers fail startup with a list of supported providers.

Environment-variable secrets (CI / local dev)

# local development
export DUMPLING_GLOBAL_SALT='local-dev-salt'
export DUMPLING_USERS_EMAIL_SALT='users-email-salt'
dumpling --input dump.sql --check

# CI environment (values injected from your secret manager)
export DUMPLING_GLOBAL_SALT="$CI_DUMPLING_GLOBAL_SALT"
export DUMPLING_USERS_EMAIL_SALT="$CI_DUMPLING_USERS_EMAIL_SALT"
dumpling --input dump.sql --check --strict-coverage --report coverage.json

File-mounted secrets (Docker / Kubernetes)

The file: provider reads the secret value from a file on disk and trims trailing newlines. This is the natural format for Docker Swarm secrets (/run/secrets/<name>), Kubernetes mounted secrets, and HashiCorp Vault Agent injected files.

Docker Swarm — declare a secret and mount it into the service:

# docker-compose.yml
secrets:
  dumpling_hmac_key:
    external: true

services:
  anonymizer:
    image: your-image
    secrets:
      - dumpling_hmac_key
    environment:
      - DUMPLING_CONFIG=/app/.dumplingconf

# .dumplingconf
salt = "${file:/run/secrets/dumpling_hmac_key}"

Kubernetes — mount a Secret as a volume:

# deployment.yaml (excerpt)
volumes:
  - name: dumpling-secrets
    secret:
      secretName: dumpling-keys
volumeMounts:
  - name: dumpling-secrets
    mountPath: /run/secrets
    readOnly: true

# .dumplingconf
salt = "${file:/run/secrets/hmac_key}"

HashiCorp Vault Agent — inject secrets as files using the template stanza:

# vault-agent.hcl (excerpt)
template {
  contents     = "{{ with secret \"secret/dumpling\" }}{{ .Data.data.hmac_key }}{{ end }}"
  destination  = "/run/secrets/dumpling_hmac_key"
}

# .dumplingconf
salt = "${file:/run/secrets/dumpling_hmac_key}"

Nested JSON targeting is supported in predicate column values via either:

dot notation (payload.profile.tier)
Django-style separators (payload__profile__tier)

When a JSON path traverses an array, Dumpling checks each element (useful for list-of-dicts JSON structures).

JSON path rules (`json` / `jsonb` columns)

You can anonymise values inside a text column that holds JSON using the same path syntax as row-filter predicates, but on [rules] keys:

Dot notation: "payload.profile.email" = { strategy = "email", domain = "orders_email", as_string = true }
Django-style: "payload__profile__email" = { strategy = "hash", salt = "${env:ORDER_SECRET_SALT}", as_string = true }

The part before the first dot or __ is the SQL column name; the rest is the path inside the parsed JSON document. Use quoted keys in TOML when the name contains dots. For a given table, you can use either path-level rules for a column or one whole-column rule for that column’s base name, not both (Dumpling rejects the conflict at startup). If a path is missing in a given row, that rule is skipped for that row. When only path rules apply (no whole-column rule), the rest of the JSON is left unchanged. Path rules are applied in longest-path-first order. column_cases still match the SQL column name only; use when predicates with nested column paths to branch on JSON content.

Safety recommendations

Prefer deterministic runs in CI by passing --seed (or DUMPLING_SEED).
Keep fail-closed behavior enabled in CI/CD; avoid --allow-noop unless a no-op run is explicitly intended.
Treat new or changed anonymization rules as code changes and require review.
Keep table/column names lowercase in config to avoid case-mismatch surprises.
table_options are no longer supported; define explicit rules and optional conditional column_cases instead.
Use --strict-coverage --report <file> --check in CI so uncovered sensitive columns fail the build.

Strict sensitive coverage

--strict-coverage enforces explicit policy coverage for sensitive columns.

Sensitive columns are detected by:
1. built-in column-name patterns, and
2. explicit per-table lists under [sensitive_columns].
A sensitive column is considered covered only if it has an explicit rules or column_cases entry (including JSON path rules whose base name is that column, e.g. payload.x.y covers payload).
If uncovered sensitive columns are found, Dumpling exits non-zero.

When --report is enabled, coverage fields are added to JSON output:

sensitive_columns_detected
sensitive_columns_covered
sensitive_columns_uncovered

Example CI gate:

dumpling --input dump.sql --check --strict-coverage --report coverage.json

Residual output scanning

Enable output scanning with:

dumpling --input dump.sql --scan-output --report scan_report.json

Add fail gates with:

dumpling --input dump.sql --check --scan-output --fail-on-findings --report scan_report.json

Output scanning inspects transformed output for common sensitive categories:

email
ssn
pan (Luhn-validated card-like numbers)
token (common secret/token formats)

When --report is set, report JSON includes an output_scan object with per-category:

category
count
threshold
severity
sample_locations (line + column + snippet where available)

CI guardrails and policy linting

Dumpling ships with a built-in policy linter (lint-policy) and a reference GitHub Actions workflow so you can catch anonymization policy regressions in pull requests before they reach production.

`dumpling lint-policy`

dumpling lint-policy                          # auto-discover config
dumpling lint-policy --config .dumplingconf   # explicit config path
dumpling lint-policy --allow-noop             # treat missing config as empty (no violations)

The command loads your configuration, runs a set of policy checks, prints any violations to stderr, and exits:

Exit code	Meaning
`0`	No violations found
`1`	One or more violations found

Checks performed

Code	Severity	Description
`empty-rules-table`	warning	A `[rules]` entry has no column rules. Likely a stale or incomplete config section.
`empty-column-cases-table`	warning	A `[column_cases]` entry has no column cases.
`unsalted-hash`	warning	A `hash` strategy is used with no salt (neither per-column `salt` nor global `salt`). Unsalted hashes are reversible via precomputed lookup tables for low-entropy inputs (names, emails, common IDs).
`inconsistent-domain-strategy`	error	The same domain name is used with two or more different strategies. This breaks referential integrity: a domain shared between incompatible generators (for example `faker` with different `faker` targets, or `faker` vs `hash`) cannot maintain a single stable mapping.
`uncovered-sensitive-column`	error	A column listed in `[sensitive_columns]` has no matching anonymization rule or case. The column will pass through unmodified, making the sensitive declaration misleading.

Recommended CI setup

Minimal (policy lint only)

# .github/workflows/policy-lint.yml
name: Policy Lint

on:
  pull_request:
  push:
    branches: [main]

jobs:
  policy-lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: dtolnay/rust-toolchain@stable
      - uses: Swatinem/rust-cache@v2
      - run: cargo build --release --locked
      - run: ./target/release/dumpling lint-policy

This single job will block merges whenever a policy violation is introduced.

Production-ready: lint + strict coverage + PII scan

Combine lint-policy with Dumpling's other CI gates for defence in depth:

name: Anonymization CI

on:
  pull_request:
  push:
    branches: [main]

jobs:
  policy-lint:
    name: Policy lint
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: dtolnay/rust-toolchain@stable
      - uses: Swatinem/rust-cache@v2
      - run: cargo build --release --locked
      - name: Lint anonymization policy
        run: ./target/release/dumpling lint-policy

  anonymize-and-scan:
    name: Anonymize + residual PII scan
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: dtolnay/rust-toolchain@stable
      - uses: Swatinem/rust-cache@v2
      - run: cargo build --release --locked

      - name: Anonymize with strict coverage enforcement
        run: |
          ./target/release/dumpling \
            --strict-coverage \
            --scan-output \
            --fail-on-findings \
            --report report.json \
            -i dump.sql \
            -o sanitized.sql

      - name: Upload anonymization report
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: anonymization-report
          path: report.json

Gating on report diff against a baseline

To detect increases in risk findings across PRs, store a baseline report as a CI artifact on your main branch and compare against it in PRs:

- name: Download baseline report
  uses: dawidd6/action-download-artifact@v6
  with:
    workflow: ci.yml
    branch: main
    name: anonymization-report
    path: baseline/
  continue-on-error: true   # first run has no baseline yet

- name: Fail if findings increased
  run: |
    BASELINE=$(jq '.output_scan.total_findings // 0' baseline/report.json 2>/dev/null || echo 0)
    CURRENT=$(jq '.output_scan.total_findings' report.json)
    echo "Baseline findings: $BASELINE  Current findings: $CURRENT"
    if [ "$CURRENT" -gt "$BASELINE" ]; then
      echo "ERROR: residual PII findings increased from $BASELINE to $CURRENT"
      exit 1
    fi

Tips

Run dumpling lint-policy locally before opening a PR to catch violations early: cargo run -- lint-policy.
Treat error-severity violations as mandatory fixes; warning-severity violations are advisory but should be reviewed.
If you intentionally use hash without a salt (e.g. for non-sensitive low-cardinality fields), add a salt = "${ENV_VAR}" at the global level to suppress the unsalted-hash warning globally.
In hardened security profile environments, a global salt is required anyway (--security-profile hardened will error without it), so unsalted-hash warnings become informational.

Release process

This project uses tag-driven releases.

Release policy

Versioning follows Semantic Versioning (MAJOR.MINOR.PATCH).
Every release is tied to an immutable git tag (vX.Y.Z).
The release workflow runs quality checks before publishing artifacts.

Maintainer checklist

Ensure main is green in CI.
Update Cargo.toml and pyproject.toml versions and CHANGELOG.md.
Open and merge a release preparation PR.

Create and push a tag from main:

git tag -a vX.Y.Z -m "Release vX.Y.Z"
git push origin vX.Y.Z

Verify the Release GitHub Actions workflow passes.
Validate uploaded artifacts and checksums from the GitHub Release page.
Announce the release with upgrade notes and rollback guidance.

Python package publishing (PyPI/TestPyPI)

Dumpling is configured for Python packaging via maturin (pyproject.toml). The Python distribution name is dumpling-cli, while the installed CLI command is still dumpling.

The Publish GitHub Actions workflow (.github/workflows/publish.yml) builds cross-platform wheels and an sdist, then publishes:

Automatically to PyPI when a v*.*.* tag is pushed.
Manually to PyPI or TestPyPI via workflow_dispatch.

Environment setup

Configure a GitHub environment named pypi for trusted publishing.
Optionally configure testpypi for manual pre-release validation.

Local dry-run commands

maturin build --release
maturin sdist

Rollback guidance

If a release is faulty, create a new patch release (for example v1.2.4) that reverts or fixes the issue.
Avoid deleting published tags; treat tags as immutable history.

Keyboard shortcuts

Dumpling