Dumpling documentation
Dumpling is a streaming anonymizer for plain SQL dumps. It supports PostgreSQL (pg_dump plain format), SQLite (.dump), and SQL Server / MSSQL (SSMS / mssql-scripter plain SQL output). For PostgreSQL custom-format or directory-format archives (e.g. Heroku pg:backups:download), Dumpling auto-detects them when --format postgres (default) and invokes pg_restore -f -—see PostgreSQL archives and compressed inputs in the configuration guide.
New here? Start with Getting started: generate a draft policy with scaffold-config, review and add secrets, run Dumpling, then tighten with lint-policy and optional CI flags.
This documentation covers the operating model for day-to-day use:
- how to build and run Dumpling locally,
- how to configure transformation behavior safely,
- how CI validates quality before changes merge,
- and how maintainers produce tagged releases.
Documentation quality gate
The mdBook site is built in CI as follows:
- Pull requests: the Docs (PR) workflow runs
mdbook buildwhen docs-related paths change (no deploy). main: the Docs workflow builds and deploys to GitHub Pages when docs-related paths change.
This keeps the docs in a continuously deployable state instead of drifting from the codebase.
Getting started
This page is the shortest path from zero to a first successful run. For strategy details, row filters, dump seals, and CI patterns, continue with the configuration guide and the repository README.md.
Prerequisites
- Rust stable toolchain (
rustuprecommended). The repo includesrust-toolchain.toml(stable +rustfmt+clippy) so CI and localcargostay aligned. cargoon yourPATH
Optional: run ./scripts/setup-dev.sh once from the repo root — it installs toolchain components, cargo fetch, and a pinned mdBook under .tools/ for the same docs build CI uses.
Build
cargo build --release
./target/release/dumpling --help
Python / pip (dumpling-cli)
pip install dumpling-cli
dumpling --help
First anonymization
-
Generate a draft policy (recommended) — From your project root (or anywhere you keep config):
dumpling scaffold-config -i dump.sql -o .dumplingconfThis beta subcommand streams the dump once and writes inferred
[rules]from SQL column names (CREATE TABLE,INSERT, and PostgreSQLCOPYcolumn lists). Heuristics are English-oriented; output is draft only—review and edit every rule, add a top-levelsalt(for hashing) and any${…}secret placeholders before production use.Useful flags:
--infer-json-paths— Keep up to five sampled rows per table (reservoir) and suggest nested JSON rules ascolumn.path.leaf.--max-json-depth— Cap JSON walking depth when using--infer-json-paths(default 24).--format—postgres(default),sqlite, ormssql.--pg-restore-path/--pg-restore-arg— Optionalpg_restorebinary and extra arguments when--inputis a PostgreSQL custom-format or directory-format archive (auto-detected with--format postgres); see PostgreSQL archives and compressed inputs.
Run
dumpling scaffold-config --helpfor the full flag list. -
Or start from the example policy — Copy
.dumplingconf.exampleto.dumplingconf(or merge under[tool.dumpling]inpyproject.toml) and author[rules]by hand. Set environment variables forsaltand any${…}references. -
Align rules with your dump (manual path only) — If you skipped
scaffold-config, useCREATE TABLE,COPY … (…), andINSERT INTO … (…)lines to name[rules."table"]or[rules."schema.table"]keys. Trim to the tables you care about first. -
Run Dumpling —
dumpling -i dump.sql -o sanitized.sql(add-c pathif the config is not in the default search path). Usedumpling --check -i dump.sqlwhen you only want to know whether anything would change. -
Tighten the policy — Run
dumpling lint-policyon your config. When you are ready for stricter gates, add[sensitive_columns]and use--strict-coverage,--report, and--scan-outputas described in the configuration guide and the repositoryREADME.md.
PostgreSQL custom-format archives
If your input is a PostgreSQL custom-format file or directory-format folder (not plain SQL), use --format postgres (default): Dumpling auto-detects the archive and runs pg_restore -f - (needs pg_restore from PostgreSQL client tools). Gzip-wrapped plain SQL is streamed without a temp file; ZIP (or gzip wrapping PGDMP) uses a temp extract that is cleaned up afterward. See PostgreSQL archives and compressed inputs in the configuration guide.
Test locally (contributors)
cargo fmt --all -- --check
cargo clippy --all-targets --all-features
cargo test --all-targets --all-features
Configuration guide
Dump format
Use --format to declare the SQL dialect of your input file:
| Value | Description |
|---|---|
postgres (default) | PostgreSQL pg_dump plain-text format. Supports COPY … FROM stdin blocks, "double-quoted" identifiers, ''-escaped strings. Custom-format (PGDMP) and directory-format (toc.dat) dumps are auto-detected and decoded with pg_restore -f - (requires client tools). Gzip — wrapped plain SQL is decompressed in-process; ZIP (or gzip wrapping PGDMP/nested ZIP) uses a temp file that is removed after the run. By default the archive is deleted after success; use --keep-original or keep_original in config to retain it. |
sqlite | SQLite .dump format. Adds INSERT OR REPLACE INTO / INSERT OR IGNORE INTO support. No COPY blocks. |
mssql | SQL Server / MSSQL plain SQL. Adds [bracket] identifier quoting, N'…' Unicode string literals, and nvarchar(n) / nchar(n) length extraction. No COPY blocks. |
Example:
dumpling --format sqlite -i data.db.sql -o anonymized.sql
dumpling --format mssql -i backup.sql -o anonymized.sql
PostgreSQL archives and compressed inputs
Heroku PGBackups and many pipelines ship pg_dump custom format (-Fc), directory-format dumps, or gzip/ZIP-wrapped files. Dumpling’s SQL engine still expects plain text at the parser; anything else is normalized first.
Custom-format and directory dumps (auto-detected)
With --format postgres (default), Dumpling detects:
- Custom-format files (magic
PGDMPat the start of the file), and - Directory-format folders (a
toc.datbeside table blobs),
then runs pg_restore -f - (script to stdout inside the process — no database) and pipes the result through the same anonymizer as a normal plain-SQL file. Detection from --input is automatic.
Requirements: PostgreSQL client tools on PATH (pg_restore), or --pg-restore-path.
Extra pg_restore arguments:
- CLI:
--pg-restore-arg(repeatable), e.g.--pg-restore-arg=--no-owner --pg-restore-arg=--no-acl - Config (optional):
[pg_restore]— CLI overrides these when you pass path or args:
[pg_restore]
path = "/usr/bin/pg_restore"
args = ["--no-owner", "--no-acl"]
Gzip and ZIP wrappers
- Gzip (
.gz) whose decompressed payload is plain SQL: decompressed in-process (streamed); no temporary dump file. - ZIP containing a single dump file (or a single
.sqlwhen multiple files exist), gzip wrappingPGDMP, or gzip wrapping an inner ZIP: Dumpling writes under the system temp directory and removes those paths when the run completes (including after errors — cleanup runs on drop).
--in-place is rejected when Dumpling had to materialize a temp file for compression or when the resolved input is a PostgreSQL archive decoded via pg_restore (use --output or stdout).
Keeping inputs and --check
After a fully successful run, Dumpling removes the --input archive path (single file or directory-format folder) by default. To keep it:
--keep-original, orkeep_original = trueat the top level of.dumplingconf/[tool.dumpling](merged with CLI;--keep-originalcannot be used with--in-place).
--check with a PostgreSQL archive requires an effective keep-original (CLI or config); otherwise the default deletion would remove the dump before you iterate on policy.
Examples (e.g. after heroku pg:backups:download):
dumpling -i latest.dump -c .dumplingconf -o anonymized.sql
Dry run while keeping the downloaded file:
dumpling --keep-original --check -i latest.dump -c .dumplingconf
Configuration sources
Configuration can be loaded from:
--config <path>(highest precedence).dumplingconfin the current working directory[tool.dumpling]inpyproject.toml
If no configuration is found, Dumpling fails closed by default and exits non-zero.
Error output includes every checked location. If you intentionally want a no-op
run, pass --allow-noop.
Dump seal (always on)
Every successful run that writes output prefixes the stream with a single-line SQL comment:
-- dumpling-seal: v=3 version=<semver> profile=<standard|hardened> sha256=<64 hex chars>
The sha256 is over canonical JSON that includes the Dumpling version, the active security profile, a stable encoding of the resolved policy (rules, row filters, column cases, sensitive columns, output scan, global salt), and runtime options that affect transforms: --format and the effective --seed / DUMPLING_SEED value in standard profile (null in hardened, where seeds are ignored).
If the input already begins with a seal line and it matches the current run, Dumpling copies the rest of the file through unchanged. If the line looks like a seal but does not match (stale policy, different flags, or older v=), that line is dropped and the dump is re-processed so you do not end up with two seal lines. --strict-coverage cannot be combined with a matching seal (table definitions are not scanned in passthrough mode). --check writes no output and therefore emits no seal line.
Hardened security profile
For adversarial risk environments — where an internal or external actor may have partial auxiliary data — use --security-profile hardened:
dumpling --security-profile hardened -i dump.sql -o sanitized.sql
What changes in hardened mode
| Aspect | Standard | Hardened |
|---|---|---|
| Random generation | xorshift64* seeded from system time | OS CSPRNG (getrandom) — non-predictable |
hash strategy | SHA-256(salt || input) | HMAC-SHA-256(key=salt, data=input) |
| Deterministic domain byte stream | SHA-256 CTR-mode | HMAC-SHA-256 CTR-mode |
Report security_profile field | "standard" | "hardened" |
--seed / DUMPLING_SEED | Seeds the PRNG | Ignored (warning emitted) |
Why this matters
- Non-predictable output: xorshift64* is seeded from system time, which is guessable. The OS CSPRNG cannot be predicted from timing alone.
- Proper keyed hashing:
SHA-256(key || data)is vulnerable to length-extension attacks and weak as a MAC. HMAC-SHA-256 uses the salt as a genuine cryptographic key, providing provable PRF security. - Domain separation: HMAC construction ensures outputs from one salt/key cannot be confused with another.
Key management guidance
Configure a per-environment secret via an env-backed reference to prevent key leakage:
# .dumplingconf
salt = "${DUMPLING_HMAC_KEY}"
[rules."public.users"]
ssn = { strategy = "hash", as_string = true }
email = { strategy = "email", domain = "users" }
export DUMPLING_HMAC_KEY="$(openssl rand -base64 32)"
dumpling --security-profile hardened -i dump.sql -o sanitized.sql
Key rotation: Changing DUMPLING_HMAC_KEY will produce entirely different pseudonyms for all salted/domain-mapped columns. If you rely on referential consistency across separately-processed dumps (e.g., snapshots over time), keep the same key or re-anonymize all related dumps together. Rotate keys when:
- A key may have been compromised.
- You intentionally want to break prior referential linkability.
Report metadata
The JSON report always includes the active security profile:
{
"security_profile": "hardened",
"total_rows_processed": 1000,
...
}
Faker strategy and the fake crate
When you use strategy = "faker" with faker = "module::Type", those names align with the Rust fake crate’s faker modules (for example name::FirstName ↔ fake::faker::name::raw::FirstName). Use the upstream docs to discover available generators and options:
- docs.rs —
fake(crate overview) - docs.rs —
fake::faker(all faker submodules) - GitHub —
cksac/fake-rs(source + README)
Dumpling only exposes a subset wired in src/faker_dispatch.rs; unsupported module::Type pairs fail at config load.
Anonymization strategies
Strategy names and per-strategy options (min, scale, as_string, faker, …) are documented in the repository README under Anonymization strategies (each strategy lists only the keys it accepts, plus Choosing a strategy for when to prefer cheap vs realistic transforms, and Cross-cutting options for domain, unique_within_domain, and as_string). Row filters, JSON path rules, and conditional column_cases are also covered in the README before the full TOML example.
The sections below expand on JSON path rules (same semantics as the README) and secret references in more depth.
Baseline config template
salt = "${DUMPLING_GLOBAL_SALT}"
[rules."public.users"]
email = { strategy = "hash", salt = "${env:DUMPLING_USERS_EMAIL_SALT}", as_string = true }
full_name = { strategy = "faker", faker = "name::Name" }
[sensitive_columns]
"public.users" = ["employee_number", "tax_id"]
[output_scan]
enabled_categories = ["email", "ssn", "pan", "token"]
default_threshold = 0
default_severity = "high"
fail_on_severity = "low"
sample_limit_per_category = 5
[output_scan.thresholds]
email = 0
ssn = 0
pan = 0
token = 0
[output_scan.severities]
email = "medium"
ssn = "high"
pan = "critical"
token = "high"
[row_filters."public.users"]
retain = [
{ column = "country", op = "eq", value = "US" },
{ column = "profile.plan", op = "eq", value = "gold" }
]
delete = [
{ column = "is_admin", op = "eq", value = "true" },
{ column = "devices__platform", op = "eq", value = "android" }
]
Secret references
Dumpling supports secret substitution in string config fields using two providers:
| Syntax | Provider | Description |
|---|---|---|
${ENV_VAR} | env (implicit) | Read from environment variable ENV_VAR |
${env:ENV_VAR} | env (explicit) | Read from environment variable ENV_VAR |
${file:/path/to/secret} | file | Read from a file (trailing newlines are stripped) |
Example using both providers:
salt = "${DUMPLING_GLOBAL_SALT}"
[rules."public.users"]
ssn = { strategy = "hash", salt = "${env:DUMPLING_USERS_SSN_SALT}" }
email = { strategy = "hash", salt = "${file:/run/secrets/dumpling_email_salt}" }
Behavior:
- Missing env references fail fast at startup with a non-zero exit and an actionable error including the config path.
- Missing or empty file references fail fast with a non-zero exit and an actionable error.
- Plaintext
saltvalues are accepted for backward compatibility, but Dumpling prints a warning because plaintext secrets are insecure. - Unknown providers fail startup with a list of supported providers.
Environment-variable secrets (CI / local dev)
# local development
export DUMPLING_GLOBAL_SALT='local-dev-salt'
export DUMPLING_USERS_EMAIL_SALT='users-email-salt'
dumpling --input dump.sql --check
# CI environment (values injected from your secret manager)
export DUMPLING_GLOBAL_SALT="$CI_DUMPLING_GLOBAL_SALT"
export DUMPLING_USERS_EMAIL_SALT="$CI_DUMPLING_USERS_EMAIL_SALT"
dumpling --input dump.sql --check --strict-coverage --report coverage.json
File-mounted secrets (Docker / Kubernetes)
The file: provider reads the secret value from a file on disk and trims trailing
newlines. This is the natural format for Docker Swarm secrets
(/run/secrets/<name>), Kubernetes mounted secrets, and HashiCorp Vault Agent
injected files.
Docker Swarm — declare a secret and mount it into the service:
# docker-compose.yml
secrets:
dumpling_hmac_key:
external: true
services:
anonymizer:
image: your-image
secrets:
- dumpling_hmac_key
environment:
- DUMPLING_CONFIG=/app/.dumplingconf
# .dumplingconf
salt = "${file:/run/secrets/dumpling_hmac_key}"
Kubernetes — mount a Secret as a volume:
# deployment.yaml (excerpt)
volumes:
- name: dumpling-secrets
secret:
secretName: dumpling-keys
volumeMounts:
- name: dumpling-secrets
mountPath: /run/secrets
readOnly: true
# .dumplingconf
salt = "${file:/run/secrets/hmac_key}"
HashiCorp Vault Agent — inject secrets as files using the template stanza:
# vault-agent.hcl (excerpt)
template {
contents = "{{ with secret \"secret/dumpling\" }}{{ .Data.data.hmac_key }}{{ end }}"
destination = "/run/secrets/dumpling_hmac_key"
}
# .dumplingconf
salt = "${file:/run/secrets/dumpling_hmac_key}"
Nested JSON targeting is supported in predicate column values via either:
- dot notation (
payload.profile.tier) - Django-style separators (
payload__profile__tier)
When a JSON path traverses an array, Dumpling checks each element (useful for list-of-dicts JSON structures).
JSON path rules (json / jsonb columns)
You can anonymise values inside a text column that holds JSON using the same path syntax as row-filter predicates, but on [rules] keys:
- Dot notation:
"payload.profile.email" = { strategy = "email", domain = "orders_email", as_string = true } - Django-style:
"payload__profile__email" = { strategy = "hash", salt = "${env:ORDER_SECRET_SALT}", as_string = true }
The part before the first dot or __ is the SQL column name; the rest is the path inside the parsed JSON document. Use quoted keys in TOML when the name contains dots. For a given table, you can use either path-level rules for a column or one whole-column rule for that column’s base name, not both (Dumpling rejects the conflict at startup). If a path is missing in a given row, that rule is skipped for that row. When only path rules apply (no whole-column rule), the rest of the JSON is left unchanged. Path rules are applied in longest-path-first order. column_cases still match the SQL column name only; use when predicates with nested column paths to branch on JSON content.
Safety recommendations
- Prefer deterministic runs in CI by passing
--seed(orDUMPLING_SEED). - Keep fail-closed behavior enabled in CI/CD; avoid
--allow-noopunless a no-op run is explicitly intended. - Treat new or changed anonymization rules as code changes and require review.
- Keep table/column names lowercase in config to avoid case-mismatch surprises.
table_optionsare no longer supported; define explicitrulesand optional conditionalcolumn_casesinstead.- Use
--strict-coverage --report <file> --checkin CI so uncovered sensitive columns fail the build.
Strict sensitive coverage
--strict-coverage enforces explicit policy coverage for sensitive columns.
- Sensitive columns are detected by:
- built-in column-name patterns, and
- explicit per-table lists under
[sensitive_columns].
- A sensitive column is considered covered only if it has an explicit
rulesorcolumn_casesentry (including JSON path rules whose base name is that column, e.g.payload.x.ycoverspayload). - If uncovered sensitive columns are found, Dumpling exits non-zero.
When --report is enabled, coverage fields are added to JSON output:
sensitive_columns_detectedsensitive_columns_coveredsensitive_columns_uncovered
Example CI gate:
dumpling --input dump.sql --check --strict-coverage --report coverage.json
Residual output scanning
Enable output scanning with:
dumpling --input dump.sql --scan-output --report scan_report.json
Add fail gates with:
dumpling --input dump.sql --check --scan-output --fail-on-findings --report scan_report.json
Output scanning inspects transformed output for common sensitive categories:
emailssnpan(Luhn-validated card-like numbers)token(common secret/token formats)
When --report is set, report JSON includes an output_scan object with per-category:
categorycountthresholdseveritysample_locations(line + column + snippet where available)
CI guardrails and policy linting
Dumpling ships with a built-in policy linter (lint-policy) and a reference
GitHub Actions workflow so you can catch anonymization policy regressions in
pull requests before they reach production.
dumpling lint-policy
dumpling lint-policy # auto-discover config
dumpling lint-policy --config .dumplingconf # explicit config path
dumpling lint-policy --allow-noop # treat missing config as empty (no violations)
The command loads your configuration, runs a set of policy checks, prints any violations to stderr, and exits:
| Exit code | Meaning |
|---|---|
0 | No violations found |
1 | One or more violations found |
Checks performed
| Code | Severity | Description |
|---|---|---|
empty-rules-table | warning | A [rules] entry has no column rules. Likely a stale or incomplete config section. |
empty-column-cases-table | warning | A [column_cases] entry has no column cases. |
unsalted-hash | warning | A hash strategy is used with no salt (neither per-column salt nor global salt). Unsalted hashes are reversible via precomputed lookup tables for low-entropy inputs (names, emails, common IDs). |
inconsistent-domain-strategy | error | The same domain name is used with two or more different strategies. This breaks referential integrity: a domain shared between incompatible generators (for example faker with different faker targets, or faker vs hash) cannot maintain a single stable mapping. |
uncovered-sensitive-column | error | A column listed in [sensitive_columns] has no matching anonymization rule or case. The column will pass through unmodified, making the sensitive declaration misleading. |
Recommended CI setup
Minimal (policy lint only)
# .github/workflows/policy-lint.yml
name: Policy Lint
on:
pull_request:
push:
branches: [main]
jobs:
policy-lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: dtolnay/rust-toolchain@stable
- uses: Swatinem/rust-cache@v2
- run: cargo build --release --locked
- run: ./target/release/dumpling lint-policy
This single job will block merges whenever a policy violation is introduced.
Production-ready: lint + strict coverage + PII scan
Combine lint-policy with Dumpling's other CI gates for defence in depth:
name: Anonymization CI
on:
pull_request:
push:
branches: [main]
jobs:
policy-lint:
name: Policy lint
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: dtolnay/rust-toolchain@stable
- uses: Swatinem/rust-cache@v2
- run: cargo build --release --locked
- name: Lint anonymization policy
run: ./target/release/dumpling lint-policy
anonymize-and-scan:
name: Anonymize + residual PII scan
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: dtolnay/rust-toolchain@stable
- uses: Swatinem/rust-cache@v2
- run: cargo build --release --locked
- name: Anonymize with strict coverage enforcement
run: |
./target/release/dumpling \
--strict-coverage \
--scan-output \
--fail-on-findings \
--report report.json \
-i dump.sql \
-o sanitized.sql
- name: Upload anonymization report
if: always()
uses: actions/upload-artifact@v4
with:
name: anonymization-report
path: report.json
Gating on report diff against a baseline
To detect increases in risk findings across PRs, store a baseline report as a CI artifact on your main branch and compare against it in PRs:
- name: Download baseline report
uses: dawidd6/action-download-artifact@v6
with:
workflow: ci.yml
branch: main
name: anonymization-report
path: baseline/
continue-on-error: true # first run has no baseline yet
- name: Fail if findings increased
run: |
BASELINE=$(jq '.output_scan.total_findings // 0' baseline/report.json 2>/dev/null || echo 0)
CURRENT=$(jq '.output_scan.total_findings' report.json)
echo "Baseline findings: $BASELINE Current findings: $CURRENT"
if [ "$CURRENT" -gt "$BASELINE" ]; then
echo "ERROR: residual PII findings increased from $BASELINE to $CURRENT"
exit 1
fi
Tips
- Run
dumpling lint-policylocally before opening a PR to catch violations early:cargo run -- lint-policy. - Treat
error-severity violations as mandatory fixes;warning-severity violations are advisory but should be reviewed. - If you intentionally use
hashwithout a salt (e.g. for non-sensitive low-cardinality fields), add asalt = "${ENV_VAR}"at the global level to suppress theunsalted-hashwarning globally. - In hardened security profile environments, a global
saltis required anyway (--security-profile hardenedwill error without it), sounsalted-hashwarnings become informational.
Release process
This project uses tag-driven releases.
Release policy
- Versioning follows Semantic Versioning (
MAJOR.MINOR.PATCH). - Every release is tied to an immutable git tag (
vX.Y.Z). - The release workflow runs quality checks before publishing artifacts.
Maintainer checklist
-
Ensure
mainis green in CI. -
Update
Cargo.tomlandpyproject.tomlversions andCHANGELOG.md. -
Open and merge a release preparation PR.
-
Create and push a tag from
main:git tag -a vX.Y.Z -m "Release vX.Y.Z" git push origin vX.Y.Z -
Verify the
ReleaseGitHub Actions workflow passes. -
Validate uploaded artifacts and checksums from the GitHub Release page.
-
Announce the release with upgrade notes and rollback guidance.
Python package publishing (PyPI/TestPyPI)
Dumpling is configured for Python packaging via maturin (pyproject.toml).
The Python distribution name is dumpling-cli, while the installed CLI command
is still dumpling.
The Publish GitHub Actions workflow (.github/workflows/publish.yml) builds
cross-platform wheels and an sdist, then publishes:
- Automatically to PyPI when a
v*.*.*tag is pushed. - Manually to PyPI or TestPyPI via
workflow_dispatch.
Environment setup
- Configure a GitHub environment named
pypifor trusted publishing. - Optionally configure
testpypifor manual pre-release validation.
Local dry-run commands
maturin build --release
maturin sdist
Rollback guidance
- If a release is faulty, create a new patch release (for example
v1.2.4) that reverts or fixes the issue. - Avoid deleting published tags; treat tags as immutable history.