How to mask production data for staging (without the GDPR risk)

Q: Does masking break foreign keys?

No. SeedBase masks column values in place while leaving keys and relationships intact, so a masked customers.email changes but the orders that reference that customer still point at the same row. Your staging database stays loadable and join-able, it just no longer contains real personal data.

The pull to clone production into staging is understandable. Real volumes, real edge cases, real distributions, all for free. The problem is what comes attached: real names, real emails, real addresses, now sitting in an environment that nobody secures like production.

Why "just copy prod" is a liability

Everyone with staging access now has your customers' PII. Developers, contractors, the analytics tool you wired into staging "just to test it". That is processing personal data for a purpose the customer never agreed to.
Staging is a softer target. Weaker auth, debug endpoints, verbose logs, backups in someone's S3 bucket. A breach in staging is still a breach of production-grade personal data.
It does not actually pass an audit. Under the GDPR, personal data in a non-production environment still needs a lawful basis and the same Article 32 safeguards. "It is only staging" is not a category the regulation recognises.

The fix is to make sure the data in staging is not personal data in the first place. Three ways to get there, from "reuse prod safely" to "never touch prod".

Option A: mask the PII in place

You have a copy of the database and you want the same rows, minus the personal bits. Point SeedBase at the connection and replace the sensitive columns with realistic, invented values:

# preview first, write nothing
seedbase mask <connection> --table customers --columns email,name,phone --dry-run

# apply it
seedbase mask <connection> --table customers --columns email,name,phone

The values change, the keys do not. customers.email becomes a believable fake, but the orders that reference that customer still resolve to the same row, so the database stays join-able and loadable. What is gone is the link back to a real person.

Option B: carve a safe subset first

Production is 400 GB and staging does not need all of it. Take a foreign-key-consistent slice, so every referenced parent comes along with its children, then mask that:

seedbase pull subset --from-database <connection> --rows 5000 --out staging.sql

You get a small dataset that still behaves like the real one, no dangling references, no orphaned rows, at a size a laptop can load in seconds.

Option C: never let production leave production

The strongest posture is the one where no real row is ever copied at all. Generate a synthetic dataset straight from your schema, with realistic distributions and every foreign key resolved, and load that into staging instead. Nothing to mask, because nothing real was there to begin with. This is the default we reach for, and masking is the fallback for when you genuinely need the exact shape and volume of production.

Anonymised, not just pseudonymised

This distinction is the whole game under the GDPR. Recital 26 says the regulation does not apply to truly anonymous data, information that can no longer be tied to a person. Pseudonymisation, where the original can be recovered with a key, stays fully in scope.

Hashing an email is pseudonymisation: same input gives the same hash, so it still identifies and links. Replacing it with an invented address and keeping no mapping is anonymisation. SeedBase does the second, which is what actually takes the data out of scope, and the synthetic values keep the right format so your validation and tests still pass.

EU-hosted, no third-party trackers, export everything. When the entire exercise is about keeping personal data inside a compliant boundary, shipping it through a US analytics pipeline to get masked rather defeats the point.

Where masking is the wrong tool. If you can get away with synthetic data, do, it sidesteps the question of personal data entirely. Reach for masking only when staging has to mirror production volume and quirks closely enough that generated data would not reproduce the bug you are chasing. Pick the lightest option that still gives you what you came for.

FAQ

Is masked production data still personal data under the GDPR?

It depends on whether the result is anonymous or only pseudonymous. Recital 26 puts truly anonymous data outside the scope of the regulation. Replace an email with a realistic but invented one and keep no mapping back, and the value is anonymous. Hash or encrypt it, where the original can still be recovered, and it is pseudonymisation, which stays in scope. SeedBase replaces values with synthetic ones rather than encoding the originals. More on the GDPR angle.

Does masking break foreign keys?

No. SeedBase masks column values in place and leaves keys and relationships intact, so a masked customers.email changes but the orders that reference that customer still point at the same row. Staging stays loadable and join-able, it just no longer contains real personal data.

Can I avoid touching production entirely?

Yes. Instead of masking a copy, generate a synthetic dataset from your schema. No production row ever leaves production, which is the cleanest posture for staging, demos and CI.

Where is the data processed?

SeedBase is EU-hosted with no third-party trackers, and you can export everything, which matters when the whole point is keeping personal data inside a compliant boundary.

Give staging realistic data, not real people

Mask a connected database in place, carve a foreign-key-consistent subset, or generate synthetic data from your schema. Free tier, no credit card, EU-hosted.

PII replaced, keys intact
FK-consistent subsets
Synthetic from schema
EU-hosted

Make staging GDPR-safe, free

More: anonymise production data · SQL test data · why Faker breaks on FKs