The pull to clone production into staging is understandable. Real volumes, real edge cases, real distributions, all for free. The problem is what comes attached: real names, real emails, real addresses, now sitting in an environment that nobody secures like production.
Why "just copy prod" is a liability
- Everyone with staging access now has your customers' PII. Developers, contractors, the analytics tool you wired into staging "just to test it". That is processing personal data for a purpose the customer never agreed to.
- Staging is a softer target. Weaker auth, debug endpoints, verbose logs, backups in someone's S3 bucket. A breach in staging is still a breach of production-grade personal data.
- It does not actually pass an audit. Under the GDPR, personal data in a non-production environment still needs a lawful basis and the same Article 32 safeguards. "It is only staging" is not a category the regulation recognises.
The fix is to make sure the data in staging is not personal data in the first place. Three ways to get there, from "reuse prod safely" to "never touch prod".
Option A: mask the PII in place
You have a copy of the database and you want the same rows, minus the personal bits. Point SeedBase at the connection and replace the sensitive columns with realistic, invented values:
# preview first, write nothing
seedbase mask <connection> --table customers --columns email,name,phone --dry-run
# apply it
seedbase mask <connection> --table customers --columns email,name,phone
The values change, the keys do not. customers.email becomes a believable fake, but the orders that reference that customer still resolve to the same row, so the database stays join-able and loadable. What is gone is the link back to a real person.
Option B: carve a safe subset first
Production is 400 GB and staging does not need all of it. Take a foreign-key-consistent slice, so every referenced parent comes along with its children, then mask that:
seedbase pull subset --from-database <connection> --rows 5000 --out staging.sql
You get a small dataset that still behaves like the real one, no dangling references, no orphaned rows, at a size a laptop can load in seconds.
Option C: never let production leave production
The strongest posture is the one where no real row is ever copied at all. Generate a synthetic dataset straight from your schema, with realistic distributions and every foreign key resolved, and load that into staging instead. Nothing to mask, because nothing real was there to begin with. This is the default we reach for, and masking is the fallback for when you genuinely need the exact shape and volume of production.
Anonymised, not just pseudonymised
This distinction is the whole game under the GDPR. Recital 26 says the regulation does not apply to truly anonymous data, information that can no longer be tied to a person. Pseudonymisation, where the original can be recovered with a key, stays fully in scope.
Hashing an email is pseudonymisation: same input gives the same hash, so it still identifies and links. Replacing it with an invented address and keeping no mapping is anonymisation. SeedBase does the second, which is what actually takes the data out of scope, and the synthetic values keep the right format so your validation and tests still pass.
FAQ
Is masked production data still personal data under the GDPR?
It depends on whether the result is anonymous or only pseudonymous. Recital 26 puts truly anonymous data outside the scope of the regulation. Replace an email with a realistic but invented one and keep no mapping back, and the value is anonymous. Hash or encrypt it, where the original can still be recovered, and it is pseudonymisation, which stays in scope. SeedBase replaces values with synthetic ones rather than encoding the originals. More on the GDPR angle.
Does masking break foreign keys?
No. SeedBase masks column values in place and leaves keys and relationships intact, so a masked customers.email changes but the orders that reference that customer still point at the same row. Staging stays loadable and join-able, it just no longer contains real personal data.
Can I avoid touching production entirely?
Yes. Instead of masking a copy, generate a synthetic dataset from your schema. No production row ever leaves production, which is the cleanest posture for staging, demos and CI.
Where is the data processed?
SeedBase is EU-hosted with no third-party trackers, and you can export everything, which matters when the whole point is keeping personal data inside a compliant boundary.
Give staging realistic data, not real people
Mask a connected database in place, carve a foreign-key-consistent subset, or generate synthetic data from your schema. Free tier, no credit card, EU-hosted.
- PII replaced, keys intact
- FK-consistent subsets
- Synthetic from schema
- EU-hosted
More: anonymise production data · SQL test data · why Faker breaks on FKs