How to generate test data from your database schema (the complete guide)

Q: What does it mean to generate test data from a schema?

It means deriving the data from your tables, columns and relationships instead of inventing values one at a time. A schema-aware generator reads the foreign keys, types and constraints, then produces a full dataset where children reference parents that exist, inserts come out in dependency order, and the result loads into a real database with constraints enabled.

Q: Which schema formats can I generate from?

A SQL dump or DDL (Postgres, MySQL, others), Django models.py, or a Prisma schema.prisma. SeedBase parses any of these into the same internal model, so the generated data is foreign-key-consistent regardless of which one you start from.

Q: Why not just use Faker or random values?

Faker is excellent for a single value, but it generates each value independently and has no concept of foreign keys, insert order or cross-row consistency. At scale that produces orphaned references, uniform distributions that hide your pagination and N+1 bugs, and a customers.email that does not match customers.name. A loadable relational dataset is a property of the whole schema, not of one cell.

Q: Is the generated data reproducible for CI?

Yes. Generation is seeded, so the same seed produces the same dataset on every run. A CI pipeline gets a stable fixture instead of a fresh random one, which keeps failing tests reproducible.

Q: Can I avoid using real production data?

Yes, and it is the default for staging and demos. Generate synthetic data from your schema so no real row ever leaves production. When you specifically need the shape of real data, mask a copy in place or carve a foreign-key-consistent subset instead.

There are two ways to fill a database for development or tests. You can invent values column by column, or you can derive the whole dataset from the schema you already wrote. The first is quick for three rows and falls apart the moment your data has relationships. This guide is about the second.

What "generate from a schema" actually means

A schema is a contract. It says orders.customer_id must point at a real customers.id, that email is unique and not null, that an order has many line items. Generating from the schema means a tool reads that contract and produces data that honours it: children reference parents that exist, inserts come out parent-first, and the result loads into Postgres or MySQL with the constraints turned on.

The contrast is a generator that fills each cell on its own. It produces plausible-looking values and a database that will not load, because a referential database is the opposite of a bag of independent cells.

The four things schema-driven data has to get right

1. Foreign keys resolve

Every order.customer_id is the id of a customer that was actually generated, including when keys are UUIDs or have gaps. Get this wrong and the load fails on the first child insert:

ERROR:  insert or update on "order_items" violates foreign key
        constraint "order_items_product_id_fkey"
DETAIL:  Key (product_id)=(173) is not present in table "products".

We dug into exactly why ad-hoc fakers hit this in why Faker breaks when your test data has foreign keys.

2. Inserts come out in dependency order

A real load inserts customers before orders and orders before order_items. That ordering is a topological sort of the foreign-key graph, and cycles or self-references need special handling. A schema-aware generator does it for you; otherwise you disable constraints on load and ship data that could never exist in production.

3. Production-like distributions

Uniform random gives every customer roughly the same number of orders. Production has a long tail: most customers have one or two, a handful have hundreds. That tail is where pagination, N+1 queries and slow joins fall over, so a dataset that hides it is testing the wrong thing.

4. Values are coherent and derived values reconcile

The email should match the name, the city should match the postcode and country, and order.total should equal the sum of its line items. Independent per-cell generation breaks all three, which quietly invalidates any test that checks a derived field.

Start from your schema, whatever shape it is in

You should not have to redescribe a schema you already maintain. Point the generator at the source of truth you have:

SQL dump or DDL

CREATE TABLE customers (
  id    integer PRIMARY KEY,
  email varchar(160) UNIQUE NOT NULL,
  name  varchar(120) NOT NULL
);
CREATE TABLE orders (
  id          integer PRIMARY KEY,
  customer_id integer NOT NULL REFERENCES customers(id),
  total       decimal(10,2),
  ordered_at  timestamp NOT NULL
);

Full walk-through: generate test data from a SQL schema.

Django models.py

Import your models.py and the relations (ForeignKey, ManyToMany) come along, so the generated rows respect them. See Django test data and the step-by-step seed a Django database.

Prisma schema.prisma

Paste your schema.prisma and the @relation directives drive the foreign keys. More in Prisma test data and Prisma seed data without a seed script.

Whichever you start from, it is parsed into one internal model, so the four properties above hold the same way. That is the point of working from the schema: one source, consistent output.

Three ways to fill the database

Synthetic from scratch. Generate a fresh dataset from the schema. No real row is involved, which is the cleanest posture for staging, demos and CI.
Mask a copy of production. When you need the exact shape and volume of real data, replace the PII in place while keeping the keys intact. Walk-through: mask production data for staging and GDPR-compliant anonymisation.
Carve a subset. Take a foreign-key-consistent slice of a large database, so every referenced parent comes along with its children, then mask it.

Get it into your database

Export the result as SQL, CSV or JSON and load it, or push it straight into a connected database. From the VS Code or JetBrains plugin the connection is auto-detected from your .env or schema.prisma, so it is one step:

psql "$DATABASE_URL" -f seed.sql
# or, in the editor: Load into Database / Reset & Reseed

Need an API instead of a database? Any generation can also be served as a read-only mock REST API with the same foreign-key-consistent data.

For CI, generation is seeded: the same seed reproduces the same dataset on every run, so a failing test is reproducible rather than a fresh roll of the dice each time.

Where this fits in

SQL test dataFrom a CREATE TABLE dump Django test dataFrom models.py Prisma test dataFrom schema.prisma Anonymise productionPrivacy-friendly staging data vs MockarooField-by-field vs a real database vs FakerValues vs a loadable dataset

One honest caveat. If your test genuinely needs three fixed rows, a tiny hand-written fixture is fine. Generating from the schema pays off when the database is relational and you want it to behave like production. SeedBase was tested against a real 20-app project with 226 tables, which is where the foreign-key ordering and distribution handling came from. EU-hosted, no third-party trackers, export everything.

FAQ

What does it mean to generate test data from a schema?

You derive the data from your tables, columns and relationships instead of inventing values one at a time. The generator reads the foreign keys, types and constraints and produces a dataset where children reference parents that exist, inserts come out in dependency order, and the result loads with constraints enabled.

Which schema formats can I generate from?

A SQL dump or DDL, a Django models.py, or a Prisma schema.prisma. All three are parsed into the same internal model, so the output is foreign-key-consistent regardless of the source.

Why not just use Faker or random values?

Faker is great for a single value but generates each one independently, with no concept of foreign keys, insert order or cross-row consistency. At scale that means orphaned references and uniform distributions that hide real bugs. More here.

Is the generated data reproducible for CI?

Yes. Generation is seeded, so the same seed reproduces the same dataset on every run.

Can I avoid using real production data?

Yes. Generate synthetic data from the schema so no real row leaves production, or mask a copy when you need the exact shape of real data. More on masking.

Generate a real database from your schema

Point SeedBase at your SQL, Django or Prisma schema and get a populated, foreign-key-consistent database with realistic distributions. Free tier, no credit card.

Every FK resolves
Long-tail distributions
SQL / CSV / JSON
EU-hosted

Generate test data, free

Or try it live: open the editor, no signup