Deep dive

Why Faker breaks when your test data has foreign keys

Faker is excellent at one job: making up a single realistic value. The trouble starts the moment your data has relationships. Here is exactly where it falls apart at scale, with code, and what referentially consistent data looks like instead.

SeedBase · ~6 min read

To be clear up front: Faker is a great library. If you need a plausible name, email, address or timestamp, reach for it. The issue is not the values. The issue is that Faker generates every value independently, and a real database is the opposite of independent.

The failure, in one snippet

Say you fake users, orders and order items in Python:

from faker import Faker
fake = Faker()

users  = [{"id": i, "name": fake.name(), "email": fake.email()} for i in range(1, 1001)]
orders = [{"id": i,
           "user_id": fake.random_int(min=1, max=1000),
           "total":   fake.pydecimal(left_digits=4, right_digits=2, positive=True)}
          for i in range(1, 5001)]
items  = [{"id": i,
           "order_id":   fake.random_int(min=1, max=5000),
           "product_id": fake.random_int(min=1, max=200),
           "quantity":   fake.random_int(min=1, max=5)}
          for i in range(1, 20001)]

It runs, and the rows look fine. Then you load them into a real database with constraints on:

ERROR:  insert or update on table "order_items" violates
        foreign key constraint "order_items_product_id_fkey"
DETAIL:  Key (product_id)=(173) is not present in table "products".

There is no products table here at all, so product_id points at nothing. And that is the good case, where the error is obvious.

The four things that break at scale

1. Foreign keys point at rows that do not exist

The random_int(min=1, max=1000) trick only works if the parent IDs are exactly 1..1000, with no gaps, generated before the children, and you remember to bound every range by hand for every relationship. The moment your keys are UUIDs, or there are gaps, or a parent set is filtered, the trick is gone. Faker has no concept of "an ID that already exists".

2. Nothing is in insert order

A real load has to insert users before orders, and orders before order_items. Faker hands you three independent lists. You either topologically sort the dependency graph yourself, or you disable constraints on load, which means you are shipping data that could never exist in production.

3. The distribution is uniform, and production never is

Because every user_id is uniform random, every user ends up with roughly the same number of orders. Real data has a long tail: most users have one or two, a few have dozens. That tail is exactly where pagination, N+1 queries and slow joins fall over. Uniform fake data hides the bugs you are trying to find.

4. Values are incoherent across columns

Faker fills each column on its own, so the email does not match the name (fake.name() gives "Alex Miller", fake.email() gives "wsmith@example.org"), and order.total has no relationship to the sum of its line items. Any test that checks a derived or denormalized value is now meaningless.

Bonus failure: uniqueness. At scale, fake.email() repeats and trips a UNIQUE constraint. Switch to fake.unique.email() and Faker eventually exhausts its pool and raises UniquenessException. Either way you are writing retry logic around a data generator.

What you actually want

You want data that is referentially consistent by construction: children only reference parents that exist, inserts come out parent-first, distributions are skewed like real life, and derived values are reconciled. That is not a Faker call, it is a property of the whole dataset, derived from your schema.

That is the line SeedBase draws. You give it your schema (SQL, a Django models.py, or a Prisma schema) and it:

Not a Faker replacement, a layer above it. Faker is a fine value library, the kind of thing that lives inside a generator. What it does not give you is a loadable relational database. SeedBase was tested against a real 20-app Django project with 226 tables, which is where the foreign-key and distribution handling came from. EU-hosted, no third-party trackers, export everything.

The short version

Faker makes great values and zero guarantees about how they fit together. For a single column, that is all you need. For a database with relationships, "valid value per cell" and "valid dataset" are different problems, and only the second one loads.

Get data that actually loads

Point SeedBase at your schema and generate a populated, FK-consistent database with realistic distributions. Free tier, no credit card.

  • Every FK resolves
  • Long-tail distributions
  • SQL / CSV / JSON
  • EU-hosted
Generate FK-consistent data, free

More: SeedBase vs Faker · seed a Django DB · SQL test data