Build in public

I found lorem ipsum in my own test data generator

Before launch I ran a deliberate quality pass over the rule-based engine, the free-tier path that generates without the AI. Here is every value heuristic I tightened, and why each one mattered.

SeedBase · ~5 min read

SeedBase has one job: produce fake data that looks real. The paid path uses an LLM for that; the free tier runs a rule-based engine, and that engine was the one I wanted airtight before launch. So I did the thing I would tell anyone to do with their own generator: feed it a wide schema, then read the output column by column and hunt for anything a sharp user would catch before they did.

A pass like this is designed to surface edge cases. That is the whole point of running it. Here is what it turned up, in order of how much each one would have bothered me, and the fix that went in for each.

1. Lorem ipsum everywhere

Every description, bio and notes field was filled with lorem ipsum dolor sit amet. Slugs read et-labore-sed-dolor. For a product whose entire pitch is "realistic data", this is the cardinal sin.

Why was it there at all? Nobody chose lorem ipsum. It is the stock default. The rule-based engine fell back to the same placeholder words that every faker library ships with, and because the primary path is the LLM, that fallback text had never been given a proper pass. Defaults stay defaults until someone goes looking. That is exactly what this audit was for.

The fix was twofold. The obvious one: replace the placeholder word list with real, neutral English vocabulary and a pool of actual sentences. The sneaky one: the anonymization path called a faker library's text() helper, which also returns lorem ipsum by default. That one was hiding in plain sight.

2. Quantities up to 9,000

An order_items.quantity of 8,924. Nobody orders nine thousand of anything. The cause was an ordering bug: a generic "this is an integer column, pick a number up to 9999" rule ran before the rule that knew "quantity" means a small number. The column-name heuristics were effectively dead code for typed integer columns. Moved them in front, and now quantity is 1-10, stock is bounded, prices are realistic, ratings are 1-5.

3. updated_at before created_at

created_at: 2026-05-01 23:46
updated_at: 2025-11-27 22:46   ← updated six months before it existed

Each timestamp column was generated independently, so roughly a third of rows had an updated_at that predated created_at, which is physically impossible. Fixed by giving the two columns non-overlapping time windows: created_at draws from an older range, updated_at from a newer one, so the order is guaranteed.

4. Names that didn't match their emails

A user named Mia Koch with the email olivia.bauer65@…. Name and email were rolled separately. Now they derive from the same row, so it is Alex Miller at alex.miller@example.com.

5. Order totals that didn't add up

This was the subtle one, and the one a careful person would actually catch. The order.total was a random number with no relationship to its line items, and each order_item.unit_price had no relationship to the product's catalog price. Eyeball one table and it looks fine. Do the arithmetic across tables and it falls apart:

order.total           = 470.60
sum(quantity × price) = 1,883.20   ← these should be equal

The fix is a reconciliation pass that runs after generation: copy each product's catalog price into the line item, then set every parent total to the sum of its children. Now unit_price matches the catalog and order.total equals the sum of its line items, every time.

Why this is the actual work

None of these are glamorous. Nobody posts "I made my fake data slightly less fake today." But this is exactly the difference between a demo that makes someone think "this is obviously generated" and one that makes them think "wait, is this real?", and that second reaction is the entire product.

There is a deeper lesson here about synthetic data in general: looking valid is not the same as being coherent. Valid means the types are right and the constraints pass. Coherent means a name matches its email, a total matches its line items, a timestamp respects causality. Most "fake data" tools stop at valid. The coherence is where the realism lives.

Still on the list. Coherent geography (a city, its postal code and its country actually matching) is next, along with broader domain vocabularies. Honest is better than perfect. These are written down, not papered over.

If you want the technical version of how the relationships are kept consistent in the first place, I wrote that up separately: generating foreign-key-consistent test data from your schema →

See the data for yourself

Generate a complete, coherent sample database in about fifteen seconds, then point it at your own schema. Free tier, no credit card.

  • No lorem ipsum
  • Coherent values
  • FK-consistent
  • EU-hosted
Try it free