Product Data Generation: Simulating a Real Marketplace

This is Part 2 of the "Building a Scalable, Faceted Online Marketplace" series. Read the Introduction here.

Why Simulate Product Data?

Before you can build a search engine or a beautiful UI, you need data—lots of it. Realistic, large-scale product data is the foundation for testing scalability, search, and filtering. In this article, we'll build tools to generate hundreds of thousands of products, each with dynamic, category-specific attributes (just like Amazon or Meesho).

The Approach

Supports multiple categories (Computers, Clothing, Books, etc.)
Generates products with both common and category-specific attributes
Outputs data in NDJSON, CSV, or gzipped formats for easy import
Is memory-safe and can generate millions of records in batches

Key Features

Dynamic Attributes: Each category has its own set of attributes (e.g., RAM for Computers, Size for Clothing)
Randomized, Realistic Values: Brands, prices, ratings, and more
Configurable Output: Choose count, category, format, and batch size

Example: Generating 200,000 Computer Products

npm run generate -- --category=Computers --count=200000 --out=./out --shard-size=50000 --format=ndjson --gzip --csv

This command will create four 50k-product NDJSON files, gzipped and ready for import.

Code Walkthrough

The main script is product-backend/generate_product_stream_per_category.cjs. It uses category definitions and attribute pools from facets.json and facets_values.json.

Category & Attribute Definitions:
- facets.json defines which attributes belong to each category
- facets_values.json provides value pools (brands, colors, etc.)
Streaming Output: Uses Node.js streams for memory safety
Randomization: Ensures realistic, non-repetitive data

Real-World Tips

Always generate more data than you think you'll need—scalability issues only show up at scale!
Use NDJSON for easy streaming into MongoDB or Elasticsearch
Keep category/attribute definitions in JSON for easy extension

What's Next?

In the next article, we'll cover how to efficiently bulk import this data into MongoDB, handling millions of records with batching and error handling. (Note: We'll use MongoDB for storage, not for faceted search.)

Next up: Bulk Importing to MongoDB: Handling Millions of Products

Continue the series to see how we bring this data to life in a real database, ready for further processing and search indexing!