Product Data Generation: Simulating a Real Marketplace
This is Part 2 of the "Building a Scalable, Faceted Online Marketplace" series. Read the Introduction here.
Why Simulate Product Data?
Before you can build a search engine or a beautiful UI, you need data—lots of it. Realistic, large-scale product data is the foundation for testing scalability, search, and filtering. In this article, we'll build tools to generate hundreds of thousands of products, each with dynamic, category-specific attributes (just like Amazon or Meesho).
The Approach
- Supports multiple categories (Computers, Clothing, Books, etc.)
- Generates products with both common and category-specific attributes
- Outputs data in NDJSON, CSV, or gzipped formats for easy import
- Is memory-safe and can generate millions of records in batches
Key Features
- Dynamic Attributes: Each category has its own set of attributes (e.g., RAM for Computers, Size for Clothing)
- Randomized, Realistic Values: Brands, prices, ratings, and more
- Configurable Output: Choose count, category, format, and batch size
Example: Generating 200,000 Computer Products
npm run generate -- --category=Computers --count=200000 --out=./out --shard-size=50000 --format=ndjson --gzip --csv
This command will create four 50k-product NDJSON files, gzipped and ready for import.
Code Walkthrough
The main script is product-backend/generate_product_stream_per_category.cjs
. It uses category definitions and attribute pools from facets.json
and facets_values.json
.
- Category & Attribute Definitions:
facets.json
defines which attributes belong to each categoryfacets_values.json
provides value pools (brands, colors, etc.)
- Streaming Output: Uses Node.js streams for memory safety
- Randomization: Ensures realistic, non-repetitive data
Real-World Tips
- Always generate more data than you think you'll need—scalability issues only show up at scale!
- Use NDJSON for easy streaming into MongoDB or Elasticsearch
- Keep category/attribute definitions in JSON for easy extension
What's Next?
In the next article, we'll cover how to efficiently bulk import this data into MongoDB, handling millions of records with batching and error handling. (Note: We'll use MongoDB for storage, not for faceted search.)
Next up: Bulk Importing to MongoDB: Handling Millions of Products
Continue the series to see how we bring this data to life in a real database, ready for further processing and search indexing!