Data quantities
- 10k stocks
- 10 metrics per stock
- 15 years per stock
- 220 metrics per year
- 330M values
- A query
- Input: 3 years, 1000 stocks, 10 values
- 30k values?
Potential data layouts to try
- Normalised format. Each row has three fields: stock, metric, date
- Group into simple buckets. Each row has an array of all the values grouped by one of stock, metric or date. e.g. one row would have the key
AAPL and an array for all days. The index would be on the key chosen.
- Group into compound buckets. Each row has an array of values. The index would be on the grouping key. Potential groupings:
- stock_metric e.g. each row would have a key
AAPL_CLOSE and an array for all days
- stock_date
- date_metric
Things to try
- Launch parallel queries rather than using
$in
- Use a plain number instead of a date object
Normalised test results
- I inserted 3.3M rows. It accounts to only about 3 years data. An example query returning about 100k rows took 1 second locally