Big Data Storage

This is a page about some of the mechanics of ‘big data’, specifically how to store, transfer and process perhaps 100s of millions or billions of rows/events.

Although some people think big data starts at 65k rows, at which point Excel used to lose it, I’m talking about gigabytes of data here.

Infrastructure for this kind of thing abounds. Data can however come from lots of sources, and go to lots of different places. In between we need a robust storage or transfer format that is universally available and understood.

And it turns out that this isn’t a particularly well solved problem. As Daniel Bachler notes in a Tweet “It’s a weak point of modern data infrastructure - at @OurWorldInData we joke that CSV files are the technological frontier.”

Let’s start out with the humble CSV file:

  • It is exceptionally easy to generate. If you can write to a file, you can create a CSV file
    • Does not put a noticeable burden on the data generating process
  • It is also trivial to modify what you are writing out - just add a column
  • Everything can parse CSV files
    • Often at tremendous speeds
  • Storage format is reasonably efficient
    • Even better with compression
  • Easy to fix corrupted data
  • No vendor lock in, people a century from now will be able to parse your file

Real life big data practitioners know CSV as a reliable work horse. It is universally understood, you’ll never face any kind of barrier. And this is very important - most environments are not setup to generate big data. Your need for data comes on top of the actual work being done. You don’t want any unnecessary hurdles.

When writing CSV, it is however very easy to mess things up:

  • Escaping - CSV has different dialects and different ways of escaping newlines, commas, tabs or semi-colons. And is a string quoted or not?
  • You can find out the hard way that your favorite way of printing a floating point number discards a lot of precision
  • Depending on locale, you might see numbers expressed as 1.527 or 1,527
  • It is wonderful you can just print out lines, but if you have a lot of fields, you have to manually check you don’t mess up the ordering with respect to the header
  • No one stops you from putting a string in a column that previously contained only floating point data

CSV is good enough to be the best universally supported format, but it does have its issues.

Suggested alternatives

I asked Twitter for inspiration on CSV alternatives, and the Internet really delivered (thanks!):

  • Sqlite3, an embeddable SQL engine. I had not initially thought of sqlite as a storage format, but if you don’t create any indexes, it turns out that sqlite is effectively that. There is virtually no overhead, but there is type safety and near universal accessibility. More about sqlite3 below.
  • Parquet, a free and open-source column-oriented data storage format in the Apache Hadoop ecosystem. Supports compression and other neat features. In theory this is a great format, and I tried to make it work for me. I gave up after a few hours because it turns out that if you are not already in their ecosystem, Parquet is not very accessible. There are no Parquet packages in Linux distributions for example. The Parquet C++ support was hard to find, and the documentation did not appear to match up to the library I found. Although I really wanted to love Parquet, it fails the accessibility test. Chances are that your data source can’t generate Parquet, or that your favorite tools can’t read it. As noted though “Parquet is great if you use pyarrow, pandas or pyspark”. It should also be noted that Parquet files support only one active user at a time. You can’t read a file while it is still open by another process.
  • HDF5 similarly offers a lot of great features, especially for higher-dimensional data. It is used a lot in science and space projects. HDF5 is however not much loved. The specification is both long and incomplete. Libraries have had unfortunate issues. And the last thing you want in data storage is ‘issues’. HDF5 however has likely passed the threshold that people will be able to read your data 100 years from now. Perhaps over time the format and libraries will prove to be robust.
  • JSON. People dealing with smaller amounts of data tend to be attached to JSON, which in a sense is also nice and trivial to generate, but it tends to be slower by at least an order of magnitude, and the files generated are a lot larger, since field names are almost always repeated for every row.
  • CBOR - a binary JSON workalike. I really wanted to love this one, and even looked at it in earnest back in my PowerDNS days. But although it is binary, it does repeat column names for every row.
  • Various “just print ASCII” variants. For example, TSV which offers some advantages over CSV, like for example having tooling and actually defined escaping rules.
  • Serialization libraries like Protobuf, Apache Avro, Cap’n Proto. These libraries are very good at what they do. To use them for data transfer however requires a bit more work - you need a file format that delineates one data object from another. While this is pretty trivial to do (“length-field object length-field object ..”), this has typically not been standardised. This means your favorite data processing tool can trivially source CSV, but you can’t pandas.read_protobufs. With a bit of work however, you can make this approach fly.
  • LMDB - is actually a transactional key/value store. I know LMDB well and even implemented an LMDB serialization library for PowerDNS. LMDB is seeing heavy use in machine learning applications. LMDB can be almost magical in its performance, but its semantics are complicated. It is easy to mess it up. But if you know what you are doing, it may well be the tool for you. Lots of projects rely on LMDB and it is packaged well, so it is relatively simple to incorporate in a project. It is not as accessible (from, say, pandas) as CSV or other more mainstream formats though. LMDB however does support reading from a file that is also open for writing.
  • DuckDB, an in-process SQL OLAP database management system. I love DuckDB! Once you have your data in one place and are ready to process it, DuckDB is truly magical. I was one of the earliest fans even. What DuckDB however is not is a read/write environment with multiple users. This makes it somewhat hard to use as a data logging environment - while you are logging, there is no way to access the data. But, once you have your data in place, DuckDB is the king of the hill in terms of processing speed and accessibility (as we’ll see below)
  • People also suggested various databases like Elasticsearch, InfluxDB, ClickHouse and MongoDB. I could say a lot about these projects, but the key thing is that these are not things to transport data with. They are not well defined files I can send someone. Unlike DuckDB and sqlite, for example.
  • XML, ASN.1: No

The clear winner

Sqlite3 turns out to have been the most suggested solution, so I gave it a spin. We want to retain the CSV goals, like low overhead, universal accessibility, robustness. Sqlite delivers on all these fronts. It for example has built in tooling for dealing with corrupted files.

Modern Sqlite can be type safe, so it will not allow you to insert a string in a floating point field. Its support for prepared statements means we don’t have to worry about escaping.

But is it compact? Does it scale? Is it fast enough? I built a small open source library that allows you to store data like this:

  SQLiteWriter sqw("example.sqlite3");

  for(int n = 0 ; n < 1000000000; ++n) {
    sqw.addValue({{"pief", n}, {"poef", 1.1234567890123*n}, {"paf", "bert"}});
  }

  sqw.addValue({{"timestamp", 1234567890}});

This provisions a table for you automatically, and also adds fields/columns as needed. This example program creates one billion rows, most of which contain an integer, a floating point and a small string. Data is committed once every second. Journaling mode is set to WAL, which means that while the file is growing, data can be accessed without interrupting the writing process.

Here is the performance:

$ time ./demo 
real	32m52,764s
user	28m35,767s
sys	0m43,501s

$ du -h example.sqlite3
26G	example.sqlite3

Or in other words, 507099 rows/second, 27 bytes/row. In terms of efficiency, you’d be hard pressed to do any better even if you rolled this by hand.

In fact, if you export this database as a .CSV file, it becomes bigger (33.7GB).

To round this off, sqlite is now truly world infrastructure. Every phone carries around a ton of sqlite instances. The intellectual property is also completely public domain and there is no chance that anyone will eventually try to ‘monetize’ your use of sqlite.

Honourable mention: DuckDB

Now, many people raved about DuckDB. As a test, I imported the CSV file into DuckDB, with the following somewhat astounding result:

-rw-r--r-- 1 ahu ahu 27878182912 oct 21 17:02 example.sqlite3
-rw-rw-r-- 1 ahu ahu 33777780033 oct 21 21:11 example.csv
-rw-rw-r-- 1 ahu ahu  4751372288 oct 21 22:07 example.duckdb

The file is now only 4.7 GB large? Can this be real? Let’s check some answers:

D select count(1) from data;
1000000001
Run Time: real 0.347 user 2.372468 sys 0.010937

D select avg(pief),avg(poef) from data;
499999999.5│561728393.9444377
Run Time: real 3.037 user 21.029809 sys 0.303720

Now how does this compare to sqlite?

sqlite> select count(1) from data;
1000000001
Run Time: real 64.407 user 19.019721 sys 12.396489

sqlite> select avg(pief),avg(poef) from data;
499999999.067109|561728393.944446
Run Time: real 76.989 user 61.249541 sys 12.303918

Now don’t get me wrong. The sqlite numbers are very good. It is the DuckDB numbers that are simply unbelievably astounding!

A hint at how this is possible can be found in the ‘real’ versus ‘user’ times. DuckDB is able to recruit multiple CPUs to do the work, allowing it to use 21 CPU seconds in only 3 seconds. In addition, internally, DuckDB vectorises calculations, enabling every CPU core to do more work per cycle.

Finally, the clever use of column storage makes it possible to compress the 27 GB of data into 4.7 GB. This saves a lot of memory, which in itself can also be a huge speedup (because you don’t need to swap & the hit rate of various caches goes up).

Summarising

For now, sqlite appears to be a very good option. Everyone can write it, everyone can read it, it is truly public domain, it is type safe, compact and fast enough. But it is not the final word - the DuckDB example above shows what clever data storage could do. It is great news that we have DuckDB to do analysis once the data is in, though!

It would be wonderful if we could get the storage smarts from DuckDB or Parquet, but in a way that also allows us to look at the data without shutting down the generating process first.

I hope we can make it happen!