Save Redis memory with large list compression

ihsw · on Aug 31, 2013

Tumblr made a blog post[1] about exploiting similar compression techniques.

Using that technique, personally I've had great success storing data in hashes by JSON-encoding it beforehand, where it would normally be like so:

    HSET user:159784623 username ihsw
    HSET user:159784623 email blahblah@gmail.com
    HSET user:159784623 created_at 1377986411

But instead it's like so:

    HSET user:156039 687 {"username":"ihsw","email":"blahblah@gmail.com","created_at": 1377986411}

Where we divide data into "buckets" that are 1024 in size and given an ID of 159784623, the resulting bucket ID is 156039 and the remainder is 687.

    id = 159784623
    bucket_size = 1024
    remainder = id % bucket_id
    bucket_id = (id - remainder) / bucket_size

Using this I've been able to reduce memory usage anywhere from 40%-80% (yes, 80%), which depends compressibility of the data (length and randomness of each hash item).

I've also been replacing dictionary keys with integers and it further reduces the size of the data being stored by an additional ~30%.

    HSET user:156039 687 {"0":"ihsw","1":"blahblah@gmail.com","2":"1377986411}

It shouldn't be underestimated how these simple techniques can make such a significant impact, especially when the gains are quite considerable. JSON-encoded data may be quite verbose, so CSV may add considerable gains too, but JSON can accommodate missing keys.

Lists and sets can also accommodate "bucketing" of data, however that comes with added complexity of accommodating the variety of redis commands that come with those data structures (BLPOP, SADD, SDIFF, etc).

[1] http://instagram-engineering.tumblr.com/post/12202313862/sto...

nemothekid · on Aug 31, 2013

The blog post you referenced I think was done by instagram, not tumblr. I remember reading it and doing something similar. They also found doubling the `hash-max-ziplist-entries` options to 512 from 1024 also didn't cause much CPU load.

Something you can also try as well is encoding with msgpack rather than JSON. You can save 30% from the original JSON string without losing the keynames.

ihsw · on Sept 1, 2013

Indeed it was Instagram, I stand corrected.

I'm aware of Msgpack and I was under the impression that it doesn't handle UTF data very well, however that was some time ago and I may be mis-remembering. Msgpack libraries are also supposed to be quite a bit faster at encoding/decoding, right?

glebm · on Aug 31, 2013

Thanks, this is a good way to partition hashes, added link to the post. When applying compression to real-life redis data, I usually also get a ballpark of 50% savings.

neuroscr · on Aug 31, 2013

It's not clear if the wallclock savings was consistent over multiple runs.

And by moving one look-up out of redis into ruby, doesn't seem like the right thing to do. That increases complexity by now requires an application layer to process the data source.

I'd like to see how this compares with just simply increasing list-max-ziplist-entries.

glebm · on Aug 31, 2013

The times were consistent, clone the repo and run `rake bm` to try it. max-ziplist cannot be set to the order of millions, since redis does have to unzip it on every access and you wouldn't get any range lookup boost. You can implement this in redis lua if app layer separation is a concern.

neuroscr · on Aug 31, 2013

Ah in that context, this makes more sense. Thanks for the reply.

DrJosiah · on Sept 1, 2013

Redis doesn't compress contents of lists (or hashes or zsets) in a "ziplist", it packs them in a concise manner that omits structure overhead.

Source: the source code of Redis itself https://github.com/antirez/redis/blob/unstable/src/ziplist.c , Redis documentation: http://redis.io/topics/memory-optimization , and/or my book: http://bitly.com/redis-in-action .

glebm · on Sept 1, 2013

Thanks, so it's packing it not compressing, I have updated the article accordingly.