Where we divide data into "buckets" that are 1024 in size and given an ID of 159784623, the resulting bucket ID is 156039 and the remainder is 687.
id = 159784623
bucket_size = 1024
remainder = id % bucket_id
bucket_id = (id - remainder) / bucket_size
Using this I've been able to reduce memory usage anywhere from 40%-80% (yes, 80%), which depends compressibility of the data (length and randomness of each hash item).
I've also been replacing dictionary keys with integers and it further reduces the size of the data being stored by an additional ~30%.
It shouldn't be underestimated how these simple techniques can make such a significant impact, especially when the gains are quite considerable. JSON-encoded data may be quite verbose, so CSV may add considerable gains too, but JSON can accommodate missing keys.
Lists and sets can also accommodate "bucketing" of data, however that comes with added complexity of accommodating the variety of redis commands that come with those data structures (BLPOP, SADD, SDIFF, etc).
The blog post you referenced I think was done by instagram, not tumblr. I remember reading it and doing something similar. They also found doubling the `hash-max-ziplist-entries` options to 512 from 1024 also didn't cause much CPU load.
Something you can also try as well is encoding with msgpack rather than JSON. You can save 30% from the original JSON string without losing the keynames.
I'm aware of Msgpack and I was under the impression that it doesn't handle UTF data very well, however that was some time ago and I may be mis-remembering. Msgpack libraries are also supposed to be quite a bit faster at encoding/decoding, right?
Thanks, this is a good way to partition hashes, added link to the post. When applying compression to real-life redis data, I usually also get a ballpark of 50% savings.
It's not clear if the wallclock savings was consistent over multiple runs.
And by moving one look-up out of redis into ruby, doesn't seem like the right thing to do. That increases complexity by now requires an application layer to process the data source.
I'd like to see how this compares with just simply increasing list-max-ziplist-entries.
The times were consistent, clone the repo and run `rake bm` to try it. max-ziplist cannot be set to the order of millions, since redis does have to unzip it on every access and you wouldn't get any range lookup boost. You can implement this in redis lua if app layer separation is a concern.
Using that technique, personally I've had great success storing data in hashes by JSON-encoding it beforehand, where it would normally be like so:
But instead it's like so: Where we divide data into "buckets" that are 1024 in size and given an ID of 159784623, the resulting bucket ID is 156039 and the remainder is 687. Using this I've been able to reduce memory usage anywhere from 40%-80% (yes, 80%), which depends compressibility of the data (length and randomness of each hash item).I've also been replacing dictionary keys with integers and it further reduces the size of the data being stored by an additional ~30%.
It shouldn't be underestimated how these simple techniques can make such a significant impact, especially when the gains are quite considerable. JSON-encoded data may be quite verbose, so CSV may add considerable gains too, but JSON can accommodate missing keys.Lists and sets can also accommodate "bucketing" of data, however that comes with added complexity of accommodating the variety of redis commands that come with those data structures (BLPOP, SADD, SDIFF, etc).
[1] http://instagram-engineering.tumblr.com/post/12202313862/sto...