> The investigation around the exception revealed that at-least one of the SSTable (Sorted String Table) rows was unordered, which caused the compaction operation to fail. SSTables are immutable files that are always sorted by the primary key
This was just... a bug in Cassandra? Is there anyone that can shed light on this? There seem to be plenty of people using Cassandra at scale -- is constantly repairing it normal practice?
Running regular incremental repairs is the norm, as nodes will from time to time have trouble talking to each other due to real world network reasons, or will go down, for things like OS patching. We had a (daily) cron job for it. I come from the software side not the DBA side of things but my main advice from running Cassandra at scale in production (it was part of an Apigee stack) is don't basically! It was very not realisable, would consume huge volumes of memory (especially during repairs), bandwidth (doing a repair is very chatty as it has to sync lots of data) and disk space (tombstoning meant deleted records take up space until compaction runs), and was generally not much fun to manage, and it was difficult to hire people who knew much about it to do so. I would not build a solution myself using it going forward. We also had to periodically (weekly) do "full" repairs to work around Cassandra bugs, silent data corruption etc...
We (as in, my company, not me myself) run large Cassandra clusters in the critical path of bank transaction processing (in the order of 2-25 million payments per day, each requiring a lot of database queries) and it's going pretty well...
Transaction in this case is not a database transaction, but a financial transaction (payment). Per payment, probably somewhere in the order of 50-100 database transactions (although Cassandra does not really have transactions of course, interpret this as read/write actions) will be performed in the course of its processing. So that is 1,875,000,000,000 database actions on busy days. Not a DBA, but for our purposes the scalability and availability of Cassandra works very well.
Yes, other distributed databases like MongoDB, CockroachDB, probably a few others. Or even multi-master DB setups. As with Cassandra, you don't want to use them unless you really need that availability and can suffer the downsides. It seems pretty rare to actually need those availability guarantees, rather than say a robust fast failover setup which might cancel some in flight transactions. It is probably when you start looking at two phase commit that you look for alternatives with better availability stories.
25 million per day might be a little high for SQLite, especially if they don't spread out evenly over the day. You also get no redundancy or replication, which you might want if your database isn't to grind to a halt during backups.
Arguably Cassandra does sound like a weird choice, but we don't know the specifics of their setup. There's a lot of solutions presented on HN where SQLite and a Java application would have been a better choice and you can say for sure without knowing all the details, I feel like this is past that point.
No need to add the extra bit - at the top end 289 transactions per seconds is not something you'd probably want to choose SQLite for, but PG/MySQL/SQL Server would do that fine and require a lot less feeding (though any database with traffic or size needs some care.)
We run a Cassandra cluster in production and its a pretty small cluster yet all that you mentioned seems to resonate. We do use Cassandra reaper to automate some of the tasks but no one wants to touch Cassandra in general in the team.
Thanks for sharing your experience -- I know I've spent a lot of time in the past worrying about FS corruption, but generally expecting that the database sitting on top of it should never get corrupted, mostly because I use postgres so much.
I don't have the experience you do in this situation, but my first reaction to this was definitely "don't use Cassandra". But I also never really understood the use-case where Cassandra shines as a solution either (seems like only companies with a lot of data really seem to get wins from it?)
Can recommend https://cassandra-reaper.io/ for most of the management stuff you're mentioning. Still not free though, running Cassandra requires (some) effort in my experience.
One wonders if Yelp could just run on a basic Postgres setup. It's not too much data, the data is relatively unimportant and the traffic is modest and mostly from US. How do the setups get so complicated?
About 6 years ago I was brought in as a Azure/operation consultant to help bid on a "project". Basically a sister-company had built a webapp for a client and we where to bid on hosting and migration away from the current fly-by-night hosting partner.
The specs called for a Kubernetes cluster and a Cassandra database cluster for production and the same for testing. Everything about this project was presented as being highly complex. Digging into it with the client and the developers we reduced it to: One Docker container running in Azure websites and CosmosDB on the backend, which could basically run on the free tier.
The whole thing hold less than 3GB of data and had maybe a few hundred requests per day. The client just loved the idea that they where special, dealing with massive amount of data and required scalability to keep costs under control. The developers more or less just ran with it and wanted to do Kubernetes and Cassandra sounded interesting and now they had a client that would pay for it. Technically I suppose that both Kubernetes and Cassandra where reasonable choices, had they had 1000x the load, but given their market they where never going to grow beyond 10x on this particular solution.
We didn't get the contract. Our bid was insanely low (not really worth the cost of bringing in a new contractor), delivered something different that asked for (fair enough).
True, but it's a complete waste if all you need is a single webapp and you're running it on something like Azure anyway which can just run that single Docker image.
In this case you had a potential zero management environment, truly Cloud as it was meant to be, vs. managing Kubernetes, plus a database. You could go with managed Kubernetes (AKS) and a managed SQLServer, but why take on that cost?
Edit: Even AKS isn't truly zero management, you need to do at least some of the work for the upgrades, so instantly more management, something you need to do, something that adds to the operational cost.
So that is why I said to use GCE not Azure, of course Azure is extra work ;P use GKE.
I have personally never build an app that was a single docker image running. Usually I am at a min of 2. One server going down shouldn't take down prod once you even have 1 paying customer.
Yelp's main database was a couple of MySQL clusters (several TB of total data) for almost all of the company's existence, but they've always had a penchant for architecture astronomy. The ad system had the most justification for novel data architectures historically (when I was there, it was generating and processing tens of terabytes of data per day via Hadoop/EMR, which was new and cool, and I'm sure it's at least an order of magnitude mode than that now).
Yelp was founded 2004, 2004 postgres was different than 2024 postgres and storage and server options werent as powerful as today either.
There were half a million restaurants back then or thereabout, and they wanted to store ratings and comments.
That is not something you'd be able to put on one single database in 2004.
Why didn't they simplify afterward I can't imagine, but I can see how a business directory at that age looked at the numbers and went yeah not in a database.
> That is not something you'd be able to put on one single database in 2004.
Sorry, but it was, plenty of companies had much larger databases back then running plain old master slave replication.
But even if it wasn't: just don't run it all on one db? Nothing says you need to be able to join on the restaurant table and the comments table. Put them on different servers. That's all you're doing with Casandra anyway.
If you're not just running it all on one db, why do you have to have all of your different DBs be on the same DB software?
I used to work at Yelp, and at least at that time, a big use case of Cassandra was basically for what were essentially materialized views created from log data. At that point in time, MySQL or logs were the "sources of truth", but there were enough transactions going on that it made sense to have things like Cassandra around too for some of the other use cases.
Postgres didn't ship built-in replication until 2010, and prior replication solutions like Slony were not options I would have enjoyed building a business around.
MySQL master slave replication has failover, so it is technically highly available, especially if you have multiple slaves. It was also best practice back then so this argument is kind of weird.
Also, whether you need a highly available database for a company like yelp is an implementation detail.
Oh no, the database is down, how will Jonny decide where to go for lunch now?!
I mean, for a start Cassandra wasn't available for years after that date, but either way, whatever Postgres was like back then Cassandra was much more of a piece of shit. Smells like implementing what the cool kids were doing to me.
Agree, unless you were Facebook you at best heard it was cool. Everyone I knew who tried it out immediately went "blecchhh" and switched off or left after they got their company to adopt the technology, heh.
Yeah sounds like they liked it because of the infinitely scalable storage and high write throughput, but don’t really get into if they were having issues with all that on MySQL. Also, they use the term “analytics” a lot, but it seems to be a different use case than typical OLAP workloads since they were serving this data to customers adhoc.
Scaling writes on MySQL/Postgres is a huge PITA once you hit the point where a single master on a chonky server can't handle it anymore and you lose most of the benefits of being on a relational db.
In my experience, 100% of the time, the use of NoSql (excluding Redis) has nothing to do with the RDBS not being good enough, rather its "solves" another non-technical issue. Examples:
1. In one case, developers hate DBAs and MongoDB is outside the scope of DBAs.
2. A developer starts using CouchDB to include in the curriculum.
I have >5 years of experience using Cassandra in production, involving thousands of clusters storing petabytes of data. My conclusion from that time is that Cassandra is simply not robust enough to be a general purpose database (the team are working on it but they're coming from a really rough starting place) - there are lots of ways to cause data corruption, and Cassandra does enough dynamic repairing that it can be hard to catch this before your backups are dropped due to time windowing. Unfortunately, the juice may still be worth the squeeze - Cassandra's storage model lends itself very nicely to disaster recovery workflows in a way which something like Oracle or FoundationDB does not (and it's Cassandra so you'll need it!), while the ability to horizontally scale gets you out of so many operations issues. If you've got a schema which works well in Cassandra, you've probably solved a lot of the issues you might have.
Example of fairly standard Cassandra bug (don't know if present on latest release, certainly was a year or two ago): When you add a new node to the cluster, it 'bootstraps', where it copies ~1/n the data from other nodes. When you are done bootstrapping, it's copied a bunch of data from other nodes, but the other nodes still contain that data. You then run 'cleanups' on the other nodes to remove the (now stale and unusable) data so as to get your disk space back.
If you accidentally run a cleanup on the new node as it is being bootstrapped, it will succeed, you will delete all the data that's been copied over so far, and Cassandra will _not_ terminate the bootstrap. Everything will be green, but your new node will suddenly be using 0 disk space. When the bootstrap finishes, possibly days later, your cluster will be immediately corrupted due to violated replication guarantees - but only on data that hasn't been read or written over that period, because if it was written it'll be re-replicated, and if it was read Cassandra will silently repair at this time. Repairs resolve the issue, but if you've made this mistake due to scripting, if you get unlucky it's possible to just delete all replicas of some data between repairs.
Example of other Cassandra bug (again, might be outdated): Cassandra nodes identify themselves on startups with IPs, and the owned token ranges are not persisted, they're streamed from other nodes in the cluster. If you've deployed your Cassandra in K8s and you reboot multiple nodes in one go and they swap IPs upon reboot, you may now find yourself in a split brain situation in which nodes magically forget they own certain data ranges and think they own each others data (or maybe it's that the nodes still think they own the right ranges but other nodes think they own the wrong ranges). Wasn't close enough to fully debug that one.
It's a mess. Would seek to avoid problem spaces where I might need to use it again, though if by chance ended up in a space where it made sense, probably wouldn't avoid the tech.
Thanks for this insight -- this is one of the first time I've heard of someone constrasting FoundationDB and Cassandra which is nice.
> Example of fairly standard Cassandra bug (don't know if present on latest release, certainly was a year or two ago): When you add a new node to the cluster, it 'bootstraps', where it copies ~1/n the data from other nodes. When you are done bootstrapping, it's copied a bunch of data from other nodes, but the other nodes still contain that data. You then run 'cleanups' on the other nodes to remove the (now stale and unusable) data so as to get your disk space back.
Interesting, seems like there is a bunch of little knowledge like this needed to run a service properly... Managed Cassandra has more added value to provide I guess.
> If you accidentally run a cleanup on the new node as it is being bootstrapped, it will succeed, you will delete all the data that's been copied over so far, and Cassandra will _not_ terminate the bootstrap. Everything will be green, but your new node will suddenly be using 0 disk space. When the bootstrap finishes, possibly days later, your cluster will be immediately corrupted due to violated replication guarantees - but only on data that hasn't been read or written over that period, because if it was written it'll be re-replicated, and if it was read Cassandra will silently repair at this time. Repairs resolve the issue, but if you've made this mistake due to scripting, if you get unlucky it's possible to just delete all replicas of some data between repairs.
This seems... really bad -- I don't think I have the skill to run a Cassandra cluster (and not enough use cases to run it as a hobby to find these edges)...
This sounds like the space for a consultancy to make a tidy killing though.
Love reading this, it's a story about how a back of house team got to have an impact on the business in a more immediate way than they usually get to. Data teams have a longer time horizon for when they succeed or fail than the glamorous lives of backend - let alone frontend! - engineering teams, which makes it harder to recognize the wins in the moment.
To any Yelp data engineers who might happen to read - good work, and it's a good testament to the platform you provide.
Despite best efforts, the shard and all of its data was lost. However due to their forward-thinking strategy it was limited to only 3-star and lower reviews for Yelp Premium clients!
Can second that this is a very solid and reliable way to do any kind of migration, not just recovering from corruption. When possible being able to insert your own code in the middle of the transfer and shunt bad data to a dead letter queue is a life saver sometimes because the last thing you want to have happen is the xfer get stuck when it hits a pothole.
Yelp: the mafia of restaurant protection rackets. I don't care if they released a free quantum computer or solved world hunger because they're still making a living off harming small businesses.
Not just restaurants. They attempted to extort my farm after I deleted my account. I kinda wish that the cluster had been lost; that would have been the most beneficial outcome for society.
What is a bit unclear which isn't addressed in the original blog post is how they actually were able to successfully scan over all the data and copy it to the new cluster, in spite of corrupt sstables. CDC would not handle back filling historical data. Would expect that to fail for certain token ranges. Perhaps they ended up deciding to discard them?
https://engineeringblog.yelp.com/2023/01/rebuilding-a-cassan...
> The investigation around the exception revealed that at-least one of the SSTable (Sorted String Table) rows was unordered, which caused the compaction operation to fail. SSTables are immutable files that are always sorted by the primary key
This was just... a bug in Cassandra? Is there anyone that can shed light on this? There seem to be plenty of people using Cassandra at scale -- is constantly repairing it normal practice?