Every once in awhile I find an article that is so well written I feel like it walks me through a complex technical situation like a layman while also expanding my understanding of a tool I use everyday in a meaningful way. This is definitely one of those and I just want thank the author(s) for taking the time to write this up. It was interesting and enlightening.
> There was a long-running transaction, usually relating to PostgreSQL's autovacuuming, during the time. The stalls stopped quickly after the transaction ended.
What is about the other end?
Why does vacuum need to be a long running transaction and cannot be cut into shorter transactions ?
My take away is that they were testing the limits of PostgreSQL capabilities, and then reverted the change in a mad dash.
That this would have been an awesome opportunity for gitlab to show how OSS they are and fund a PostgreSQL developer to allow gitlab to design these boundary pushing designs.
> That this would have been an awesome opportunity for gitlab to show how OSS they are and fund a PostgreSQL developer to allow gitlab to design these boundary pushing designs.
As you can read in the article they also hired Nikolay Samokhvalov (a PostgreSQL contributor) to take a look at the problem. But as you can also read the solution wasn't with PostgreSQL. The solution was modifying their application.
> But as you can also read the solution wasn't with PostgreSQL. The solution was modifying their application.
They're also submitting patches to improve subtransaction support in future versions of PostgreSQL (likely v15). This is them doing the right thing for the project.
That Nikolay was hired for this does not come across in the article.
If this is true, my criticism is even more prescient, as it was merely in the writing of the article where this could have been corrected. It sounds like gitlab already did the thing I recommended, they just didn't highlight it properly.
This isn't directed at you, but HN does not take criticism or alternate viewpoints well. The article is great on many levels. I gave feedback on how they could improve, the real damage is done when someone doesn't care enough to comment.
The conclusion section is lacking in meta analysis and why gitlab. Given the LoE that went into this, gitlab missed an opportunity here. If I were the senior manager on this project, I would want a retrospective on why this was hit in the first place and not as a form of someone to blame, but what about the design, the technology selection, the monitoring, testing, etc. contributed to the outcome we saw.
SubtransControlLock indicates that the query is waiting for PostgreSQL to load subtransaction data from disk into shared memory.
I felt the article fell down for two reasons:
(1) It didn't really articulate the need for transactions in the first place (database integrity). Nor did it discuss the implications on integrity with this change.
(2) It didn't articulate the possibilities of other architectures (pushing to a read cache other than PostgresSQL like Cassandra).
I got the feeling they were really pushing PostgresSQL to its limits in their cluster with their load - and it was time to consider another design.
1) It’s an article talking about optimizing transactions, I think it’s fair to assume prior knowledge about what a transaction is. The author is not responsible for rebuilding the world.
2) They have a very specific, narrow band problem which they solved two ways. What you’re suggesting is several orders of magnitude more work and would likely take years.
I don't think there was sufficient room in the article to cover the value of sub-transactions, let alone transactions as a whole. In this case I think it's fair for it to be assumed knowledge given the contents of the article.
They are definitely not pushing the limits of PostgreSQL as a whole, they just hit an edge case with a not heavily used feature. Sub-transactions aren't overly common in most PostgreSQL workloads I have encountered, especially not coupled with long-running transactions.
Great write-up though! I do hope patches to make the sub-transaction cache configurable land, while it's not something I have ran into it would be a great lever to have if I do.
I don't know what gave you the impression they should consider another design. They hit a buggy corner case that had straightforward workarounds. Suggesting this implies a need to rearchitect their system seems like a drastic overreaction.
It says they had sub-transactions, not what caused them. Nothing in the article gives concrete details on specific code or logic that was contributing to the issue. The entire "Why would you use a SAVEPOINT?" should have been replaced with "Why we were using SAVEPOINT".
Personally, I was hoping for the post to dig into a specific example and explain why transactions were used in the first place (justification) or why removing transactions or sub-transactions had no negative impact on data integrity.