It’s Friday night. You’re wrapping up the workweek, packing your backpack (or quitting slack) and closing your laptop when you suddenly get paged. “user created without welcome notification”. You sigh, unpack, crack your knuckles, and start looking through logs.
Constant data integrity violations from various writes to the database plagued us. Having just completed a long but highly successful, no-downtime migration from NoSQL to PostgreSQL, we had all of the read capabilities of a relational database, without any of the write safety. In our case, we have a code-path where a user is created, and then a welcome notification and email are sent to them, but for some reason the user was created without a notification. After digging around in logs, the user was created, but the notification failed to be created.
At Merit, data-invariants were too common for comfort while using NoSQL, and once we moved to a relational database, we had the opportunity to take care of these issues at the database level. Our data-invariant debugging playbooks could be retired and we could spend our Friday evenings knowing that we won’t be writing high-risk data integrity migrations.
Enter Transactional Boundaries
After a long, arduous, but ultimately successful (0 downtime WOOT) migration to Postgres from Google (NoSQL) Datastore, we were ready to begin tackling our data integrity violations with transactional boundaries. Transaction boundaries are where a transaction begins or ends, where within the transaction all writes to the database are atomic, in that they either all complete, or are all reverted if any single write in a given transaction fails.
So in the example above, we could now write code like the following
- Begin Transaction
- Write user
- Write notification
- End Transaction (Commit to the database OR Rollback)
Which now guarantees that if the notification fails to write, the user also fails to be written. This allowed us to ensure that interdependent models can be written without data integrity violations.
We knew that we were prone to this in code-paths all over our platform, and wanted to unlock this functionality without bringing velocity in our mono-repo to a screeching halt. In order to fully utilize transaction boundaries, every code-path needed to be changed to support them. Then we set out to split the work across the engineering organization in a way that allowed us to maintain velocity.
The Shared Connection
First, we needed to refactor our database layer. All of our services share a singular database interface, and we needed a way to expose write and read interfaces with transaction boundaries, so we added a version of our database interface that utilized transactional boundaries, allowing for any code-path to implement either the old or new interface.
Pushing the Boundaries
Next, we needed to define the boundaries for each of our code-paths. While we had many code-paths that may be hit by various clients, we narrowed them down to four primary resources (each aligning with a specific deployment). The four resources were our GraphQL server, our subscriber server(s), Rest API server, and lastly a cron-job like server which executes tasks on a timed basis.
Since we are in a mono-repo, many code-paths are shared across services (for example, the database layer is used by many other services), so we needed to ensure that if one code-path was refactored end-to-end (including all downstream paths), it wouldn’t necessitate refactoring all other upstream services. To achieve this, we built an “escape hatch” for developers to add a transaction boundary around only code-paths that they were refactoring.
After implementing transactional boundaries in some of our high-traffic code-paths, we noticed immediate improvements. Firstly, we had significant quality of life improvements. We were able to delete a significant amount of code that was built around our lack of transactions. Various lock-services, and even tables dedicated to establishing strongly consistent writes were deleted, and our database layer was significantly simplified. As part of the effort, Pubsub messages were collected and published only after our transactions were committed without the developer overhead of tracking transaction boundaries. Because of Pubsub messages being collected into a single publish after transactions, our data processing pipeline increased throughput 3x, going from processing 1 million rows in 1 hour to 1 million rows in 20 minutes. Our database halved in CPU pressure because we had significantly less round-trips per request. For the same reasons, we were able to utilize higher horizontal scaling, and utilize more database connections simultaneously. Our code-quality improved in our most high-impact business logic as we no longer had to hack around the lack of transactions to ensure data integrity. We got rid of an entire custom-build lock service for this use-case, removing over 400 LOC and a service dependency. It’s hard to describe how good it feels to delete code you wished you didn’t have to write in the first place.
It’s Friday night, you quit slack, and are about to close your laptop when the familiar instinct kicks in. You haven’t received a pager duty alert in a while, but the habit is hard to break. You open error reporting just in case… and Nothing. No database invariants, no DB CPU spiking, no customer service requests describing “another weird data issue” that you could’ve sworn was impossible. Sigh in relief. Transactional boundaries are here to help.