Today, I read a paper titled Lessons from Giant-Scale Services. This paper is written by Eric Brewer, the guy behind CAP theorem. It is an old paper published in 2001. The paper helps the reader build mental model on how to think about availability of large scale distributed systems.
Welcome to the fourth post of lessons I learnt (LIL) series. I had a busy last week where I was trying to manage multiple things at the same time. I am not good at multitasking so at times during the last week it became stressful and difficult to keep check on all the items on my plate. But, with patience and better planning I manage to get things done. There are two lessons that I want to share this week. They help me scale better and get things done.
> When you give the best you have to someone in need, it translates into something much deeper to the receiver. It means they are worthy.
> If it’s not good enough for you, it’s not good enough for those in need either. Giving the best you have does more than feed an empty belly—it feeds the soul.
Calendar Versioning: 10 mins read. CalVer is a versioning convention based on your project’s release calendar, instead of arbitrary numbers.
Doing a database join with CSV files: 10 mins read. xsv is a tool that you can use to join two CSV files. The author shows examples of inner join, left join, and right join. Very useful indeed.
Ordered data structures perform much better when n is large. With hash based collections, one collision can cause O(n) performance. Range queries becomes O(n) if implemented using hash tables
Ordering helps in indexes and we can reuse one index in multiple ways. With hash tables, we have to implement separate indexes
Ordered collection achieve locality of reference.
Xor Filters: Faster and Smaller Than Bloom Filters: 15 mins read. In this post, author talks about Xor filters to solve problems where you need to check whether an item exist in cache or not. Usually we solve such problems using a hash based collection but this can be solve using Xor filters as well. Xor filters take a bit longer to build, but once built, it uses less memory and is about 25% faster. Bloom filters and cuckoo filters are two other common approaches to solve these kind of problem as well.
Like GitHub, Shopify, and Airbnb, DigitalOcean began as a Rails application in 2011. The Rails application, internally known as Cloud, managed all user interactions in both the UI and public API. Aiding the Rails service were two Perl services: Scheduler and DOBE (DigitalOcean BackEnd). Scheduler scheduled and assigned Droplets to hypervisors, while DOBE was in charge of creating the actual Droplet virtual machines. While the Cloud and Scheduler ran as stand-alone services, DOBE ran on every server in the fleet.
For four years, the database message queue formed the backbone of DigitalOcean’s technology stack. During this period, we adopted a microservice architecture, replaced HTTPS with gRPC for internal traffic, and ousted Perl in favor of Golang for the backend services. However, all roads still led to that MySQL database.
By the start of 2016, the database had over 15,000 direct connections, each one querying for new events every one to five seconds. If that was not bad enough, the SQL query that each hypervisor used to fetch new Droplet events had also grown in complexity. It had become a colossus over 150 lines long and JOINed across 18 tables.
When Event Router went live, it slashed the number of database connections from over 15,000 to less than 100.
Unfortunately, removing the database’s message queue was not an easy feat. The first step was preventing services from having direct access to it. The database needed an abstraction layer.
Now the real work began. Having complete control of the event system meant that Harpoon had the freedom to reinvent the Droplet workflow.
Harpoon’s first task was to extract the message queue responsibilities from the database into itself. To do this, Harpoon created an internal messaging queue of its own that was made up of RabbitMQ and asynchronous workers. As of this writing in 2019, this is where the Droplet event architecture stands.
The first thing to consider is agility—cloud services offer significant advantages on how quickly you can spin infrastructure up and down, allowing you to concentrate on creating value on the software and data side.
But the flip side of this agility is our second factor, which is cost. The agility and convenience of cloud infrastructure comes with a price premium that you pay over time, particularly for “higher level” services than raw compute and storage.
The third factor is control. If you want full control over the hardware or network or security environment that your data lives in, then you will probably want to manage that on-premises.
Tools I discovered this week
Broot: It is a CLI tool that you can use to get an overview of a directory, even a big one. It is written in Rust programming language. I use it as an alternative to tree command.
xsv: It is a CLI tool for working with CSV files. It can concatenate, count, join, flatten, and many other things. It is Swiss army tool for CSV. It is written in Rust programming language.