Followers

Friday, September 19, 2008

Digg Database Architecture

timeless

by timeless


Welcome back to the Digg technology blog, where you get to read about what the tech people at Digg are thinking about. Let’s get right into it, shall we?

How Many Databases does Digg Have?

As Joe hinted in his earlier blog entry, the particulars of how many machines Digg has is one of the most often asked questions, and yet one of the least relevant to Digg employees. None of us directly responsible for putting machines into our production cluster bothers to know the answer to this question, including myself.

The simple answer, which seems flippant, is that we have enough to do what Digg does. But I suppose you deserve a more detailed answer. So let’s go into it a little, shall we?

We have about 1.8x to 2.5x the theoretical minimum number of machines required to run Digg. Operations mandates that we want 2x the server capacity required to run Digg, so this makes sense. But why the spread? Let’s go into some of the reasons as they pertain to the databases.

Load deltas can be caused by errors or brilliance by Digg employees, such as bugs or fixes from the developers or improved indexing or bad I/O subsystem layout by the DB team, for example. Although it’s fair to say improvements far outnumber the mistakes, they both exist and both must be dealt with (feel free to ponder about what various ways performance improvements must be “dealt with”).

Sometimes Digg performance is changed when we enable or disable a particular costly feature of Digg. The DB team must be prepared for the human element of performance. Sometimes a feature is considered valuable enough to keep even though it causes what may appear to be an undue strain on the databases. At some time in the future, the feature may become so system-intensive that the hard decision is made to cut it. Until that point, the systems can be “overloaded,” and more importantly, once the feature is disabled, the systems become underutilised.

Pool utilisation can also be changed when the Digg members, as a social group, stop using a feature, or start using another that Digg itself made no change to.

All such factors contribute to the number of machines deployed in our production cluster at a given moment being either more or fewer than we desire, but if we’re careful, there are always more than the theoretical minimum required to run Digg, even during spikes in load.

Database Pools

At the highest level, you can think of the Digg databases as a four-master set of clusters. We shall call them A, B, C, and D. Two of the masters (masters A and B) are masters only, and two (masters C and D) are slaves of one of the other masters (A).

Each cluster has several slaves in it. I shall call a grouping of slaves a “pool.” At the highest level, you can think of all the slaves in a particular pool of masters B through D as equivalent, but the slaves in the pool underneath master A are special. This is for historical reasons. So let’s dig into history, shall we?

The original Digg database was designed as a single monolithic DB, and additional capacity was created by adding slaves. As Digg grew, we added more database clusters (B, C, D) into the mix. This is your classical scaling via distribution of writes, but the original database cluster (A) remained with most of the read/write throughput. Scaling it has proven to require a bit more ingenuity as its slaves have always historically had the most disk contention.

At first, to get more cache hits on a slave in A’s DB pool while still keeping all tables on each slave, they were split into subpools, call them A_alpha and A_beta. All queries sent to the slaves of A were given descriptors, and the database access layer was given a mapping of descriptors to subpools. Thus queries that mostly hit only tables M and N could be sent to A_alpha, and those that hit mostly I and J could be sent to A_beta. Hence the index and data pages for M, N, I, and J would most likely be in RAM on their respective database slaves.

This worked rather well for some time until the write load on A_[alpha|beta] became too intense, and further optimisations were required. These include dropping indexes and tables that aren’t needed in their respective subpools.

If you’re a MySQL DBA running MySQL 5.0 or lower, you know that there isn’t a simple report that MySQL will generate that shows you a list of indexes or tables that are used in a database. The assumption is that if your company created a table or index, that it will always be in use.

The radical changes to Digg’s front-end architecture over the past several years means that isn’t true. We know there are tables and indexes we don’t use in our A_alpha and A_beta subpools. So to determine what could be dropped from these two subpools, we analysed the dbmon.pl output (dbmon.pl is covered in some detail in the section on Database Overload, below) using getServerIndexes.pl. Basically it does an EXPLAIN on every query on the slave to generate a list of tables and indexes used (special note, I didn’t use the word “exhaustive” in the preceding sentence). If you use this tool, be sure you read the caveats in the script comments! As with everything high volume, nothing is ever simple.

Database Access Layer

The Digg database access layer is written in PHP and lives at the level of the application server (Apache). Basically, when the application decides it needs to do a read query, it passes off the query with a descriptor to a method that grabs a list of servers for the database pool that can satisfy the query, then picks one at random, submits the query, and returns the results to the calling method.

If the server picked won’t respond in a very small amount of time, the code moves on to the next server in the list. So if MySQL is down on a database machine in one of the pools, the end-user of Digg doesn’t notice. This code is extremely robust and well-tested. We worry neither that shutting down MySQL on a read slave in the Digg cluster, nor a failure in alerting on a DB slave that dies will cause site degradation.

Every few months we consider using a SQL proxy to do this database pooling, failover, and balancing, but it’s a tough sell since the code is simple and works extremely well. Furthermore, the load balancing function itself is spread across all our Apaches. Hence there is no “single point of failure” as there would be if we had a single SQL proxy.

Monitoring and Alerting

Though it is an integral part of our database architecture, I will keep this section a bit short, since it isn’t my specialty. We use Nagios to alert us on predicted failure modes of databases. The most common alerts are slave lag, disk space low, and complete machine death. Slave lag is caused by a number of things, including spikes in system usage, long-running update queries, or intermittent disk failure.

For monitoring, we use a Cacti-like Digg-written tool called MotiRTG. Suffice it to say it resembles Cacti in several ways, but is specialised to Digg’s cluster layout and is more suited to a cluster that has machines entering and leaving every day. It is written and maintained by our Networks and Metrics manager, Mike Newton. It is a strong candidate for open sourcing in the future.

Our alerting and monitoring subsystems are used in the traditional fashion. Alerts are for predicted failure modes, and monitoring is for post-failure analysis and future trending.

Database Overload

One of the most common problems on Digg systems is a spike in load, often caused by large news events like Apple announcements or hurricanes or… well, anything newsworthy. Assuming the spike isn’t taken care of by one of the myriad other spike-limiting features of Digg’s infrastructure, and the spike makes it to Digg’s databases, there are two simple mechanisms to limit the effect.

The first is the aforementioned over deployment of machines in our DB cluster. I estimate that this takes care of more than 99% of database spikes in load. It may be no exaggeration to say 99.99%; we get several such spikes in database load per minute.

It is possible that a combination of adverse conditions will contribute to a spike that causes a particular segment of our DB resources to become 100% utilised. Under such conditions, it is not acceptable that Digg go down entirely. Hence the second mechanism, a tool called “dbmon.pl” (which you can download the source code for), is used. It’s a daemon that watches the MySQL instance for queries that have been running longer than a time limit set on the commandline, and kills queries that take longer than that.

Combined with separate DB subpools for different sorts of functionality on Digg, an overload on the DBs will only affect the portion of Digg serviced by that subpool, and then only a subset of the total requests coming in.

Note that the front-end application must deal with killed queries gracefully, since some of the killed requests will originate from legitimate users. If you use this tool, be sure you test your code in some environment where the kill time is set to a much lower value than you’ll actually use in production, and under stress, to be sure your application doesn’t barf out some nasty stack trace on the user when a connection gets killed.

Be very careful of adding logic where the front-end resubmits the request at no explicit request from the user. I recommend adding no such logic. It can easily negate the advantage of using dbmon.pl. Don’t worry. The user will hit reload. There is no need to DOS your own databases.

Original here


No comments: