Black Friday Post-Incident Report
Over the last few years, Host1Plus experienced a rapid customer base growth. It was clear there’s a need to redesign the base infrastructure to accommodate this growth and beyond.
Introduction to storage backends
In the end of 2014, we’ve sat down with our engineering team and started with the storage backend redesign.
Our options at hand were:
- Keep the local (as in to specific compute node) storage
- Buy or build a big SAN
- Explore distributed storage systems
What our engineering wanted:
- Ability to live-migrate customer environments between the physical compute nodes
- Have a high(-er) data protection level
- Self-healing system
High IOPS that we scale:
- Large usable storage pool
- Snapshots, etc.
- Cost effective hardware, no fancy interfaces
What our Management team wanted:
- Stable and scalable storage system
- Cost effective
Meanwhile, our customers wanted:
- Plenty of storage space
- High IOPS and low latency
- Fast backups
- Data protection
Now, the problem is if you make a matrix out of all “wants” and “needs”, you end up with conflicting areas.
So, we knew we need to trade off something as there are no miracles (yet) in this particular area of computing.
We’ve immediately dropped the <insert a well-known hardware SAN vendor name here> option, as these types of systems are designed for (Large) Enterprises, but not necessarily for large data sets. The cost per gigabyte and the entrance fee was prohibitive.
Why did we drop local storage?
We’ve dropped the local storage due to several factors:
- Speed vs Space tradeoff. Considering a typical 1U server with 4 hot-swap drive slots, you either have all of them with large HDDs – plenty of space, but low IOPS, or you have SSD only – not much space, but plenty of IOPS or a mix of two with low level of protection.
- Data protection – no matter what the books say, RAID can lead to data corruption. It will hit you when you least expect it and the number of drives per physical node that were reasonable at a time meant this scenario is likely to happen more often than not.
- Waste of space – considering you have enough disk space on a node (with reserve), you will always hit the other resource limits (i.e. RAM or CPU) before using up the available storage. That, multiplied by the number of nodes you run, leads to hundreds of terabytes unused. Which is also a waste of money.
- High node recovery time – in case of a physical node failure (Motherboard, CPU, RAM, etc.), the downtime is considerably longer, since you need to physically replace the failed component(-s) or physically take out the hard disks and move them to the spare machine.
- Lastly, live-migration aspect becomes considerably more complicated due to the need to move a lot of data between the machines.
Why did we choose Ceph?
That left us with distributed storage and luckily there was a mature project, which delivered on most of our “wants”. The project in question is called CEPH. It started as a Ph.D. thesis by Sage Weil back in 2007, transformed into an independent company Inktank, which later was acquired by Red Hat in 2014. CEPH is a free-software storage platform, which implements object storage on distributed cluster and provides interfaces for object, file, and block (our case) storage.
At the time of our initial testing in early 2015, Ceph development team were about to release the eighth major stable version of Ceph. This is the version we based on deployment later on, which has served us well until the last week when we hit an unexpected bug.
In a nutshell, Ceph cluster consists of two types of service daemons: Ceph Monitor and Ceph OSD (Object Storage Device).
Ceph Monitor maintains a master copy of the whole cluster map, monitors the health and provides the cluster map to clients during the initial connect.
Ceph OSD handles the read/write operations on the storage disks, monitors itself and other OSDs and reports back to Monitors. OSD Daemons also create object replicas on the other Ceph nodes to ensure data safety and high availability.
The whole system is decentralized as much as possible so that each client and OSD can compute the object location by itself without a need for a “master” server.
The next few paragraphs describe the core pieces of the Ceph implementation, however, they are essential in understanding the bug that we hit.
Ceph Clients and Ceph OSD Daemons both use the so-called “CRUSH” algorithm to efficiently compute information about object location, instead of having to depend on a central lookup table. CRUSH provides a better data management mechanism compared to older approaches and enables massive scale by cleanly distributing the work to all the clients and OSD daemons in the cluster.
The Ceph storage system “Pools” are logical partitions for storing objects.
Ceph Clients retrieve a Cluster Map from a Ceph Monitor and write objects to pools. The pool’s size or the number of replicas, the CRUSH ruleset and the number of placement groups determine how Ceph will place the data.
Pools set at least the following parameters:
- Ownership/Access to Objects
- The Number of Placement Groups, and
- The CRUSH ruleset to use.
Each pool has a number of Placement Groups. CRUSH maps PGs to OSDs dynamically. When a Ceph Client stores object, CRUSH will map each object to a placement group.
How placement groups are used
A placement group (PG) aggregates objects within a pool because tracking object placement and object metadata on a per-object basis is computationally expensive–i.e., a system with millions of objects cannot realistically track placement on a per-object basis.
The object’s contents within a placement group are stored in a set of OSDs. For instance, in a replicated pool of size two, each placement group will store objects on two OSDs, as shown below.
In the above scenario, should OSD #2 fail, another one will be assigned to Placement Group #1 and will be filled with copies of all objects in OSD #1. The following diagram depicts how CRUSH algorithm maps objects to placement
The following diagram depicts how CRUSH algorithm maps objects to placement groups and placement groups to OSDs.
With a copy of the cluster map and the CRUSH algorithm, the client can compute exactly which OSD to use when reading or writing a particular object. Similarly, having the latest copy of a Cluster Map, clients compute the object locations by having only the object ID and the pool name, which is must faster than performing object location query every time you need to access one.
The number of placement groups per pool, however, is not an automatic setting, but rather needs to be chosen when creating a pool.
Ceph allows for tiered storage, which is transparent to the end user. It’s called cache tiering and provides Ceph clients with better I/O performance for the data stored in a backing storage tier.
We’ve deployed a number of physical nodes with large enterprise-class HDDs aka “cold storage” and added a number of physical nodes with the enterprise-class SSDs/NVMes aka “hot storage” to be used as a cache tier.
Ceph then handles where to place the objects and the tiering agent determines when to flush objects from the cache to make sure only the “hot” data resides there. This process is completely transparent to the end user.
Hitting the bug
The ratio between the “hot” and the “cold” data pools is of dynamic nature. Initially, when the cluster utilization is low, you can almost fit all of your data into a hot storage (cache) pool. As the cluster grows, the amount of objects (in Ceph, data is split into 4MB objects) in the cold storage grows, while the maximum number of objects in the hot storage stays constant. Naturally, one wants to keep this ratio as low as (economically) possible for faster I/O and expand this pool in a timely manner. This is what we’ve done numerous times with success, however, this time it was slightly different.
At the wake of the incident
On November 21st we’ve sent an engineer to our datacenter in Frankfurt to perform a number of tasks, one of which was to move some of our inactive caching layer SSDs to different physical nodes. This was a slightly overdue task, that required physical intervention, due to the original machine’s HBAs randomly losing SAS connectivity with. In addition, we’ve had these machines running the aging Ubuntu version, while the rest of the cluster has been moved into CentOS.
Before this move, we’ve cleanly flushed all the data from those SSDs and left them inactive.
Around mid-day we were ready to re-add them to the cluster, however, since the recent cache pool upgrades and this one pending we’ve decided to also implement two optimizations:
- Modify the CRUSH Map availability zones for our caching layer. Rather than having each participating cluster node as a separate object, we’ve created two separate virtual availability zones, based on the physical chassis layout
- Increase the number of Placement Groups to equalize the object distribution in the cluster and decrease the risk of data loss should some of them fail
We’ve prepared the new CRUSH map, triple-verified and loaded. As expected, Ceph initiated a rebalance of all objects in accordance with a new layout.
How did it happen?
About 30 minutes later we decide to slightly increase the PG number on one of our smaller storage pools to measure the impact of a larger scale increase on our customer pools. This wasn’t taken without concern, however, I still green lighted it.
About an hour and a half later we’ve started seeing cache OSDs dropping out of the cluster and joining back in a random fashion. This kept halting all write operations for the entire cluster due to the following factors:
- Either two or more OSDs kept on dying at once, rendering PGs unavailable
- Rebalance of degraded/missing objects as soon as an OSD joins the cluster
We knew we are in trouble, but not quite how large it will turn out to be.
We assembled a strike team around 21 o’clock that night. Immediately, we saw our cache pool OSDs asserting on the following function:
osd/ReplicatedPG.cc: In function'void ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned int)'
This was surprising, as we’ve never had any issues with hit sets before. Think of hit sets as databases, holding the information on how many times client accessed a certain object over the period of time. This allows Ceph to do the “age” versus “temperature” of an object comparison and decide if it should stay in the hot cache layer or not.
Now, every now and then, these databases need to be trimmed based on their period setting and that’s where we seemed to be hitting the bug.
That night we’ve worked till the early hours of the morning trying everything we had in our bag of tricks. Nothing helped. OSDs kept on flapping, we’ve had a bunch of PGs marked as unfound, even though the disks and the physical data was there. Bewildered and out of ideas, we went for a few hour rest.
On November 23rd morning, the next day after the incident, we’ve got a suggestion from a forum to try to increase the number of hit sets. We’ve done that and we got the system somewhat stable, allowing the cluster to finish the rebalance that never finished before. However, after the rebalance a lot of PGs were still marked as not clean or unfound, so we have contacted 42on.com engineers to assist in tracking all the placement groups, verifying their data and marking them clean. In parallel, we’ve contacted Red Hat and asked for their assistance.
Joint cleaning effort went into a late evening and we continued it ourselves till the early hours of the next day.
On Nov 24th at around 8 a.m., we finally had our cluster reporting HEALTH_OK. This was a major milestone psychologically. So we started bringing up the systems and client environments. However, after roughly an hour of seemingly OK operation, we started to hit the same bug as before. This left us with the following:
- Disable cache mode completely
- Keep on increasing hit sets until we exhaust memory
We’ve tried both, however, we were getting nowhere near the performance we needed. Disabling the cache meant that tens or hundreds of thousands of IOPS were no way to be served from cold storage and the customer virtual environments were either going into read-only file system state or shutting down. Increasing hit sets seemed like it could win us some time, but the cluster with a high amount of hit sets just didn’t work as it used to. At that moment we were sure we have the data safe, however, we were not having enough IOPS to serve everybody.
One long weekend
Over the next few days (Friday and Saturday) we’ve built a test lab and managed to replicate the bug by changing the (normally not changeable) namespace name that Ceph uses to store the hit set archive files on a running cluster. This brought some excitement, as we could try all the crazy ideas we could think of without touching our production cluster. Late Saturday night we finally had our Red Hat Ceph Storage subscription activated and support ticket created.
Came Sunday, we were still in the same position as on Thursday. Somewhat operation cluster, but not really up. We began to feel a huge psychological toll on support, marketing and technical teams.
Later in the evening, we had another call with Red Hat, who basically told us that they are not supporting the caching layer and that we should get rid of it. The estimate was grim: up to 7 days to model the scenario and work out the procedure. It was clear that come Monday 28th we’ll need to take on drastic actions: either programmatically or otherwise.
At the crossroads
On Monday 28th we came to the office, held a meeting and went two ways:
- Develop a tool which can dump and edit LevelDB records. We knew by then that each OSD has its own LevelDB database, which keeps the information about hit sets (among other things, like PG map). The idea was to remove all erroneous entries from the 22nd and by doing so hopefully get the cluster out of the bug loop that it got itself into
- Shut all of our infrastructure, except for the Ceph cluster itself. Hopefully, be able to dump the hot cache pools. Destroy all cache pools and remove the cache tier from the cold storage. Rebuild the cache pools and re-add them as a tier for cold storage.
Around 11 a.m. we had a LevelDB dump/edit tool. Tried it in our test lab and it seemed to work as expected. Over the course of several days that we had our test lab, we’ve hacked it so much. We saw that something else was broken beyond (a reasonable quick) repair. There was no time to work on that.
Meanwhile, we have shut the infrastructure down and initiated the cache flush. Somewhat surprisingly (after previously failed attempts) that went very well and no objects were found to be locked. After triple checking that no data objects were left in the cache pools, we removed them one by one and rebuilt.
Resolution at last
Around 2 p.m. we started to gradually bring up our core systems and our own test environments to run the stress test on a cluster. After a few hours of checking every piece of performance data, we were confident enough to start bringing up customer environments. Over the next 11 hours, we were bringing up all customer environments.
Around 1 a.m. of November 29th, we had everything online, except for a few our own systems that were not essential.
November 29th and 30th we were assisting our customers in bringing up or configuring their environments, while closely monitoring the systems.
Our plan going forward remains unchanged from what it was in the first place:
- Keep our customer data safe
- Deliver increasingly better performance
- Avoid slowdowns or downtimes
To make it happen
Currently we are in the process of:
- Researching alternatives to Ceph caching mechanism
- Working on a local (SSD-only) storage tier offering for our Cloud Server customers
- Investing in our Systems infrastructure, to make it highly available. This incident has shown that we have to add additional layer of redundancy to be able to serve the rest of the global infrastructure.
- Investing in our engineer staff technical excellence
Being true to our craft, as you understood by now, we ate our own dog food and used the same distributed storage for our Systems. Surely, highly critical data had their own disks. But still, almost all of the systems had some ties to the storage. We’ll work on decoupling this as much possible, but will keep one leg just to understand the performance that our customers are getting. This will allow us to further improve as it grows.
We are also in the process of working with Red Hat to unify the storage node OS versions. Ceph versions and getting rid of the features that are deemed to be unstable in a production environment.
Our company was built in order to create high quality hosting services and to serve our customers with passion and to be reliable partner in their projects. We sincerely apologize our customers for everything they had to go through following this incident. We truly appreciate our clients, even though the recent events have shown otherwise. In any case, will continue to improve and set precautions to prevent incidents of such scale in the future.