? Uber Goes Unconventional: Using Driver Phones as a Backup Datacenter (archive)

Source originale du contenu

In How Uber Scales Their Real-Time Market Platform one of the most intriguing hints was how Uber handles datacenter failovers using driver phones as an external distributed storage system for recovery.

Now we know a lot more about how that system works from Uber's Nikunj Aggarwal and Joshua Corbin, who gave a very interesting talk at the @Scale conference: How Uber Uses your Phone as a Backup Datacenter.

Rather than use a traditional backend replication scheme where databases sync state between datacenters to achieve a measure of k-safety, Uber did something different, what they do is store enough state on driver phones so that if a datacenter failover occurs trip information can not be lost on the failover.

Why choose this approach? The traditional approach would be much simpler. I think it is to make sure the customer always has a good customer experience and losing trip information for an active trip would make for a horrible customer experience. 

By building their syncing strategy around the phone, even thought it's complicated and takes a lot work, Uber is able to preserve trip data and make for a seamless customer experience even on datacenter failures. And making the customer happy is what counts, especially in a market with near zero switching costs.

So the goal is not to lose trip information, even on a datacenter failover. Using a traditional database replication strategy it would not be possible to make this guarantee for reasons that have parallels to how network management systems have always had to work. Let me explain.

In a network devices are the authoritative source for state information like packet errors, alarms, packets sent and received, and so on. The network management system is authoritative for configuration data like alarm thresholds and customer information. The complication is devices and the network management system are not always in contact, so they get out of sync because they work independently of each other. Which means on bootup, failover, and communication reconnection all this information has to be merged in both directions using a complicated dance that ensures correctness and consistency. 

Uber has the same problem, only the devices are smart phones and the authoritative state the phone contains is trip information. So on bootup, failover, and communication reconnection the trip information must be preserved because the phone is the authoritative source for trip information.

Even when connectivity is lost the phone has an accurate record all trip data. So you wouldn't want to sync trip data from the datacenter down to the phone because that would wipe out the correct data on the phone. The correct information must come from the phone.

Uber also takes another trick from network management systems. They periodically query phones to test the integrity of information in the datacenter. 

Let's see how they do it...

Motivation for Using Phones as Storage for Datacenter Failure

How They Got the Reliability of the System to Work at Scale

Goals

The Flow

Moving Between Datacenters

Ensuring 99.99% Reliability