Skip to content
This repository was archived by the owner on May 19, 2024. It is now read-only.
Mark Sayson edited this page Feb 11, 2017 · 11 revisions

kvservice

kvservice is a simple key-value service developed to demonstrate various failure recovery strategies for distributed systems, and to compare each strategy's relative advantages and disadvantages.

A client-server API abstracts away from implementation details, allowing us to choose from any number of distributed solutions. Different system designs may be appropriate depending on system requirements for availability, consistency, and failure tolerance.

For example, Google Docs may prioritize availability over consistency, and allow multiple clients to write to the same document simultaneously, simply storing each person's changes so that they can fix merge conflicts themselves. However, a financial records system may be more concerned with strong data consistency, and allow simultaneous read operations but only one write operation at a time per record. These differences in needs will inform the design of our distributed system.

For systems with business-critical data, some form of distributed data storage makes sense, as this allows us to replicate our data across multiple machines and survive individual failures. To illustrate why we might find this important, consider the following abstract system:

What happens if our server crashes, experiences a hardware failure, or otherwise becomes compromised? Clients will no longer be able to access our services, and we may lose all of our data.

Is there any way to avoid this? Well, why do we have to store all our data on the server the client interacts with?

Consider storing data on back-end servers that the front-end server interacts with. This way we can maintain data replicas across back-end machines so that a given back-end failure doesn't cripple our system, and we don't lose any data if our front-end server fails.

In this scenario, a front-end failure temporarily results in loss of user access to the system, but doesn't affect the integrity of our data. How can we recover from a front-end server failure?

We could try to minimize downtime by monitoring the front-end server process and swapping in a replacement if it becomes responsive. Alternatively, we could run multiple front-end processes concurrently, and allow clients to access any given one. Note that if the goal is to survive any given failure, at least one front-end should be hosted on a separate machine.

Clone this wiki locally