Random Replication Leads to Definitive Data Loss

PublishedJanuary 9, 2021

Lazy non-determinism

Disciplined determinism

In the world of distributed systems, there is a common pattern of replication which is often used to prevent data loss (in reality, we mean data unavailability since usually data gets backed up to disk and disks almost never completely fail) in storage systems.

The pattern goes like this:

1.) Store a data chunk A on node 1

2.) Replicate chunk A to nodes 2 and 3 so that you can lose 2 of those nodes and still not lose data

3.) Repeat for all your data chunks, randomly choosing the node for each replica among all N nodes in your system

If you do this, you can lose any 2 nodes and be sure not to lose any data.

But as systems grow in size, node failure becomes more and more frequent in absolute terms.

If you have a cluster of 10000 nodes, there is a good chance that at any moment more than 2 nodes are failing.

But you can expect to have a small percentage of failures - says 1% - and that should protect you, since it's very unlikely that the 1% of nodes that fail at the same time contain all 3 replicas for any chunk of data.....right?

WRONG!

Yes, it's true that this is how replication is implemented in many popular storage systems such as HDFS and GFS - but the truth is that for large clusters, the probability of data loss becomes 100% over the course of 1 year.

#distributed-systems-data-replication-data-loss-storage-systems

Comments

Join the discussion

No comments yet. Be the first to comment.

More from this blog

Learn AI Independently: How to Use ChatGPT and Other Tools to Understand a Technical Book

I've spent the past 13 years in software engineering—and, call me a masochist, but my favorite part is still running face-first into unfamiliar tech and coming out the other side. If you’re a jack-of-all-trades like me, you know the drill: you can da...

Sep 24, 20254 min read

Learn AI Independently: How to Use ChatGPT and Other Tools to Understand a Technical Book

Uber's Michelangelo vs. Netflix's Metaflow

Originally published on Blogger. This post compares two major ML platform approaches from industry leaders. Michelangelo Pain point Without michelangelo, each team at uber that uses ML (that's all of them - every interaction with the ride or eats app...

May 16, 20256 min read

Uber's Michelangelo vs. Netflix's Metaflow

ChatGPT - How Long Till They Realize I'm a Robot?

I tried it first on December 2nd... ...and slowly the meaning of it started to sink in. It's January 1st and as the new year begins, my future has never felt so hazy. It helps me write code. At my new company I'm writing golang, which is new for me, ...

Jan 1, 20233 min read

ChatGPT - How Long Till They Realize I'm a Robot?

Architectural Characteristics - Transcending Requirements

Building a system means meeting a set of requirements dictated by the customer. But the customer isn't always going to translate what they want into engineering terms. Even if you say please, they might not even know how. If they ask for an online or...

Oct 19, 20203 min read

Architectural Characteristics - Transcending Requirements

masudio - tech

16 posts

A blog about distributed systems and software infrastructure

Command Palette

Comments

More from this blog