>HDDs typically have a BER (Bit Error Rate) of 1 in 1015, meaning some incorrect...

allanjude · 2026-01-22T19:04:35 1769108675

RAID would only be able to recover if it KNEW the data was wrong.

Without a checksum, hardware RAID has no way to KNOW it needs to use the parity to correct the block.

iberator · 2026-01-20T12:21:56 1768911716

This is pure theory. Ber shouldn't be counted per sector etc? We shouldn't tread all disk space as single entity, IMO

thfuran · 2026-01-20T14:08:23 1768918103

Why would that make a difference unless some sectors have higher/lower error rates than others?

Dylan16807 · 2026-01-20T14:55:25 1768920925

For a fixed bit error rate, making your typical error 100x bigger means it will happen 100x less often.

If the typical error is an entire sector, that's over 30 thousand bits. 1:1e15 BER could mean 1 corrupted bit every 100 terabytes or it could mean 1 corrupted sector every 4 exabytes. Or anything in between. If there's any more detailed spec for what that number means, I'd love to see it.

digiown · 2026-01-20T14:14:39 1768918479

This stat is also complete bullshit. If it were true, your scrubs of any 20+TB pool would get at least corrected errors quite frequently. But this is not the case.

The consumer grade drives are often given an even lower spec of 1 in 1e14. For a 20TB drive, that's more than one error every scrub, which does not happen. I don't know about you, but I would not consider a drive to be functional at all if reading it out in full would produce more than one error on average. Pretty much nothing said on that datasheet reflects reality.

alexfoo · 2026-01-20T15:41:23 1768923683

> This stat is also complete bullshit. If it were true, your scrubs of any 20+TB pool would get at least corrected errors quite frequently. But this is not the case.

I would expect the ZFS code is written with the expected BER in mind. If it reads something, computes the checksum and goes "uh oh" then it will probably first re-read the block/sector, see that the result is different, possibly re-read it a third time and if all OK continue on without even bothering to log an obvious BER related error. I would expect it only bothers to log or warn about something when it repeatedly reads the same data that breaks the checksum.

Caveat Reddit but https://www.reddit.com/r/zfs/comments/3gpkm9/statistics_on_r... has some useful info in it. The OP starts off with a similar premise that a BER of 10^-14 is rubbish but then people in charge of very large pools of drives wade in with real world experience to give more context.

digiown · 2026-01-20T19:30:43 1768937443

That's some very old data. I'm curious as to how stuff have changed with all the new advancements like helium drives, HAMR, etc. From the stats Backblaze helpfully publish, I feel like the huge amount of variance between models far outweigh the importance of this specific stat in terms of considering failure risks.

I also thought that it's "URE", i.e. unrecoverable with all the correction mechanisms. I'm aware that drives use various ways to protect against bitrot internally.