BTRFS - the filesystem that was first introduced in 2009, back in the days of Linux 2.6.x kernel. It came with a promise of being this modern filesystem, for the modern age. Packed with all sorts of features that were traditionally being considered as part of the “Software RAID” or “Volume manager”.

To this day, it is being considered by many, as a “ZFS for the Linux world”.

My first encounter with this filesystem was way later, when it was already considered pretty much stable.

My first impressions

At that time, I didn’t know much about filesystems to be honest. I was just a kid, learning its way through the Linux world. My highest reach at the time was basically having separate partitions and mount-points for various things. With that in mind, retrospectively, it isn’t weird that I have had the experienced I had.

It is really hard to say from the memory what I did like about it, apart of learning and tinkering.

I just remember that one day, system has crashed on me, and my data was corrupt. It is unlikely that the hardware was the culprit because the drive I used to use then, is still in service, no bad blocks or anything. But, trying to prove it was the filesystems fault, is just stupid, because the only thing I have is the fuzzy memory of the inexperienced kid about the series of events that happened.

Never the less, this catastrophic data loss I’ve experienced, left a pretty bad taste in my mouth and I was hesitant to try it out again for a long time after that. I was even more convinced not to try it each time I went through the forums, bug reports of folks loosing their data, etc.

Another try

So, few months ago, after the regular poke-and-explore session, I’ve came up with the courage to try BTRFS “in production” again. Everything that I read about it has assured me all is good now. At least for the setup I’ve planned to use.

My current setup consists of 2x NVME drives (128GB each) and 4 spinning rust drives (3x2TB, 1x1.5TB).

Up until trying BTRFS I have used those 2 NVMe drives in MD Raid 1 setup with LVM on top for easier volume management while HDDs were using none other than ZFS - topology being 2 vdevs of mirrors in a stripe or RAID10 if you wish.

ZFS

I was really happy with ZFS. It was always rock-solid and packed with all sorts of features I need, and even those I don’t :-) From send-receive, snapshots, and many others. It’s hard to enumerate all those nice features here. I definitely encourage you to try it out for yourself and I have no doubt you’ll be amazed.

The thing I didn’t like about it was its inflexibility after the things have already been configured. For example, if you want to turn on object compression, you can do it any time, but there’s no way to “convert” existing data in the dataset unless you copy it over again. Topology of the pool is pretty much fixed as well. There are quite a few things that make it inflexible, but going through every one of those, just like with the good things, would just be too-long for this post.

BTRFS

Anyhow, back to the BTRFS. As any responsible person, I first kicked-out one of the NVMe drives out of my MD Raid setup and nuked it. It was experimentation time!

Instead of copying everything over, and converting current setup to BTRFS without reinstall (which is quite possible to do BTW), I’ve opted to go with a fresh Fedora (32 I believe was current back then) install. On my desktop I’ve never encrypted the root partition, but this has changed now. Enabling LUKS was just a tick-box in the installer so everything was quite straight-forward.

So, my current disk layout looked like this:

NAME                                          LABEL  TYPE  FSTYPE      MOUNTPOINT
nvme0n1                                            259:1    0 111.8G  0 disk
├─nvme0n1p1                                        259:5    0     1G  0 part  /boot/efi
├─nvme0n1p2                                        259:6    0     1G  0 part  /boot
└─nvme0n1p3                                        259:7    0 109.8G  0 part
  └─luks-5e4beb41-67e4-4254-88c2-543329e889f5      253:0    0 109.8G  0 crypt /home

After playing around, and learning a bit more, I’ve decided to format the other NVMe drive, and join it into this BTRFS.

RAID 1

Naturally, when playing with BTRFS, I was testing its capabilities. In order to be certain that my data will not go “bye-bye” on me, I had to simulate some drive failure scenarios and be prepared for the recovery.

What I learned is, that when you’re using RAID1 BTRFS, on LUKS, you have to specify both of your root devices in the /etc/defaults/grub, so bootloader knows from where to boot from. :-) This is done by simply defining it twice.

GRUB_CMDLINE_LINUX="--cut-- rd.luks.uuid=luks-5e4beb41-67e4-4254-88c2-543329e889f5 rd.luks.uuid=luks-5d7921cb-4515-4667-8793-603ae61cff29 --cut--"

This way, if one device is not available, the second one will be used. I also updated my /etc/crypttab and then regenerated GRUB config for EFI (as I boot via UEFI):

grub2-mkconfig -o /boot/efi/EFI/fedora/grub.cfg

Subvolumes

What ZFS calls “datasets”, BTRFS calls “subvolumes” – kind-off. Although not completely 1-to-1 mapping, for the sake of illustration let’s just consider them that way.

In the usual Linux fashion, you have the freedom to choose how those subvolumes will be mounted. ZFS keeps information about the subvolume mountpoints in something they call zpool.cache. Also, by default, new dataset will inherit parent dataset mountpoint with its name appended to it, just like you’re creating a directory. With BTRFS, there’s no default. Subvolume will be visible when you’re exploring filesystem itself, but you can also mount your subvolume via /etc/fstab directly, as you would the regular filesystem.

So for mounting the filesystem I can use BTRFS UUID:

UUID=b6c24e1b-5e00-44cd-8ff1-85c5f887bd51  /storage              btrfs  compress=zstd,x-systemd.device-timeout=0     0  0

And for the subvolume I use that very same (filesystem) UUID, but I also specify the subvol option:

UUID=b6c24e1b-5e00-44cd-8ff1-85c5f887bd51  /home/ivan/Music      btrfs  subvol=Music,x-systemd.device-timeout=0      0  0

This way, I can see what’s within Music subvolume both by exploring /storage/Music and /home/ivan/Music. The downside as opposed to ZFS dataset management is the fact that in order to manipulate subvolumes (create new one), main filesystem has to be mounted.

So in my above-shown example, I have to use something like:

cd /storage
btrfs subvolume create mysubvolume

Well, this is not completely true. If you have subvolume already mounted, you can create subvolume within that subvolume. They’re all part of the subvolume “hierarchy” within the main filesystem. Unfortunately, I haven’t found a way to have main filesystem unmounted, and then create “first-level” subvolume in it.

Data redundancy

BTRFS, as opposed to classic RAID, has notion of redundancy level in the filesystem which doesn’t necessarily map “this drive is identical to this one”. Instead, redundancy is done on the block level. This way, you can theoretically have RAID1 with 3 drives, but 2 copies of the data. So this is not the 3-way mirror which can endure crash of 2 drives, instead, only 1 drive can be lost to be sure that the data on the filesystem is still consistent. To have the 3-way mirror, one could use raid1c3 profile which is basically something like “store data in raid1 fashion, but instead of 2 copies, do 3”. You can read more on the topic of profiles in the BTRFS Wiki page. BTW, they have excellent documentation available!

This also means you can have devices with different sizes chugging along happily in a single “pool”.

For my /storage use-case, where I store all my personal data, I have chosen to use RAID10 profile, which is just like it sounds. But instead of having explicit pair of drives in the “mirror”, writes are like RAID1, but being written to two disk. So half of the data on disk A, half on the disk B, then copy of those two halves to drives C and D respectively.

It is also worth noting that BTRFS allocates space in “round” chunks, so don’t get confused when you see the following in your usage output:

[root@main ~]# btrfs fi usage /storage
Overall:
    Device size:		   6.82TiB
    Device allocated:		   3.04TiB
    Device unallocated:		   3.78TiB
    Device missing:		     0.00B
    Used:			   3.00TiB
    Free (estimated):		   1.91TiB	(min: 1.91TiB)
    Free (statfs, df):		   1.23TiB
    Data ratio:			      2.00
    Metadata ratio:		      2.00
    Global reserve:		 512.00MiB	(used: 0.00B)
    Multiple profiles:		        no

Data,RAID10: Size:1.52TiB, Used:1.50TiB (98.74%)
   /dev/mapper/storage-1-b8e36464-cd3d-42b8-bfeb-585dc8e0e854	 778.00GiB
   /dev/mapper/storage-2-0816ebad-c1ec-4832-afc4-fe54a0f75ea4	 778.00GiB
   /dev/mapper/storage-3-e31396f1-7a06-4adb-923a-4a954f684989	 778.00GiB
   /dev/mapper/storage-4-4fcf3bcb-5a3c-48f1-8ccb-0aca2d479dc6	 778.00GiB

Metadata,RAID10: Size:3.00GiB, Used:2.15GiB (71.84%)
   /dev/mapper/storage-1-b8e36464-cd3d-42b8-bfeb-585dc8e0e854	   1.50GiB
   /dev/mapper/storage-2-0816ebad-c1ec-4832-afc4-fe54a0f75ea4	   1.50GiB
   /dev/mapper/storage-3-e31396f1-7a06-4adb-923a-4a954f684989	   1.50GiB
   /dev/mapper/storage-4-4fcf3bcb-5a3c-48f1-8ccb-0aca2d479dc6	   1.50GiB

System,RAID10: Size:64.00MiB, Used:192.00KiB (0.29%)
   /dev/mapper/storage-1-b8e36464-cd3d-42b8-bfeb-585dc8e0e854	  32.00MiB
   /dev/mapper/storage-2-0816ebad-c1ec-4832-afc4-fe54a0f75ea4	  32.00MiB
   /dev/mapper/storage-3-e31396f1-7a06-4adb-923a-4a954f684989	  32.00MiB
   /dev/mapper/storage-4-4fcf3bcb-5a3c-48f1-8ccb-0aca2d479dc6	  32.00MiB

Unallocated:
   /dev/mapper/storage-1-b8e36464-cd3d-42b8-bfeb-585dc8e0e854	   1.06TiB
   /dev/mapper/storage-2-0816ebad-c1ec-4832-afc4-fe54a0f75ea4	   1.06TiB
   /dev/mapper/storage-3-e31396f1-7a06-4adb-923a-4a954f684989	   1.06TiB
   /dev/mapper/storage-4-4fcf3bcb-5a3c-48f1-8ccb-0aca2d479dc6	 617.72GiB

Here you have the redundancy of System (BTRFS internal data), Metadata, and regular Data. You’ll see that data and metadata is rounded to 1GB chunks while system data to 64MB. Used percentage reflects used portion of the allocated data chunks. At the end of the output you also have the view on how much unallocated storage in the pool you have. Or “available space for allocating chunks” if you wish". :-)

Conclusion

BTRFS is nice, feature full, and very-reliable filesystem. I have had 0 issues with it and am happily continuing to use it as my daily driver. It gives me a great advantage of moving things as I need, without much hassle. I can just move things around, rebalance the filesystem and always ensure I have enough redundancy in the process. With ZFS for example, if I wanted to re-organize my pool for whatever reason, I had to kick-out drives from the mirrors first, and operate with no redundancy until everything was shuffled around.

With BTRFS, I can change redundancy structure on-the-fly, as well as the compression and many more. I just configure it and depending on the thing I’ve changed I have to run rebalance or scrub and be done with it.

So in conclusion, don’t be afraid of the BTRFS like I was and don’t let the naysayers scare you away. BTRFS keeps really nice status page, and as long as you avoid the red ones (experimental, or untested), you should be just fine.

Do you have any questions regarding the BTRFS filesystem?

My first impressions#

Another try#

ZFS#

BTRFS#

RAID 1#

Subvolumes#

Data redundancy#

Conclusion#