XFS problems [Update]

TuxYesterday I run into strange trouble: I turned on the machine, run an update (Amarok 1.4.1 -> 1.4.2) and fetched my e-mails. While the spam and antivirus tools checked the e-mails (takes a bit of time for each e-mail) I restarted Amarok, which took some time because it tried to rebuild the database.

However, when I opened the e-mail client (kontact/kmail) and wanted to copy some incoming mails into other folders it suddenly said that this folder does not exist. I tried to click on the folder (in kmail) to see what happens, and it crashed. I tried to start konsole by the icon, but I got just a message that kdeinit was not available. I thought that this is probably due to the Amarok update and restarted the computer. Everything went fine, I was able to login, but as soon as I opened kmail, the same error occured, kdeinit was not available anymore, and I had to logout again.

Then I checked /var/log/messages and there was the first hint that it was not KDE’s fault:

Sep 1 15:46:18 localhost kernel: xfs_da_do_buf: bno 16777216
Sep 1 15:46:18 localhost kernel: dir: inode 8439210
Sep 1 15:46:18 localhost kernel: Filesystem "hda2": XFS internal error xfs_da_do_buf(1) at line 2119 of file fs/xfs/xfs_da_btree.c. Caller 0xf8b8614e
Sep 1 15:46:18 localhost kernel: xfs_da_do_buf+0x45b/0x829 [xfs] xfs_da_read_buf+0x30/0x35 [xfs]
Sep 1 15:46:18 localhost kernel: xfs_da_read_buf+0x30/0x35 [xfs] xfs_dir2_leafn_lookup_int+0x2fd/0x45d [xfs]
Sep 1 15:46:18 localhost kernel: xfs_dir2_data_log_unused+0x49/0x4f [xfs] xfs_da_read_buf+0x30/0x35 [xfs]
Sep 1 15:46:18 localhost kernel: xfs_dir2_node_removename+0x288/0x483 [xfs] xfs_dir2_node_removename+0x288/0x483 [xfs]
Sep 1 15:46:18 localhost kernel: xfs_dir2_removename+0xce/0xd5 [xfs] kmem_zone_alloc+0x4d/0x98 [xfs]
Sep 1 15:46:18 localhost kernel: xfs_ilock+0x8a/0xd0 [xfs] xfs_remove+0x2b4/0x458 [xfs]
Sep 1 15:46:18 localhost kernel: xfs_vn_unlink+0x17/0x3b [xfs] debug_mutex_add_waiter+0x1c/0x2c
Sep 1 15:46:18 localhost kernel: __mutex_lock_slowpath+0x339/0x439 vfs_unlink+0x70/0xdf
Sep 1 15:46:18 localhost kernel: xfs_vn_permission+0x0/0x13 [xfs] xfs_vn_permission+0xf/0x13 [xfs]
Sep 1 15:46:18 localhost kernel: vfs_unlink+0x70/0xdf vfs_unlink+0xa5/0xdf
Sep 1 15:46:18 localhost kernel: do_unlinkat+0x90/0x124 do_page_fault+0x22d/0x5ad
Sep 1 15:46:18 localhost kernel: syscall_call+0x7/0xb
Sep 1 15:46:18 localhost kernel: Filesystem "hda2": XFS internal error xfs_trans_cancel at line 1150 of file fs/xfs/xfs_trans.c. Caller 0xf8bba749
Sep 1 15:46:18 localhost kernel: xfs_trans_cancel+0x59/0xe5 [xfs] xfs_remove+0x42f/0x458 [xfs]
Sep 1 15:46:18 localhost kernel: xfs_remove+0x42f/0x458 [xfs] xfs_vn_unlink+0x17/0x3b [xfs]
Sep 1 15:46:18 localhost kernel: debug_mutex_add_waiter+0x1c/0x2c __mutex_lock_slowpath+0x339/0x439
Sep 1 15:46:18 localhost kernel: vfs_unlink+0x70/0xdf xfs_vn_permission+0x0/0x13 [xfs]
Sep 1 15:46:18 localhost kernel: xfs_vn_permission+0xf/0x13 [xfs] vfs_unlink+0x70/0xdf
Sep 1 15:46:18 localhost kernel: vfs_unlink+0xa5/0xdf do_unlinkat+0x90/0x124
Sep 1 15:46:18 localhost kernel: do_page_fault+0x22d/0x5ad syscall_call+0x7/0xb
Sep 1 15:46:19 localhost kernel: xfs_force_shutdown(hda2,0x8) called from line 1151 of file fs/xfs/xfs_trans.c. Return address = 0xf8bc59b4
Sep 1 15:46:19 localhost kernel: Filesystem "hda2": Corruption of in-memory data detected. Shutting down filesystem: hda2

Especially the last line is pretty important: Due to a massive error the filesystem was shut down. I was happy at that moment that I strictly divide my partitions: the xfs was my home, the root is another one, therefore I was able to boot and log into root without any problem.

So, first I thought it was a file system error, and run xfs_check, adn that told me something is not right. After that I run xfs_repair – and that failed:

Phase 1 - find and verify superblock...
Phase 2 - using internal log
- zero log...
- scan filesystem freespace and inode maps...
- found root inode chunk
Phase 3 - for each AG...
- scan and clear agi unlinked lists...
- process known inodes and perform inode discovery...
- agno = 0
- agno = 1
- agno = 2
- agno = 3
- agno = 4
- agno = 5
- agno = 6
- agno = 7
- process newly discovered inodes...
Phase 4 - check for duplicate blocks...
- setting up duplicate extent list...
- clear lost+found (if it exists) ...
- clearing existing "lost+found" inode
- marking entry "lost+found" to be deleted
- check for inodes claiming duplicate blocks...
- agno = 0
- agno = 1
- agno = 2
- agno = 3
- agno = 4
- agno = 5
- agno = 6
- agno = 7
Phase 5 - rebuild AG headers and trees...
- reset superblock...
Phase 6 - check inode connectivity...
- resetting contents of realtime bitmap and summary inodes
- ensuring existence of lost+found directory
- traversing filesystem starting at / ...
rebuilding directory inode 128

fatal error -- can't read block 16777216 for directory inode 8439210

So I thought it was a disk failure, and run some smartctl commands, but these did not detect any error. So I searched a bit around (remember: you are never the first one who has a specific problem!) and found this post (sorry, german).

So I run the mentioned commands:

$ fs_db -x /dev/hda2
inode 8439210
write core.mode 0
quit
$ fs_repair

The xfs_repair was able to finish the job, and I got my system back.

And the file loss? Well, I was lucky in a way. First of all, everything which was lost (at least everything I missed) could be found in “Lost+Found” in the root directory of the partition. There were some 750 files, but with file * I got a first impression of the files, and it turned out that these files were all e-mails (remember: the problem turned up every time I used kmail). But I was lucky: first of all I would have been able to recover all files, but I didn’t need to because the error ate only one specific folder: the spam folder😀

So, the question remains why this happened. I am not sure. What I know is that at least the problem that xfs_repair was not able to finish its job was a bug which is now fixed in the newest version of the xfs tools (which will be included in FC6). I also found some e-mail discussions talking about that problem, and at least judging from that it looks like there was something messed up in the xfs driver in one of the last kernels.

Anyway, now it works again and I learned a bit more how to solve such stuff. And, I have to say: besides this bug xfs behaves almost perfect. And with this crash I was able to recover data, something which did not work with a ext3 crash I once had.

I wonder if we will see something like zfs in Linux soon, that would be nice to play with…

Update
Following the discussions about the DebianUbuntu relationship I came across maddock’s blog who had some trouble with Xfs as well recently. He points to the Xfs FAQ which explains that there was a bug introduced with the last kernel version 2.6.17. Well, guess which kernel version Fedora ships with at the moment…

6 thoughts on “XFS problems [Update]”

  1. hi,

    afaik, reiser4 might solve this problem as well, as it should be atomic. about XFS, its particularly vulnerable to unclean shutdowns (powerfailure), more so than most other filesystems. its more designed for high-availability hardware with UPS’es and stuff like that. now i’m not sure what to recommend of course, i’ve had bad experiences with ext3 as well. also with xfs and reiserfs, and i think we both know our own experiences don’t really matter, statistically speaking. but technically, ext3 shouldn’t be as much as XFS at risk when you have power problems or other crashes – i’d go with that.

    but of course – you can do some more reading yourself as well. i guess you know filesystems are another of these ‘something vs somethingelse’ type problems that stir quite a few flamewars😉

  2. about XFS, its particularly vulnerable to unclean shutdowns (powerfailure)

    That’s interesting, I never read about that. Do you have further information?

    The reason for me to choose Xfs was originally that it came along with default extended attributes and default acl support – I tested around with this stuff at that time. Later on, almost every file system comparison test I saw was positive about Xfs, therefore I kept it.
    Ah, and not to forget: I once had this power failure, and lost an important amount of data – on ext3 (but as you said, this is *not* statistically).

    So, I would love to get some links, especially about the power failure stuff. Thanks for the comment!

    (And yes, there is (too) much place for flamewars but I do not want to join, I prefer reading some well written reports🙂 ).

  3. The post and the comment now makes some light on why i wasn’t able to complete the boot when i had XFS and FC5

    (and yes, i had a power failure since my father had to shut down the power in the house to make some maintenence and forgot to shut down the Linux box)

  4. Hm, I have to admit that I sometimes have to shut down my computer cold (pressing the power button 5-10 seconds) when X locks due to the Ati drivers (I suspect) – and I never had any problems like these.

    In fact, this problem appeared without any power problems, after a regular shutdown. Strange, strange…

  5. The newest 2.6.17-1.2174_FC5 kernel already contains the fix (as it has upgraded the kernel to 2.6.17.8). However, you will still need to cleanup the filesystem with the xfs_repair in the newest xfsprogs (2.8.10 or 2.8.11).

Comments are closed.