Linux 6.16: bcachefs merges

This post summarizes bcachefs merges that landed in Linux 6.16.

RC1

  • 2025-05-26: Merge tag ‘bcachefs-2025-05-24’ of git://evilpiepirate.org/bcachefs (commit)

    Show pull requestHide pull request
    Pull bcachefs updates from Kent Overstreet:
    
     - Poisoned extents can now be moved: this lets us handle bitrotted data
       without deleting it. For now, reading from poisoned extents only
       returns -EIO: in the future we'll have an API for specifying "read
       this data even if there were bitflips".
    
     - Incompatible features may now be enabled at runtime, via
       "opts/version_upgrade" in sysfs. Toggle it to incompatible, and then
       toggle it back - option changes via the sysfs interface are
       persistent.
    
     - Various changes to support deployable disk images:
    
         - RO mounts now use less memory
    
         - Images may be stripped of alloc info, particularly useful for
           slimming them down if they will primarily be mounted RO. Alloc
           info will be automatically regenerated on first RW mount, and
           this is quite fast
    
         - Filesystem images generated with 'bcachefs image' will be
           automatically resized the first time they're mounted on a larger
           device
    
       The images 'bcachefs image' generates with compression enabled have
       been comparable in size to those generated by squashfs and erofs -
       but you get a full RW capable filesystem
    
     - Major error message improvements for btree node reads, data reads,
       and elsewhere. We now build up a single error message that lists all
       the errors encountered, actions taken to repair, and success/failure
       of the IO. This extends to other error paths that may kick off other
       actions, e.g. scheduling recovery passes: actions we took because of
       an error are included in that error message, with
       grouping/indentation so we can see what caused what.
    
     - New option, 'rebalance_on_ac_only'. Does exactly what the name
       suggests, quite handy with background compression.
    
     - Repair/self healing:
    
         - We can now kick off recovery passes and run them in the
           background if we detect errors. Currently, this is just used by
           code that walks backpointers. We now also check for missing
           backpointers at runtime and run check_extents_to_backpointers if
           required. The messy 6.14 upgrade left missing backpointers for
           some users, and this will correct that automatically instead of
           requiring a manual fsck - some users noticed this as copygc
           spinning and not making progress.
    
           In the future, as more recovery passes come online, we'll be able
           to repair and recover from nearly anything - except for
           unreadable btree nodes, and that's why you're using replication,
           of course - without shutting down the filesystem.
    
         - There's a new recovery pass, for checking the rebalance_work
           btree, which tracks extents that rebalance will process later.
    
     - Hardening:
    
         - Close the last known hole in btree iterator/btree locking
           assertions: path->should_be_locked paths must stay locked until
           the end of the transaction. This shook out a few bugs, including
           a performance issue that was causing unnecessary path_upgrade
           transaction restarts.
    
     - Performance:
    
         - Faster snapshot deletion: this is an incompatible feature, as it
           requires new sentinal values, for safety. Snapshot deletion no
           longer has to do a full metadata scan, it now just scans the
           inodes btree: if an extent/dirent/xattr is present for a given
           snapshot ID, we already require that an inode be present with
           that same snapshot ID.
    
           If/when users hit scalability limits again (ridiculously huge
           filesystems with lots of inodes, and many sparse snapshots), let
           me know - the next step will be to add an index from snapshot ID
           -> inode number, which won't be too hard.
    
         - Faster device removal: the "scan for pointers to this device" no
           longer does a full metadata scan, instead it walks backpointers.
           Like fast snapshot deletion this is another incompat feature: it
           also requires a new sentinal value, because we don't want to
           reuse these device IDs until after a fsck.
    
         - We're now coalescing redundant accounting updates prior to
           transaction commit, taking some pressure off the journal. Shortly
           we'll also be doing multiple extent updates in a transaction in
           the main write path, which combined with the previous should
           drastically cut down on the amount of metadata updates we have to
           journal.
    
     - Stack usage improvements: All allocator state has been moved off the
       stack
    
     - Debug improvements:
    
         - enumerated refcounts: The debug code previously used for
           filesystem write refs is now a small library, and used for other
           heavily used refcounts. Different users of a refcount are
           enumerated, making it much easier to debug refcount issues.
    
         - Async object debugging: There's a new kconfig option that makes
           various async objects (different types of bios, data updates,
           write ops, etc.) visible in debugfs, and it should be fast enough
           to leave on in production.
    
         - Various sets of assertions no longer require
           CONFIG_BCACHEFS_DEBUG, instead they're controlled by module
           parameters and static keys, meaning users won't need to compile
           custom kernels as often to help debug issues.
    
         - bch2_trans_kmalloc() calls can be tracked (there's a new kconfig
           option). With it on you can check the btree_transaction_stats in
           debugfs to see the bch2_trans_kmalloc() calls a transaction did
           when it used the most memory.
    
    * tag 'bcachefs-2025-05-24' of git://evilpiepirate.org/bcachefs: (218 commits)
      bcachefs: Don't mount bs > ps without TRANSPARENT_HUGEPAGE
      bcachefs: Fix btree_iter_next_node() for new locking asserts
      bcachefs: Ensure we don't use a blacklisted journal seq
      bcachefs: Small check_fix_ptr fixes
      bcachefs: Fix opts.recovery_pass_last
      bcachefs: Fix allocate -> self healing path
      bcachefs: Fix endianness in casefold check/repair
      bcachefs: Path must be locked if trans->locked && should_be_locked
      bcachefs: Simplify bch2_path_put()
      bcachefs: Plumb btree_trans for more locking asserts
      bcachefs: Clear trans->locked before unlock
      bcachefs: Clear should_be_locked before unlock in key_cache_drop()
      bcachefs: bch2_path_get() reuses paths if upgrade_fails & !should_be_locked
      bcachefs: Give out new path if upgrade fails
      bcachefs: Fix btree_path_get_locks when not doing trans restart
      bcachefs: btree_node_locked_type_nowrite()
      bcachefs: Kill bch2_path_put_nokeep()
      bcachefs: bch2_journal_write_checksum()
      bcachefs: Reduce stack usage in data_update_index_update()
      bcachefs: bch2_trans_log_str()
      ...
    
  • 2025-06-04: Merge tag ‘bcachefs-2025-06-04’ of git://evilpiepirate.org/bcachefs (commit)

    Show pull requestHide pull request
    Pull more bcachefs updates from Kent Overstreet:
     "More bcachefs updates:
    
       - More stack usage improvements (~600 bytes)
    
       - Define CLASS()es for some commonly used types, and convert most
         rcu_read_lock() uses to the new lock guards
    
       - New introspection:
           - Superblock error counters are now available in sysfs:
             previously, they were only visible with 'show-super', which
             doesn't provide a live view
           - New tracepoint, error_throw(), which is called any time we
             return an error and start to unwind
    
       - Repair
           - check_fix_ptrs() can now repair btree node roots
           - We can now repair when we've somehow ended up with the journal
             using a superblock bucket
    
       - Revert some leftovers from the aborted directory i_size feature,
         and add repair code: some userspace programs (e.g. sshfs) were
         getting confused
    
      It seems in 6.15 there's a bug where i_nlink on the vfs inode has been
      getting incorrectly set to 0, with some unfortunate results;
      list_journal analysis showed bch2_inode_rm() being called (by
      bch2_evict_inode()) when it clearly should not have been.
    
       - bch2_inode_rm() now runs "should we be deleting this inode?" checks
         that were previously only run when deleting unlinked inodes in
         recovery
    
       - check_subvol() was treating a dangling subvol (pointing to a
         missing root inode) like a dangling dirent, and deleting it. This
         was the really unfortunate one: check_subvol() will now recreate
         the root inode if necessary
    
      This took longer to debug than it should have, and we lost several
      filesystems unnecessarily, because users have been ignoring the
      release notes and blindly running 'fsck -y'. Debugging required
      reconstructing what happened through analyzing the journal, when
      ideally someone would have noticed 'hey, fsck is asking me if I want
      to repair this: it usually doesn't, maybe I should run this in dry run
      mode and check what's going on?'
    
      As a reminder, fsck errors are being marked as autofix once we've
      verified, in real world usage, that they're working correctly; blindly
      running 'fsck -y' on an experimental filesystem is playing with fire
    
      Up to this incident we've had an excellent track record of not losing
      data, so let's try to learn from this one
    
      This is a community effort, I wouldn't be able to get this done
      without the help of all the people QAing and providing excellent bug
      reports and feedback based on real world usage. But please don't
      ignore advice and expect me to pick up the pieces
    
      If an error isn't marked as autofix, and it /is/ happening in the
      wild, that's also something I need to know about so we can check it
      out and add it to the autofix list if repair looks good. I haven't
      been getting those reports, and I should be; since we don't have any
      sort of telemetry yet I am absolutely dependent on user reports
    
      Now I'll be spending the weekend working on new repair code to see if
      I can get a filesystem back for a user who didn't have backups"
    
    * tag 'bcachefs-2025-06-04' of git://evilpiepirate.org/bcachefs: (69 commits)
      bcachefs: add cond_resched() to handle_overwrites()
      bcachefs: Make journal read log message a bit quieter
      bcachefs: Fix subvol to missing root repair
      bcachefs: Run may_delete_deleted_inode() checks in bch2_inode_rm()
      bcachefs: delete dead code from may_delete_deleted_inode()
      bcachefs: Add flags to subvolume_to_text()
      bcachefs: Fix oops in btree_node_seq_matches()
      bcachefs: Fix dirent_casefold_mismatch repair
      bcachefs: Fix bch2_fsck_rename_dirent() for casefold
      bcachefs: Redo bch2_dirent_init_name()
      bcachefs: Fix -Wc23-extensions in bch2_check_dirents()
      bcachefs: Run check_dirents second time if required
      bcachefs: Run snapshot deletion out of system_long_wq
      bcachefs: Make check_key_has_snapshot safer
      bcachefs: BCH_RECOVERY_PASS_NO_RATELIMIT
      bcachefs: bch2_require_recovery_pass()
      bcachefs: bch_err_throw()
      bcachefs: Repair code for directory i_size
      bcachefs: Kill un-reverted directory i_size code
      bcachefs: Delete redundant fsck_err()
      ...
    

RC2

  • 2025-06-13: Merge tag ‘bcachefs-2025-06-12’ of git://evilpiepirate.org/bcachefs (commit)

    Show pull requestHide pull request
    Pull bcachefs fixes from Kent Overstreet:
     "As usual, highlighting the ones users have been noticing:
    
       - Fix a small issue with has_case_insensitive not being propagated on
         snapshot creation; this led to fsck errors, which we're harmless
         because we're not using this flag yet (it's for overlayfs +
         casefolding).
    
       - Log the error being corrected in the journal when we're doing fsck
         repair: this was one of the "lessons learned" from the i_nlink 0 ->
         subvolume deletion bug, where reconstructing what had happened by
         analyzing the journal was a bit more difficult than it needed to
         be.
    
       - Don't schedule btree node scan to run in the superblock: this fixes
         a regression from the 6.16 recovery passes rework, and let to it
         running unnecessarily.
    
         The real issue here is that we don't have online, "self healing"
         style topology repair yet: topology repair currently has to run
         before we go RW, which means that we may schedule it unnecessarily
         after a transient error. This will be fixed in the future.
    
       - We now track, in btree node flags, the reason it was scheduled to
         be rewritten. We discovered a deadlock in recovery when many btree
         nodes need to be rewritten because they're degraded: fully fixing
         this will take some work but it's now easier to see what's going
         on.
    
         For the bug report where this came up, a device had been kicked RO
         due to transient errors: manually setting it back to RW was
         sufficient to allow recovery to succeed.
    
       - Mark a few more fsck errors as autofix: as a reminder to users,
         please do keep reporting cases where something needs to be repaired
         and is not repaired automatically (i.e. cases where -o fix_errors
         or fsck -y is required).
    
       - rcu_pending.c now works with PREEMPT_RT
    
       - 'bcachefs device add', then umount, then remount wasn't working -
         we now emit a uevent so that the new device's new superblock is
         correctly picked up
    
       - Assorted repair fixes: btree node scan will no longer incorrectly
         update sb->version_min,
    
       - Assorted syzbot fixes"
    
    * tag 'bcachefs-2025-06-12' of git://evilpiepirate.org/bcachefs: (23 commits)
      bcachefs: Don't trace should_be_locked unless changing
      bcachefs: Ensure that snapshot creation propagates has_case_insensitive
      bcachefs: Print devices we're mounting on multi device filesystems
      bcachefs: Don't trust sb->nr_devices in members_to_text()
      bcachefs: Fix version checks in validate_bset()
      bcachefs: ioctl: avoid stack overflow warning
      bcachefs: Don't pass trans to fsck_err() in gc_accounting_done
      bcachefs: Fix leak in bch2_fs_recovery() error path
      bcachefs: Fix rcu_pending for PREEMPT_RT
      bcachefs: Fix downgrade_table_extra()
      bcachefs: Don't put rhashtable on stack
      bcachefs: Make sure opts.read_only gets propagated back to VFS
      bcachefs: Fix possible console lock involved deadlock
      bcachefs: mark more errors autofix
      bcachefs: Don't persistently run scan_for_btree_nodes
      bcachefs: Read error message now prints if self healing
      bcachefs: Only run 'increase_depth' for keys from btree node csan
      bcachefs: Mark need_discard_freespace_key_bad autofix
      bcachefs: Update /dev/disk/by-uuid on device add
      bcachefs: Add more flags to btree nodes for rewrite reason
      ...
    

RC4

  • 2025-06-26: Merge tag ‘bcachefs-2025-06-26’ of git://evilpiepirate.org/bcachefs (commit)

    Show pull requestHide pull request
    Pull bcachefs fixes from Kent Overstreet:
    
     - Lots of small check/repair fixes, primarily in subvol loop and
       directory structure loop (when involving snapshots).
    
     - Fix a few 6.16 regressions: rare UAF in the foreground allocator path
       when taking a transaction restart from the transaction bump
       allocator, and some small fallout from the change to log the error
       being corrected in the journal when repairing errors, also some
       fallout from the btree node read error logging improvements.
    
       (Alan, Bharadwaj)
    
     - New option: journal_rewind
    
       This lets the entire filesystem be reset to an earlier point in time.
    
       Note that this is only a disaster recovery tool, and right now there
       are major caveats to using it (discards should be disabled, in
       particular), but it successfully restored the filesystem of one of
       the users who was bit by the subvolume deletion bug and didn't have
       backups. I'll likely be making some changes to the discard path in
       the future to make this a reliable recovery tool.
    
     - Some new btree iterator tracepoints, for tracking down some
       livelock-ish behaviour we've been seeing in the main data write path.
    
    * tag 'bcachefs-2025-06-26' of git://evilpiepirate.org/bcachefs: (51 commits)
      bcachefs: Plumb correct ip to trans_relock_fail tracepoint
      bcachefs: Ensure we rewind to run recovery passes
      bcachefs: Ensure btree node scan runs before checking for scanned nodes
      bcachefs: btree_root_unreadable_and_scan_found_nothing should not be autofix
      bcachefs: fix bch2_journal_keys_peek_prev_min() underflow
      bcachefs: Use wait_on_allocator() when allocating journal
      bcachefs: Check for bad write buffer key when moving from journal
      bcachefs: Don't unlock the trans if ret doesn't match BCH_ERR_operation_blocked
      bcachefs: Fix range in bch2_lookup_indirect_extent() error path
      bcachefs: fix spurious error_throw
      bcachefs: Add missing bch2_err_class() to fileattr_set()
      bcachefs: Add missing key type checks to check_snapshot_exists()
      bcachefs: Don't log fsck err in the journal if doing repair elsewhere
      bcachefs: Fix *__bch2_trans_subbuf_alloc() error path
      bcachefs: Fix missing newlines before ero
      bcachefs: fix spurious error in read_btree_roots()
      bcachefs: fsck: Fix oops in key_visible_in_snapshot()
      bcachefs: fsck: fix unhandled restart in topology repair
      bcachefs: fsck: Fix check_directory_structure when no check_dirents
      bcachefs: Fix restart handling in btree_node_scrub_work()
      ...
    

RC5

  • 2025-07-04: Merge tag ‘bcachefs-2025-07-03’ of git://evilpiepirate.org/bcachefs (commit)

    Show pull requestHide pull request
    Pull bcachefs fixes from Kent Overstreet:
     "The 'opts.casefold_disabled' patch is non critical, but would be a
      6.15 backport; it's to address the casefolding + overlayfs
      incompatibility that was discovvered late.
    
      It's late because I was hoping that this would be addressed on the
      overlayfs side (and will be in 6.17), but user reports keep coming in
      on this one (lots of people are using docker these days)"
    
    * tag 'bcachefs-2025-07-03' of git://evilpiepirate.org/bcachefs:
      bcachefs: opts.casefold_disabled
      bcachefs: Work around deadlock to btree node rewrites in journal replay
      bcachefs: Fix incorrect transaction restart handling
      bcachefs: fix btree_trans_peek_prev_journal()
      bcachefs: mark invalid_btree_id autofix
    

RC6

  • 2025-07-12: Merge tag ‘bcachefs-2025-07-11’ of git://evilpiepirate.org/bcachefs (commit)

    Show pull requestHide pull request
    Pull bcachefs fixes from Kent Overstreet.
    
    * tag 'bcachefs-2025-07-11' of git://evilpiepirate.org/bcachefs:
      bcachefs: Don't set BCH_FS_error on transaction restart
      bcachefs: Fix additional misalignment in journal space calculations
      bcachefs: Don't schedule non persistent passes persistently
      bcachefs: Fix bch2_btree_transactions_read() synchronization
      bcachefs: btree read retry fixes
      bcachefs: btree node scan no longer uses btree cache
      bcachefs: Tweak btree cache helpers for use by btree node scan
      bcachefs: Fix btree for nonexistent tree depth
      bcachefs: Fix bch2_io_failures_to_text()
      bcachefs: bch2_fpunch_snapshot()
    

RC7

  • 2025-07-18: Merge tag ‘bcachefs-2025-07-17’ of git://evilpiepirate.org/bcachefs (commit)

    Show pull requestHide pull request
    Pull bcachefs fixes from Kent Overstreet:
    
     - two small syzbot fixes
    
     - fix discard behaviour regression; we no longer wait until the number
       of buckets needing discard is greater than the number of buckets
       available before kicking off discards
    
     - fix a fast_list leak when async object debugging is enabled
    
     - fixes for casefolding when CONFIG_UTF8 != y
    
    * tag 'bcachefs-2025-07-17' of git://evilpiepirate.org/bcachefs:
      bcachefs: Fix bch2_maybe_casefold() when CONFIG_UTF8=n
      bcachefs: Fix build when CONFIG_UNICODE=n
      bcachefs: Fix reference to invalid bucket in copygc
      bcachefs: Don't build aux search tree when still repairing node
      bcachefs: Tweak threshold for allocator triggering discards
      bcachefs: Fix triggering of discard by the journal path
      bcachefs: io_read: remove from async obj list in rbio_done()
    

final

  • 2025-07-25: Merge tag ‘bcachefs-2025-07-24’ of git://evilpiepirate.org/bcachefs (commit)

    Show pull requestHide pull request
    Pull bcachefs fixes from Kent Overstreet:
     "User reported fixes:
    
       - Fix btree node scan on encrypted filesystems by not using btree
         node header fields encrypted
    
       - Fix a race in btree write buffer flush; this caused EROs primarily
         during fsck for some people"
    
    * tag 'bcachefs-2025-07-24' of git://evilpiepirate.org/bcachefs:
      bcachefs: Add missing snapshots_seen_add_inorder()
      bcachefs: Fix write buffer flushing from open journal entry
      bcachefs: btree_node_scan: don't re-read before initializing found_btree_node