btrfs: use delayed items when logging a directory

When logging a directory we start by flushing all its delayed items. That results in adding dir index items to the subvolume btree, for new dentries, and removing dir index items from the subvolume btree for any dentries that were deleted. This makes it straightforward to log a directory simply by iterating over all the modified subvolume btree leaves, especially when we used to log both dir index keys and dir item keys (before commit 339d035424849c ("btrfs: only copy dir index keys when logging a directory") and when we used to copy old dir index entries for leaves modified in the current transaction (before commit 732d591a5d6c12 ("btrfs: stop copying old dir items when logging a directory")). From an efficiency point of view this has a couple of drawbacks: 1) Adds extra latency, due to copying delayed items to the subvolume btree and deleting dir index items from the btree. Further if there are other tasks accessing the btree, which is common (syscalls like creat, mkdir, rename, link, unlink, truncate, reflinks, etc, finishing an ordered extent, etc), lock contention can cause further delays, both to the task logging a directory and to the other tasks accessing the btree; 2) More time spent overall flushing delayed items, if after logging the directory further changes are done to the directory in the same transaction. For example, if we add 10 dentries to a directory, fsync it, add more 10 dentries, fsync it again, then add more 10 dentries and fsync it again, then we end up inserting 3 batches of 10 items to the subvolume btree. With the changes from this patch, we flush all the delayed items to the btree only once - a single batch of 30 items, and outside the logging code (transaction commit or when delayed items are flushed asynchronously). This change simply skips the flushing of delayed items every time we log a directory. Instead we copy the delayed insertion items directly to the log tree and delete delayed deletion items directly from the log tree. Therefore avoiding changing first the subvolume btree and then scanning it for new items to copy from it to the log tree and detecting deletions by observing gaps in consecutive dir index keys in subvolume btree leaves. Running the following tests on a non-debug kernel (Debian's default kernel config), on a box with a NVMe device, a 12 cores Intel CPU and 64G of ram, produced the results below. The results compare a branch without this patch and all the other patches it depends on versus the same branch with the patchset applied. The patchset is comprised of the following patches: btrfs: don't drop dir index range items when logging a directory btrfs: remove the root argument from log_new_dir_dentries() btrfs: update stale comment for log_new_dir_dentries() btrfs: free list element sooner at log_new_dir_dentries() btrfs: avoid memory allocation at log_new_dir_dentries() for common case btrfs: remove root argument from btrfs_delayed_item_reserve_metadata() btrfs: store index number instead of key in struct btrfs_delayed_item btrfs: remove unused logic when looking up delayed items btrfs: shrink the size of struct btrfs_delayed_item btrfs: search for last logged dir index if it's not cached in the inode btrfs: move need_log_inode() to above log_conflicting_inodes() btrfs: move log_new_dir_dentries() above btrfs_log_inode() btrfs: log conflicting inodes without holding log mutex of the initial inode btrfs: skip logging parent dir when conflicting inode is not a dir btrfs: use delayed items when logging a directory Custom test script for testing time spent at btrfs_log_inode(): #!/bin/bash DEV=/dev/nvme0n1 MNT=/mnt/nvme0n1 # Total number of files to create in the test directory. NUM_FILES=10000 # Fsync after creating or renaming N files. FSYNC_AFTER=100 umount $DEV &> /dev/null mkfs.btrfs -f $DEV mount -o ssd $DEV $MNT TEST_DIR=$MNT/testdir mkdir $TEST_DIR echo "Creating files..." for ((i = 1; i <= $NUM_FILES; i++)); do echo -n > $TEST_DIR/file_$i if (( ($i % $FSYNC_AFTER) == 0 )); then xfs_io -c "fsync" $TEST_DIR fi done sync echo "Renaming files..." for ((i = 1; i <= $NUM_FILES; i++)); do mv $TEST_DIR/file_$i $TEST_DIR/file_$i.renamed if (( ($i % $FSYNC_AFTER) == 0 )); then xfs_io -c "fsync" $TEST_DIR fi done umount $MNT And using the following bpftrace script to capture the total time that is spent at btrfs_log_inode(): #!/usr/bin/bpftrace k:btrfs_log_inode { @start_log_inode[tid] = nsecs; } kr:btrfs_log_inode /@start_log_inode[tid]/ { $dur = (nsecs - @start_log_inode[tid]) / 1000; @btrfs_log_inode_total_time = sum($dur); delete(@start_log_inode[tid]); } END { clear(@start_log_inode); } Result before applying patchset: @btrfs_log_inode_total_time: 622642 Result after applying patchset: @btrfs_log_inode_total_time: 354134 (-43.1% time spent) The following dbench script was also used for testing: #!/bin/bash NUM_JOBS=$(nproc --all) DEV=/dev/nvme0n1 MNT=/mnt/nvme0n1 MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="-O no-holes -R free-space-tree" echo "performance" | \ tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor umount $DEV &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT dbench -D $MNT --skip-cleanup -t 120 -S $NUM_JOBS umount $MNT Before patchset: Operation Count AvgLat MaxLat ---------------------------------------- NTCreateX 3322265 0.034 21.032 Close 2440562 0.002 0.994 Rename 140664 1.150 269.633 Unlink 670796 1.093 269.678 Deltree 96 5.481 15.510 Mkdir 48 0.004 0.052 Qpathinfo 3010924 0.014 8.127 Qfileinfo 528055 0.001 0.518 Qfsinfo 552113 0.003 0.372 Sfileinfo 270575 0.005 0.688 Find 1164176 0.052 13.931 WriteX 1658537 0.019 5.918 ReadX 5207412 0.003 1.034 LockX 10818 0.003 0.079 UnlockX 10818 0.002 0.313 Flush 232811 1.027 269.735 Throughput 869.867 MB/sec (sync dirs) 12 clients 12 procs max_latency=269.741 ms After patchset: Operation Count AvgLat MaxLat ---------------------------------------- NTCreateX 4152738 0.029 20.863 Close 3050770 0.002 1.119 Rename 175829 0.871 211.741 Unlink 838447 0.845 211.724 Deltree 120 4.798 14.162 Mkdir 60 0.003 0.005 Qpathinfo 3763807 0.011 4.673 Qfileinfo 660111 0.001 0.400 Qfsinfo 690141 0.003 0.429 Sfileinfo 338260 0.005 0.725 Find 1455273 0.046 6.787 WriteX 2073307 0.017 5.690 ReadX 6509193 0.003 1.171 LockX 13522 0.003 0.077 UnlockX 13522 0.002 0.125 Flush 291044 0.811 211.631 Throughput 1089.27 MB/sec (sync dirs) 12 clients 12 procs max_latency=211.750 ms (+25.2% throughput, -21.5% max latency) Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
author: Filipe Manana <fdmanana@suse.com> 2022-08-22 13:51:44 +0300
committer: David Sterba <dsterba@suse.com> 2022-09-26 13:27:57 +0300
commit: 30b80f3ce0f9d58ab8a2094922f3d54d2fd4f92e (patch)
tree: 5fcfb85b8dcd86f7c588c7dc79f3d43b27286725 /fs/btrfs/delayed-inode.c
parent: 5557a069f3d7bfe5c9af5f04594133dad0fcacc7 (diff)
download: linux-30b80f3ce0f9d58ab8a2094922f3d54d2fd4f92e.tar.xz
1 files changed, 112 insertions, 0 deletions
diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
index b35eddcac846..cac5169eaf8d 100644
--- a/fs/btrfs/delayed-inode.c
+++ b/fs/btrfs/delayed-inode.c
@@ -315,6 +315,8 @@ static struct btrfs_delayed_item *btrfs_alloc_delayed_item(u16 data_len,
 		item->bytes_reserved = 0;
 		item->delayed_node = node;
 		RB_CLEAR_NODE(&item->rb_node);
+		INIT_LIST_HEAD(&item->log_list);
+		item->logged = false;
 		refcount_set(&item->refs, 1);
 	}
 	return item;
@@ -2045,3 +2047,113 @@ void btrfs_destroy_delayed_inodes(struct btrfs_fs_info *fs_info)
 	}
 }
 
+void btrfs_log_get_delayed_items(struct btrfs_inode *inode,
+				 struct list_head *ins_list,
+				 struct list_head *del_list)
+{
+	struct btrfs_delayed_node *node;
+	struct btrfs_delayed_item *item;
+
+	node = btrfs_get_delayed_node(inode);
+	if (!node)
+		return;
+
+	mutex_lock(&node->mutex);
+	item = __btrfs_first_delayed_insertion_item(node);
+	while (item) {
+		/*
+		 * It's possible that the item is already in a log list. This
+		 * can happen in case two tasks are trying to log the same
+		 * directory. For example if we have tasks A and task B:
+		 *
+		 * Task A collected the delayed items into a log list while
+		 * under the inode's log_mutex (at btrfs_log_inode()), but it
+		 * only releases the items after logging the inodes they point
+		 * to (if they are new inodes), which happens after unlocking
+		 * the log mutex;
+		 *
+		 * Task B enters btrfs_log_inode() and acquires the log_mutex
+		 * of the same directory inode, before task B releases the
+		 * delayed items. This can happen for example when logging some
+		 * inode we need to trigger logging of its parent directory, so
+		 * logging two files that have the same parent directory can
+		 * lead to this.
+		 *
+		 * If this happens, just ignore delayed items already in a log
+		 * list. All the tasks logging the directory are under a log
+		 * transaction and whichever finishes first can not sync the log
+		 * before the other completes and leaves the log transaction.
+		 */
+		if (!item->logged && list_empty(&item->log_list)) {
+			refcount_inc(&item->refs);
+			list_add_tail(&item->log_list, ins_list);
+		}
+		item = __btrfs_next_delayed_item(item);
+	}
+
+	item = __btrfs_first_delayed_deletion_item(node);
+	while (item) {
+		/* It may be non-empty, for the same reason mentioned above. */
+		if (!item->logged && list_empty(&item->log_list)) {
+			refcount_inc(&item->refs);
+			list_add_tail(&item->log_list, del_list);
+		}
+		item = __btrfs_next_delayed_item(item);
+	}
+	mutex_unlock(&node->mutex);
+
+	/*
+	 * We are called during inode logging, which means the inode is in use
+	 * and can not be evicted before we finish logging the inode. So we never
+	 * have the last reference on the delayed inode.
+	 * Also, we don't use btrfs_release_delayed_node() because that would
+	 * requeue the delayed inode (change its order in the list of prepared
+	 * nodes) and we don't want to do such change because we don't create or
+	 * delete delayed items.
+	 */
+	ASSERT(refcount_read(&node->refs) > 1);
+	refcount_dec(&node->refs);
+}
+
+void btrfs_log_put_delayed_items(struct btrfs_inode *inode,
+				 struct list_head *ins_list,
+				 struct list_head *del_list)
+{
+	struct btrfs_delayed_node *node;
+	struct btrfs_delayed_item *item;
+	struct btrfs_delayed_item *next;
+
+	node = btrfs_get_delayed_node(inode);
+	if (!node)
+		return;
+
+	mutex_lock(&node->mutex);
+
+	list_for_each_entry_safe(item, next, ins_list, log_list) {
+		item->logged = true;
+		list_del_init(&item->log_list);
+		if (refcount_dec_and_test(&item->refs))
+			kfree(item);
+	}
+
+	list_for_each_entry_safe(item, next, del_list, log_list) {
+		item->logged = true;
+		list_del_init(&item->log_list);
+		if (refcount_dec_and_test(&item->refs))
+			kfree(item);
+	}
+
+	mutex_unlock(&node->mutex);
+
+	/*
+	 * We are called during inode logging, which means the inode is in use
+	 * and can not be evicted before we finish logging the inode. So we never
+	 * have the last reference on the delayed inode.
+	 * Also, we don't use btrfs_release_delayed_node() because that would
+	 * requeue the delayed inode (change its order in the list of prepared
+	 * nodes) and we don't want to do such change because we don't create or
+	 * delete delayed items.
+	 */
+	ASSERT(refcount_read(&node->refs) > 1);
+	refcount_dec(&node->refs);
+}
author	Filipe Manana <fdmanana@suse.com>	2022-08-22 13:51:44 +0300
committer	David Sterba <dsterba@suse.com>	2022-09-26 13:27:57 +0300
commit	30b80f3ce0f9d58ab8a2094922f3d54d2fd4f92e (patch)
tree	5fcfb85b8dcd86f7c588c7dc79f3d43b27286725 /fs/btrfs/delayed-inode.c
parent	5557a069f3d7bfe5c9af5f04594133dad0fcacc7 (diff)
download	linux-30b80f3ce0f9d58ab8a2094922f3d54d2fd4f92e.tar.xz