Age | Commit message (Collapse) | Author | Files | Lines |
|
Cosmetic, but imho it makes the usage look more clear and simple, the new
helper doesn't require to initialize "t".
After this change while_each_thread() has only 3 users, and it is only
used in the do/while loops.
Link: https://lkml.kernel.org/r/20231030155710.GA9095@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Hugetlb vmemmap default option (HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON)
is a sub-option to hugetlbfs, but it shows in the same level as hugetlbfs
itself, under "Pesudo filesystems".
Make the vmemmap option a sub-option to hugetlbfs, by changing hugetlbfs
into a menuconfig. When moving it, fix a typo 'v' spot by Randy.
Link: https://lkml.kernel.org/r/20231124151902.1075697-1-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
There were already assertions that we were not passing a tail page to
error_remove_page(), so make the compiler enforce that by converting
everything to pass and use a folio.
Link: https://lkml.kernel.org/r/20231117161447.2461643-7-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
In preparation for removing the return value entirely, stop testing it
in smb.
Link: https://lkml.kernel.org/r/20231108204605.745109-4-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Steve French <sfrench@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
In preparation for removing the return value entirely, stop testing it
in afs.
Link: https://lkml.kernel.org/r/20231108204605.745109-3-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Steve French <sfrench@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Use folio_fill_tail() to implement the unstuffing and folio_end_read() to
simultaneously mark the folio uptodate and unlock it. Unifies a couple of
code paths.
Link: https://lkml.kernel.org/r/20231107212643.3490372-4-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The iomap code was limited to PAGE_SIZE bytes; generalise it to cover
an arbitrary-sized folio, and move it to be a common helper.
[akpm@linux-foundation.org: fix folio_fill_tail(), per Andreas Gruenbacher]
Link: https://lkml.kernel.org/r/20231107212643.3490372-3-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "Add folio_zero_tail() and folio_fill_tail()".
I'm trying to make it easier for filesystems with tailpacking / stuffing /
inline data to use folios. The primary function here is
folio_fill_tail(). You give it a pointer to memory where the data
currently is, and it takes care of copying it into the folio at that
offset. That works for gfs2 & iomap. Then There's Ext4. Rather than gin
up some kind of specialist "Here's a two pointers to two blocks of memory"
routine, just let it do its current thing, and let it call
folio_zero_tail(), which is also called by folio_fill_tail().
Other filesystems can be converted later; these ones seemed like good
examples as they're already partly or completely converted to folios.
This patch (of 3):
Instead of unmapping the folio after copying the data to it, then mapping
it again to zero the tail, provide folio_zero_tail() to zero the tail of
an already-mapped folio.
[akpm@linux-foundation.org: fix kerneldoc argument ordering]
Link: https://lkml.kernel.org/r/20231107212643.3490372-1-willy@infradead.org
Link: https://lkml.kernel.org/r/20231107212643.3490372-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The PAGEMAP_SCAN ioctl returns information regarding page table entries.
It is more efficient compared to reading pagemap files. CRIU can start to
utilize this ioctl, but it needs info about soft-dirty bits to track
memory changes.
We are aware of a new method for tracking memory changes implemented in
the PAGEMAP_SCAN ioctl. For CRIU, the primary advantage of this method is
its usability by unprivileged users. However, it is not feasible to
transparently replace the soft-dirty tracker with the new one. The main
problem here is userfault descriptors that have to be preserved between
pre-dump iterations. It means criu continues supporting the soft-dirty
method to avoid breakage for current users. The new method will be
implemented as a separate feature.
[avagin@google.com: update tools/include/uapi/linux/fs.h]
Link: https://lkml.kernel.org/r/20231107164139.576046-1-avagin@google.com
Link: https://lkml.kernel.org/r/20231106220959.296568-1-avagin@google.com
Signed-off-by: Andrei Vagin <avagin@google.com>
Reviewed-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
Cc: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Since outstanding journal buffers hold a journal pin, when flushing all
pins we need to close the current journal entry if necessary so its pin
can be released.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
Pull char / misc driver fixes from Greg KH:
"Here are some small fixes for 6.7-rc5 for a variety of small driver
subsystems. Included in here are:
- debugfs revert for reported issue
- greybus revert for reported issue
- greybus fixup for endian build warning
- coresight driver fixes
- nvmem driver fixes
- devcoredump fix
- parport new device id
- ndtest build fix
All of these have ben in linux-next with no reported issues"
* tag 'char-misc-6.7-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
nvmem: Do not expect fixed layouts to grab a layout driver
parport: Add support for Brainboxes IX/UC/PX parallel cards
Revert "greybus: gb-beagleplay: Ensure le for values in transport"
greybus: gb-beagleplay: Ensure le for values in transport
greybus: BeaglePlay driver needs CRC_CCITT
Revert "debugfs: annotate debugfs handlers vs. removal with lockdep"
devcoredump: Send uevent once devcd is ready
ndtest: fix typo class_regster -> class_register
misc: mei: client.c: fix problem of return '-EOVERFLOW' in mei_cl_write
misc: mei: client.c: return negative error code in mei_cl_write
mei: pxp: fix mei_pxp_send_message return value
coresight: ultrasoc-smb: Fix uninitialized before use buf_hw_base
coresight: ultrasoc-smb: Config SMB buffer before register sink
coresight: ultrasoc-smb: Fix sleep while close preempt in enable_smb
Documentation: coresight: fix `make refcheckdocs` warning
hwtracing: hisi_ptt: Don't try to attach a task
hwtracing: hisi_ptt: Handle the interrupt in hardirq context
hwtracing: hisi_ptt: Add dummy callback pmu::read()
coresight: Fix crash when Perf and sysfs modes are used concurrently
coresight: etm4x: Remove bogous __exit annotation for some functions
|
|
Adjust a comment that was missed during commit 15baf55481de
("fscrypt: track master key presence separately from secret").
Link: https://lore.kernel.org/r/20231206002127.14790-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
|
|
Pull smb client fixes from Steve French:
"Six smb3 client fixes:
- Fixes for copy_file_range and clone (cache invalidation and file
size), also addresses an xfstest failure
- Fix to return proper error if REMAP_FILE_DEDUP set (also fixes
xfstest generic/304)
- Fix potential null pointer reference with DFS
- Multichannel fix addressing (reverting an earlier patch) some of
the problems with enabling/disabling channels dynamically
Still working on a followon multichannel fix to address another issue
found in reconnect testing that will send next week"
* tag '6.7-rc4-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: reconnect worker should take reference on server struct unconditionally
Revert "cifs: reconnect work should have reference on server struct"
cifs: Fix non-availability of dedup breaking generic/304
smb: client: fix potential NULL deref in parse_dfs_referrals()
cifs: Fix flushing, invalidation and file size with FICLONE
cifs: Fix flushing, invalidation and file size with copy_file_range()
|
|
Fix function parameters to prevent kernel-doc warnings.
vboxsf_wrappers.c:132: warning: Function parameter or member 'create_parms' not described in 'vboxsf_create'
vboxsf_wrappers.c:132: warning: Excess function parameter 'param' description in 'vboxsf_create'
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Hans de Goede <hdegoede@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Link: https://lore.kernel.org/r/20231206025355.31814-1-rdunlap@infradead.org
|
|
Simplify error path when "private" needs to be freed.
Cc: Guilherme G. Piccoli <gpiccoli@igalia.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: <linux-hardening@vger.kernel.org>
Link: https://lore.kernel.org/r/20231205182622.1329923-4-keescook@chromium.org
Signed-off-by: Kees Cook <keescook@chromium.org>
|
|
Simplify error path for failures where "inode" needs to be freed.
Cc: Guilherme G. Piccoli <gpiccoli@igalia.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: <linux-hardening@vger.kernel.org>
Link: https://lore.kernel.org/r/20231205182622.1329923-3-keescook@chromium.org
Signed-off-by: Kees Cook <keescook@chromium.org>
|
|
Replace open-coded mutex handling with cleanup.h guard(mutex) and
scoped_guard(mutex, ...).
Cc: Guilherme G. Piccoli <gpiccoli@igalia.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: <linux-hardening@vger.kernel.org>
Link: https://lore.kernel.org/r/20231205182622.1329923-2-keescook@chromium.org
Signed-off-by: Kees Cook <keescook@chromium.org>
|
|
Mostly as an example to myself, replace a simple allocation pattern with
the automatic kfree cleanup features now exposed by cleanup.h.
Cc: Guilherme G. Piccoli <gpiccoli@igalia.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: <linux-hardening@vger.kernel.org>
Link: https://lore.kernel.org/r/20231205182622.1329923-1-keescook@chromium.org
Signed-off-by: Kees Cook <keescook@chromium.org>
|
|
In persistent_ram_init_ecc(), on 64-bit arches DIV_ROUND_UP() will return
64-bit value since persistent_ram_zone::buffer_size has type size_t which
is derived from the 64-bit *unsigned long*, while the ecc_blocks variable
this value gets assigned to has (always 32-bit) *int* type. Even if that
value fits into *int* type, an overflow is still possible when calculating
the size_t typed ecc_total variable further below since there's no cast to
any 64-bit type before multiplication. Declaring the ecc_blocks variable
as *size_t* should fix this mess...
Found by Linux Verification Center (linuxtesting.org) with the SVACE static
analysis tool.
Fixes: 9cc05ad97c57 ("staging: android: persistent_ram: refactor ecc support")
Signed-off-by: Sergey Shtylyov <s.shtylyov@omp.ru>
Link: https://lore.kernel.org/r/20231105202936.25694-1-s.shtylyov@omp.ru
Signed-off-by: Kees Cook <keescook@chromium.org>
|
|
When the number of cpu cores is adjusted to 7 or other odd numbers,
the zone size will become an odd number.
The address of the zone will become:
addr of zone0 = BASE
addr of zone1 = BASE + zone_size
addr of zone2 = BASE + zone_size*2
...
The address of zone1/3/5/7 will be mapped to non-alignment va.
Eventually crashes will occur when accessing these va.
So, use ALIGN_DOWN() to make sure the zone size is even
to avoid this bug.
Signed-off-by: Weichen Chen <weichen.chen@mediatek.com>
Reviewed-by: Matthias Brugger <matthias.bgg@gmail.com>
Tested-by: "Guilherme G. Piccoli" <gpiccoli@igalia.com>
Link: https://lore.kernel.org/r/20230224023632.6840-1-weichen.chen@mediatek.com
Signed-off-by: Kees Cook <keescook@chromium.org>
|
|
Blocks are tracked by u32, so the max permitted filesize is
(U32_MAX + 1) * BLOCK_SIZE. Additionally, in order to support crypto
data unit sizes of 4K with a 16K block with IV_INO_LBLK_{32,64}, we must
further restrict max filesize to (U32_MAX + 1) * 4096. This does not
affect 4K blocksize f2fs as the natural limit for files are well below
that.
Fixes: d7e9a9037de2 ("f2fs: Support Block Size == Page Size")
Signed-off-by: Daniel Rosenberg <drosen@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"31 hotfixes. Ten of these address pre-6.6 issues and are marked
cc:stable. The remainder address post-6.6 issues or aren't considered
serious enough to justify backporting"
* tag 'mm-hotfixes-stable-2023-12-07-18-47' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (31 commits)
mm/madvise: add cond_resched() in madvise_cold_or_pageout_pte_range()
nilfs2: prevent WARNING in nilfs_sufile_set_segment_usage()
mm/hugetlb: have CONFIG_HUGETLB_PAGE select CONFIG_XARRAY_MULTI
scripts/gdb: fix lx-device-list-bus and lx-device-list-class
MAINTAINERS: drop Antti Palosaari
highmem: fix a memory copy problem in memcpy_from_folio
nilfs2: fix missing error check for sb_set_blocksize call
kernel/Kconfig.kexec: drop select of KEXEC for CRASH_DUMP
units: add missing header
drivers/base/cpu: crash data showing should depends on KEXEC_CORE
mm/damon/sysfs-schemes: add timeout for update_schemes_tried_regions
scripts/gdb/tasks: fix lx-ps command error
mm/Kconfig: make userfaultfd a menuconfig
selftests/mm: prevent duplicate runs caused by TEST_GEN_PROGS
mm/damon/core: copy nr_accesses when splitting region
lib/group_cpus.c: avoid acquiring cpu hotplug lock in group_cpus_evenly
checkstack: fix printed address
mm/memory_hotplug: fix error handling in add_memory_resource()
mm/memory_hotplug: add missing mem_hotplug_lock
.mailmap: add a new address mapping for Chester Lin
...
|
|
MS confirm that "AISi" name of SMB2_CREATE_ALLOCATION_SIZE in MS-SMB2
specification is a typo. cifs/ksmbd have been using this wrong name from
MS-SMB2. It should be "AlSi". Also It will cause problem when running
smb2.create.open test in smbtorture against ksmbd.
Cc: stable@vger.kernel.org
Fixes: 12197a7fdda9 ("Clarify SMB2/SMB3 create context and add missing ones")
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Reviewed-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
When client send SMB2_CREATE_ALLOCATION_SIZE create context, ksmbd update
old size to ->AllocationSize in smb2 create response. ksmbd_vfs_getattr()
should be called after it to get updated stat result.
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
opinfo_put() could be called twice on error of smb21_lease_break_ack().
It will cause UAF issue if opinfo is referenced on other places.
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Don't immediately send directory lease break notification on smb2_write().
Instead, It postpones it until smb2_close().
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
If client send different parent key, different client guid, or there is
no parent lease key flags in create context v2 lease, ksmbd send lease
break to client.
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
We could delete directories transactionally on rmdir()/unlink(), but we
don't; instead, like with regular files we wait for the VFS to call
evict().
That means that our check for directories in the deleted inodes btree is
wrong - the check should be for non-empty directories.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Move xfs_ondisk.h to libxfs so that we can do the struct sanity checks
in userspace libxfs as well. This should allow us to retire the
somewhat fragile xfs/122 test on xfstests.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
|
|
Use the compiler-provided static_assert built-in from C11 instead of
the kernel-specific BUILD_BUG_ON_MSG for the structure size and offset
checks in xfs_ondisk. This not only gives slightly nicer error messages
in case things go south, but can also be trivially used as-is in
userspace.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
|
|
This patch does not modify logic.
xfs_da_buf_copy() will copy one block from src xfs_buf to
dst xfs_buf, and update the block metadata in dst directly.
Signed-off-by: Zhang Tianci <zhangtianci.1997@bytedance.com>
Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
|
|
xfs_da3_swap_lastblock() copy the last block content to the dead block,
but do not update the metadata in it. We need update some metadata
for some kinds of type block, such as dir3 leafn block records its
blkno, we shall update it to the dead block blkno. Otherwise,
before write the xfs_buf to disk, the verify_write() will fail in
blk_hdr->blkno != xfs_buf->b_bn, then xfs will be shutdown.
We will get this warning:
XFS (dm-0): Metadata corruption detected at xfs_dir3_leaf_verify+0xa8/0xe0 [xfs], xfs_dir3_leafn block 0x178
XFS (dm-0): Unmount and run xfs_repair
XFS (dm-0): First 128 bytes of corrupted metadata buffer:
00000000e80f1917: 00 80 00 0b 00 80 00 07 3d ff 00 00 00 00 00 00 ........=.......
000000009604c005: 00 00 00 00 00 00 01 a0 00 00 00 00 00 00 00 00 ................
000000006b6fb2bf: e4 44 e3 97 b5 64 44 41 8b 84 60 0e 50 43 d9 bf .D...dDA..`.PC..
00000000678978a2: 00 00 00 00 00 00 00 83 01 73 00 93 00 00 00 00 .........s......
00000000b28b247c: 99 29 1d 38 00 00 00 00 99 29 1d 40 00 00 00 00 .).8.....).@....
000000002b2a662c: 99 29 1d 48 00 00 00 00 99 49 11 00 00 00 00 00 .).H.....I......
00000000ea2ffbb8: 99 49 11 08 00 00 45 25 99 49 11 10 00 00 48 fe .I....E%.I....H.
0000000069e86440: 99 49 11 18 00 00 4c 6b 99 49 11 20 00 00 4d 97 .I....Lk.I. ..M.
XFS (dm-0): xfs_do_force_shutdown(0x8) called from line 1423 of file fs/xfs/xfs_buf.c. Return address = 00000000c0ff63c1
XFS (dm-0): Corruption of in-memory data detected. Shutting down filesystem
XFS (dm-0): Please umount the filesystem and rectify the problem(s)
>From the log above, we know xfs_buf->b_no is 0x178, but the block's hdr record
its blkno is 0x1a0.
Fixes: 24df33b45ecf ("xfs: add CRC checking to dir2 leaf blocks")
Signed-off-by: Zhang Tianci <zhangtianci.1997@bytedance.com>
Suggested-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
|
|
In the case of returning -ENOSPC, ensure logflagsp is initialized by 0.
Otherwise the caller __xfs_bunmapi will set uninitialized illegal
tmp_logflags value into xfs log, which might cause unpredictable error
in the log recovery procedure.
Also, remove the flags variable and set the *logflagsp directly, so that
the code should be more robust in the long run.
Fixes: 1b24b633aafe ("xfs: move some more code into xfs_bmap_del_extent_real")
Signed-off-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
|
|
Use struct types instead of typedefs so that the header can be included
with pulling in the headers that define the typedefs, and remove the
pointless externs.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
|
|
xfs_reserve_blocks has a very odd interface that can only be explained
by it directly deriving from the IRIX fcntl handler back in the day.
Split reporting out the reserved blocks out of xfs_reserve_blocks into
the only caller that cares. This means that the value reported from
XFS_IOC_SET_RESBLKS isn't atomically sampled in the same critical
section as when it was set anymore, but as the values could change
right after setting them anyway that does not matter. It does
provide atomic sampling of both values for XFS_IOC_GET_RESBLKS now,
though.
Also pass a normal scalar integer value for the requested value instead
of the pointless pointer.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
|
|
Split XFS_IOC_FSCOUNTS out of the main xfs_file_ioctl function, and
merge the xfs_fs_counts helper into the ioctl handler.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
|
|
The XFS_IOC_GET_RESBLKS and XFS_IOC_SET_RESBLKS already share a fair
amount of code, and will share even more soon. Move the logic for both
of them out of the main xfs_file_ioctl function into a
xfs_ioctl_getset_resblocks helper to share the code and prepare for
additional changes.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
|
|
Now, if we suddenly remove a PMEM device(by calling unbind) which
contains FSDAX while programs are still accessing data in this device,
e.g.:
```
$FSSTRESS_PROG -d $SCRATCH_MNT -n 99999 -p 4 &
# $FSX_PROG -N 1000000 -o 8192 -l 500000 $SCRATCH_MNT/t001 &
echo "pfn1.1" > /sys/bus/nd/drivers/nd_pmem/unbind
```
it could come into an unacceptable state:
1. device has gone but mount point still exists, and umount will fail
with "target is busy"
2. programs will hang and cannot be killed
3. may crash with NULL pointer dereference
To fix this, we introduce a MF_MEM_PRE_REMOVE flag to let it know that we
are going to remove the whole device, and make sure all related processes
could be notified so that they could end up gracefully.
This patch is inspired by Dan's "mm, dax, pmem: Introduce
dev_pagemap_failure()"[1]. With the help of dax_holder and
->notify_failure() mechanism, the pmem driver is able to ask filesystem
on it to unmap all files in use, and notify processes who are using
those files.
Call trace:
trigger unbind
-> unbind_store()
-> ... (skip)
-> devres_release_all()
-> kill_dax()
-> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
-> xfs_dax_notify_failure()
`-> freeze_super() // freeze (kernel call)
`-> do xfs rmap
` -> mf_dax_kill_procs()
` -> collect_procs_fsdax() // all associated processes
` -> unmap_and_kill()
` -> invalidate_inode_pages2_range() // drop file's cache
`-> thaw_super() // thaw (both kernel & user call)
Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
event. Use the exclusive freeze/thaw[2] to lock the filesystem to prevent
new dax mapping from being created. Do not shutdown filesystem directly
if configuration is not supported, or if failure range includes metadata
area. Make sure all files and processes(not only the current progress)
are handled correctly. Also drop the cache of associated files before
pmem is removed.
[1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
[2]: https://lore.kernel.org/linux-xfs/169116275623.3187159.16862410128731457358.stg-ugh@frogsfrogsfrogs/
Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
|
|
Introduce the concept of a defer ops barrier to separate consecutively
queued pending work items of the same type. With a barrier in place,
the two work items will be tracked separately, and receive separate log
intent items. The goal here is to prevent reaping of old metadata
blocks from creating unnecessarily huge EFIs that could then run the
risk of overflowing the scrub transaction.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
We need to log EFIs for every extent that we allocate for the purpose of
staging a new btree so that if we fail then the blocks will be freed
during log recovery. Use the autoreaping mechanism provided by the
previous patch to attach paused freeing work to the scrub transaction.
We can then mark the EFIs stale if we decide to commit the new btree, or
we can unpause the EFIs if we decide to abort the repair.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
Create a new xrep_newbt structure to encapsulate a fake root for
creating a staged btree cursor as well as to track all the blocks that
we need to reserve in order to build that btree.
As for the particular choice of lowspace thresholds and btree block
slack factors -- at this point one could say that the thresholds in
online repair come from bulkload_estimate_ag_slack in xfs_repair[1].
But that's not the entire story, since the offline btree rebuilding
code in xfs_repair was merged as a retroport of the online btree code
in this patchset!
Before xfs_btree_staging.[ch] came along, xfs_repair determined the
slack factor (aka the number of slots to leave unfilled in each new
btree block) via open-coded logic in repair/phase5.c[2]. At that point
the slack factors were arbitrary quantities per btree. The rmapbt
automatically left 10 slots free; everything else left zero.
That had a noticeable effect on performance straight after mounting
because adding records to /any/ btree would result in splits. A few
years ago when this patch was first written, Dave and I decided that
repair should generate btree blocks that were 75% full unless space was
tight, in which case it should try to fill the blocks to nearly full.
We defined tight as ~10% free to avoid repair failures but settled on
3/32 (~9%) to avoid div64.
IOWs, we mostly pulled the thresholds out of thin air. We've been
QAing with those geometry numbers ever since. ;)
Link: https://git.kernel.org/pub/scm/fs/xfs/xfsprogs-dev.git/tree/repair/bulkload.c?h=v6.5.0#n114
Link: https://git.kernel.org/pub/scm/fs/xfs/xfsprogs-dev.git/tree/repair/phase5.c?h=v4.19.0#n1349
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
|
|
Remove these unused fields since nobody uses them. They should have
been removed years ago in a different cleanup series from Christoph
Hellwig.
Fixes: daf83964a3681 ("xfs: move the per-fork nextents fields into struct xfs_ifork")
Fixes: f7e67b20ecbbc ("xfs: move the fork format fields into struct xfs_ifork")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
|
|
As mentioned in the previous commit, online repair wants to allocate
space to write out a new metadata structure, and it also wants to hedge
against system crashes during repairs by logging (and later cancelling)
EFIs to free the space if we crash before committing the new data
structure.
Therefore, create a trio of functions to schedule automatic reaping of
freshly allocated unwritten space. xfs_alloc_schedule_autoreap creates
a paused EFI representing the space we just allocated. Once the
allocations are made and the autoreaps scheduled, we can start writing
to disk.
If the writes succeed, xfs_alloc_cancel_autoreap marks the EFI work
items as stale and unpauses the pending deferred work item. Assuming
that's done in the same transaction that commits the new structure into
the filesystem, we guarantee that either the new object is fully
visible, or that all the space gets reclaimed.
If the writes succeed but only part of an extent was used, repair must
call the same _cancel_autoreap function to kill the first EFI and then
log a new EFI to free the unused space. The first EFI is already
committed, so it cannot be changed.
For full extents that aren't used, xfs_alloc_commit_autoreap will
unpause the EFI, which results in the space being freed during the next
_defer_finish cycle.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
xfs_free_extent_later is a trivial helper, so remove it to reduce the
amount of thinking required to understand the deferred freeing
interface. This will make it easier to introduce automatic reaping of
speculative allocations in the next patch.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
Traditionally, all pending deferred work attached to a transaction is
finished when one of the xfs_defer_finish* functions is called.
However, online repair wants to be able to allocate space for a new data
structure, format a new metadata structure into the allocated space, and
commit that into the filesystem.
As a hedge against system crashes during repairs, we also want to log
some EFI items for the allocated space speculatively, and cancel them if
we elect to commit the new data structure.
Therefore, introduce the idea of pausing a pending deferred work item.
Log intent items are still created for paused items and relogged as
necessary. However, paused items are pushed onto a side list before we
start calling ->finish_item, and the whole list is reattach to the
transaction afterwards. New work items are never attached to paused
pending items.
Modify xfs_defer_cancel to clean up pending deferred work items holding
a log intent item but not a log intent done item, since that is now
possible.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
When someone tries to add a deferred work item to xfs_defer_add, it will
try to attach the work item to the most recently added xfs_defer_pending
object attached to the transaction. However, it doesn't check if the
pending object has a log intent item attached to it. This is incorrect
behavior because we cannot add more work to an object that has already
been committed to the ondisk log.
Therefore, change the behavior not to append to pending items with a non
null dfp_intent. In practice this has not been an issue because the
only way xfs_defer_add gets called after log intent items have been
committed is from the defer ops ->finish_item functions themselves, and
the @dop_pending isolation in xfs_defer_finish_noroll protects the
pending items that have already been logged.
However, the next patch will add the ability to pause a deferred extent
free object during online btree rebuilding, and any new extfree work
items need to have their own pending event.
While we're at it, hoist the predicate to its own static inline function
for readability.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
When scrub is trying to iget an inode, ensure that it won't end up
deadlocked on a cycle in the inode btree by using an empty transaction
to store all the buffers.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
Extended attribute updates use the deferred work machinery to manage
state across a chain of smaller transactions. All previous deferred
work users have employed log intent items and log done items to manage
restarting of interrupted operations, which means that ->create_intent
sets dfp_intent to a log intent item and ->create_done uses that item to
create a log intent done item.
However, xattrs have used the INCOMPLETE flag to deal with the lack of
recovery support for an interrupted transaction chain. Log items are
optional if the xattr update caller didn't set XFS_DA_OP_LOGGED to
require a restartable sequence.
In other words, ->create_intent can return NULL to say that there's no
log intent item. If that's the case, no log intent done item should be
created. Clean up xfs_defer_create_done not to do this, so that the
->create_done functions don't have to check for non-null dfp_intent
themselves.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
Christoph requested a blurb somewhere explaining exactly what LARP
means. I don't know of a good place other than the source code (debug
knobs aren't covered in Documentation/), so here it is.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|