| Age | Commit message (Collapse) | Author | Files | Lines |
|
commit c0fe2994f9a9d0a2ec9e42441ea5ba74b6a16176 upstream.
Currently calc_target() clears t->paused if the request shouldn't be
paused anymore, but doesn't ever set t->paused even though it's able to
determine when the request should be paused. Setting t->paused is left
to __submit_request() which is fine for regular requests but doesn't
work for linger requests -- since __submit_request() doesn't operate
on linger requests, there is nowhere for lreq->t.paused to be set.
One consequence of this is that watches don't get reestablished on
paused -> unpaused transitions in cases where requests have been paused
long enough for the (paused) unwatch request to time out and for the
subsequent (re)watch request to enter the paused state. On top of the
watch not getting reestablished, rbd_reregister_watch() gets stuck with
rbd_dev->watch_mutex held:
rbd_register_watch
__rbd_register_watch
ceph_osdc_watch
linger_reg_commit_wait
It's waiting for lreq->reg_commit_wait to be completed, but for that to
happen the respective request needs to end up on need_resend_linger list
and be kicked when requests are unpaused. There is no chance for that
if the request in question is never marked paused in the first place.
The fact that rbd_dev->watch_mutex remains taken out forever then
prevents the image from getting unmapped -- "rbd unmap" would inevitably
hang in D state on an attempt to grab the mutex.
Cc: stable@vger.kernel.org
Reported-by: Raphael Zimmer <raphael.zimmer@tu-ilmenau.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 11194b416ef95012c2cfe5f546d71af07b639e93 upstream.
When a fault occurs, the connection is abandoned, reestablished, and any
pending operations are retried. The OSD client tracks the progress of a
sparse-read reply using a separate state machine, largely independent of
the messenger's state.
If a connection is lost mid-payload or the sparse-read state machine
returns an error, the sparse-read state is not reset. The OSD client
will then interpret the beginning of a new reply as the continuation of
the old one. If this makes the sparse-read machinery enter a failure
state, it may never recover, producing loops like:
libceph: [0] got 0 extents
libceph: data len 142248331 != extent len 0
libceph: osd0 (1)...:6801 socket error on read
libceph: data len 142248331 != extent len 0
libceph: osd0 (1)...:6801 socket error on read
Therefore, reset the sparse-read state in osd_fault(), ensuring retries
start from a clean state.
Cc: stable@vger.kernel.org
Fixes: f628d7999727 ("libceph: add sparse read support to OSD client")
Signed-off-by: Sam Edwards <CFSworks@gmail.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit e84b48d31b5008932c0a0902982809fbaa1d3b70 upstream.
Currently any error from ceph_auth_handle_reply_done() is propagated
via finish_auth() but isn't returned from mon_handle_auth_done(). This
results in higher layers learning that (despite the monitor considering
us to be successfully authenticated) something went wrong in the
authentication phase and reacting accordingly, but msgr2 still trying
to proceed with establishing the session in the background. In the
case of secure mode this can trigger a WARN in setup_crypto() and later
lead to a NULL pointer dereference inside of prepare_auth_signature().
Cc: stable@vger.kernel.org
Fixes: cd1a677cad99 ("libceph, ceph: implement msgr2.1 protocol (crc and secure modes)")
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit e3fe30e57649c551757a02e1cad073c47e1e075e upstream.
free_choose_arg_map() may dereference a NULL pointer if its caller fails
after a partial allocation.
For example, in decode_choose_args(), if allocation of arg_map->args
fails, execution jumps to the fail label and free_choose_arg_map() is
called. Since arg_map->size is updated to a non-zero value before memory
allocation, free_choose_arg_map() will iterate over arg_map->args and
dereference a NULL pointer.
To prevent this potential NULL pointer dereference and make
free_choose_arg_map() more resilient, add checks for pointers before
iterating.
Cc: stable@vger.kernel.org
Co-authored-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Tuo Li <islituo@gmail.com>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit e00c3f71b5cf75681dbd74ee3f982a99cb690c2b upstream.
If the osdmap is (maliciously) corrupted such that the incremental
osdmap epoch is different from what is expected, there is no need to
BUG. Instead, just declare the incremental osdmap to be invalid.
Cc: stable@vger.kernel.org
Reported-by: ziming zhang <ezrakiez@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 818156caffbf55cb4d368f9c3cac64e458fb49c9 upstream.
Perform an explicit bounds check on payload_len to avoid a possible
out-of-bounds access in the callout.
[ idryomov: changelog ]
Cc: stable@vger.kernel.org
Signed-off-by: ziming zhang <ezrakiez@gmail.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 8c738512714e8c0aa18f8a10c072d5b01c83db39 upstream.
If the osdmap is (maliciously) corrupted such that the encoded length
of ceph_pg_pool envelope is less than what is expected for a particular
encoding version, out-of-bounds reads may ensue because the only bounds
check that is there is based on that length value.
This patch adds explicit bounds checks for each field that is decoded
or skipped.
Cc: stable@vger.kernel.org
Reported-by: ziming zhang <ezrakiez@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Tested-by: ziming zhang <ezrakiez@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
The len field originates from untrusted network packets. Boundary
checks have been added to prevent potential out-of-bounds writes when
decrypting the connection secret or processing service tickets.
[ idryomov: changelog ]
Cc: stable@vger.kernel.org
Signed-off-by: ziming zhang <ezrakiez@gmail.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
OSD indexes come from untrusted network packets. Boundary checks are
added to validate these against map->max_osd.
[ idryomov: drop BUG_ON in ceph_get_primary_affinity(), minor cosmetic
edits ]
Cc: stable@vger.kernel.org
Signed-off-by: ziming zhang <ezrakiez@gmail.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
The crash in process_v2_sparse_read() for fscrypt-encrypted directories
has been reported. Issue takes place for Ceph msgr2 protocol in secure
mode. It can be reproduced by the steps:
sudo mount -t ceph :/ /mnt/cephfs/ -o name=admin,fs=cephfs,ms_mode=secure
(1) mkdir /mnt/cephfs/fscrypt-test-3
(2) cp area_decrypted.tar /mnt/cephfs/fscrypt-test-3
(3) fscrypt encrypt --source=raw_key --key=./my.key /mnt/cephfs/fscrypt-test-3
(4) fscrypt lock /mnt/cephfs/fscrypt-test-3
(5) fscrypt unlock --key=my.key /mnt/cephfs/fscrypt-test-3
(6) cat /mnt/cephfs/fscrypt-test-3/area_decrypted.tar
(7) Issue has been triggered
[ 408.072247] ------------[ cut here ]------------
[ 408.072251] WARNING: CPU: 1 PID: 392 at net/ceph/messenger_v2.c:865
ceph_con_v2_try_read+0x4b39/0x72f0
[ 408.072267] Modules linked in: intel_rapl_msr intel_rapl_common
intel_uncore_frequency_common intel_pmc_core pmt_telemetry pmt_discovery
pmt_class intel_pmc_ssram_telemetry intel_vsec kvm_intel joydev kvm irqbypass
polyval_clmulni ghash_clmulni_intel aesni_intel rapl input_leds psmouse
serio_raw i2c_piix4 vga16fb bochs vgastate i2c_smbus floppy mac_hid qemu_fw_cfg
pata_acpi sch_fq_codel rbd msr parport_pc ppdev lp parport efi_pstore
[ 408.072304] CPU: 1 UID: 0 PID: 392 Comm: kworker/1:3 Not tainted 6.17.0-rc7+
[ 408.072307] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
1.17.0-5.fc42 04/01/2014
[ 408.072310] Workqueue: ceph-msgr ceph_con_workfn
[ 408.072314] RIP: 0010:ceph_con_v2_try_read+0x4b39/0x72f0
[ 408.072317] Code: c7 c1 20 f0 d4 ae 50 31 d2 48 c7 c6 60 27 d5 ae 48 c7 c7 f8
8e 6f b0 68 60 38 d5 ae e8 00 47 61 fe 48 83 c4 18 e9 ac fc ff ff <0f> 0b e9 06
fe ff ff 4c 8b 9d 98 fd ff ff 0f 84 64 e7 ff ff 89 85
[ 408.072319] RSP: 0018:ffff88811c3e7a30 EFLAGS: 00010246
[ 408.072322] RAX: ffffed1024874c6f RBX: ffffea00042c2b40 RCX: 0000000000000f38
[ 408.072324] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[ 408.072325] RBP: ffff88811c3e7ca8 R08: 0000000000000000 R09: 00000000000000c8
[ 408.072326] R10: 00000000000000c8 R11: 0000000000000000 R12: 00000000000000c8
[ 408.072327] R13: dffffc0000000000 R14: ffff8881243a6030 R15: 0000000000003000
[ 408.072329] FS: 0000000000000000(0000) GS:ffff88823eadf000(0000)
knlGS:0000000000000000
[ 408.072331] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 408.072332] CR2: 000000c0003c6000 CR3: 000000010c106005 CR4: 0000000000772ef0
[ 408.072336] PKRU: 55555554
[ 408.072337] Call Trace:
[ 408.072338] <TASK>
[ 408.072340] ? sched_clock_noinstr+0x9/0x10
[ 408.072344] ? __pfx_ceph_con_v2_try_read+0x10/0x10
[ 408.072347] ? _raw_spin_unlock+0xe/0x40
[ 408.072349] ? finish_task_switch.isra.0+0x15d/0x830
[ 408.072353] ? __kasan_check_write+0x14/0x30
[ 408.072357] ? mutex_lock+0x84/0xe0
[ 408.072359] ? __pfx_mutex_lock+0x10/0x10
[ 408.072361] ceph_con_workfn+0x27e/0x10e0
[ 408.072364] ? metric_delayed_work+0x311/0x2c50
[ 408.072367] process_one_work+0x611/0xe20
[ 408.072371] ? __kasan_check_write+0x14/0x30
[ 408.072373] worker_thread+0x7e3/0x1580
[ 408.072375] ? __pfx__raw_spin_lock_irqsave+0x10/0x10
[ 408.072378] ? __pfx_worker_thread+0x10/0x10
[ 408.072381] kthread+0x381/0x7a0
[ 408.072383] ? __pfx__raw_spin_lock_irq+0x10/0x10
[ 408.072385] ? __pfx_kthread+0x10/0x10
[ 408.072387] ? __kasan_check_write+0x14/0x30
[ 408.072389] ? recalc_sigpending+0x160/0x220
[ 408.072392] ? _raw_spin_unlock_irq+0xe/0x50
[ 408.072394] ? calculate_sigpending+0x78/0xb0
[ 408.072395] ? __pfx_kthread+0x10/0x10
[ 408.072397] ret_from_fork+0x2b6/0x380
[ 408.072400] ? __pfx_kthread+0x10/0x10
[ 408.072402] ret_from_fork_asm+0x1a/0x30
[ 408.072406] </TASK>
[ 408.072407] ---[ end trace 0000000000000000 ]---
[ 408.072418] Oops: general protection fault, probably for non-canonical
address 0xdffffc0000000000: 0000 [#1] SMP KASAN NOPTI
[ 408.072984] KASAN: null-ptr-deref in range [0x0000000000000000-
0x0000000000000007]
[ 408.073350] CPU: 1 UID: 0 PID: 392 Comm: kworker/1:3 Tainted: G W
6.17.0-rc7+ #1 PREEMPT(voluntary)
[ 408.073886] Tainted: [W]=WARN
[ 408.074042] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
1.17.0-5.fc42 04/01/2014
[ 408.074468] Workqueue: ceph-msgr ceph_con_workfn
[ 408.074694] RIP: 0010:ceph_msg_data_advance+0x79/0x1a80
[ 408.074976] Code: fc ff df 49 8d 77 08 48 c1 ee 03 80 3c 16 00 0f 85 07 11 00
00 48 ba 00 00 00 00 00 fc ff df 49 8b 5f 08 48 89 de 48 c1 ee 03 <0f> b6 14 16
84 d2 74 09 80 fa 03 0f 8e 0f 0e 00 00 8b 13 83 fa 03
[ 408.075884] RSP: 0018:ffff88811c3e7990 EFLAGS: 00010246
[ 408.076305] RAX: ffff8881243a6388 RBX: 0000000000000000 RCX: 0000000000000000
[ 408.076909] RDX: dffffc0000000000 RSI: 0000000000000000 RDI: ffff8881243a6378
[ 408.077466] RBP: ffff88811c3e7a20 R08: 0000000000000000 R09: 00000000000000c8
[ 408.078034] R10: ffff8881243a6388 R11: 0000000000000000 R12: ffffed1024874c71
[ 408.078575] R13: dffffc0000000000 R14: ffff8881243a6030 R15: ffff8881243a6378
[ 408.079159] FS: 0000000000000000(0000) GS:ffff88823eadf000(0000)
knlGS:0000000000000000
[ 408.079736] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 408.080039] CR2: 000000c0003c6000 CR3: 000000010c106005 CR4: 0000000000772ef0
[ 408.080376] PKRU: 55555554
[ 408.080513] Call Trace:
[ 408.080630] <TASK>
[ 408.080729] ceph_con_v2_try_read+0x49b9/0x72f0
[ 408.081115] ? __pfx_ceph_con_v2_try_read+0x10/0x10
[ 408.081348] ? _raw_spin_unlock+0xe/0x40
[ 408.081538] ? finish_task_switch.isra.0+0x15d/0x830
[ 408.081768] ? __kasan_check_write+0x14/0x30
[ 408.081986] ? mutex_lock+0x84/0xe0
[ 408.082160] ? __pfx_mutex_lock+0x10/0x10
[ 408.082343] ceph_con_workfn+0x27e/0x10e0
[ 408.082529] ? metric_delayed_work+0x311/0x2c50
[ 408.082737] process_one_work+0x611/0xe20
[ 408.082948] ? __kasan_check_write+0x14/0x30
[ 408.083156] worker_thread+0x7e3/0x1580
[ 408.083331] ? __pfx__raw_spin_lock_irqsave+0x10/0x10
[ 408.083557] ? __pfx_worker_thread+0x10/0x10
[ 408.083751] kthread+0x381/0x7a0
[ 408.083922] ? __pfx__raw_spin_lock_irq+0x10/0x10
[ 408.084139] ? __pfx_kthread+0x10/0x10
[ 408.084310] ? __kasan_check_write+0x14/0x30
[ 408.084510] ? recalc_sigpending+0x160/0x220
[ 408.084708] ? _raw_spin_unlock_irq+0xe/0x50
[ 408.084917] ? calculate_sigpending+0x78/0xb0
[ 408.085138] ? __pfx_kthread+0x10/0x10
[ 408.085335] ret_from_fork+0x2b6/0x380
[ 408.085525] ? __pfx_kthread+0x10/0x10
[ 408.085720] ret_from_fork_asm+0x1a/0x30
[ 408.085922] </TASK>
[ 408.086036] Modules linked in: intel_rapl_msr intel_rapl_common
intel_uncore_frequency_common intel_pmc_core pmt_telemetry pmt_discovery
pmt_class intel_pmc_ssram_telemetry intel_vsec kvm_intel joydev kvm irqbypass
polyval_clmulni ghash_clmulni_intel aesni_intel rapl input_leds psmouse
serio_raw i2c_piix4 vga16fb bochs vgastate i2c_smbus floppy mac_hid qemu_fw_cfg
pata_acpi sch_fq_codel rbd msr parport_pc ppdev lp parport efi_pstore
[ 408.087778] ---[ end trace 0000000000000000 ]---
[ 408.088007] RIP: 0010:ceph_msg_data_advance+0x79/0x1a80
[ 408.088260] Code: fc ff df 49 8d 77 08 48 c1 ee 03 80 3c 16 00 0f 85 07 11 00
00 48 ba 00 00 00 00 00 fc ff df 49 8b 5f 08 48 89 de 48 c1 ee 03 <0f> b6 14 16
84 d2 74 09 80 fa 03 0f 8e 0f 0e 00 00 8b 13 83 fa 03
[ 408.089118] RSP: 0018:ffff88811c3e7990 EFLAGS: 00010246
[ 408.089357] RAX: ffff8881243a6388 RBX: 0000000000000000 RCX: 0000000000000000
[ 408.089678] RDX: dffffc0000000000 RSI: 0000000000000000 RDI: ffff8881243a6378
[ 408.090020] RBP: ffff88811c3e7a20 R08: 0000000000000000 R09: 00000000000000c8
[ 408.090360] R10: ffff8881243a6388 R11: 0000000000000000 R12: ffffed1024874c71
[ 408.090687] R13: dffffc0000000000 R14: ffff8881243a6030 R15: ffff8881243a6378
[ 408.091035] FS: 0000000000000000(0000) GS:ffff88823eadf000(0000)
knlGS:0000000000000000
[ 408.091452] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 408.092015] CR2: 000000c0003c6000 CR3: 000000010c106005 CR4: 0000000000772ef0
[ 408.092530] PKRU: 55555554
[ 417.112915]
==================================================================
[ 417.113491] BUG: KASAN: slab-use-after-free in
__mutex_lock.constprop.0+0x1522/0x1610
[ 417.114014] Read of size 4 at addr ffff888124870034 by task kworker/2:0/4951
[ 417.114587] CPU: 2 UID: 0 PID: 4951 Comm: kworker/2:0 Tainted: G D W
6.17.0-rc7+ #1 PREEMPT(voluntary)
[ 417.114592] Tainted: [D]=DIE, [W]=WARN
[ 417.114593] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
1.17.0-5.fc42 04/01/2014
[ 417.114596] Workqueue: events handle_timeout
[ 417.114601] Call Trace:
[ 417.114602] <TASK>
[ 417.114604] dump_stack_lvl+0x5c/0x90
[ 417.114610] print_report+0x171/0x4dc
[ 417.114613] ? __pfx__raw_spin_lock_irqsave+0x10/0x10
[ 417.114617] ? kasan_complete_mode_report_info+0x80/0x220
[ 417.114621] kasan_report+0xbd/0x100
[ 417.114625] ? __mutex_lock.constprop.0+0x1522/0x1610
[ 417.114628] ? __mutex_lock.constprop.0+0x1522/0x1610
[ 417.114630] __asan_report_load4_noabort+0x14/0x30
[ 417.114633] __mutex_lock.constprop.0+0x1522/0x1610
[ 417.114635] ? queue_con_delay+0x8d/0x200
[ 417.114638] ? __pfx___mutex_lock.constprop.0+0x10/0x10
[ 417.114641] ? __send_subscribe+0x529/0xb20
[ 417.114644] __mutex_lock_slowpath+0x13/0x20
[ 417.114646] mutex_lock+0xd4/0xe0
[ 417.114649] ? __pfx_mutex_lock+0x10/0x10
[ 417.114652] ? ceph_monc_renew_subs+0x2a/0x40
[ 417.114654] ceph_con_keepalive+0x22/0x110
[ 417.114656] handle_timeout+0x6b3/0x11d0
[ 417.114659] ? _raw_spin_unlock_irq+0xe/0x50
[ 417.114662] ? __pfx_handle_timeout+0x10/0x10
[ 417.114664] ? queue_delayed_work_on+0x8e/0xa0
[ 417.114669] process_one_work+0x611/0xe20
[ 417.114672] ? __kasan_check_write+0x14/0x30
[ 417.114676] worker_thread+0x7e3/0x1580
[ 417.114678] ? __pfx__raw_spin_lock_irqsave+0x10/0x10
[ 417.114682] ? __pfx_sched_setscheduler_nocheck+0x10/0x10
[ 417.114687] ? __pfx_worker_thread+0x10/0x10
[ 417.114689] kthread+0x381/0x7a0
[ 417.114692] ? __pfx__raw_spin_lock_irq+0x10/0x10
[ 417.114694] ? __pfx_kthread+0x10/0x10
[ 417.114697] ? __kasan_check_write+0x14/0x30
[ 417.114699] ? recalc_sigpending+0x160/0x220
[ 417.114703] ? _raw_spin_unlock_irq+0xe/0x50
[ 417.114705] ? calculate_sigpending+0x78/0xb0
[ 417.114707] ? __pfx_kthread+0x10/0x10
[ 417.114710] ret_from_fork+0x2b6/0x380
[ 417.114713] ? __pfx_kthread+0x10/0x10
[ 417.114715] ret_from_fork_asm+0x1a/0x30
[ 417.114720] </TASK>
[ 417.125171] Allocated by task 2:
[ 417.125333] kasan_save_stack+0x26/0x60
[ 417.125522] kasan_save_track+0x14/0x40
[ 417.125742] kasan_save_alloc_info+0x39/0x60
[ 417.125945] __kasan_slab_alloc+0x8b/0xb0
[ 417.126133] kmem_cache_alloc_node_noprof+0x13b/0x460
[ 417.126381] copy_process+0x320/0x6250
[ 417.126595] kernel_clone+0xb7/0x840
[ 417.126792] kernel_thread+0xd6/0x120
[ 417.126995] kthreadd+0x85c/0xbe0
[ 417.127176] ret_from_fork+0x2b6/0x380
[ 417.127378] ret_from_fork_asm+0x1a/0x30
[ 417.127692] Freed by task 0:
[ 417.127851] kasan_save_stack+0x26/0x60
[ 417.128057] kasan_save_track+0x14/0x40
[ 417.128267] kasan_save_free_info+0x3b/0x60
[ 417.128491] __kasan_slab_free+0x6c/0xa0
[ 417.128708] kmem_cache_free+0x182/0x550
[ 417.128906] free_task+0xeb/0x140
[ 417.129070] __put_task_struct+0x1d2/0x4f0
[ 417.129259] __put_task_struct_rcu_cb+0x15/0x20
[ 417.129480] rcu_do_batch+0x3d3/0xe70
[ 417.129681] rcu_core+0x549/0xb30
[ 417.129839] rcu_core_si+0xe/0x20
[ 417.130005] handle_softirqs+0x160/0x570
[ 417.130190] __irq_exit_rcu+0x189/0x1e0
[ 417.130369] irq_exit_rcu+0xe/0x20
[ 417.130531] sysvec_apic_timer_interrupt+0x9f/0xd0
[ 417.130768] asm_sysvec_apic_timer_interrupt+0x1b/0x20
[ 417.131082] Last potentially related work creation:
[ 417.131305] kasan_save_stack+0x26/0x60
[ 417.131484] kasan_record_aux_stack+0xae/0xd0
[ 417.131695] __call_rcu_common+0xcd/0x14b0
[ 417.131909] call_rcu+0x31/0x50
[ 417.132071] delayed_put_task_struct+0x128/0x190
[ 417.132295] rcu_do_batch+0x3d3/0xe70
[ 417.132478] rcu_core+0x549/0xb30
[ 417.132658] rcu_core_si+0xe/0x20
[ 417.132808] handle_softirqs+0x160/0x570
[ 417.132993] __irq_exit_rcu+0x189/0x1e0
[ 417.133181] irq_exit_rcu+0xe/0x20
[ 417.133353] sysvec_apic_timer_interrupt+0x9f/0xd0
[ 417.133584] asm_sysvec_apic_timer_interrupt+0x1b/0x20
[ 417.133921] Second to last potentially related work creation:
[ 417.134183] kasan_save_stack+0x26/0x60
[ 417.134362] kasan_record_aux_stack+0xae/0xd0
[ 417.134566] __call_rcu_common+0xcd/0x14b0
[ 417.134782] call_rcu+0x31/0x50
[ 417.134929] put_task_struct_rcu_user+0x58/0xb0
[ 417.135143] finish_task_switch.isra.0+0x5d3/0x830
[ 417.135366] __schedule+0xd30/0x5100
[ 417.135534] schedule_idle+0x5a/0x90
[ 417.135712] do_idle+0x25f/0x410
[ 417.135871] cpu_startup_entry+0x53/0x70
[ 417.136053] start_secondary+0x216/0x2c0
[ 417.136233] common_startup_64+0x13e/0x141
[ 417.136894] The buggy address belongs to the object at ffff888124870000
which belongs to the cache task_struct of size 10504
[ 417.138122] The buggy address is located 52 bytes inside of
freed 10504-byte region [ffff888124870000, ffff888124872908)
[ 417.139465] The buggy address belongs to the physical page:
[ 417.140016] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0
pfn:0x124870
[ 417.140789] head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0
pincount:0
[ 417.141519] memcg:ffff88811aa20e01
[ 417.141874] anon flags:
0x17ffffc0000040(head|node=0|zone=2|lastcpupid=0x1fffff)
[ 417.142600] page_type: f5(slab)
[ 417.142922] raw: 0017ffffc0000040 ffff88810094f040 0000000000000000
dead000000000001
[ 417.143554] raw: 0000000000000000 0000000000030003 00000000f5000000
ffff88811aa20e01
[ 417.143954] head: 0017ffffc0000040 ffff88810094f040 0000000000000000
dead000000000001
[ 417.144329] head: 0000000000000000 0000000000030003 00000000f5000000
ffff88811aa20e01
[ 417.144710] head: 0017ffffc0000003 ffffea0004921c01 00000000ffffffff
00000000ffffffff
[ 417.145106] head: ffffffffffffffff 0000000000000000 00000000ffffffff
0000000000000008
[ 417.145485] page dumped because: kasan: bad access detected
[ 417.145859] Memory state around the buggy address:
[ 417.146094] ffff88812486ff00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
fc
[ 417.146439] ffff88812486ff80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
fc
[ 417.146791] >ffff888124870000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb
fb
[ 417.147145] ^
[ 417.147387] ffff888124870080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
fb
[ 417.147751] ffff888124870100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
fb
[ 417.148123]
==================================================================
First of all, we have warning in get_bvec_at() because
cursor->total_resid contains zero value. And, finally,
we have crash in ceph_msg_data_advance() because
cursor->data is NULL. It means that get_bvec_at()
receives not initialized ceph_msg_data_cursor structure
because data is NULL and total_resid contains zero.
Moreover, we don't have likewise issue for the case of
Ceph msgr1 protocol because ceph_msg_data_cursor_init()
has been called before reading sparse data.
This patch adds calling of ceph_msg_data_cursor_init()
in the beginning of process_v2_sparse_read() with
the goal to guarantee that logic of reading sparse data
works correctly for the case of Ceph msgr2 protocol.
Cc: stable@vger.kernel.org
Link: https://tracker.ceph.com/issues/73152
Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
With the previous commit revamping the timeout handling, started isn't
used anymore. It could be taken into account by adjusting the initial
value of the timeout, but there is little point as both callers capture
the timestamp shortly before calling __ceph_open_session() -- the only
thing of note that happens in the interim is taking client->mount_mutex
and that isn't expected to take multiple seconds.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
|
|
The wait loop in __ceph_open_session() can race with the client
receiving a new monmap or osdmap shortly after the initial map is
received. Both ceph_monc_handle_map() and handle_one_map() install
a new map immediately after freeing the old one
kfree(monc->monmap);
monc->monmap = monmap;
ceph_osdmap_destroy(osdc->osdmap);
osdc->osdmap = newmap;
under client->monc.mutex and client->osdc.lock respectively, but
because neither is taken in have_mon_and_osd_map() it's possible for
client->monc.monmap->epoch and client->osdc.osdmap->epoch arms in
client->monc.monmap && client->monc.monmap->epoch &&
client->osdc.osdmap && client->osdc.osdmap->epoch;
condition to dereference an already freed map. This happens to be
reproducible with generic/395 and generic/397 with KASAN enabled:
BUG: KASAN: slab-use-after-free in have_mon_and_osd_map+0x56/0x70
Read of size 4 at addr ffff88811012d810 by task mount.ceph/13305
CPU: 2 UID: 0 PID: 13305 Comm: mount.ceph Not tainted 6.14.0-rc2-build2+ #1266
...
Call Trace:
<TASK>
have_mon_and_osd_map+0x56/0x70
ceph_open_session+0x182/0x290
ceph_get_tree+0x333/0x680
vfs_get_tree+0x49/0x180
do_new_mount+0x1a3/0x2d0
path_mount+0x6dd/0x730
do_mount+0x99/0xe0
__do_sys_mount+0x141/0x180
do_syscall_64+0x9f/0x100
entry_SYSCALL_64_after_hwframe+0x76/0x7e
</TASK>
Allocated by task 13305:
ceph_osdmap_alloc+0x16/0x130
ceph_osdc_init+0x27a/0x4c0
ceph_create_client+0x153/0x190
create_fs_client+0x50/0x2a0
ceph_get_tree+0xff/0x680
vfs_get_tree+0x49/0x180
do_new_mount+0x1a3/0x2d0
path_mount+0x6dd/0x730
do_mount+0x99/0xe0
__do_sys_mount+0x141/0x180
do_syscall_64+0x9f/0x100
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Freed by task 9475:
kfree+0x212/0x290
handle_one_map+0x23c/0x3b0
ceph_osdc_handle_map+0x3c9/0x590
mon_dispatch+0x655/0x6f0
ceph_con_process_message+0xc3/0xe0
ceph_con_v1_try_read+0x614/0x760
ceph_con_workfn+0x2de/0x650
process_one_work+0x486/0x7c0
process_scheduled_works+0x73/0x90
worker_thread+0x1c8/0x2a0
kthread+0x2ec/0x300
ret_from_fork+0x24/0x40
ret_from_fork_asm+0x1a/0x30
Rewrite the wait loop to check the above condition directly with
client->monc.mutex and client->osdc.lock taken as appropriate. While
at it, improve the timeout handling (previously mount_timeout could be
exceeded in case wait_event_interruptible_timeout() slept more than
once) and access client->auth_err under client->monc.mutex to match
how it's set in finish_auth().
monmap_show() and osdmap_show() now take the respective lock before
accessing the map as well.
Cc: stable@vger.kernel.org
Reported-by: David Howells <dhowells@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
|
|
Pull ceph updates from Ilya Dryomov:
- some messenger improvements (Eric and Max)
- address an issue (also affected userspace) of incorrect permissions
being granted to users who have access to multiple different CephFS
instances within the same cluster (Kotresh)
- a bunch of assorted CephFS fixes (Slava)
* tag 'ceph-for-6.18-rc1' of https://github.com/ceph/ceph-client:
ceph: add bug tracking system info to MAINTAINERS
ceph: fix multifs mds auth caps issue
ceph: cleanup in ceph_alloc_readdir_reply_buffer()
ceph: fix potential NULL dereference issue in ceph_fill_trace()
libceph: add empty check to ceph_con_get_out_msg()
libceph: pass the message pointer instead of loading con->out_msg
libceph: make ceph_con_get_out_msg() return the message pointer
ceph: fix potential race condition on operations with CEPH_I_ODIRECT flag
ceph: refactor wake_up_bit() pattern of calling
ceph: fix potential race condition in ceph_ioctl_lazyio()
ceph: fix overflowed constant issue in ceph_do_objects_copy()
ceph: fix wrong sizeof argument issue in register_session()
ceph: add checking of wait_for_completion_killable() return value
ceph: make ceph_start_io_*() killable
libceph: Use HMAC-SHA256 library instead of crypto_shash
|
|
This moves the list_empty() checks from the two callers (v1 and v2)
into the base messenger.c library. Now the v1/v2 specializations do
not need to know about con->out_queue; that implementation detail is
now hidden behind the ceph_con_get_out_msg() function.
[ idryomov: instead of changing prepare_write_message() to return
a bool, move ceph_con_get_out_msg() call out to arrive to the same
pattern as in messenger_v2.c ]
Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
This pointer is in a register anyway, so let's use that instead of
reloading from memory everywhere.
[ idryomov: formatting ]
Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
The caller in messenger_v1.c loads it anyway, so let's keep the
pointer in the register instead of reloading it from memory. This
eliminates a tiny bit of unnecessary overhead.
Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
Use the HMAC-SHA256 library functions instead of crypto_shash. This is
simpler and faster.
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
Currently if a user enqueue a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.
This lack of consistentcy cannot be addressed without refactoring the API.
alloc_workqueue() treats all queues as per-CPU by default, while unbound
workqueues must opt-in via WQ_UNBOUND.
This default is suboptimal: most workloads benefit from unbound queues,
allowing the scheduler to place worker threads where they’re needed and
reducing noise when CPUs are isolated.
This change adds a new WQ_PERCPU flag at the network subsystem, to explicitly
request the use of the per-CPU behavior. Both flags coexist for one release
cycle to allow callers to transition their calls.
Once migration is complete, WQ_UNBOUND can be removed and unbound will
become the implicit default.
With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND),
any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND
must now use WQ_PERCPU.
All existing users have been updated accordingly.
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Link: https://patch.msgid.link/20250918142427.309519-4-marco.crivellari@suse.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Currently if a user enqueue a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.
This lack of consistentcy cannot be addressed without refactoring the API.
system_unbound_wq should be the default workqueue so as not to enforce
locality constraints for random work whenever it's not required.
Adding system_dfl_wq to encourage its use when unbound work should be used.
The old system_unbound_wq will be kept for a few release cycles.
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Link: https://patch.msgid.link/20250918142427.309519-3-marco.crivellari@suse.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
There is a place where generic code in messenger.c is reading and
another place where it is writing to con->v1 union member without
checking that the union member is active (i.e. msgr1 is in use).
On 64-bit systems, con->v1.auth_retry overlaps with con->v2.out_iter,
so such a read is almost guaranteed to return a bogus value instead of
0 when msgr2 is in use. This ends up being fairly benign because the
side effect is just the invalidation of the authorizer and successive
fetching of new tickets.
con->v1.connect_seq overlaps with con->v2.conn_bufs and the fact that
it's being written to can cause more serious consequences, but luckily
it's not something that happens often.
Cc: stable@vger.kernel.org
Fixes: cd1a677cad99 ("libceph, ceph: implement msgr2.1 protocol (crc and secure modes)")
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
|
|
Rename hmac_sha256() to ceph_hmac_sha256(), to avoid a naming conflict
with the upcoming hmac_sha256() library function.
This code will be able to use the HMAC-SHA256 library, but that's left
for a later commit.
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20250630160645.3198-2-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
|
|
Pull ceph fixes from Ilya Dryomov:
"A small CephFS encryption-related fix and a dead code cleanup"
* tag 'ceph-for-6.15-rc4' of https://github.com/ceph/ceph-client:
ceph: Fix incorrect flush end position calculation
ceph: Remove osd_client deadcode
|
|
Now that LIBCRC32C does nothing besides select CRC32, make every option
that selects LIBCRC32C instead select CRC32 directly. Then remove
LIBCRC32C.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20250401221600.24878-8-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
|
|
osd_req_op_extent_osd_data_pagelist() was added in 2013 as part of
commit a4ce40a9a7c1 ("libceph: combine initializing and setting osd data")
but never used.
The last use of osd_req_op_cls_request_data_pagelist() was removed in
2017's commit ecd4a68a26a2 ("rbd: switch rbd_obj_method_sync() to
ceph_osdc_call()")
Remove them.
Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
If mounted with sparseread option, ceph_direct_read_write() ends up
making an unnecessarily allocation for O_DIRECT writes.
Fixes: 03bc06c7b0bd ("ceph: add new mount option to enable sparse reads")
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Markuze <amarkuze@redhat.com>
|
|
ceph_crypto_key_encode() was added in 2010's commit
8b6e4f2d8b21 ("ceph: aes crypto and base64 encode/decode helpers")
but has remained unused (the decode is used).
Remove it.
Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
ceph_osdc_watch_check() has been unused since it was added in commit
b07d3c4bd727 ("libceph: support for checking on status of watch")
Remove it.
Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
ceph_copy_user_to_page_vector() has been unused since 2013's commit
e8344e668915 ("ceph: Implement writev/pwritev for sync operation.")
ceph_copy_to_page_vector() has been unused since 2012's commit
913d2fdcf605 ("rbd: always pass ops array to rbd_req_sync_op()")
Remove them.
Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
ceph_pagelist_truncate() and ceph_pagelist_set_cursor() have been unused
since commit
39be95e9c8c0 ("ceph: ceph_pagelist_append might sleep while atomic")
Remove them.
Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
When resolving name in ceph_dns_resolve_name(), the end address of name
is determined by the minimum value of delim_p and colon_p. So using min()
here is more in line with the context.
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
Currently, when built with "make W=1", the following warnings are
generated:
net/ceph/crush/mapper.c:466: warning: Function parameter or struct member 'work' not described in 'crush_choose_firstn'
net/ceph/crush/mapper.c:466: warning: Function parameter or struct member 'weight' not described in 'crush_choose_firstn'
net/ceph/crush/mapper.c:466: warning: Function parameter or struct member 'weight_max' not described in 'crush_choose_firstn'
net/ceph/crush/mapper.c:466: warning: Function parameter or struct member 'choose_args' not described in 'crush_choose_firstn'
Update the crush_choose_firstn() kernel-doc to document these
parameters.
Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
Currently, when built with "make W=1", the following warnings are
generated:
net/ceph/crush/mapper.c:655: warning: Function parameter or struct member 'map' not described in 'crush_choose_indep'
net/ceph/crush/mapper.c:655: warning: Function parameter or struct member 'work' not described in 'crush_choose_indep'
net/ceph/crush/mapper.c:655: warning: Function parameter or struct member 'bucket' not described in 'crush_choose_indep'
net/ceph/crush/mapper.c:655: warning: Function parameter or struct member 'weight' not described in 'crush_choose_indep'
net/ceph/crush/mapper.c:655: warning: Function parameter or struct member 'weight_max' not described in 'crush_choose_indep'
net/ceph/crush/mapper.c:655: warning: Function parameter or struct member 'x' not described in 'crush_choose_indep'
net/ceph/crush/mapper.c:655: warning: Function parameter or struct member 'left' not described in 'crush_choose_indep'
net/ceph/crush/mapper.c:655: warning: Function parameter or struct member 'numrep' not described in 'crush_choose_indep'
net/ceph/crush/mapper.c:655: warning: Function parameter or struct member 'type' not described in 'crush_choose_indep'
net/ceph/crush/mapper.c:655: warning: Function parameter or struct member 'out' not described in 'crush_choose_indep'
net/ceph/crush/mapper.c:655: warning: Function parameter or struct member 'outpos' not described in 'crush_choose_indep'
net/ceph/crush/mapper.c:655: warning: Function parameter or struct member 'tries' not described in 'crush_choose_indep'
net/ceph/crush/mapper.c:655: warning: Function parameter or struct member 'recurse_tries' not described in 'crush_choose_indep'
net/ceph/crush/mapper.c:655: warning: Function parameter or struct member 'recurse_to_leaf' not described in 'crush_choose_indep'
net/ceph/crush/mapper.c:655: warning: Function parameter or struct member 'out2' not described in 'crush_choose_indep'
net/ceph/crush/mapper.c:655: warning: Function parameter or struct member 'parent_r' not described in 'crush_choose_indep'
net/ceph/crush/mapper.c:655: warning: Function parameter or struct member 'choose_args' not described in 'crush_choose_indep'
These warnings are generated because the prologue comment for
crush_choose_indep() uses the kernel-doc prefix, but the actual
comment is a very brief description that is not in kernel-doc
format. Since this is a static function there is no need to fully
document the function, so replace the kernel-doc comment prefix with a
standard comment prefix to remove these warnings.
Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
The way the delayed work is handled in ceph_monc_stop() is prone to
races with mon_fault() and possibly also finish_hunting(). Both of
these can requeue the delayed work which wouldn't be canceled by any of
the following code in case that happens after cancel_delayed_work_sync()
runs -- __close_session() doesn't mess with the delayed work in order
to avoid interfering with the hunting interval logic. This part was
missed in commit b5d91704f53e ("libceph: behave in mon_fault() if
cur_mon < 0") and use-after-free can still ensue on monc and objects
that hang off of it, with monc->auth and monc->monmap being
particularly susceptible to quickly being reused.
To fix this:
- clear monc->cur_mon and monc->hunting as part of closing the session
in ceph_monc_stop()
- bail from delayed_work() if monc->cur_mon is cleared, similar to how
it's done in mon_fault() and finish_hunting() (based on monc->hunting)
- call cancel_delayed_work_sync() after the session is closed
Cc: stable@vger.kernel.org
Link: https://tracker.ceph.com/issues/66857
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
|
|
The cursor is no longer initialized in the OSD client, causing the
sparse read state machine to fall into an infinite loop. The cursor
should be initialized in IN_S_PREPARE_SPARSE_DATA state.
[ idryomov: use msg instead of con->in_msg, changelog ]
Link: https://tracker.ceph.com/issues/64607
Fixes: 8e46a2d068c9 ("libceph: just wait for more data to be available on the socket")
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Tested-by: Luis Henriques <lhenriques@suse.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
A short read may occur while reading the message footer from the
socket. Later, when the socket is ready for another read, the
messenger invokes all read_partial_*() handlers, including
read_partial_sparse_msg_data(). The expectation is that
read_partial_sparse_msg_data() would bail, allowing the messenger to
invoke read_partial() for the footer and pick up where it left off.
However read_partial_sparse_msg_data() violates that and ends up
calling into the state machine in the OSD client. The sparse-read
state machine assumes that it's a new op and interprets some piece of
the footer as the sparse-read header and returns bogus extents/data
length, etc.
To determine whether read_partial_sparse_msg_data() should bail, let's
reuse cursor->total_resid. Because once it reaches to zero that means
all the extents and data have been successfully received in last read,
else it could break out when partially reading any of the extents and
data. And then osd_sparse_read() could continue where it left off.
[ idryomov: changelog ]
Link: https://tracker.ceph.com/issues/63586
Fixes: d396f89db39a ("libceph: add sparse read support to msgr1")
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
These functions are supposed to behave like other read_partial_*()
handlers: the contract with messenger v1 is that the handler bails if
the area of the message it's responsible for is already processed.
This comes up when handling short reads from the socket.
[ idryomov: changelog ]
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Acked-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
Once this happens that means there have bugs.
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
There is no any limit for the extent array size and it's possible
that when reading with a large size contents the total number of
extents will exceed 4096. Then the messager will fail by reseting
the connection and keeps resending the inflight IOs infinitely.
[ idryomov: adjust error message ]
Link: https://tracker.ceph.com/issues/62081
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto updates from Herbert Xu:
"API:
- Add virtual-address based lskcipher interface
- Optimise ahash/shash performance in light of costly indirect calls
- Remove ahash alignmask attribute
Algorithms:
- Improve AES/XTS performance of 6-way unrolling for ppc
- Remove some uses of obsolete algorithms (md4, md5, sha1)
- Add FIPS 202 SHA-3 support in pkcs1pad
- Add fast path for single-page messages in adiantum
- Remove zlib-deflate
Drivers:
- Add support for S4 in meson RNG driver
- Add STM32MP13x support in stm32
- Add hwrng interface support in qcom-rng
- Add support for deflate algorithm in hisilicon/zip"
* tag 'v6.7-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (283 commits)
crypto: adiantum - flush destination page before unmapping
crypto: testmgr - move pkcs1pad(rsa,sha3-*) to correct place
Documentation/module-signing.txt: bring up to date
module: enable automatic module signing with FIPS 202 SHA-3
crypto: asymmetric_keys - allow FIPS 202 SHA-3 signatures
crypto: rsa-pkcs1pad - Add FIPS 202 SHA-3 support
crypto: FIPS 202 SHA-3 register in hash info for IMA
x509: Add OIDs for FIPS 202 SHA-3 hash and signatures
crypto: ahash - optimize performance when wrapping shash
crypto: ahash - check for shash type instead of not ahash type
crypto: hash - move "ahash wrapping shash" functions to ahash.c
crypto: talitos - stop using crypto_ahash::init
crypto: chelsio - stop using crypto_ahash::init
crypto: ahash - improve file comment
crypto: ahash - remove struct ahash_request_priv
crypto: ahash - remove crypto_ahash_alignmask
crypto: gcm - stop using alignmask of ahash
crypto: chacha20poly1305 - stop using alignmask of ahash
crypto: ccm - stop using alignmask of ahash
net: ipv6: stop checking crypto_ahash_alignmask
...
|
|
Now that the shash algorithm type does not support nonzero alignmasks,
crypto_shash_alignmask() always returns 0 and will be removed. In
preparation for this, stop checking crypto_shash_alignmask() in
net/ceph/messenger_v2.c.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
Cross-merge networking fixes after downstream PR.
net/mac80211/key.c
02e0e426a2fb ("wifi: mac80211: fix error path key leak")
2a8b665e6bcc ("wifi: mac80211: remove key_mtx")
7d6904bf26b9 ("Merge wireless into wireless-next")
https://lore.kernel.org/all/20231012113648.46eea5ec@canb.auug.org.au/
Adjacent changes:
drivers/net/ethernet/ti/Kconfig
a602ee3176a8 ("net: ethernet: ti: Fix mixed module-builtin object")
98bdeae9502b ("net: cpmac: remove driver to prepare for platform removal")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Direct calls to ops->connect() can overwrite the address parameter when
used in conjunction with BPF SOCK_ADDR hooks. Recent changes to
kernel_connect() ensure that callers are insulated from such side
effects. This patch wraps the direct call to ops->connect() with
kernel_connect() to prevent unexpected changes to the address passed to
ceph_tcp_connect().
This change was originally part of a larger patch targeting the net tree
addressing all instances of unprotected calls to ops->connect()
throughout the kernel, but this change was split up into several patches
targeting various trees.
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/netdev/20230821100007.559638-1-jrife@google.com/
Link: https://lore.kernel.org/netdev/9944248dba1bce861375fcce9de663934d933ba9.camel@redhat.com/
Fixes: d74bad4e74ee ("bpf: Hooks for sys_connect")
Signed-off-by: Jordan Rife <jrife@google.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
Prepare for the coming implementation by GCC and Clang of the __counted_by
attribute. Flexible array members annotated with __counted_by can have
their accesses bounds-checked at run-time checking via CONFIG_UBSAN_BOUNDS
(for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family
functions).
As found with Coccinelle[1], add __counted_by for struct ceph_monmap.
Additionally, since the element count member must be set before accessing
the annotated flexible array member, move its initialization earlier.
[1] https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Xiubo Li <xiubli@redhat.com>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: ceph-devel@vger.kernel.org
Cc: netdev@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The header file crypto/algapi.h is for internal use only. Use the
header file crypto/utils.h instead.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
Currently we have some special-casing for multi-op writes, but in the
case of a read, we can't really handle it. All of the current multi-op
callers call it with CEPH_OSD_FLAG_WRITE set.
Have ceph_osdc_new_request check for CEPH_OSD_FLAG_READ and if it's set,
allocate multiple reply ops instead of multiple request ops. If neither
flag is set, return -EINVAL.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
...and record the user_version in the reply in a new field in
ceph_osd_request, so we can populate the assert_ver appropriately.
Shuffle the fields a bit too so that the new field fits in an
existing hole on x86_64.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
Add an iov_iter to the unions in ceph_msg_data and ceph_msg_data_cursor.
Instead of requiring a list of pages or bvecs, we can just use an
iov_iter directly, and avoid extra allocations.
We assume that the pages represented by the iter are pinned such that
they shouldn't incur page faults, which is the case for the iov_iters
created by netfs.
While working on this, Al Viro informed me that he was going to change
iov_iter_get_pages to auto-advance the iterator as that pattern is more
or less required for ITER_PIPE anyway. We emulate that here for now by
advancing in the _next op and tracking that amount in the "lastlen"
field.
In the event that _next is called twice without an intervening
_advance, we revert the iov_iter by the remaining lastlen before
calling iov_iter_get_pages.
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
Have get_reply check for the presence of sparse read ops in the
request and set the sparse_read boolean in the msg. That will queue the
messenger layer to use the sparse read codepath instead of the normal
data receive.
Add a new sparse_read operation for the OSD client, driven by its own
state machine. The messenger will repeatedly call the sparse_read
operation, and it will pass back the necessary info to set up to read
the next extent of data, while zero-filling the sparse regions.
The state machine will stop at the end of the last extent, and will
attach the extent map buffer to the ceph_osd_req_op so that the caller
can use it.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
Add 2 new fields to ceph_connection_v1_info to track the necessary info
in sparse reads. Skip initializing the cursor for a sparse read.
Break out read_partial_message_section into a wrapper around a new
read_partial_message_chunk function that doesn't zero out the crc first.
Add new helper functions to drive receiving into the destinations
provided by the sparse_read state machine.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
Add a new init_sgs_pages helper that populates the scatterlist from
an arbitrary point in an array of pages.
Change setup_message_sgs to take an optional pointer to an array of
pages. If that's set, then the scatterlist will be set using that
array instead of the cursor.
When given a sparse read on a secure connection, decrypt the data
in-place rather than into the final destination, by passing it the
in_enc_pages array.
After decrypting, run the sparse_read state machine in a loop, copying
data from the decrypted pages until it's complete.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|