summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2022-04-26Merge tag 'for-5.18/fbdev-2' of ↵Linus Torvalds29-45/+60
git://git.kernel.org/pub/scm/linux/kernel/git/deller/linux-fbdev Pull fbdev fixes and updates from Helge Deller: "A bunch of outstanding fbdev patches - all trivial and small" * tag 'for-5.18/fbdev-2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/linux-fbdev: video: fbdev: clps711x-fb: Use syscon_regmap_lookup_by_phandle video: fbdev: mmp: replace usage of found with dedicated list iterator variable video: fbdev: sh_mobile_lcdcfb: Remove sh_mobile_lcdc_check_var() declaration video: fbdev: i740fb: Error out if 'pixclock' equals zero video: fbdev: i740fb: use memset_io() to clear screen video: fbdev: s3fb: Error out if 'pixclock' equals zero video: fbdev: arkfb: Error out if 'pixclock' equals zero video: fbdev: tridentfb: Error out if 'pixclock' equals zero video: fbdev: vt8623fb: Error out if 'pixclock' equals zero video: fbdev: kyro: Error out if 'lineclock' equals zero video: fbdev: neofb: Fix the check of 'var->pixclock' video: fbdev: imxfb: Fix missing of_node_put in imxfb_probe video: fbdev: omap: Make it CCF clk API compatible video: fbdev: aty/matrox/...: Prepare cleanup of powerpc's asm/prom.h video: fbdev: pm2fb: Fix a kernel-doc formatting issue linux/fb.h: Spelling s/palette/palette/ video: fbdev: sis: fix potential NULL dereference in sisfb_post_sis300() video: fbdev: pxafb: use if else instead video: fbdev: udlfb: properly check endpoint type video: fbdev: of: display_timing: Remove a redundant zeroing of memory
2022-04-26Merge tag 'gfs2-v5.18-rc4-fix' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2 Pull gfs2 fix from Andreas Gruenbacher: - Only re-check for direct I/O writes past the end of the file after re-acquiring the inode glock. * tag 'gfs2-v5.18-rc4-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: gfs2: Don't re-check for write past EOF unnecessarily
2022-04-26Bluetooth: hci_sync: Cleanup hci_conn if it cannot be abortedLuiz Augusto von Dentz4-19/+39
This attempts to cleanup the hci_conn if it cannot be aborted as otherwise it would likely result in having the controller and host stack out of sync with respect to connection handle. Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2022-04-26Merge tag 'for-5.18-rc4-tag' of ↵Linus Torvalds9-29/+76
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - direct IO fixes: - restore passing file offset to correctly calculate checksums when repairing on read and bio split happens - use correct bio when sumitting IO on zoned filesystem - zoned mode fixes: - fix selection of device to correctly calculate device capabilities when allocating a new bio - use a dedicated lock for exclusion during relocation - fix leaked plug after failure syncing log - fix assertion during scrub and relocation * tag 'for-5.18-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: zoned: use dedicated lock for data relocation btrfs: fix assertion failure during scrub due to block group reallocation btrfs: fix direct I/O writes for split bios on zoned devices btrfs: fix direct I/O read repair for split bios btrfs: fix and document the zoned device choice in alloc_new_bio btrfs: fix leaked plug after failure syncing log on zoned filesystems
2022-04-26Bluetooth: hci_event: Fix creating hci_conn object on error statusLuiz Augusto von Dentz1-0/+12
It is useless to create a hci_conn object if on error status as the result would be it being freed in the process and anyway it is likely the result of controller and host stack being out of sync. Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2022-04-26Bluetooth: hci_event: Fix checking for invalid handle on error statusLuiz Augusto von Dentz2-29/+37
Commit d5ebaa7c5f6f6 introduces checks for handle range (e.g HCI_CONN_HANDLE_MAX) but controllers like Intel AX200 don't seem to respect the valid range int case of error status: > HCI Event: Connect Complete (0x03) plen 11 Status: Page Timeout (0x04) Handle: 65535 Address: 94:DB:56:XX:XX:XX (Sony Home Entertainment& Sound Products Inc) Link type: ACL (0x01) Encryption: Disabled (0x00) [1644965.827560] Bluetooth: hci0: Ignoring HCI_Connection_Complete for invalid handle Because of it is impossible to cleanup the connections properly since the stack would attempt to cancel the connection which is no longer in progress causing the following trace: < HCI Command: Create Connection Cancel (0x01|0x0008) plen 6 Address: 94:DB:56:XX:XX:XX (Sony Home Entertainment& Sound Products Inc) = bluetoothd: src/profile.c:record_cb() Unable to get Hands-Free Voice gateway SDP record: Connection timed out > HCI Event: Command Complete (0x0e) plen 10 Create Connection Cancel (0x01|0x0008) ncmd 1 Status: Unknown Connection Identifier (0x02) Address: 94:DB:56:XX:XX:XX (Sony Home Entertainment& Sound Products Inc) < HCI Command: Create Connection Cancel (0x01|0x0008) plen 6 Address: 94:DB:56:XX:XX:XX (Sony Home Entertainment& Sound Products Inc) Fixes: d5ebaa7c5f6f6 ("Bluetooth: hci_event: Ignore multiple conn complete events") Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2022-04-26ice: fix use-after-free when deinitializing mailbox snapshotJacob Keller1-1/+1
During ice_sriov_configure, if num_vfs is 0, we are being asked by the kernel to remove all VFs. The driver first de-initializes the snapshot before freeing all the VFs. This results in a use-after-free BUG detected by KASAN. The bug occurs because the snapshot can still be accessed until all VFs are removed. Fix this by freeing all the VFs first before calling ice_mbx_deinit_snapshot. [ +0.032591] ================================================================== [ +0.000021] BUG: KASAN: use-after-free in ice_mbx_vf_state_handler+0x1c3/0x410 [ice] [ +0.000315] Write of size 28 at addr ffff889908eb6f28 by task kworker/55:2/1530996 [ +0.000029] CPU: 55 PID: 1530996 Comm: kworker/55:2 Kdump: loaded Tainted: G S I 5.17.0-dirty #1 [ +0.000022] Hardware name: Dell Inc. PowerEdge R740/0923K0, BIOS 1.6.13 12/17/2018 [ +0.000013] Workqueue: ice ice_service_task [ice] [ +0.000279] Call Trace: [ +0.000012] <TASK> [ +0.000011] dump_stack_lvl+0x33/0x42 [ +0.000030] print_report.cold.13+0xb2/0x6b3 [ +0.000028] ? ice_mbx_vf_state_handler+0x1c3/0x410 [ice] [ +0.000295] kasan_report+0xa5/0x120 [ +0.000026] ? __switch_to_asm+0x21/0x70 [ +0.000024] ? ice_mbx_vf_state_handler+0x1c3/0x410 [ice] [ +0.000298] kasan_check_range+0x183/0x1e0 [ +0.000019] memset+0x1f/0x40 [ +0.000018] ice_mbx_vf_state_handler+0x1c3/0x410 [ice] [ +0.000304] ? ice_conv_link_speed_to_virtchnl+0x160/0x160 [ice] [ +0.000297] ? ice_vsi_dis_spoofchk+0x40/0x40 [ice] [ +0.000305] ice_is_malicious_vf+0x1aa/0x250 [ice] [ +0.000303] ? ice_restore_all_vfs_msi_state+0x160/0x160 [ice] [ +0.000297] ? __mutex_unlock_slowpath.isra.15+0x410/0x410 [ +0.000022] ? ice_debug_cq+0xb7/0x230 [ice] [ +0.000273] ? __kasan_slab_alloc+0x2f/0x90 [ +0.000022] ? memset+0x1f/0x40 [ +0.000017] ? do_raw_spin_lock+0x119/0x1d0 [ +0.000022] ? rwlock_bug.part.2+0x60/0x60 [ +0.000024] __ice_clean_ctrlq+0x3a6/0xd60 [ice] [ +0.000273] ? newidle_balance+0x5b1/0x700 [ +0.000026] ? ice_print_link_msg+0x2f0/0x2f0 [ice] [ +0.000271] ? update_cfs_group+0x1b/0x140 [ +0.000018] ? load_balance+0x1260/0x1260 [ +0.000022] ? ice_process_vflr_event+0x27/0x130 [ice] [ +0.000301] ice_service_task+0x136e/0x1470 [ice] [ +0.000281] process_one_work+0x3b4/0x6c0 [ +0.000030] worker_thread+0x65/0x660 [ +0.000023] ? __kthread_parkme+0xe4/0x100 [ +0.000021] ? process_one_work+0x6c0/0x6c0 [ +0.000020] kthread+0x179/0x1b0 [ +0.000018] ? kthread_complete_and_exit+0x20/0x20 [ +0.000022] ret_from_fork+0x22/0x30 [ +0.000026] </TASK> [ +0.000018] Allocated by task 10742: [ +0.000013] kasan_save_stack+0x1c/0x40 [ +0.000018] __kasan_kmalloc+0x84/0xa0 [ +0.000016] kmem_cache_alloc_trace+0x16c/0x2e0 [ +0.000015] intel_iommu_probe_device+0xeb/0x860 [ +0.000015] __iommu_probe_device+0x9a/0x2f0 [ +0.000016] iommu_probe_device+0x43/0x270 [ +0.000015] iommu_bus_notifier+0xa7/0xd0 [ +0.000015] blocking_notifier_call_chain+0x90/0xc0 [ +0.000017] device_add+0x5f3/0xd70 [ +0.000014] pci_device_add+0x404/0xa40 [ +0.000015] pci_iov_add_virtfn+0x3b0/0x550 [ +0.000016] sriov_enable+0x3bb/0x600 [ +0.000013] ice_ena_vfs+0x113/0xa79 [ice] [ +0.000293] ice_sriov_configure.cold.17+0x21/0xe0 [ice] [ +0.000291] sriov_numvfs_store+0x160/0x200 [ +0.000015] kernfs_fop_write_iter+0x1db/0x270 [ +0.000018] new_sync_write+0x21d/0x330 [ +0.000013] vfs_write+0x376/0x410 [ +0.000013] ksys_write+0xba/0x150 [ +0.000012] do_syscall_64+0x3a/0x80 [ +0.000012] entry_SYSCALL_64_after_hwframe+0x44/0xae [ +0.000028] Freed by task 10742: [ +0.000011] kasan_save_stack+0x1c/0x40 [ +0.000015] kasan_set_track+0x21/0x30 [ +0.000016] kasan_set_free_info+0x20/0x30 [ +0.000012] __kasan_slab_free+0x104/0x170 [ +0.000016] kfree+0x9b/0x470 [ +0.000013] devres_destroy+0x1c/0x20 [ +0.000015] devm_kfree+0x33/0x40 [ +0.000012] ice_mbx_deinit_snapshot+0x39/0x70 [ice] [ +0.000295] ice_sriov_configure+0xb0/0x260 [ice] [ +0.000295] sriov_numvfs_store+0x1bc/0x200 [ +0.000015] kernfs_fop_write_iter+0x1db/0x270 [ +0.000016] new_sync_write+0x21d/0x330 [ +0.000012] vfs_write+0x376/0x410 [ +0.000012] ksys_write+0xba/0x150 [ +0.000012] do_syscall_64+0x3a/0x80 [ +0.000012] entry_SYSCALL_64_after_hwframe+0x44/0xae [ +0.000024] Last potentially related work creation: [ +0.000010] kasan_save_stack+0x1c/0x40 [ +0.000016] __kasan_record_aux_stack+0x98/0xa0 [ +0.000013] insert_work+0x34/0x160 [ +0.000015] __queue_work+0x20e/0x650 [ +0.000016] queue_work_on+0x4c/0x60 [ +0.000015] nf_nat_masq_schedule+0x297/0x2e0 [nf_nat] [ +0.000034] masq_device_event+0x5a/0x60 [nf_nat] [ +0.000031] raw_notifier_call_chain+0x5f/0x80 [ +0.000017] dev_close_many+0x1d6/0x2c0 [ +0.000015] unregister_netdevice_many+0x4e3/0xa30 [ +0.000015] unregister_netdevice_queue+0x192/0x1d0 [ +0.000014] iavf_remove+0x8f9/0x930 [iavf] [ +0.000058] pci_device_remove+0x65/0x110 [ +0.000015] device_release_driver_internal+0xf8/0x190 [ +0.000017] pci_stop_bus_device+0xb5/0xf0 [ +0.000014] pci_stop_and_remove_bus_device+0xe/0x20 [ +0.000016] pci_iov_remove_virtfn+0x19c/0x230 [ +0.000015] sriov_disable+0x4f/0x170 [ +0.000014] ice_free_vfs+0x9a/0x490 [ice] [ +0.000306] ice_sriov_configure+0xb8/0x260 [ice] [ +0.000294] sriov_numvfs_store+0x1bc/0x200 [ +0.000015] kernfs_fop_write_iter+0x1db/0x270 [ +0.000016] new_sync_write+0x21d/0x330 [ +0.000012] vfs_write+0x376/0x410 [ +0.000012] ksys_write+0xba/0x150 [ +0.000012] do_syscall_64+0x3a/0x80 [ +0.000012] entry_SYSCALL_64_after_hwframe+0x44/0xae [ +0.000025] The buggy address belongs to the object at ffff889908eb6f00 which belongs to the cache kmalloc-96 of size 96 [ +0.000016] The buggy address is located 40 bytes inside of 96-byte region [ffff889908eb6f00, ffff889908eb6f60) [ +0.000026] The buggy address belongs to the physical page: [ +0.000010] page:00000000b7e99a2e refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x1908eb6 [ +0.000016] flags: 0x57ffffc0000200(slab|node=1|zone=2|lastcpupid=0x1fffff) [ +0.000024] raw: 0057ffffc0000200 ffffea0069d9fd80 dead000000000002 ffff88810004c780 [ +0.000015] raw: 0000000000000000 0000000000200020 00000001ffffffff 0000000000000000 [ +0.000009] page dumped because: kasan: bad access detected [ +0.000016] Memory state around the buggy address: [ +0.000012] ffff889908eb6e00: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc [ +0.000014] ffff889908eb6e80: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc [ +0.000014] >ffff889908eb6f00: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc [ +0.000011] ^ [ +0.000013] ffff889908eb6f80: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc [ +0.000013] ffff889908eb7000: fa fb fb fb fb fb fb fb fc fc fc fc fa fb fb fb [ +0.000012] ================================================================== Fixes: 0891c89674e8 ("ice: warn about potentially malicious VFs") Reported-by: Slawomir Laba <slawomirx.laba@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Konrad Jankowski <konrad0.jankowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2022-04-26ice: wait 5 s for EMP reset after firmware flashPetr Oros1-0/+3
We need to wait 5 s for EMP reset after firmware flash. Code was extracted from OOT driver (ice v1.8.3 downloaded from sourceforge). Without this wait, fw_activate let card in inconsistent state and recoverable only by second flash/activate. Flash was tested on these fw's: From -> To 3.00 -> 3.10/3.20 3.10 -> 3.00/3.20 3.20 -> 3.00/3.10 Reproducer: [root@host ~]# devlink dev flash pci/0000:ca:00.0 file E810_XXVDA4_FH_O_SEC_FW_1p6p1p9_NVM_3p10_PLDMoMCTP_0.11_8000AD7B.bin Preparing to flash [fw.mgmt] Erasing [fw.mgmt] Erasing done [fw.mgmt] Flashing 100% [fw.mgmt] Flashing done 100% [fw.undi] Erasing [fw.undi] Erasing done [fw.undi] Flashing 100% [fw.undi] Flashing done 100% [fw.netlist] Erasing [fw.netlist] Erasing done [fw.netlist] Flashing 100% [fw.netlist] Flashing done 100% Activate new firmware by devlink reload [root@host ~]# devlink dev reload pci/0000:ca:00.0 action fw_activate reload_actions_performed: fw_activate [root@host ~]# ip link show ens7f0 71: ens7f0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN mode DEFAULT group default qlen 1000 link/ether b4:96:91:dc:72:e0 brd ff:ff:ff:ff:ff:ff altname enp202s0f0 dmesg after flash: [ 55.120788] ice: Copyright (c) 2018, Intel Corporation. [ 55.274734] ice 0000:ca:00.0: Get PHY capabilities failed status = -5, continuing anyway [ 55.569797] ice 0000:ca:00.0: The DDP package was successfully loaded: ICE OS Default Package version 1.3.28.0 [ 55.603629] ice 0000:ca:00.0: Get PHY capability failed. [ 55.608951] ice 0000:ca:00.0: ice_init_nvm_phy_type failed: -5 [ 55.647348] ice 0000:ca:00.0: PTP init successful [ 55.675536] ice 0000:ca:00.0: DCB is enabled in the hardware, max number of TCs supported on this port are 8 [ 55.685365] ice 0000:ca:00.0: FW LLDP is disabled, DCBx/LLDP in SW mode. [ 55.692179] ice 0000:ca:00.0: Commit DCB Configuration to the hardware [ 55.701382] ice 0000:ca:00.0: 126.024 Gb/s available PCIe bandwidth, limited by 16.0 GT/s PCIe x8 link at 0000:c9:02.0 (capable of 252.048 Gb/s with 16.0 GT/s PCIe x16 link) Reboot doesn’t help, only second flash/activate with OOT or patched driver put card back in consistent state. After patch: [root@host ~]# devlink dev flash pci/0000:ca:00.0 file E810_XXVDA4_FH_O_SEC_FW_1p6p1p9_NVM_3p10_PLDMoMCTP_0.11_8000AD7B.bin Preparing to flash [fw.mgmt] Erasing [fw.mgmt] Erasing done [fw.mgmt] Flashing 100% [fw.mgmt] Flashing done 100% [fw.undi] Erasing [fw.undi] Erasing done [fw.undi] Flashing 100% [fw.undi] Flashing done 100% [fw.netlist] Erasing [fw.netlist] Erasing done [fw.netlist] Flashing 100% [fw.netlist] Flashing done 100% Activate new firmware by devlink reload [root@host ~]# devlink dev reload pci/0000:ca:00.0 action fw_activate reload_actions_performed: fw_activate [root@host ~]# ip link show ens7f0 19: ens7f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000 link/ether b4:96:91:dc:72:e0 brd ff:ff:ff:ff:ff:ff altname enp202s0f0 Fixes: 399e27dbbd9e94 ("ice: support immediate firmware activation via devlink reload") Signed-off-by: Petr Oros <poros@redhat.com> Tested-by: Gurucharan <gurucharanx.g@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2022-04-26ice: Protect vf_state check by cfg_lock in ice_vc_process_vf_msg()Ivan Vecera1-7/+5
Previous patch labelled "ice: Fix incorrect locking in ice_vc_process_vf_msg()" fixed an issue with ignored messages sent by VF driver but a small race window still left. Recently caught trace during 'ip link set ... vf 0 vlan ...' operation: [ 7332.995625] ice 0000:3b:00.0: Clearing port VLAN on VF 0 [ 7333.001023] iavf 0000:3b:01.0: Reset indication received from the PF [ 7333.007391] iavf 0000:3b:01.0: Scheduling reset task [ 7333.059575] iavf 0000:3b:01.0: PF returned error -5 (IAVF_ERR_PARAM) to our request 3 [ 7333.059626] ice 0000:3b:00.0: Invalid message from VF 0, opcode 3, len 4, error -1 Setting of VLAN for VF causes a reset of the affected VF using ice_reset_vf() function that runs with cfg_lock taken: 1. ice_notify_vf_reset() informs IAVF driver that reset is needed and IAVF schedules its own reset procedure 2. Bit ICE_VF_STATE_DIS is set in vf->vf_state 3. Misc initialization steps 4. ice_sriov_post_vsi_rebuild() -> ice_vf_set_initialized() and that clears ICE_VF_STATE_DIS in vf->vf_state Step 3 is mentioned race window because IAVF reset procedure runs in parallel and one of its step is sending of VIRTCHNL_OP_GET_VF_RESOURCES message (opcode==3). This message is handled in ice_vc_process_vf_msg() and if it is received during the mentioned race window then it's marked as invalid and error is returned to VF driver. Protect vf_state check in ice_vc_process_vf_msg() by cfg_lock to avoid this race condition. Fixes: e6ba5273d4ed ("ice: Fix race conditions between virtchnl handling and VF ndo ops") Tested-by: Fei Liu <feliu@redhat.com> Signed-off-by: Ivan Vecera <ivecera@redhat.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Konrad Jankowski <konrad0.jankowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2022-04-26ice: Fix incorrect locking in ice_vc_process_vf_msg()Ivan Vecera1-14/+7
Usage of mutex_trylock() in ice_vc_process_vf_msg() is incorrect because message sent from VF is ignored and never processed. Use mutex_lock() instead to fix the issue. It is safe because this mutex is used to prevent races between VF related NDOs and handlers processing request messages from VF and these handlers are running in ice_service_task() context. Additionally move this mutex lock prior ice_vc_is_opcode_allowed() call to avoid potential races during allowlist access. Fixes: e6ba5273d4ed ("ice: Fix race conditions between virtchnl handling and VF ndo ops") Signed-off-by: Ivan Vecera <ivecera@redhat.com> Tested-by: Konrad Jankowski <konrad0.jankowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2022-04-26xsk: Fix possible crash when multiple sockets are createdMaciej Fijalkowski3-4/+26
Fix a crash that happens if an Rx only socket is created first, then a second socket is created that is Tx only and bound to the same umem as the first socket and also the same netdev and queue_id together with the XDP_SHARED_UMEM flag. In this specific case, the tx_descs array page pool was not created by the first socket as it was an Rx only socket. When the second socket is bound it needs this tx_descs array of this shared page pool as it has a Tx component, but unfortunately it was never allocated, leading to a crash. Note that this array is only used for zero-copy drivers using the batched Tx APIs, currently only ice and i40e. [ 5511.150360] BUG: kernel NULL pointer dereference, address: 0000000000000008 [ 5511.158419] #PF: supervisor write access in kernel mode [ 5511.164472] #PF: error_code(0x0002) - not-present page [ 5511.170416] PGD 0 P4D 0 [ 5511.173347] Oops: 0002 [#1] PREEMPT SMP PTI [ 5511.178186] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G E 5.18.0-rc1+ #97 [ 5511.187245] Hardware name: Intel Corp. GRANTLEY/GRANTLEY, BIOS GRRFCRB1.86B.0276.D07.1605190235 05/19/2016 [ 5511.198418] RIP: 0010:xsk_tx_peek_release_desc_batch+0x198/0x310 [ 5511.205375] Code: c0 83 c6 01 84 c2 74 6d 8d 46 ff 23 07 44 89 e1 48 83 c0 14 48 c1 e1 04 48 c1 e0 04 48 03 47 10 4c 01 c1 48 8b 50 08 48 8b 00 <48> 89 51 08 48 89 01 41 80 bd d7 00 00 00 00 75 82 48 8b 19 49 8b [ 5511.227091] RSP: 0018:ffffc90000003dd0 EFLAGS: 00010246 [ 5511.233135] RAX: 0000000000000000 RBX: ffff88810c8da600 RCX: 0000000000000000 [ 5511.241384] RDX: 000000000000003c RSI: 0000000000000001 RDI: ffff888115f555c0 [ 5511.249634] RBP: ffffc90000003e08 R08: 0000000000000000 R09: ffff889092296b48 [ 5511.257886] R10: 0000ffffffffffff R11: ffff889092296800 R12: 0000000000000000 [ 5511.266138] R13: ffff88810c8db500 R14: 0000000000000040 R15: 0000000000000100 [ 5511.274387] FS: 0000000000000000(0000) GS:ffff88903f800000(0000) knlGS:0000000000000000 [ 5511.283746] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 5511.290389] CR2: 0000000000000008 CR3: 00000001046e2001 CR4: 00000000003706f0 [ 5511.298640] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 5511.306892] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 5511.315142] Call Trace: [ 5511.317972] <IRQ> [ 5511.320301] ice_xmit_zc+0x68/0x2f0 [ice] [ 5511.324977] ? ktime_get+0x38/0xa0 [ 5511.328913] ice_napi_poll+0x7a/0x6a0 [ice] [ 5511.333784] __napi_poll+0x2c/0x160 [ 5511.337821] net_rx_action+0xdd/0x200 [ 5511.342058] __do_softirq+0xe6/0x2dd [ 5511.346198] irq_exit_rcu+0xb5/0x100 [ 5511.350339] common_interrupt+0xa4/0xc0 [ 5511.354777] </IRQ> [ 5511.357201] <TASK> [ 5511.359625] asm_common_interrupt+0x1e/0x40 [ 5511.364466] RIP: 0010:cpuidle_enter_state+0xd2/0x360 [ 5511.370211] Code: 49 89 c5 0f 1f 44 00 00 31 ff e8 e9 00 7b ff 45 84 ff 74 12 9c 58 f6 c4 02 0f 85 72 02 00 00 31 ff e8 02 0c 80 ff fb 45 85 f6 <0f> 88 11 01 00 00 49 63 c6 4c 2b 2c 24 48 8d 14 40 48 8d 14 90 49 [ 5511.391921] RSP: 0018:ffffffff82a03e60 EFLAGS: 00000202 [ 5511.397962] RAX: ffff88903f800000 RBX: 0000000000000001 RCX: 000000000000001f [ 5511.406214] RDX: 0000000000000000 RSI: ffffffff823400b9 RDI: ffffffff8234c046 [ 5511.424646] RBP: ffff88810a384800 R08: 000005032a28c046 R09: 0000000000000008 [ 5511.443233] R10: 000000000000000b R11: 0000000000000006 R12: ffffffff82bcf700 [ 5511.461922] R13: 000005032a28c046 R14: 0000000000000001 R15: 0000000000000000 [ 5511.480300] cpuidle_enter+0x29/0x40 [ 5511.494329] do_idle+0x1c7/0x250 [ 5511.507610] cpu_startup_entry+0x19/0x20 [ 5511.521394] start_kernel+0x649/0x66e [ 5511.534626] secondary_startup_64_no_verify+0xc3/0xcb [ 5511.549230] </TASK> Detect such case during bind() and allocate this memory region via newly introduced xp_alloc_tx_descs(). Also, use kvcalloc instead of kcalloc as for other buffer pool allocations, so that it matches the kvfree() from xp_destroy(). Fixes: d1bc532e99be ("i40e: xsk: Move tmp desc array from driver to pool") Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/bpf/20220425153745.481322-1-maciej.fijalkowski@intel.com
2022-04-26kprobes: Fix KRETPROBES when CONFIG_KRETPROBE_ON_RETHOOK is setAdam Zabrocki1-1/+1
The recent kernel change in 73f9b911faa7 ("kprobes: Use rethook for kretprobe if possible"), introduced a potential NULL pointer dereference bug in the KRETPROBE mechanism. The official Kprobes documentation defines that "Any or all handlers can be NULL". Unfortunately, there is a missing return handler verification to fulfill these requirements and can result in a NULL pointer dereference bug. This patch adds such verification in kretprobe_rethook_handler() function. Fixes: 73f9b911faa7 ("kprobes: Use rethook for kretprobe if possible") Signed-off-by: Adam Zabrocki <pi3@pi3.com.pl> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com> Cc: Anil S. Keshavamurthy <anil.s.keshavamurthy@intel.com> Link: https://lore.kernel.org/bpf/20220422164027.GA7862@pi3.com.pl
2022-04-26gfs2: Don't re-check for write past EOF unnecessarilyAndreas Gruenbacher1-1/+1
Only re-check for direct I/O writes past the end of the file after re-acquiring the inode glock. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2022-04-26virtio_net: fix wrong buf address calculation when using xdpNikolay Aleksandrov1-1/+19
We received a report[1] of kernel crashes when Cilium is used in XDP mode with virtio_net after updating to newer kernels. After investigating the reason it turned out that when using mergeable bufs with an XDP program which adjusts xdp.data or xdp.data_meta page_to_buf() calculates the build_skb address wrong because the offset can become less than the headroom so it gets the address of the previous page (-X bytes depending on how lower offset is): page_to_skb: page addr ffff9eb2923e2000 buf ffff9eb2923e1ffc offset 252 headroom 256 This is a pr_err() I added in the beginning of page_to_skb which clearly shows offset that is less than headroom by adding 4 bytes of metadata via an xdp prog. The calculations done are: receive_mergeable(): headroom = VIRTIO_XDP_HEADROOM; // VIRTIO_XDP_HEADROOM == 256 bytes offset = xdp.data - page_address(xdp_page) - vi->hdr_len - metasize; page_to_skb(): p = page_address(page) + offset; ... buf = p - headroom; Now buf goes -4 bytes from the page's starting address as can be seen above which is set as skb->head and skb->data by build_skb later. Depending on what's done with the skb (when it's freed most often) we get all kinds of corruptions and BUG_ON() triggers in mm[2]. We have to recalculate the new headroom after the xdp program has run, similar to how offset and len are recalculated. Headroom is directly related to data_hard_start, data and data_meta, so we use them to get the new size. The result is correct (similar pr_err() in page_to_skb, one case of xdp_page and one case of virtnet buf): a) Case with 4 bytes of metadata [ 115.949641] page_to_skb: page addr ffff8b4dcfad2000 offset 252 headroom 252 [ 121.084105] page_to_skb: page addr ffff8b4dcf018000 offset 20732 headroom 252 b) Case of pushing data +32 bytes [ 153.181401] page_to_skb: page addr ffff8b4dd0c4d000 offset 288 headroom 288 [ 158.480421] page_to_skb: page addr ffff8b4dd00b0000 offset 24864 headroom 288 c) Case of pushing data -33 bytes [ 835.906830] page_to_skb: page addr ffff8b4dd3270000 offset 223 headroom 223 [ 840.839910] page_to_skb: page addr ffff8b4dcdd68000 offset 12511 headroom 223 Offset and headroom are equal because offset points to the start of reserved bytes for the virtio_net header which are at buf start + headroom, while data points at buf start + vnet hdr size + headroom so when data or data_meta are adjusted by the xdp prog both the headroom size and the offset change equally. We can use data_hard_start to compute the new headroom after the xdp prog (linearized / page start case, the virtnet buf case is similar just with bigger base offset): xdp.data_hard_start = page_address + vnet_hdr xdp.data = page_address + vnet_hdr + headroom new headroom after xdp prog = xdp.data - xdp.data_hard_start - metasize An example reproducer xdp prog[3] is below. [1] https://github.com/cilium/cilium/issues/19453 [2] Two of the many traces: [ 40.437400] BUG: Bad page state in process swapper/0 pfn:14940 [ 40.916726] BUG: Bad page state in process systemd-resolve pfn:053b7 [ 41.300891] kernel BUG at include/linux/mm.h:720! [ 41.301801] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI [ 41.302784] CPU: 1 PID: 1181 Comm: kubelet Kdump: loaded Tainted: G B W 5.18.0-rc1+ #37 [ 41.304458] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014 [ 41.306018] RIP: 0010:page_frag_free+0x79/0xe0 [ 41.306836] Code: 00 00 75 ea 48 8b 07 a9 00 00 01 00 74 e0 48 8b 47 48 48 8d 50 ff a8 01 48 0f 45 fa eb d0 48 c7 c6 18 b8 30 a6 e8 d7 f8 fc ff <0f> 0b 48 8d 78 ff eb bc 48 8b 07 a9 00 00 01 00 74 3a 66 90 0f b6 [ 41.310235] RSP: 0018:ffffac05c2a6bc78 EFLAGS: 00010292 [ 41.311201] RAX: 000000000000003e RBX: 0000000000000000 RCX: 0000000000000000 [ 41.312502] RDX: 0000000000000001 RSI: ffffffffa6423004 RDI: 00000000ffffffff [ 41.313794] RBP: ffff993c98823600 R08: 0000000000000000 R09: 00000000ffffdfff [ 41.315089] R10: ffffac05c2a6ba68 R11: ffffffffa698ca28 R12: ffff993c98823600 [ 41.316398] R13: ffff993c86311ebc R14: 0000000000000000 R15: 000000000000005c [ 41.317700] FS: 00007fe13fc56740(0000) GS:ffff993cdd900000(0000) knlGS:0000000000000000 [ 41.319150] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 41.320152] CR2: 000000c00008a000 CR3: 0000000014908000 CR4: 0000000000350ee0 [ 41.321387] Call Trace: [ 41.321819] <TASK> [ 41.322193] skb_release_data+0x13f/0x1c0 [ 41.322902] __kfree_skb+0x20/0x30 [ 41.343870] tcp_recvmsg_locked+0x671/0x880 [ 41.363764] tcp_recvmsg+0x5e/0x1c0 [ 41.384102] inet_recvmsg+0x42/0x100 [ 41.406783] ? sock_recvmsg+0x1d/0x70 [ 41.428201] sock_read_iter+0x84/0xd0 [ 41.445592] ? 0xffffffffa3000000 [ 41.462442] new_sync_read+0x148/0x160 [ 41.479314] ? 0xffffffffa3000000 [ 41.496937] vfs_read+0x138/0x190 [ 41.517198] ksys_read+0x87/0xc0 [ 41.535336] do_syscall_64+0x3b/0x90 [ 41.551637] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 41.568050] RIP: 0033:0x48765b [ 41.583955] Code: e8 4a 35 fe ff eb 88 cc cc cc cc cc cc cc cc e8 fb 7a fe ff 48 8b 7c 24 10 48 8b 74 24 18 48 8b 54 24 20 48 8b 44 24 08 0f 05 <48> 3d 01 f0 ff ff 76 20 48 c7 44 24 28 ff ff ff ff 48 c7 44 24 30 [ 41.632818] RSP: 002b:000000c000a2f5b8 EFLAGS: 00000212 ORIG_RAX: 0000000000000000 [ 41.664588] RAX: ffffffffffffffda RBX: 000000c000062000 RCX: 000000000048765b [ 41.681205] RDX: 0000000000005e54 RSI: 000000c000e66000 RDI: 0000000000000016 [ 41.697164] RBP: 000000c000a2f608 R08: 0000000000000001 R09: 00000000000001b4 [ 41.713034] R10: 00000000000000b6 R11: 0000000000000212 R12: 00000000000000e9 [ 41.728755] R13: 0000000000000001 R14: 000000c000a92000 R15: ffffffffffffffff [ 41.744254] </TASK> [ 41.758585] Modules linked in: br_netfilter bridge veth netconsole virtio_net and [ 33.524802] BUG: Bad page state in process systemd-network pfn:11e60 [ 33.528617] page ffffe05dc0147b00 ffffe05dc04e7a00 ffff8ae9851ec000 (1) len 82 offset 252 metasize 4 hroom 0 hdr_len 12 data ffff8ae9851ec10c data_meta ffff8ae9851ec108 data_end ffff8ae9851ec14e [ 33.529764] page:000000003792b5ba refcount:0 mapcount:-512 mapping:0000000000000000 index:0x0 pfn:0x11e60 [ 33.532463] flags: 0xfffffc0000000(node=0|zone=1|lastcpupid=0x1fffff) [ 33.532468] raw: 000fffffc0000000 0000000000000000 dead000000000122 0000000000000000 [ 33.532470] raw: 0000000000000000 0000000000000000 00000000fffffdff 0000000000000000 [ 33.532471] page dumped because: nonzero mapcount [ 33.532472] Modules linked in: br_netfilter bridge veth netconsole virtio_net [ 33.532479] CPU: 0 PID: 791 Comm: systemd-network Kdump: loaded Not tainted 5.18.0-rc1+ #37 [ 33.532482] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1.fc35 04/01/2014 [ 33.532484] Call Trace: [ 33.532496] <TASK> [ 33.532500] dump_stack_lvl+0x45/0x5a [ 33.532506] bad_page.cold+0x63/0x94 [ 33.532510] free_pcp_prepare+0x290/0x420 [ 33.532515] free_unref_page+0x1b/0x100 [ 33.532518] skb_release_data+0x13f/0x1c0 [ 33.532524] kfree_skb_reason+0x3e/0xc0 [ 33.532527] ip6_mc_input+0x23c/0x2b0 [ 33.532531] ip6_sublist_rcv_finish+0x83/0x90 [ 33.532534] ip6_sublist_rcv+0x22b/0x2b0 [3] XDP program to reproduce(xdp_pass.c): #include <linux/bpf.h> #include <bpf/bpf_helpers.h> SEC("xdp_pass") int xdp_pkt_pass(struct xdp_md *ctx) { bpf_xdp_adjust_head(ctx, -(int)32); return XDP_PASS; } char _license[] SEC("license") = "GPL"; compile: clang -O2 -g -Wall -target bpf -c xdp_pass.c -o xdp_pass.o load on virtio_net: ip link set enp1s0 xdpdrv obj xdp_pass.o sec xdp_pass CC: stable@vger.kernel.org CC: Jason Wang <jasowang@redhat.com> CC: Xuan Zhuo <xuanzhuo@linux.alibaba.com> CC: Daniel Borkmann <daniel@iogearbox.net> CC: "Michael S. Tsirkin" <mst@redhat.com> CC: virtualization@lists.linux-foundation.org Fixes: 8fb7da9e9907 ("virtio_net: get build_skb() buf by data ptr") Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org> Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com> Link: https://lore.kernel.org/r/20220425103703.3067292-1-razor@blackwall.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-04-26net: dsa: mv88e6xxx: Fix port_hidden_wait to account for port_base_addrNathan Rossi1-2/+3
The other port_hidden functions rely on the port_read/port_write functions to access the hidden control port. These functions apply the offset for port_base_addr where applicable. Update port_hidden_wait to use the port_wait_bit so that port_base_addr offsets are accounted for when waiting for the busy bit to change. Without the offset the port_hidden_wait function would timeout on devices that have a non-zero port_base_addr (e.g. MV88E6141), however devices that have a zero port_base_addr would operate correctly (e.g. MV88E6390). Fixes: 609070133aff ("net: dsa: mv88e6xxx: update code operating on hidden registers") Signed-off-by: Nathan Rossi <nathan@nathanrossi.com> Reviewed-by: Marek Behún <kabel@kernel.org> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://lore.kernel.org/r/20220425070454.348584-1-nathan@nathanrossi.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-04-26net: phy: marvell10g: fix return value on errorBaruch Siach1-1/+1
Return back the error value that we get from phy_read_mmd(). Fixes: c84786fa8f91 ("net: phy: marvell10g: read copper results from CSSR1") Signed-off-by: Baruch Siach <baruch.siach@siklu.com> Reviewed-by: Marek Behún <kabel@kernel.org> Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://lore.kernel.org/r/f47cb031aeae873bb008ba35001607304a171a20.1650868058.git.baruch@tkos.co.il Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-04-26net: usb: qmi_wwan: add support for Sierra Wireless EM7590Ethan Yang1-0/+1
add support for Sierra Wireless EM7590 0xc081 composition. Signed-off-by: Ethan Yang <etyang@sierrawireless.com> Acked-by: Bjørn Mork <bjorn@mork.no> Link: https://lore.kernel.org/r/20220425054028.5444-1-etyang@sierrawireless.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-04-26net/af_packet: add VLAN support for AF_PACKET SOCK_RAW GSOHangbin Liu1-5/+13
Currently, the kernel drops GSO VLAN tagged packet if it's created with socket(AF_PACKET, SOCK_RAW, 0) plus virtio_net_hdr. The reason is AF_PACKET doesn't adjust the skb network header if there is a VLAN tag. Then after virtio_net_hdr_set_proto() called, the skb->protocol will be set to ETH_P_IP/IPv6. And in later inet/ipv6_gso_segment() the skb is dropped as network header position is invalid. Let's handle VLAN packets by adjusting network header position in packet_parse_headers(). The adjustment is safe and does not affect the later xmit as tap device also did that. In packet_snd(), packet_parse_headers() need to be moved before calling virtio_net_hdr_set_proto(), so we can set correct skb->protocol and network header first. There is no need to update tpacket_snd() as it calls packet_parse_headers() in tpacket_fill_skb(), which is already before calling virtio_net_hdr_* functions. skb->no_fcs setting is also moved upper to make all skb settings together and keep consistency with function packet_sendmsg_spkt(). Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Acked-by: Willem de Bruijn <willemb@google.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Link: https://lore.kernel.org/r/20220425014502.985464-1-liuhangbin@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-04-26net: bcmgenet: hide status block before TX timestampingJonathan Lemon1-0/+7
The hardware checksum offloading requires use of a transmit status block inserted before the outgoing frame data, this was updated in '9a9ba2a4aaaa ("net: bcmgenet: always enable status blocks")' However, skb_tx_timestamp() assumes that it is passed a raw frame and PTP parsing chokes on this status block. Fix this by calling __skb_pull(), which hides the TSB before calling skb_tx_timestamp(), so an outgoing PTP packet is parsed correctly. As the data in the skb has already been set up for DMA, and the dma_unmap_* calls use a separately stored address, there is no no effective change in the data transmission. Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com> Acked-by: Florian Fainelli <f.fainelli@gmail.com> Link: https://lore.kernel.org/r/20220424165307.591145-1-jonathan.lemon@gmail.com Fixes: d03825fba459 ("net: bcmgenet: add skb_tx_timestamp call") Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-04-26net: dsa: ksz: added the generic port_stp_state_set functionArun Ramadoss6-73/+49
The ksz8795 and ksz9477 uses the same algorithm for the port_stp_state_set function except the register address is different. So moved the algorithm to the ksz_common.c and used the dev_ops for register read and write. This function can also used for the lan937x part. Hence making it generic for all the parts. Signed-off-by: Arun Ramadoss <arun.ramadoss@microchip.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Link: https://lore.kernel.org/r/20220424112831.11504-1-arun.ramadoss@microchip.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-04-26net: phy: LAN937x: add interrupt support for link detectionArun Ramadoss1-0/+2
Added the config_intr and handle_interrupt for the LAN937x phy which is same as the LAN87xx phy. Signed-off-by: Arun Ramadoss <arun.ramadoss@microchip.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://lore.kernel.org/r/20220423154727.29052-1-arun.ramadoss@microchip.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-04-26mctp: defer the kfree of object mdev->addrsLin Ma1-1/+1
The function mctp_unregister() reclaims the device's relevant resource when a netcard detaches. However, a running routine may be unaware of this and cause the use-after-free of the mdev->addrs object. The race condition can be demonstrated below cleanup thread another thread | unregister_netdev() | mctp_sendmsg() ... | ... mctp_unregister() | rt = mctp_route_lookup() ... | mctl_local_output() kfree(mdev->addrs) | ... | saddr = rt->dev->addrs[0]; | An attacker can adopt the (recent provided) mtcpserial driver with pty to fake the device detaching and use the userfaultfd to increase the race success chance (in mctp_sendmsg). The KASan report for such a POC is shown below: [ 86.051955] ================================================================== [ 86.051955] BUG: KASAN: use-after-free in mctp_local_output+0x4e9/0xb7d [ 86.051955] Read of size 1 at addr ffff888005f298c0 by task poc/295 [ 86.051955] [ 86.051955] Call Trace: [ 86.051955] <TASK> [ 86.051955] dump_stack_lvl+0x33/0x42 [ 86.051955] print_report.cold.13+0xb2/0x6b3 [ 86.051955] ? preempt_schedule_irq+0x57/0x80 [ 86.051955] ? mctp_local_output+0x4e9/0xb7d [ 86.051955] kasan_report+0xa5/0x120 [ 86.051955] ? mctp_local_output+0x4e9/0xb7d [ 86.051955] mctp_local_output+0x4e9/0xb7d [ 86.051955] ? mctp_dev_set_key+0x79/0x79 [ 86.051955] ? copyin+0x38/0x50 [ 86.051955] ? _copy_from_iter+0x1b6/0xf20 [ 86.051955] ? sysvec_apic_timer_interrupt+0x97/0xb0 [ 86.051955] ? asm_sysvec_apic_timer_interrupt+0x12/0x20 [ 86.051955] ? mctp_local_output+0x1/0xb7d [ 86.051955] mctp_sendmsg+0x64d/0xdb0 [ 86.051955] ? mctp_sk_close+0x20/0x20 [ 86.051955] ? __fget_light+0x2fd/0x4f0 [ 86.051955] ? mctp_sk_close+0x20/0x20 [ 86.051955] sock_sendmsg+0xdd/0x110 [ 86.051955] __sys_sendto+0x1cc/0x2a0 [ 86.051955] ? __ia32_sys_getpeername+0xa0/0xa0 [ 86.051955] ? new_sync_write+0x335/0x550 [ 86.051955] ? alloc_file+0x22f/0x500 [ 86.051955] ? __ip_do_redirect+0x820/0x1820 [ 86.051955] ? vfs_write+0x44d/0x7b0 [ 86.051955] ? vfs_write+0x44d/0x7b0 [ 86.051955] ? fput_many+0x15/0x120 [ 86.051955] ? ksys_write+0x155/0x1b0 [ 86.051955] ? __ia32_sys_read+0xa0/0xa0 [ 86.051955] __x64_sys_sendto+0xd8/0x1b0 [ 86.051955] ? exit_to_user_mode_prepare+0x2f/0x120 [ 86.051955] ? syscall_exit_to_user_mode+0x12/0x20 [ 86.051955] do_syscall_64+0x3a/0x80 [ 86.051955] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 86.051955] RIP: 0033:0x7f82118a56b3 [ 86.051955] RSP: 002b:00007ffdb154b110 EFLAGS: 00000293 ORIG_RAX: 000000000000002c [ 86.051955] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f82118a56b3 [ 86.051955] RDX: 0000000000000010 RSI: 00007f8211cd4000 RDI: 0000000000000007 [ 86.051955] RBP: 00007ffdb154c1d0 R08: 00007ffdb154b164 R09: 000000000000000c [ 86.051955] R10: 0000000000000000 R11: 0000000000000293 R12: 000055d779800db0 [ 86.051955] R13: 00007ffdb154c2b0 R14: 0000000000000000 R15: 0000000000000000 [ 86.051955] </TASK> [ 86.051955] [ 86.051955] Allocated by task 295: [ 86.051955] kasan_save_stack+0x1c/0x40 [ 86.051955] __kasan_kmalloc+0x84/0xa0 [ 86.051955] mctp_rtm_newaddr+0x242/0x610 [ 86.051955] rtnetlink_rcv_msg+0x2fd/0x8b0 [ 86.051955] netlink_rcv_skb+0x11c/0x340 [ 86.051955] netlink_unicast+0x439/0x630 [ 86.051955] netlink_sendmsg+0x752/0xc00 [ 86.051955] sock_sendmsg+0xdd/0x110 [ 86.051955] __sys_sendto+0x1cc/0x2a0 [ 86.051955] __x64_sys_sendto+0xd8/0x1b0 [ 86.051955] do_syscall_64+0x3a/0x80 [ 86.051955] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 86.051955] [ 86.051955] Freed by task 301: [ 86.051955] kasan_save_stack+0x1c/0x40 [ 86.051955] kasan_set_track+0x21/0x30 [ 86.051955] kasan_set_free_info+0x20/0x30 [ 86.051955] __kasan_slab_free+0x104/0x170 [ 86.051955] kfree+0x8c/0x290 [ 86.051955] mctp_dev_notify+0x161/0x2c0 [ 86.051955] raw_notifier_call_chain+0x8b/0xc0 [ 86.051955] unregister_netdevice_many+0x299/0x1180 [ 86.051955] unregister_netdevice_queue+0x210/0x2f0 [ 86.051955] unregister_netdev+0x13/0x20 [ 86.051955] mctp_serial_close+0x6d/0xa0 [ 86.051955] tty_ldisc_kill+0x31/0xa0 [ 86.051955] tty_ldisc_hangup+0x24f/0x560 [ 86.051955] __tty_hangup.part.28+0x2ce/0x6b0 [ 86.051955] tty_release+0x327/0xc70 [ 86.051955] __fput+0x1df/0x8b0 [ 86.051955] task_work_run+0xca/0x150 [ 86.051955] exit_to_user_mode_prepare+0x114/0x120 [ 86.051955] syscall_exit_to_user_mode+0x12/0x20 [ 86.051955] do_syscall_64+0x46/0x80 [ 86.051955] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 86.051955] [ 86.051955] The buggy address belongs to the object at ffff888005f298c0 [ 86.051955] which belongs to the cache kmalloc-8 of size 8 [ 86.051955] The buggy address is located 0 bytes inside of [ 86.051955] 8-byte region [ffff888005f298c0, ffff888005f298c8) [ 86.051955] [ 86.051955] The buggy address belongs to the physical page: [ 86.051955] flags: 0x100000000000200(slab|node=0|zone=1) [ 86.051955] raw: 0100000000000200 dead000000000100 dead000000000122 ffff888005c42280 [ 86.051955] raw: 0000000000000000 0000000080660066 00000001ffffffff 0000000000000000 [ 86.051955] page dumped because: kasan: bad access detected [ 86.051955] [ 86.051955] Memory state around the buggy address: [ 86.051955] ffff888005f29780: 00 fc fc fc fc 00 fc fc fc fc 00 fc fc fc fc 00 [ 86.051955] ffff888005f29800: fc fc fc fc 00 fc fc fc fc 00 fc fc fc fc 00 fc [ 86.051955] >ffff888005f29880: fc fc fc fb fc fc fc fc fa fc fc fc fc fa fc fc [ 86.051955] ^ [ 86.051955] ffff888005f29900: fc fc 00 fc fc fc fc 00 fc fc fc fc 00 fc fc fc [ 86.051955] ffff888005f29980: fc 00 fc fc fc fc 00 fc fc fc fc 00 fc fc fc fc [ 86.051955] ================================================================== To this end, just like the commit e04480920d1e ("Bluetooth: defer cleanup of resources in hci_unregister_dev()") this patch defers the destructive kfree(mdev->addrs) in mctp_unregister to the mctp_dev_put, where the refcount of mdev is zero and the entire device is reclaimed. This prevents the use-after-free because the sendmsg thread holds the reference of mdev in the mctp_route object. Fixes: 583be982d934 (mctp: Add device handling and netlink interface) Signed-off-by: Lin Ma <linma@zju.edu.cn> Acked-by: Jeremy Kerr <jk@codeconstruct.com.au> Link: https://lore.kernel.org/r/20220422114340.32346-1-linma@zju.edu.cn Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-04-26cpufreq: qcom-cpufreq-hw: Clear dcvs interruptsVladimir Zapolskiy1-0/+8
It's noted that dcvs interrupts are not self-clearing, thus an interrupt handler runs constantly, which leads to a severe regression in runtime. To fix the problem an explicit write to clear interrupt register is required, note that on OSM platforms the register may not be present. Fixes: 275157b367f4 ("cpufreq: qcom-cpufreq-hw: Add dcvs interrupt support") Signed-off-by: Vladimir Zapolskiy <vladimir.zapolskiy@linaro.org> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
2022-04-26Merge branch 'Introduce typed pointer support in BPF maps'Alexei Starovoitov23-162/+2035
Kumar Kartikeya Dwivedi says: ==================== This set enables storing pointers of a certain type in BPF map, and extends the verifier to enforce type safety and lifetime correctness properties. The infrastructure being added is generic enough for allowing storing any kind of pointers whose type is available using BTF (user or kernel) in the future (e.g. strongly typed memory allocation in BPF program), which are internally tracked in the verifier as PTR_TO_BTF_ID, but for now the series limits them to two kinds of pointers obtained from the kernel. Obviously, use of this feature depends on map BTF. 1. Unreferenced kernel pointer In this case, there are very few restrictions. The pointer type being stored must match the type declared in the map value. However, such a pointer when loaded from the map can only be dereferenced, but not passed to any in-kernel helpers or kernel functions available to the program. This is because while the verifier's exception handling mechanism coverts BPF_LDX to PROBE_MEM loads, which are then handled specially by the JIT implementation, the same liberty is not available to accesses inside the kernel. The pointer by the time it is passed into a helper has no lifetime related guarantees about the object it is pointing to, and may well be referencing invalid memory. 2. Referenced kernel pointer This case imposes a lot of restrictions on the programmer, to ensure safety. To transfer the ownership of a reference in the BPF program to the map, the user must use the bpf_kptr_xchg helper, which returns the old pointer contained in the map, as an acquired reference, and releases verifier state for the referenced pointer being exchanged, as it moves into the map. This a normal PTR_TO_BTF_ID that can be used with in-kernel helpers and kernel functions callable by the program. However, if BPF_LDX is used to load a referenced pointer from the map, it is still not permitted to pass it to in-kernel helpers or kernel functions. To obtain a reference usable with helpers, the user must invoke a kfunc helper which returns a usable reference (which also must be eventually released before BPF_EXIT, or moved into a map). Since the load of the pointer (preserving data dependency ordering) must happen inside the RCU read section, the kfunc helper will take a pointer to the map value, which must point to the actual pointer of the object whose reference is to be raised. The type will be verified from the BTF information of the kfunc, as the prototype must be: T *func(T **, ... /* other arguments */); Then, the verifier checks whether pointer at offset of the map value points to the type T, and permits the call. This convention is followed so that such helpers may also be called from sleepable BPF programs, where RCU read lock is not necessarily held in the BPF program context, hence necessiating the need to pass in a pointer to the actual pointer to perform the load inside the RCU read section. Notes ----- * C selftests require https://reviews.llvm.org/D119799 to pass. * Unlike BPF timers, kptr is not reset or freed on map_release_uref. * Referenced kptr storage is always treated as unsigned long * on kernel side, as BPF side cannot mutate it. The storage (8 bytes) is sufficient for both 32-bit and 64-bit platforms. * Use of WRITE_ONCE to reset unreferenced kptr on 32-bit systems is fine, as the actual pointer is always word sized, so the store tearing into two 32-bit stores won't be a problem as the other half is always zeroed out. Changelog: ---------- v5 -> v6 v5: https://lore.kernel.org/bpf/20220415160354.1050687-1-memxor@gmail.com * Address comments from Alexei * Drop 'Revisit stack usage' comment * Rename off_btf to kernel_btf * Add comment about searching using type from map BTF * Do kmemdup + btf_get instead of get + kmemdup + put * Add comment for btf_struct_ids_match * Add comment for assigning non-zero id for mark_ptr_or_null_reg * Rename PTR_RELEASE to OBJ_RELEASE * Rename BPF_MAP_OFF_DESC_TYPE_XXX_KPTR to BPF_KPTR_XXX * Remove unneeded likely/unlikely in cold functions * Fix other misc nits * Keep release_regno instead of replacing with bool + regno * Add a patch to prevent type match for first member when off == 0 for release functions (kfunc + BPF helpers) * Guard kptr/kptr_ref definition in libbpf header with __has_attribute to prevent selftests compilation error with old clang not support type tags v4 -> v5 v4: https://lore.kernel.org/bpf/20220409093303.499196-1-memxor@gmail.com * Address comments from Joanne * Move __btf_member_bit_offset before strcmp * Move strcmp conditional on name to unref kptr patch * Directly return from btf_find_struct in patch 1 * Use enum btf_field_type vs int field_type * Put btf and btf_id in off_desc in named struct 'kptr' * Switch order for BTF_FIELD_IGNORE check * Drop dead tab->nr_off = 0 store * Use i instead of tab->nr_off to btf_put on failure * Replace kzalloc + memcpy with kmemdup (kernel test robot) * Reject both BPF_F_RDONLY_PROG and BPF_F_WRONLY_PROG * Add logging statement for reject BPF_MODE(insn->code) != BPF_MEM * Rename off_desc -> kptr_off_desc in check_mem_access * Drop check for err, fallthrough to end of function * Remove is_release_function, use meta.release_regno to detect release function, release reference state, and remove check_release_regno * Drop off_desc->flags, use off_desc->type * Update comment for ARG_PTR_TO_KPTR * Distinguish between direct/indirect access to kptr * Drop check_helper_mem_access from process_kptr_func, check_mem_reg in kptr_get * Add verifier test for helper accessing kptr indirectly * Fix other misc nits, add Acked-by for patch 2 v3 -> v4 v3: https://lore.kernel.org/bpf/20220320155510.671497-1-memxor@gmail.com * Use btf_parse_kptrs, plural kptrs naming (Joanne, Andrii) * Remove unused parameters in check_map_kptr_access (Joanne) * Handle idx < info_cnt kludge using tmp variable (Andrii) * Validate tags always precede modifiers in BTF (Andrii) * Split out into https://lore.kernel.org/bpf/20220406004121.282699-1-memxor@gmail.com * Store u32 type_id in btf_field_info (Andrii) * Use base_type in map_kptr_match_type (Andrii) * Free kptr_off_tab when not bpf_capable (Martin) * Use PTR_RELEASE flag instead of bools in bpf_func_proto (Joanne) * Drop extra reg->off and reg->ref_obj_id checks in map_kptr_match_type (Martin) * Use separate u32 and u8 arrays for offs and sizes in off_arr (Andrii) * Simplify and remove map->value_size sentinel in copy_map_value (Andrii) * Use sort_r to keep both arrays in sync while sorting (Andrii) * Rename check_and_free_timers_and_kptr to check_and_free_fields (Andrii) * Move dtor prototype checks to registration phase (Alexei) * Use ret variable for checking ASSERT_XXX, use shorter strings (Andrii) * Fix missing checks for other maps (Jiri) * Fix various other nits, and bugs noticed during self review v2 -> v3 v2: https://lore.kernel.org/bpf/20220317115957.3193097-1-memxor@gmail.com * Address comments from Alexei * Set name, sz, align in btf_find_field * Do idx >= info_cnt check in caller of btf_find_field_* * Use extra element in the info_arr to make this safe * Remove while loop, reject extra tags * Remove cases of defensive programming * Move bpf_capable() check to map_check_btf * Put check_ptr_off_reg reordering hunk into separate patch * Warn for ref_ptr once * Make the meta.ref_obj_id == 0 case simpler to read * Remove kptr_percpu and kptr_user support, remove their tests * Store size of field at offset in off_arr * Fix BPF_F_NO_PREALLOC set wrongly for hash map in C selftest * Add missing check_mem_reg call for kptr_get kfunc arg#0 check v1 -> v2 v1: https://lore.kernel.org/bpf/20220220134813.3411982-1-memxor@gmail.com * Address comments from Alexei * Rename bpf_btf_find_by_name_kind_all to bpf_find_btf_id * Reduce indentation level in that function * Always take reference regardless of module or vmlinux BTF * Also made it the same for btf_get_module_btf * Use kptr, kptr_ref, kptr_percpu, kptr_user type tags * Don't reserve tag namespace * Refactor btf_find_field to be side effect free, allocate and populate kptr_off_tab in caller * Move module reference to dtor patch * Remove support for BPF_XCHG, BPF_CMPXCHG insn * Introduce bpf_kptr_xchg helper * Embed offset array in struct bpf_map, populate and sort it once * Adjust copy_map_value to memcpy directly using this offset array * Removed size member from offset array to save space * Fix some problems pointed out by kernel test robot * Tidy selftests * Lots of other minor fixes ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-04-26selftests/bpf: Add test for strict BTF type checkKumar Kartikeya Dwivedi2-1/+41
Ensure that the edge case where first member type was matched successfully even if it didn't match BTF type of register is caught and rejected by the verifier. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20220424214901.2743946-14-memxor@gmail.com
2022-04-26selftests/bpf: Add verifier tests for kptrKumar Kartikeya Dwivedi3-7/+562
Reuse bpf_prog_test functions to test the support for PTR_TO_BTF_ID in BPF map case, including some tests that verify implementation sanity and corner cases. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20220424214901.2743946-13-memxor@gmail.com
2022-04-26selftests/bpf: Add C tests for kptrKumar Kartikeya Dwivedi2-0/+227
This uses the __kptr and __kptr_ref macros as well, and tries to test the stuff that is supposed to work, since we have negative tests in test_verifier suite. Also include some code to test map-in-map support, such that the inner_map_meta matches the kptr_off_tab of map added as element. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20220424214901.2743946-12-memxor@gmail.com
2022-04-26libbpf: Add kptr type tag macros to bpf_helpers.hKumar Kartikeya Dwivedi1-0/+7
Include convenience definitions: __kptr: Unreferenced kptr __kptr_ref: Referenced kptr Users can use them to tag the pointer type meant to be used with the new support directly in the map value definition. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20220424214901.2743946-11-memxor@gmail.com
2022-04-26bpf: Make BTF type match stricter for release argumentsKumar Kartikeya Dwivedi3-8/+27
The current of behavior of btf_struct_ids_match for release arguments is that when type match fails, it retries with first member type again (recursively). Since the offset is already 0, this is akin to just casting the pointer in normal C, since if type matches it was just embedded inside parent sturct as an object. However, we want to reject cases for release function type matching, be it kfunc or BPF helpers. An example is the following: struct foo { struct bar b; }; struct foo *v = acq_foo(); rel_bar(&v->b); // btf_struct_ids_match fails btf_types_are_same, then // retries with first member type and succeeds, while // it should fail. Hence, don't walk the struct and only rely on btf_types_are_same for strict mode. All users of strict mode must be dealing with zero offset anyway, since otherwise they would want the struct to be walked. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20220424214901.2743946-10-memxor@gmail.com
2022-04-26bpf: Teach verifier about kptr_get kfunc helpersKumar Kartikeya Dwivedi2-5/+55
We introduce a new style of kfunc helpers, namely *_kptr_get, where they take pointer to the map value which points to a referenced kernel pointer contained in the map. Since this is referenced, only bpf_kptr_xchg from BPF side and xchg from kernel side is allowed to change the current value, and each pointer that resides in that location would be referenced, and RCU protected (this must be kept in mind while adding kernel types embeddable as reference kptr in BPF maps). This means that if do the load of the pointer value in an RCU read section, and find a live pointer, then as long as we hold RCU read lock, it won't be freed by a parallel xchg + release operation. This allows us to implement a safe refcount increment scheme. Hence, enforce that first argument of all such kfunc is a proper PTR_TO_MAP_VALUE pointing at the right offset to referenced pointer. For the rest of the arguments, they are subjected to typical kfunc argument checks, hence allowing some flexibility in passing more intent into how the reference should be taken. For instance, in case of struct nf_conn, it is not freed until RCU grace period ends, but can still be reused for another tuple once refcount has dropped to zero. Hence, a bpf_ct_kptr_get helper not only needs to call refcount_inc_not_zero, but also do a tuple match after incrementing the reference, and when it fails to match it, put the reference again and return NULL. This can be implemented easily if we allow passing additional parameters to the bpf_ct_kptr_get kfunc, like a struct bpf_sock_tuple * and a tuple__sz pair. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20220424214901.2743946-9-memxor@gmail.com
2022-04-26bpf: Wire up freeing of referenced kptrKumar Kartikeya Dwivedi6-25/+210
A destructor kfunc can be defined as void func(type *), where type may be void or any other pointer type as per convenience. In this patch, we ensure that the type is sane and capture the function pointer into off_desc of ptr_off_tab for the specific pointer offset, with the invariant that the dtor pointer is always set when 'kptr_ref' tag is applied to the pointer's pointee type, which is indicated by the flag BPF_MAP_VALUE_OFF_F_REF. Note that only BTF IDs whose destructor kfunc is registered, thus become the allowed BTF IDs for embedding as referenced kptr. Hence it serves the purpose of finding dtor kfunc BTF ID, as well acting as a check against the whitelist of allowed BTF IDs for this purpose. Finally, wire up the actual freeing of the referenced pointer if any at all available offsets, so that no references are leaked after the BPF map goes away and the BPF program previously moved the ownership a referenced pointer into it. The behavior is similar to BPF timers, where bpf_map_{update,delete}_elem will free any existing referenced kptr. The same case is with LRU map's bpf_lru_push_free/htab_lru_push_free functions, which are extended to reset unreferenced and free referenced kptr. Note that unlike BPF timers, kptr is not reset or freed when map uref drops to zero. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20220424214901.2743946-8-memxor@gmail.com
2022-04-26bpf: Populate pairs of btf_id and destructor kfunc in btfKumar Kartikeya Dwivedi2-0/+125
To support storing referenced PTR_TO_BTF_ID in maps, we require associating a specific BTF ID with a 'destructor' kfunc. This is because we need to release a live referenced pointer at a certain offset in map value from the map destruction path, otherwise we end up leaking resources. Hence, introduce support for passing an array of btf_id, kfunc_btf_id pairs that denote a BTF ID and its associated release function. Then, add an accessor 'btf_find_dtor_kfunc' which can be used to look up the destructor kfunc of a certain BTF ID. If found, we can use it to free the object from the map free path. The registration of these pairs also serve as a whitelist of structures which are allowed as referenced PTR_TO_BTF_ID in a BPF map, because without finding the destructor kfunc, we will bail and return an error. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20220424214901.2743946-7-memxor@gmail.com
2022-04-26bpf: Adapt copy_map_value for multiple offset caseKumar Kartikeya Dwivedi2-27/+117
Since now there might be at most 10 offsets that need handling in copy_map_value, the manual shuffling and special case is no longer going to work. Hence, let's generalise the copy_map_value function by using a sorted array of offsets to skip regions that must be avoided while copying into and out of a map value. When the map is created, we populate the offset array in struct map, Then, copy_map_value uses this sorted offset array is used to memcpy while skipping timer, spin lock, and kptr. The array is allocated as in most cases none of these special fields would be present in map value, hence we can save on space for the common case by not embedding the entire object inside bpf_map struct. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20220424214901.2743946-6-memxor@gmail.com
2022-04-26bpf: Prevent escaping of kptr loaded from mapsKumar Kartikeya Dwivedi2-8/+37
While we can guarantee that even for unreferenced kptr, the object pointer points to being freed etc. can be handled by the verifier's exception handling (normal load patching to PROBE_MEM loads), we still cannot allow the user to pass these pointers to BPF helpers and kfunc, because the same exception handling won't be done for accesses inside the kernel. The same is true if a referenced pointer is loaded using normal load instruction. Since the reference is not guaranteed to be held while the pointer is used, it must be marked as untrusted. Hence introduce a new type flag, PTR_UNTRUSTED, which is used to mark all registers loading unreferenced and referenced kptr from BPF maps, and ensure they can never escape the BPF program and into the kernel by way of calling stable/unstable helpers. In check_ptr_to_btf_access, the !type_may_be_null check to reject type flags is still correct, as apart from PTR_MAYBE_NULL, only MEM_USER, MEM_PERCPU, and PTR_UNTRUSTED may be set for PTR_TO_BTF_ID. The first two are checked inside the function and rejected using a proper error message, but we still want to allow dereference of untrusted case. Also, we make sure to inherit PTR_UNTRUSTED when chain of pointers are walked, so that this flag is never dropped once it has been set on a PTR_TO_BTF_ID (i.e. trusted to untrusted transition can only be in one direction). In convert_ctx_accesses, extend the switch case to consider untrusted PTR_TO_BTF_ID in addition to normal PTR_TO_BTF_ID for PROBE_MEM conversion for BPF_LDX. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20220424214901.2743946-5-memxor@gmail.com
2022-04-26bpf: Allow storing referenced kptr in mapKumar Kartikeya Dwivedi6-13/+151
Extending the code in previous commits, introduce referenced kptr support, which needs to be tagged using 'kptr_ref' tag instead. Unlike unreferenced kptr, referenced kptr have a lot more restrictions. In addition to the type matching, only a newly introduced bpf_kptr_xchg helper is allowed to modify the map value at that offset. This transfers the referenced pointer being stored into the map, releasing the references state for the program, and returning the old value and creating new reference state for the returned pointer. Similar to unreferenced pointer case, return value for this case will also be PTR_TO_BTF_ID_OR_NULL. The reference for the returned pointer must either be eventually released by calling the corresponding release function, otherwise it must be transferred into another map. It is also allowed to call bpf_kptr_xchg with a NULL pointer, to clear the value, and obtain the old value if any. BPF_LDX, BPF_STX, and BPF_ST cannot access referenced kptr. A future commit will permit using BPF_LDX for such pointers, but attempt at making it safe, since the lifetime of object won't be guaranteed. There are valid reasons to enforce the restriction of permitting only bpf_kptr_xchg to operate on referenced kptr. The pointer value must be consistent in face of concurrent modification, and any prior values contained in the map must also be released before a new one is moved into the map. To ensure proper transfer of this ownership, bpf_kptr_xchg returns the old value, which the verifier would require the user to either free or move into another map, and releases the reference held for the pointer being moved in. In the future, direct BPF_XCHG instruction may also be permitted to work like bpf_kptr_xchg helper. Note that process_kptr_func doesn't have to call check_helper_mem_access, since we already disallow rdonly/wronly flags for map, which is what check_map_access_type checks, and we already ensure the PTR_TO_MAP_VALUE refers to kptr by obtaining its off_desc, so check_map_access is also not required. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20220424214901.2743946-4-memxor@gmail.com
2022-04-26bpf: Tag argument to be released in bpf_func_protoKumar Kartikeya Dwivedi8-49/+60
Add a new type flag for bpf_arg_type that when set tells verifier that for a release function, that argument's register will be the one for which meta.ref_obj_id will be set, and which will then be released using release_reference. To capture the regno, introduce a new field release_regno in bpf_call_arg_meta. This would be required in the next patch, where we may either pass NULL or a refcounted pointer as an argument to the release function bpf_kptr_xchg. Just releasing only when meta.ref_obj_id is set is not enough, as there is a case where the type of argument needed matches, but the ref_obj_id is set to 0. Hence, we must enforce that whenever meta.ref_obj_id is zero, the register that is to be released can only be NULL for a release function. Since we now indicate whether an argument is to be released in bpf_func_proto itself, is_release_function helper has lost its utitlity, hence refactor code to work without it, and just rely on meta.release_regno to know when to release state for a ref_obj_id. Still, the restriction of one release argument and only one ref_obj_id passed to BPF helper or kfunc remains. This may be lifted in the future. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20220424214901.2743946-3-memxor@gmail.com
2022-04-26bpf: Allow storing unreferenced kptr in mapKumar Kartikeya Dwivedi6-36/+433
This commit introduces a new pointer type 'kptr' which can be embedded in a map value to hold a PTR_TO_BTF_ID stored by a BPF program during its invocation. When storing such a kptr, BPF program's PTR_TO_BTF_ID register must have the same type as in the map value's BTF, and loading a kptr marks the destination register as PTR_TO_BTF_ID with the correct kernel BTF and BTF ID. Such kptr are unreferenced, i.e. by the time another invocation of the BPF program loads this pointer, the object which the pointer points to may not longer exist. Since PTR_TO_BTF_ID loads (using BPF_LDX) are patched to PROBE_MEM loads by the verifier, it would safe to allow user to still access such invalid pointer, but passing such pointers into BPF helpers and kfuncs should not be permitted. A future patch in this series will close this gap. The flexibility offered by allowing programs to dereference such invalid pointers while being safe at runtime frees the verifier from doing complex lifetime tracking. As long as the user may ensure that the object remains valid, it can ensure data read by it from the kernel object is valid. The user indicates that a certain pointer must be treated as kptr capable of accepting stores of PTR_TO_BTF_ID of a certain type, by using a BTF type tag 'kptr' on the pointed to type of the pointer. Then, this information is recorded in the object BTF which will be passed into the kernel by way of map's BTF information. The name and kind from the map value BTF is used to look up the in-kernel type, and the actual BTF and BTF ID is recorded in the map struct in a new kptr_off_tab member. For now, only storing pointers to structs is permitted. An example of this specification is shown below: #define __kptr __attribute__((btf_type_tag("kptr"))) struct map_value { ... struct task_struct __kptr *task; ... }; Then, in a BPF program, user may store PTR_TO_BTF_ID with the type task_struct into the map, and then load it later. Note that the destination register is marked PTR_TO_BTF_ID_OR_NULL, as the verifier cannot know whether the value is NULL or not statically, it must treat all potential loads at that map value offset as loading a possibly NULL pointer. Only BPF_LDX, BPF_STX, and BPF_ST (with insn->imm = 0 to denote NULL) are allowed instructions that can access such a pointer. On BPF_LDX, the destination register is updated to be a PTR_TO_BTF_ID, and on BPF_STX, it is checked whether the source register type is a PTR_TO_BTF_ID with same BTF type as specified in the map BTF. The access size must always be BPF_DW. For the map in map support, the kptr_off_tab for outer map is copied from the inner map's kptr_off_tab. It was chosen to do a deep copy instead of introducing a refcount to kptr_off_tab, because the copy only needs to be done when paramterizing using inner_map_fd in the map in map case, hence would be unnecessary for all other users. It is not permitted to use MAP_FREEZE command and mmap for BPF map having kptrs, similar to the bpf_timer case. A kptr also requires that BPF program has both read and write access to the map (hence both BPF_F_RDONLY_PROG and BPF_F_WRONLY_PROG are disallowed). Note that check_map_access must be called from both check_helper_mem_access and for the BPF instructions, hence the kptr check must distinguish between ACCESS_DIRECT and ACCESS_HELPER, and reject ACCESS_HELPER cases. We rename stack_access_src to bpf_access_src and reuse it for this purpose. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20220424214901.2743946-2-memxor@gmail.com
2022-04-26bpf: Use bpf_prog_run_array_cg_flags everywhereStanislav Fomichev2-54/+26
Rename bpf_prog_run_array_cg_flags to bpf_prog_run_array_cg and use it everywhere. check_return_code already enforces sane return ranges for all cgroup types. (only egress and bind hooks have uncanonical return ranges, the rest is using [0, 1]) No functional changes. v2: - 'func_ret & 1' under explicit test (Andrii & Martin) Suggested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20220425220448.3669032-1-sdf@google.com
2022-04-26bpftool, musl compat: Replace sys/fcntl.h by fcntl.hDominique Martinet1-1/+1
musl does not like including sys/fcntl.h directly: [...] 1 | #warning redirecting incorrect #include <sys/fcntl.h> to <fcntl.h> [...] Signed-off-by: Dominique Martinet <asmadeus@codewreck.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Quentin Monnet <quentin@isovalent.com> Link: https://lore.kernel.org/bpf/20220424051022.2619648-5-asmadeus@codewreck.org
2022-04-26bpftool, musl compat: Replace nftw with FTW_ACTIONRETVALDominique Martinet1-55/+57
musl nftw implementation does not support FTW_ACTIONRETVAL. There have been multiple attempts at pushing the feature in musl upstream, but it has been refused or ignored all the times: https://www.openwall.com/lists/musl/2021/03/26/1 https://www.openwall.com/lists/musl/2022/01/22/1 In this case we only care about /proc/<pid>/fd/<fd>, so it's not too difficult to reimplement directly instead, and the new implementation makes 'bpftool perf' slightly faster because it doesn't needlessly stat/readdir unneeded directories (54ms -> 13ms on my machine). Signed-off-by: Dominique Martinet <asmadeus@codewreck.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Quentin Monnet <quentin@isovalent.com> Link: https://lore.kernel.org/bpf/20220424051022.2619648-4-asmadeus@codewreck.org
2022-04-25wwan_hwsim: Avoid flush_scheduled_work() usageTetsuo Handa1-5/+17
Flushing system-wide workqueues is dangerous and will be forbidden. Replace system_wq with local wwan_wq. While we are at it, make err_clean_devs: label of wwan_hwsim_init() behave like wwan_hwsim_exit(), for it is theoretically possible to call wwan_hwsim_debugfs_devcreate_write()/wwan_hwsim_debugfs_devdestroy_write() by the moment wwan_hwsim_init_devs() returns. Link: https://lkml.kernel.org/r/49925af7-78a8-a3dd-bce6-cfc02e1a9236@I-love.SAKURA.ne.jp Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reviewed-by: Sergey Ryazanov <ryazanov.s.a@gmail.com> Reviewed-by: Loic Poulain <loic.poulain@linaro.org> Link: https://lore.kernel.org/r/7390d51f-60e2-3cee-5277-b819a55ceabe@I-love.SAKURA.ne.jp Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-04-25net: ieee802154: ca8210: Call _xmit_error() when a transmission failsMiquel Raynal1-2/+1
ieee802154_xmit_error() is the right helper to call when a transmission has failed. Let's use it instead of open-coding it. Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com> Acked-by: Alexander Aring <aahringo@redhat.com> Link: https://lore.kernel.org/r/20220407100903.1695973-11-miquel.raynal@bootlin.com Signed-off-by: Stefan Schmidt <stefan@datenfreihafen.org>
2022-04-25net: ieee802154: ca8210: Use core return codes instead of hardcoding themMiquel Raynal1-110/+68
All the error codes defined in this driver are generic and already defined in the ieee802154 main header. Let's just get rid of these extra definition and switch to the core's values. There is no functional change. Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com> Acked-by: Alexander Aring <aahringo@redhat.com> Link: https://lore.kernel.org/r/20220407100903.1695973-10-miquel.raynal@bootlin.com Signed-off-by: Stefan Schmidt <stefan@datenfreihafen.org>
2022-04-25net: ieee802154: atusb: Call _xmit_hw_error() upon transmission errorMiquel Raynal1-3/+1
ieee802154_xmit_hw_error() is the right helper to call when a transmission has failed for a non-determined (and probably not IEEE802.15.4 specific) reason. Let's use this helper instead of open-coding it. Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com> Acked-by: Alexander Aring <aahringo@redhat.com> Link: https://lore.kernel.org/r/20220407100903.1695973-9-miquel.raynal@bootlin.com Signed-off-by: Stefan Schmidt <stefan@datenfreihafen.org>
2022-04-25net: ieee802154: at86rf230: Forward Tx trac errorsMiquel Raynal2-113/+21
Commit 493bc90a9683 ("at86rf230: add debugfs support") brought trac support as part of a debugfs feature, in order to add some testing capabilities involving ack handling. As we want to collect trac errors but do not need the debugfs feature anymore, let's partially revert this commit, keeping the Tx trac handling part which still makes sense. This allows to always return the trac error directly to the core with the recently introduced ieee802154_xmit_error() helper. Suggested-by: Alexander Aring <alex.aring@gmail.com> Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com> Acked-by: Alexander Aring <aahringo@redhat.com> Link: https://lore.kernel.org/r/20220407100903.1695973-8-miquel.raynal@bootlin.com Signed-off-by: Stefan Schmidt <stefan@datenfreihafen.org>
2022-04-25net: ieee802154: at86rf230: Call _xmit_hw_error() when failing to offload framesMiquel Raynal1-2/+1
If we end up at this location, it means that there was likely a hardware issue (either a bus error when asynchronously offloading the packet to the transceiver, or the transceiver took too long for some state change). In this case it was decided to return IEEE802154_SYSTEM_ERROR through the ieee802154_xmit_hw_error() helper dedicated to non IEEE802.15.4 specific errors. Let's use this helper instead of (almost) open-coding it. Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com> Acked-by: Alexander Aring <aahringo@redhat.com> Link: https://lore.kernel.org/r/20220407100903.1695973-7-miquel.raynal@bootlin.com Signed-off-by: Stefan Schmidt <stefan@datenfreihafen.org>
2022-04-25net: mac802154: Create an error helper for asynchronous offloading errorsMiquel Raynal2-0/+15
A few drivers do the full transmit operation asynchronously, which means that a bus error that happens when forwarding the packet to the transmitter or a timeout happening when offloading the request to the transmitter will not be reported immediately. The solution in this case is to call this new helper to free the necessary resources, restart the queue and always return the same generic TRAC error code: IEEE802154_SYSTEM_ERROR. Suggested-by: Alexander Aring <alex.aring@gmail.com> Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com> Acked-by: Alexander Aring <aahringo@redhat.com> Link: https://lore.kernel.org/r/20220407100903.1695973-6-miquel.raynal@bootlin.com Signed-off-by: Stefan Schmidt <stefan@datenfreihafen.org>
2022-04-25net: mac802154: Create an offloaded transmission error helperMiquel Raynal2-0/+21
So far there is only a helper for successful transmissions, which led device drivers to implement their own handling in case of error. Unfortunately, we really need all the drivers to give the hand back to the core once they are done in order to be able to build a proper synchronous API. So let's create a _xmit_error() helper and take this opportunity to fill the new device-global field storing Tx statuses. Suggested-by: Alexander Aring <alex.aring@gmail.com> Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com> Acked-by: Alexander Aring <aahringo@redhat.com> Link: https://lore.kernel.org/r/20220407100903.1695973-5-miquel.raynal@bootlin.com Signed-off-by: Stefan Schmidt <stefan@datenfreihafen.org>
2022-04-25net: mac802154: Save a global error code on transmissionsMiquel Raynal2-1/+6
So far no error is returned from a failing transmission. However it might sometimes be useful, and particularly easy to use during sync transfers (for certain MLME commands). Let's create an internal variable for that, global to the device. Right now only success are registered, which is rather useless, but soon we will have more situations filling this field. Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com> Acked-by: Alexander Aring <aahringo@redhat.com> Link: https://lore.kernel.org/r/20220407100903.1695973-4-miquel.raynal@bootlin.com Signed-off-by: Stefan Schmidt <stefan@datenfreihafen.org>
2022-04-25net: ieee802154: Fill the list of MLME return codesMiquel Raynal1-1/+66
There are more codes than already listed, let's be a bit more exhaustive. This will allow to drop device drivers local definitions of these codes. Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com> Acked-by: Alexander Aring <aahringo@redhat.com> Link: https://lore.kernel.org/r/20220407100903.1695973-3-miquel.raynal@bootlin.com Signed-off-by: Stefan Schmidt <stefan@datenfreihafen.org>