summaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)AuthorFilesLines
2020-06-04Merge branch 'for-linus' of ↵Linus Torvalds2-11/+14
git://git.kernel.org/pub/scm/linux/kernel/git/livepatching/livepatching Pull livepatching updates from Jiri Kosina: - simplifications and improvements for issues Peter Ziljstra found during his previous work on W^X cleanups. This allows us to remove livepatch arch-specific .klp.arch sections and add proper support for jump labels in patched code. Also, this patchset removes the last module_disable_ro() usage in the tree. Patches from Josh Poimboeuf and Peter Zijlstra - a few other minor cleanups * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/livepatching/livepatching: MAINTAINERS: add lib/livepatch to LIVE PATCHING livepatch: add arch-specific headers to MAINTAINERS livepatch: Make klp_apply_object_relocs static MAINTAINERS: adjust to livepatch .klp.arch removal module: Make module_enable_ro() static again x86/module: Use text_mutex in apply_relocate_add() module: Remove module_disable_ro() livepatch: Remove module_disable_ro() usage x86/module: Use text_poke() for late relocations s390/module: Use s390_kernel_write() for late relocations s390: Change s390_kernel_write() return type to match memcpy() livepatch: Prevent module-specific KLP rela sections from referencing vmlinux symbols livepatch: Remove .klp.arch livepatch: Apply vmlinux-specific KLP relocations early livepatch: Disallow vmlinux.ko
2020-06-04Merge tag 'sound-5.8-rc1' of ↵Linus Torvalds28-120/+423
git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound Pull sound updates from Takashi Iwai: "It was another busy development cycle, and the majority of changes are found in ASoC side. Below are Some highlights. ASoC core: - Lots of core cleanups and refactorings, still on-going work by Morimoto-san ASoC drivers: - Continued work on cleaning up and improving the Intel SOF stuff, along with new platform support including SoundWire - Fixes to make the Marvell SSPA driver work upstream - Support for AMD Renoir ACP, Dialog DA7212, Freescale EASRC and i.MX8M, Intel Elkhard Lake, Maxim MAX98390, Nuvoton NAU8812 and NAU8814 and Realtek RT1016. USB-audio: - Improvement for sync and implicit feedback streams with the more accurate frame size calculation and full-duplex support - Support for RME Babyface Pro and Prioneer DJ DJM HD-audio: - Fixes for Mic mute LED on HP machines - Re-enable support of Intel SST driver for SKL/KBL platforms FireWire: - Lots of refactoring, add support for RME FireFace and MOTU UltraLite-mk3" * tag 'sound-5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (428 commits) ALSA: es1688: Add the missed snd_card_free() ALSA: hda: add sienna_cichlid audio asic id for sienna_cichlid up ALSA: usb-audio: Add Pioneer DJ DJM-900NXS2 support ASoC: qcom: q6asm-dai: kCFI fix ASoC: soc-card: add snd_soc_card_remove_dai_link() ASoC: soc-card: add snd_soc_card_add_dai_link() ASoC: soc-card: add snd_soc_card_set_bias_level_post() ASoC: soc-card: add snd_soc_card_set_bias_level() ASoC: soc-card: add snd_soc_card_remove() ASoC: soc-card: add snd_soc_card_late_probe() ASoC: soc-card: add snd_soc_card_probe() ASoC: soc-card: add probed bit field to snd_soc_card ASoC: soc-card: add snd_soc_card_resume_post() ASoC: soc-card: add snd_soc_card_resume_pre() ASoC: soc-card: add snd_soc_card_suspend_post() ASoC: soc-card: add snd_soc_card_suspend_pre() ASoC: soc-card: move snd_soc_card_subclass to soc-card ASoC: soc-card: move snd_soc_card_get_codec_dai() to soc-card ASoC: soc-card: move snd_soc_card_set/get_drvdata() to soc-card ASoC: soc-card: move snd_soc_card_jack_new() to soc-card ...
2020-06-04Merge branch 'pcmcia-next' of ↵Linus Torvalds1-3/+3
git://git.kernel.org/pub/scm/linux/kernel/git/brodo/linux Pull pcmcia updates from Dominik Brodowski: "Two minor PCMCIA odd fixes: one replacing zero-length arrays with a flexible-array member, and one making a local function static" * 'pcmcia-next' of git://git.kernel.org/pub/scm/linux/kernel/git/brodo/linux: pcmcia: make pccard_loop_tuple() static pcmcia: Replace zero-length array with flexible-array
2020-06-04Merge branch 'remotes/lorenzo/pci/rcar'Bjorn Helgaas1-12/+26
- Fix rcar OB window programming (Andrew Murray) - Add rcar suspend/resume support (Kazufumi Ikeda) - Add r8a77961 to DT binding (Yoshihiro Shimoda) - Rename pcie-rcar.c to pcie-rcar-host.c to make room for endpoint mode (Lad Prabhakar) - Move shareable code to pcie-rcar.c (Lad Prabhakar) - Correct PCIEPAMR mask calculation for "size < 128" (Lad Prabhakar) - Add endpoint support for multiple outbound memory windows (Lad Prabhakar) - Add R-Car PCIe endpoint driver and DT bindings (Lad Prabhakar) * remotes/lorenzo/pci/rcar: MAINTAINERS: Add file patterns for rcar PCI device tree bindings PCI: rcar: Add endpoint mode support dt-bindings: PCI: rcar: Add bindings for R-Car PCIe endpoint controller PCI: endpoint: Add support to handle multiple base for mapping outbound memory PCI: endpoint: Pass page size as argument to pci_epc_mem_init() PCI: rcar: Fix calculating mask for PCIEPAMR register PCI: rcar: Move shareable code to a common file PCI: rcar: Rename pcie-rcar.c to pcie-rcar-host.c dt-bindings: pci: rcar: add r8a77961 support PCI: rcar: Add suspend/resume PCI: rcar: Fix incorrect programming of OB windows
2020-06-04Merge branch 'remotes/lorenzo/pci/host-generic'Bjorn Helgaas2-14/+13
- Constify struct pci_ecam_ops (Rob Herring) - Support building as modules (Rob Herring) - Eliminate wrappers for pci_host_common_probe() by using DT match table data (Rob Herring) * remotes/lorenzo/pci/host-generic: PCI: host-generic: Eliminate pci_host_common_probe wrappers PCI: host-generic: Support building as modules PCI: Constify struct pci_ecam_ops # Conflicts: # drivers/pci/controller/dwc/pcie-hisi.c
2020-06-04Merge branch 'remotes/lorenzo/pci/brcmstb'Bjorn Helgaas1-1/+8
- Assert fundamental reset on initialization (Nicolas Saenz Julienne) - Remove unnecessary clk_put(); devm_clk_get() handles this automatically (Jim Quinlan) - Fix outbound memory window register stride offset (Jim Quinlan) - Add "aspm-no-l0s" property for brcmstb and disable ASPM L0s when present (Jim Quinlan) - Add property to notify Raspberry Pi firmware of xHCI reset (Nicolas Saenz Julienne) - Add Raspberry Pi VL805 xHCI init function to trigger VL805 firmware load (Nicolas Saenz Julienne) - Wait in brcmstb probe for Raspberry Pi VL805 firmware initialization (Nicolas Saenz Julienne) - Load Raspberry Pi VL805 firmware in USB early handoff quirk (Nicolas Saenz Julienne) * remotes/lorenzo/pci/brcmstb: USB: pci-quirks: Add Raspberry Pi 4 quirk PCI: brcmstb: Wait for Raspberry Pi's firmware when present firmware: raspberrypi: Introduce vl805 init routine soc: bcm2835: Add notify xHCI reset property PCI: brcmstb: Disable L0s component of ASPM if requested dt-bindings: PCI: brcmstb: New prop 'aspm-no-l0s' PCI: brcmstb: Fix window register offset from 4 to 8 PCI: brcmstb: Don't clk_put() a managed clock PCI: brcmstb: Assert fundamental reset on initialization
2020-06-04Merge branch 'pci/pm'Bjorn Helgaas1-0/+6
- Check .bridge_d3() hook for NULL before calling it (Bjorn Helgaas) - Disable PME# for Pericom OHCI/UHCI USB controllers because it's not reliably asserted on USB hotplug (Kai-Heng Feng) - Assume ports without DLL Link Active train links in 100 ms to work around Thunderbolt bridge defects (Mika Westerberg) * pci/pm: PCI/PM: Assume ports without DLL Link Active train links in 100 ms PCI/PM: Adjust pcie_wait_for_link_delay() for caller delay PCI: Avoid Pericom USB controller OHCI/EHCI PME# defect serial: 8250_pci: Move Pericom IDs to pci_ids.h PCI/PM: Call .bridge_d3() hook only if non-NULL
2020-06-04Merge branch 'pci/misc'Bjorn Helgaas2-15/+22
- Clarify that platform_get_irq() should never return 0 (Bjorn Helgaas) - Check for platform_get_irq() failure consistently (Bjorn Helgaas) - Replace zero-length array with flexible-array (Gustavo A. R. Silva) - Unify pcie_find_root_port() and pci_find_pcie_root_port() (Yicong Yang) - Quirk Intel C620 MROMs, which have non-BARs in BAR locations (Xiaochun Lee) - Fix pcie_pme_resume() and pcie_pme_remove() kernel-doc (Jay Fang) - Rename _DSM constants to align with spec (Krzysztof Wilczyński) * pci/misc: PCI: Rename _DSM constants to align with spec PCI/PME: Fix kernel-doc of pcie_pme_resume() and pcie_pme_remove() x86/PCI: Mark Intel C620 MROMs as having non-compliant BARs PCI: Unify pcie_find_root_port() and pci_find_pcie_root_port() PCI: Replace zero-length array with flexible-array PCI: Check for platform_get_irq() failure consistently driver core: platform: Clarify that IRQ 0 is invalid
2020-06-04Merge branch 'pci/error'Bjorn Helgaas2-8/+0
- Log only ACPI_NOTIFY_DISCONNECT_RECOVER events for EDR, not all ACPI SYSTEM-level events (Kuppuswamy Sathyanarayanan) - Rely only on _OSC (not _OSC + HEST FIRMWARE_FIRST) to negotiate AER Capability ownership (Alexandru Gagniuc) - Remove HEST/FIRMWARE_FIRST parsing that was previously used to help intuit AER Capability ownership (Kuppuswamy Sathyanarayanan) - Remove redundant pci_is_pcie() and dev->aer_cap checks (Kuppuswamy Sathyanarayanan) - Print IRQ number used by DPC (Yicong Yang) * pci/error: PCI/DPC: Print IRQ number used by port PCI/AER: Use "aer" variable for capability offset PCI/AER: Remove redundant dev->aer_cap checks PCI/AER: Remove redundant pci_is_pcie() checks PCI/AER: Remove HEST/FIRMWARE_FIRST parsing for AER ownership PCI/AER: Use only _OSC to determine AER ownership PCI/EDR: Log only ACPI_NOTIFY_DISCONNECT_RECOVER events
2020-06-04Merge tag 'backlight-next-5.8' of ↵Linus Torvalds2-17/+1
git://git.kernel.org/pub/scm/linux/kernel/git/lee/backlight Pull backlight updates from Lee Jones: "Core Framework: - Add backlight_device_get_by_name() to the API New Device Support: - Add support for WLED5 to Qualcomm WLED Fix-ups: - Convert to GPIO descriptors in l4f00242t03 - Device Tree fix-ups for qcom-wled Bug Fixes: - Properly disable regulators on .probe() failure" * tag 'backlight-next-5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/backlight: backlight: Add backlight_device_get_by_name() backlight: qcom-wled: Add support for WLED5 peripheral that is present on PM8150L PMICs dt-bindings: backlight: qcom-wled: Add WLED5 bindings backlight: qcom-wled: Add callback functions dt-bindings: backlight: qcom-wled: Convert the wled bindings to .yaml format backlight: l4f00242t03: Convert to GPIO descriptors backlight: lp855x: Ensure regulators are disabled on probe failure
2020-06-04Merge tag 'mfd-next-5.8' of ↵Linus Torvalds8-2/+721
git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd Pull MFD updates from Lee Jones: "Core Frameworks: - Constify 'properties' attribute in core header file New Drivers: - Add support for Gateworks System Controller - Add support for MediaTek MT6358 PMIC - Add support for Mediatek MT6360 PMIC - Add support for Monolithic Power Systems MP2629 ADC and Battery charger Fix-ups: - Use new I2C API in htc-i2cpld - Remove superfluous code in sprd-sc27xx-spi - Improve error handling in stm32-timers - Device Tree additions/fixes in mt6397 - Defer probe betterment in wm8994-core - Improve module handling in wm8994-core - Staticify in stpmic1 - Trivial (spelling, formatting) in tqmx86 Bug Fixes: - Fix incorrect register/PCI IDs in intel-lpss-pci - Fix unbalanced Regulator API calls in wm8994-core - Fix double free() in wcd934x - Remove IRQ domain on failure in stmfx - Reset chip on resume in stmfx - Disable/enable IRQs on suspend/resume in stmfx - Do not use bulk writes on H/W which does not support them in max77620" * tag 'mfd-next-5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd: (29 commits) mfd: mt6360: Remove duplicate REGMAP_IRQ_REG_LINE() entry mfd: Add support for PMIC MT6360 mfd: max77620: Use single-byte writes on MAX77620 mfd: wcd934x: Drop kfree for memory allocated with devm_kzalloc mfd: stmfx: Disable IRQ in suspend to avoid spurious interrupt mfd: stmfx: Fix stmfx_irq_init error path mfd: stmfx: Reset chip on resume as supply was disabled mfd: wm8994: Silence warning about supplies during deferred probe mfd: wm8994: Fix unbalanced calls to regulator_bulk_disable() mfd: wm8994: Fix driver operation if loaded as modules dt-bindings: mfd: mediatek: Add MT6397 Pin Controller mfd: Constify properties in mfd_cell mfd: stm32-timers: Use dma_request_chan() instead dma_request_slave_channel() mfd: sprd: Remove unnecessary spi_bus_type setting mfd: intel-lpss: Update LPSS UART #2 PCI ID for Jasper Lake mfd: tqmx86: Fix a typo in MODULE_DESCRIPTION mfd: stpmic1: Make stpmic1_regmap_config static mfd: htc-i2cpld: Convert to use i2c_new_client_device() MAINTAINERS: Add entry for mp2629 Battery Charger driver power: supply: mp2629: Add impedance compensation config ...
2020-06-04Merge tag 'keys-next-20200602' of ↵Linus Torvalds2-1/+2
git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs Pull keyring updates from David Howells: - Fix a documentation warning. - Replace a zero-length array with a flexible one - Make the big_key key type use ChaCha20Poly1305 and use the crypto algorithm directly rather than going through the crypto layer. - Implement the update op for the big_key type. * tag 'keys-next-20200602' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: keys: Implement update for the big_key type security/keys: rewrite big_key crypto to use library interface KEYS: Replace zero-length array with flexible-array Documentation: security: core.rst: add missing argument
2020-06-04afs: Add a tracepoint to track the lifetime of the afs_volume structDavid Howells1-0/+56
Add a tracepoint to track the lifetime of the afs_volume struct. Signed-off-by: David Howells <dhowells@redhat.com>
2020-06-04afs: Implement client support for the YFSVL.GetCellName RPC opDavid Howells1-0/+4
Implement client support for the YFSVL.GetCellName RPC operation by which YFS permits the canonical cell name to be queried from a VL server. Signed-off-by: David Howells <dhowells@redhat.com>
2020-06-04afs: Build an abstraction around an "operation" conceptDavid Howells1-11/+8
Turn the afs_operation struct into the main way that most fileserver operations are managed. Various things are added to the struct, including the following: (1) All the parameters and results of the relevant operations are moved into it, removing corresponding fields from the afs_call struct. afs_call gets a pointer to the op. (2) The target volume is made the main focus of the operation, rather than the target vnode(s), and a bunch of op->vnode->volume are made op->volume instead. (3) Two vnode records are defined (op->file[]) for the vnode(s) involved in most operations. The vnode record (struct afs_vnode_param) contains: - The vnode pointer. - The fid of the vnode to be included in the parameters or that was returned in the reply (eg. FS.MakeDir). - The status and callback information that may be returned in the reply about the vnode. - Callback break and data version tracking for detecting simultaneous third-parth changes. (4) Pointers to dentries to be updated with new inodes. (5) An operations table pointer. The table includes pointers to functions for issuing AFS and YFS-variant RPCs, handling the success and abort of an operation and handling post-I/O-lock local editing of a directory. To make this work, the following function restructuring is made: (A) The rotation loop that issues calls to fileservers that can be found in each function that wants to issue an RPC (such as afs_mkdir()) is extracted out into common code, in a new file called fs_operation.c. (B) The rotation loops, such as the one in afs_mkdir(), are replaced with a much smaller piece of code that allocates an operation, sets the parameters and then calls out to the common code to do the actual work. (C) The code for handling the success and failure of an operation are moved into operation functions (as (5) above) and these are called from the core code at appropriate times. (D) The pseudo inode getting stuff used by the dynamic root code is moved over into dynroot.c. (E) struct afs_iget_data is absorbed into the operation struct and afs_iget() expects to be given an op pointer and a vnode record. (F) Point (E) doesn't work for the root dir of a volume, but we know the FID in advance (it's always vnode 1, unique 1), so a separate inode getter, afs_root_iget(), is provided to special-case that. (G) The inode status init/update functions now also take an op and a vnode record. (H) The RPC marshalling functions now, for the most part, just take an afs_operation struct as their only argument. All the data they need is held there. The result delivery functions write their answers there as well. (I) The call is attached to the operation and then the operation core does the waiting. And then the new operation code is, for the moment, made to just initialise the operation, get the appropriate vnode I/O locks and do the same rotation loop as before. This lays the foundation for the following changes in the future: (*) Overhauling the rotation (again). (*) Support for asynchronous I/O, where the fileserver rotation must be done asynchronously also. Signed-off-by: David Howells <dhowells@redhat.com>
2020-06-04Merge tag 'media/v5.8-1' of ↵Linus Torvalds18-109/+581
git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media Pull media updates from Mauro Carvalho Chehab: - Media documentation is now split into admin-guide, driver-api and userspace-api books (a longstanding request from Jon); - The media Kconfig was reorganized, in order to make easier to select drivers and their dependencies; - The testing drivers now has a separate directory; - added a new driver for Rockchip Video Decoder IP; - The atomisp staging driver was resurrected. It is meant to work with 4 generations of cameras on Atom-based laptops, tablets and cell phones. So, it seems worth investing time to cleanup this driver and making it in good shape. - Added some V4L2 core ancillary routines to help with h264 codecs; - Added an ov2740 image sensor driver; - The si2157 gained support for Analog TV, which, in turn, added support for some cx231xx and cx23885 boards to also support analog standards; - Added some V4L2 controls (V4L2_CID_CAMERA_ORIENTATION and V4L2_CID_CAMERA_SENSOR_ROTATION) to help identifying where the camera is located at the device; - VIDIOC_ENUM_FMT was extended to support MC-centric devices; - Lots of drivers improvements and cleanups. * tag 'media/v5.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (503 commits) media: Documentation: media: Refer to mbus format documentation from CSI-2 docs media: s5k5baf: Replace zero-length array with flexible-array media: i2c: imx219: Drop <linux/clk-provider.h> and <linux/clkdev.h> media: i2c: Add ov2740 image sensor driver media: ov8856: Implement sensor module revision identification media: ov8856: Add devicetree support media: dt-bindings: ov8856: Document YAML bindings media: dvb-usb: Add Cinergy S2 PCIe Dual Port support media: dvbdev: Fix tuner->demod media controller link media: dt-bindings: phy: phy-rockchip-dphy-rx0: move rockchip dphy rx0 bindings out of staging media: staging: dt-bindings: phy-rockchip-dphy-rx0: remove non-used reg property media: atomisp: unify the version for isp2401 a0 and b0 versions media: atomisp: update TODO with the current data media: atomisp: adjust some code at sh_css that could be broken media: atomisp: don't produce errs for ignored IRQs media: atomisp: print IRQ when debugging media: atomisp: isp_mmu: don't use kmem_cache media: atomisp: add a notice about possible leak resources media: atomisp: disable the dynamic and reserved pools media: atomisp: turn on camera before setting it ...
2020-06-04Merge branch 'akpm' (patches from Andrew)Linus Torvalds16-159/+209
Merge more updates from Andrew Morton: "More mm/ work, plenty more to come Subsystems affected by this patch series: slub, memcg, gup, kasan, pagealloc, hugetlb, vmscan, tools, mempolicy, memblock, hugetlbfs, thp, mmap, kconfig" * akpm: (131 commits) arm64: mm: use ARCH_HAS_DEBUG_WX instead of arch defined x86: mm: use ARCH_HAS_DEBUG_WX instead of arch defined riscv: support DEBUG_WX mm: add DEBUG_WX support drivers/base/memory.c: cache memory blocks in xarray to accelerate lookup mm/thp: rename pmd_mknotpresent() as pmd_mkinvalid() powerpc/mm: drop platform defined pmd_mknotpresent() mm: thp: don't need to drain lru cache when splitting and mlocking THP hugetlbfs: get unmapped area below TASK_UNMAPPED_BASE for hugetlbfs sparc32: register memory occupied by kernel as memblock.memory include/linux/memblock.h: fix minor typo and unclear comment mm, mempolicy: fix up gup usage in lookup_node tools/vm/page_owner_sort.c: filter out unneeded line mm: swap: memcg: fix memcg stats for huge pages mm: swap: fix vmstats for huge pages mm: vmscan: limit the range of LRU type balancing mm: vmscan: reclaim writepage is IO cost mm: vmscan: determine anon/file pressure balance at the reclaim root mm: balance LRU lists based on relative thrashing mm: only count actual rotations as LRU reclaim cost ...
2020-06-04fs: move fiemap range validation into the file systems instancesChristoph Hellwig1-1/+2
Replace fiemap_check_flags with a fiemap_prep helper that also takes the inode and mapped range, and performs the sanity check and truncation previously done in fiemap_check_range. This way the validation is inside the file system itself and thus properly works for the stacked overlayfs case as well. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Link: https://lore.kernel.org/r/20200523073016.2944131-7-hch@lst.de Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-04iomap: fix the iomap_fiemap prototypeChristoph Hellwig1-1/+1
iomap_fiemap should take u64 start and len arguments, just like the ->fiemap prototype. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Link: https://lore.kernel.org/r/20200523073016.2944131-6-hch@lst.de Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-04fs: move the fiemap definitions out of fs.hChristoph Hellwig3-21/+28
No need to pull the fiemap definitions into almost every file in the kernel build. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Link: https://lore.kernel.org/r/20200523073016.2944131-5-hch@lst.de Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-04fs: mark __generic_block_fiemap staticChristoph Hellwig1-4/+0
There is no caller left outside of ioctl.c. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Link: https://lore.kernel.org/r/20200523073016.2944131-4-hch@lst.de Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-04ext4: mballoc: use lock for checking free blocks while retryingRitesh Harjani1-1/+2
Currently while doing block allocation grp->bb_free may be getting modified if discard is happening in parallel. For e.g. consider a case where there are lot of threads who have preallocated lot of blocks and there is a thread which is trying to discard all of this group's PA. Now it could happen that we see all of those group's bb_free is zero and fail the allocation while there is sufficient space if we free up all the PA. So this patch adds another flag "EXT4_MB_STRICT_CHECK" which will be set if we are unable to allocate any blocks in the first try (since we may not have considered blocks about to be discarded from PA lists). So during retry attempt to allocate blocks we will use ext4_lock_group() for checking if the group is good or not. Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/9cb740a117c958c36596f167b12af1beae9a68b7.1589955723.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-04writeback: Export inode_io_list_del()Jan Kara1-0/+1
Ext4 needs to remove inode from writeback lists after it is out of visibility of its journalling machinery (which can still dirty the inode). Export inode_io_list_del() for it. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20200421085445.5731-3-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-04ext4: translate a few more map flags to strings in tracepointsEric Whitney1-1/+4
As new ext4_map_blocks() flags have been added, not all have gotten flag bit to string translations to make tracepoint output more readable. Fix that, and go one step further by adding a translation for the EXT4_EX_NOCACHE flag as well. The EXT4_EX_FORCE_CACHE flag can never be set in a tracepoint in the current code, so there's no need to bother with a translation for it right now. Signed-off-by: Eric Whitney <enwlinux@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20200415203140.30349-3-enwlinux@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-04ext4: remove EXT4_GET_BLOCKS_KEEP_SIZE flagEric Whitney1-1/+0
The eofblocks code was removed in the 5.7 release by "ext4: remove EOFBLOCKS_FL and associated code" (4337ecd1fe99). The ext4_map_blocks() flag used to trigger it can now be removed as well. Signed-off-by: Eric Whitney <enwlinux@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20200415203140.30349-2-enwlinux@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-04include/linux/memblock.h: fix minor typo and unclear commentchenqiwu1-2/+2
Fix a minor typo "usabe->usable" for the current discription of member variable "memory" in struct memblock. BTW, I think it's unclear the member variable "base" in struct memblock_type is currently described as the physical address of memory region, change it to base address of the region is clearer since the variable is decorated as phys_addr_t. Signed-off-by: chenqiwu <chenqiwu@xiaomi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> Link: http://lkml.kernel.org/r/1588846952-32166-1-git-send-email-qiwuchen55@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-04mm: vmscan: reclaim writepage is IO costJohannes Weiner2-1/+4
The VM tries to balance reclaim pressure between anon and file so as to reduce the amount of IO incurred due to the memory shortage. It already counts refaults and swapins, but in addition it should also count writepage calls during reclaim. For swap, this is obvious: it's IO that wouldn't have occurred if the anonymous memory hadn't been under memory pressure. From a relative balancing point of view this makes sense as well: even if anon is cold and reclaimable, a cache that isn't thrashing may have equally cold pages that don't require IO to reclaim. For file writeback, it's trickier: some of the reclaim writepage IO would have likely occurred anyway due to dirty expiration. But not all of it - premature writeback reduces batching and generates additional writes. Since the flushers are already woken up by the time the VM starts writing cache pages one by one, let's assume that we'e likely causing writes that wouldn't have happened without memory pressure. In addition, the per-page cost of IO would have probably been much cheaper if written in larger batches from the flusher thread rather than the single-page-writes from kswapd. For our purposes - getting the trend right to accelerate convergence on a stable state that doesn't require paging at all - this is sufficiently accurate. If we later wanted to optimize for sustained thrashing, we can still refine the measurements. Count all writepage calls from kswapd as IO cost toward the LRU that the page belongs to. Why do this dynamically? Don't we know in advance that anon pages require IO to reclaim, and so could build in a static bias? First, scanning is not the same as reclaiming. If all the anon pages are referenced, we may not swap for a while just because we're scanning the anon list. During this time, however, it's important that we age anonymous memory and the page cache at the same rate so that their hot-cold gradients are comparable. Everything else being equal, we still want to reclaim the coldest memory overall. Second, we keep copies in swap unless the page changes. If there is swap-backed data that's mostly read (tmpfs file) and has been swapped out before, we can reclaim it without incurring additional IO. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@surriel.com> Link: http://lkml.kernel.org/r/20200520232525.798933-14-hannes@cmpxchg.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-04mm: vmscan: determine anon/file pressure balance at the reclaim rootJohannes Weiner1-0/+13
We split the LRU lists into anon and file, and we rebalance the scan pressure between them when one of them begins thrashing: if the file cache experiences workingset refaults, we increase the pressure on anonymous pages; if the workload is stalled on swapins, we increase the pressure on the file cache instead. With cgroups and their nested LRU lists, we currently don't do this correctly. While recursive cgroup reclaim establishes a relative LRU order among the pages of all involved cgroups, LRU pressure balancing is done on an individual cgroup LRU level. As a result, when one cgroup is thrashing on the filesystem cache while a sibling may have cold anonymous pages, pressure doesn't get equalized between them. This patch moves LRU balancing decision to the root of reclaim - the same level where the LRU order is established. It does this by tracking LRU cost recursively, so that every level of the cgroup tree knows the aggregate LRU cost of all memory within its domain. When the page scanner calculates the scan balance for any given individual cgroup's LRU list, it uses the values from the ancestor cgroup that initiated the reclaim cycle. If one sibling is then thrashing on the cache, it will tip the pressure balance inside its ancestors, and the next hierarchical reclaim iteration will go more after the anon pages in the tree. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@surriel.com> Link: http://lkml.kernel.org/r/20200520232525.798933-13-hannes@cmpxchg.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-04mm: balance LRU lists based on relative thrashingJohannes Weiner1-2/+1
Since the LRUs were split into anon and file lists, the VM has been balancing between page cache and anonymous pages based on per-list ratios of scanned vs. rotated pages. In most cases that tips page reclaim towards the list that is easier to reclaim and has the fewest actively used pages, but there are a few problems with it: 1. Refaults and LRU rotations are weighted the same way, even though one costs IO and the other costs a bit of CPU. 2. The less we scan an LRU list based on already observed rotations, the more we increase the sampling interval for new references, and rotations become even more likely on that list. This can enter a death spiral in which we stop looking at one list completely until the other one is all but annihilated by page reclaim. Since commit a528910e12ec ("mm: thrash detection-based file cache sizing") we have refault detection for the page cache. Along with swapin events, they are good indicators of when the file or anon list, respectively, is too small for its workingset and needs to grow. For example, if the page cache is thrashing, the cache pages need more time in memory, while there may be colder pages on the anonymous list. Likewise, if swapped pages are faulting back in, it indicates that we reclaim anonymous pages too aggressively and should back off. Replace LRU rotations with refaults and swapins as the basis for relative reclaim cost of the two LRUs. This will have the VM target list balances that incur the least amount of IO on aggregate. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@surriel.com> Link: http://lkml.kernel.org/r/20200520232525.798933-12-hannes@cmpxchg.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-04mm: base LRU balancing on an explicit cost modelJohannes Weiner2-14/+9
Currently, scan pressure between the anon and file LRU lists is balanced based on a mixture of reclaim efficiency and a somewhat vague notion of "value" of having certain pages in memory over others. That concept of value is problematic, because it has caused us to count any event that remotely makes one LRU list more or less preferrable for reclaim, even when these events are not directly comparable and impose very different costs on the system. One example is referenced file pages that we still deactivate and referenced anonymous pages that we actually rotate back to the head of the list. There is also conceptual overlap with the LRU algorithm itself. By rotating recently used pages instead of reclaiming them, the algorithm already biases the applied scan pressure based on page value. Thus, when rebalancing scan pressure due to rotations, we should think of reclaim cost, and leave assessing the page value to the LRU algorithm. Lastly, considering both value-increasing as well as value-decreasing events can sometimes cause the same type of event to be counted twice, i.e. how rotating a page increases the LRU value, while reclaiming it succesfully decreases the value. In itself this will balance out fine, but it quietly skews the impact of events that are only recorded once. The abstract metric of "value", the murky relationship with the LRU algorithm, and accounting both negative and positive events make the current pressure balancing model hard to reason about and modify. This patch switches to a balancing model of accounting the concrete, actually observed cost of reclaiming one LRU over another. For now, that cost includes pages that are scanned but rotated back to the list head. Subsequent patches will add consideration for IO caused by refaulting of recently evicted pages. Replace struct zone_reclaim_stat with two cost counters in the lruvec, and make everything that affects cost go through a new lru_note_cost() function. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@surriel.com> Link: http://lkml.kernel.org/r/20200520232525.798933-9-hannes@cmpxchg.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-04mm: fold and remove lru_cache_add_anon() and lru_cache_add_file()Johannes Weiner1-2/+0
They're the same function, and for the purpose of all callers they are equivalent to lru_cache_add(). [akpm@linux-foundation.org: fix it for local_lock changes] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Rik van Riel <riel@surriel.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Link: http://lkml.kernel.org/r/20200520232525.798933-5-hannes@cmpxchg.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-04mm: keep separate anon and file statistics on page reclaim activityJohannes Weiner1-0/+4
Having statistics on pages scanned and pages reclaimed for both anon and file pages makes it easier to evaluate changes to LRU balancing. While at it, clean up the stat-keeping mess for isolation, putback, reclaim stats etc. a bit: first the physical LRU operation (isolation and putback), followed by vmstats, reclaim_stats, and then vm events. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@surriel.com> Link: http://lkml.kernel.org/r/20200520232525.798933-3-hannes@cmpxchg.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-04mm: memcontrol: delete unused lrucare handlingJohannes Weiner1-3/+2
Swapin faults were the last event to charge pages after they had already been put on the LRU list. Now that we charge directly on swapin, the lrucare portion of the charge code is unused. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <guro@fb.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Shakeel Butt <shakeelb@google.com> Link: http://lkml.kernel.org/r/20200508183105.225460-19-hannes@cmpxchg.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-04mm: memcontrol: prepare swap controller setup for integrationJohannes Weiner1-1/+1
A few cleanups to streamline the swap controller setup: - Replace the do_swap_account flag with cgroup_memory_noswap. This brings it in line with other functionality that is usually available unless explicitly opted out of - nosocket, nokmem. - Remove the really_do_swap_account flag that stores the boot option and is later used to switch the do_swap_account. It's not clear why this indirection is/was necessary. Use do_swap_account directly. - Minor coding style polishing Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Balbir Singh <bsingharora@gmail.com> Link: http://lkml.kernel.org/r/20200508183105.225460-15-hannes@cmpxchg.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-04mm: memcontrol: drop unused try/commit/cancel charge APIJohannes Weiner1-36/+0
There are no more users. RIP in peace. [arnd@arndb.de: fix an unused-function warning] Link: http://lkml.kernel.org/r/20200528095640.151454-1-arnd@arndb.de Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Balbir Singh <bsingharora@gmail.com> Link: http://lkml.kernel.org/r/20200508183105.225460-14-hannes@cmpxchg.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-04mm: memcontrol: convert anon and file-thp to new mem_cgroup_charge() APIJohannes Weiner1-3/+1
With the page->mapping requirement gone from memcg, we can charge anon and file-thp pages in one single step, right after they're allocated. This removes two out of three API calls - especially the tricky commit step that needed to happen at just the right time between when the page is "set up" and when it's "published" - somewhat vague and fluid concepts that varied by page type. All we need is a freshly allocated page and a memcg context to charge. v2: prevent double charges on pre-allocated hugepages in khugepaged [hannes@cmpxchg.org: Fix crash - *hpage could be ERR_PTR instead of NULL] Link: http://lkml.kernel.org/r/20200512215813.GA487759@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Qian Cai <cai@lca.pw> Link: http://lkml.kernel.org/r/20200508183105.225460-13-hannes@cmpxchg.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-04mm: memcontrol: switch to native NR_ANON_THPS counterJohannes Weiner1-2/+1
With rmap memcg locking already in place for NR_ANON_MAPPED, it's just a small step to remove the MEMCG_RSS_HUGE wart and switch memcg to the native NR_ANON_THPS accounting sites. [hannes@cmpxchg.org: fixes] Link: http://lkml.kernel.org/r/20200512121750.GA397968@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org> Reviewed-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> [build-tested] Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Balbir Singh <bsingharora@gmail.com> Link: http://lkml.kernel.org/r/20200508183105.225460-12-hannes@cmpxchg.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-04mm: memcontrol: switch to native NR_ANON_MAPPED counterJohannes Weiner1-2/+1
Memcg maintains a private MEMCG_RSS counter. This divergence from the generic VM accounting means unnecessary code overhead, and creates a dependency for memcg that page->mapping is set up at the time of charging, so that page types can be told apart. Convert the generic accounting sites to mod_lruvec_page_state and friends to maintain the per-cgroup vmstat counter of NR_ANON_MAPPED. We use lock_page_memcg() to stabilize page->mem_cgroup during rmap changes, the same way we do for NR_FILE_MAPPED. With the previous patch removing MEMCG_CACHE and the private NR_SHMEM counter, this patch finally eliminates the need to have page->mapping set up at charge time. However, we need to have page->mem_cgroup set up by the time rmap runs and does the accounting, so switch the commit and the rmap callbacks around. v2: fix temporary accounting bug by switching rmap<->commit (Joonsoo) Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Balbir Singh <bsingharora@gmail.com> Link: http://lkml.kernel.org/r/20200508183105.225460-11-hannes@cmpxchg.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-04mm: memcontrol: switch to native NR_FILE_PAGES and NR_SHMEM countersJohannes Weiner1-2/+1
Memcg maintains private MEMCG_CACHE and NR_SHMEM counters. This divergence from the generic VM accounting means unnecessary code overhead, and creates a dependency for memcg that page->mapping is set up at the time of charging, so that page types can be told apart. Convert the generic accounting sites to mod_lruvec_page_state and friends to maintain the per-cgroup vmstat counters of NR_FILE_PAGES and NR_SHMEM. The page is already locked in these places, so page->mem_cgroup is stable; we only need minimal tweaks of two mem_cgroup_migrate() calls to ensure it's set up in time. Then replace MEMCG_CACHE with NR_FILE_PAGES and delete the private NR_SHMEM accounting sites. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Balbir Singh <bsingharora@gmail.com> Link: http://lkml.kernel.org/r/20200508183105.225460-10-hannes@cmpxchg.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-04mm: memcontrol: prepare cgroup vmstat infrastructure for native anon countersJohannes Weiner1-2/+3
Anonymous compound pages can be mapped by ptes, which means that if we want to track NR_MAPPED_ANON, NR_ANON_THPS on a per-cgroup basis, we have to be prepared to see tail pages in our accounting functions. Make mod_lruvec_page_state() and lock_page_memcg() deal with tail pages correctly, namely by redirecting to the head page which has the page->mem_cgroup set up. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Balbir Singh <bsingharora@gmail.com> Link: http://lkml.kernel.org/r/20200508183105.225460-9-hannes@cmpxchg.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-04mm: memcontrol: convert page cache to a new mem_cgroup_charge() APIJohannes Weiner1-0/+10
The try/commit/cancel protocol that memcg uses dates back to when pages used to be uncharged upon removal from the page cache, and thus couldn't be committed before the insertion had succeeded. Nowadays, pages are uncharged when they are physically freed; it doesn't matter whether the insertion was successful or not. For the page cache, the transaction dance has become unnecessary. Introduce a mem_cgroup_charge() function that simply charges a newly allocated page to a cgroup and sets up page->mem_cgroup in one single step. If the insertion fails, the caller doesn't have to do anything but free/put the page. Then switch the page cache over to this new API. Subsequent patches will also convert anon pages, but it needs a bit more prep work. Right now, memcg depends on page->mapping being already set up at the time of charging, so that it can maintain its own MEMCG_CACHE and MEMCG_RSS counters. For anon, page->mapping is set under the same pte lock under which the page is publishd, so a single charge point that can block doesn't work there just yet. The following prep patches will replace the private memcg counters with the generic vmstat counters, thus removing the page->mapping dependency, then complete the transition to the new single-point charge API and delete the old transactional scheme. v2: leave shmem swapcache when charging fails to avoid double IO (Joonsoo) v3: rebase on preceeding shmem simplification patch Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Alex Shi <alex.shi@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Balbir Singh <bsingharora@gmail.com> Link: http://lkml.kernel.org/r/20200508183105.225460-6-hannes@cmpxchg.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-04mm: memcontrol: move out cgroup swaprate throttlingJohannes Weiner1-4/+2
The cgroup swaprate throttling is about matching new anon allocations to the rate of available IO when that is being throttled. It's the io controller hooking into the VM, rather than a memory controller thing. Rename mem_cgroup_throttle_swaprate() to cgroup_throttle_swaprate(), and drop the @memcg argument which is only used to check whether the preceding page charge has succeeded and the fault is proceeding. We could decouple the call from mem_cgroup_try_charge() here as well, but that would cause unnecessary churn: the following patches convert all callsites to a new charge API and we'll decouple as we go along. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Alex Shi <alex.shi@linux.alibaba.com> Reviewed-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <guro@fb.com> Cc: Balbir Singh <bsingharora@gmail.com> Link: http://lkml.kernel.org/r/20200508183105.225460-5-hannes@cmpxchg.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-04mm: memcontrol: drop @compound parameter from memcg charging APIJohannes Weiner1-14/+8
The memcg charging API carries a boolean @compound parameter that tells whether the page we're dealing with is a hugepage. mem_cgroup_commit_charge() has another boolean @lrucare that indicates whether the page needs LRU locking or not while charging. The majority of callsites know those parameters at compile time, which results in a lot of naked "false, false" argument lists. This makes for cryptic code and is a breeding ground for subtle mistakes. Thankfully, the huge page state can be inferred from the page itself and doesn't need to be passed along. This is safe because charging completes before the page is published and somebody may split it. Simplify the callsites by removing @compound, and let memcg infer the state by using hpage_nr_pages() unconditionally. That function does PageTransHuge() to identify huge pages, which also helpfully asserts that nobody passes in tail pages by accident. The following patches will introduce a new charging API, best not to carry over unnecessary weight. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Alex Shi <alex.shi@linux.alibaba.com> Reviewed-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <guro@fb.com> Cc: Balbir Singh <bsingharora@gmail.com> Link: http://lkml.kernel.org/r/20200508183105.225460-4-hannes@cmpxchg.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-04mm/vmscan: count layzfree pages and fix nr_isolated_* mismatchJaewon Kim1-0/+1
Fix an nr_isolate_* mismatch problem between cma and dirty lazyfree pages. If try_to_unmap_one is used for reclaim and it detects a dirty lazyfree page, then the lazyfree page is changed to a normal anon page having SwapBacked by commit 802a3a92ad7a ("mm: reclaim MADV_FREE pages"). Even with the change, reclaim context correctly counts isolated files because it uses is_file_lru to distinguish file. And the change to anon is not happened if try_to_unmap_one is used for migration. So migration context like compaction also correctly counts isolated files even though it uses page_is_file_lru insted of is_file_lru. Recently page_is_file_cache was renamed to page_is_file_lru by commit 9de4f22a60f7 ("mm: code cleanup for MADV_FREE"). But the nr_isolate_* mismatch problem happens on cma alloc. There is reclaim_clean_pages_from_list which is being used only by cma. It was introduced by commit 02c6de8d757c ("mm: cma: discard clean pages during contiguous allocation instead of migration") to reclaim clean file pages without migration. The cma alloc uses both reclaim_clean_pages_from_list and migrate_pages, and it uses page_is_file_lru to count isolated files. If there are dirty lazyfree pages allocated from cma memory region, the pages are counted as isolated file at the beginging but are counted as isolated anon after finished. Mem-Info: Node 0 active_anon:3045904kB inactive_anon:611448kB active_file:14892kB inactive_file:205636kB unevictable:10416kB isolated(anon):0kB isolated(file):37664kB mapped:630216kB dirty:384kB writeback:0kB shmem:42576kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no Like log above, there were too much isolated files, 37664kB, which triggers too_many_isolated in reclaim even when there is no actually isolated file in system wide. It could be reproducible by running two programs, writing on MADV_FREE page and doing cma alloc, respectively. Although isolated anon is 0, I found that the internal value of isolated anon was the negative value of isolated file. Fix this by compensating the isolated count for both LRU lists. Count non-discarded lazyfree pages in shrink_page_list, then compensate the counted number in reclaim_clean_pages_from_list. Reported-by: Yong-Taek Lee <ytk.lee@samsung.com> Suggested-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Jaewon Kim <jaewon31.kim@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Shaohua Li <shli@fb.com> Link: http://lkml.kernel.org/r/20200426011718.30246-1-jaewon31.kim@samsung.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-04mm: simplify calling a compound page destructorMatthew Wilcox (Oracle)1-2/+2
None of the three callers of get_compound_page_dtor() want to know the value; they just want to call the function. Replace it with destroy_compound_page() which calls the dtor for them. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Link: http://lkml.kernel.org/r/20200517105051.9352-1-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-04mm/hugetlb: define a generic fallback for arch_clear_hugepage_flags()Anshuman Khandual1-0/+5
There are multiple similar definitions for arch_clear_hugepage_flags() on various platforms. Lets just add it's generic fallback definition for platforms that do not override. This help reduce code duplication. Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Helge Deller <deller@gmx.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Rich Felker <dalias@libc.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Link: http://lkml.kernel.org/r/1588907271-11920-4-git-send-email-anshuman.khandual@arm.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-04mm/hugetlb: define a generic fallback for is_hugepage_only_range()Anshuman Khandual1-0/+9
There are multiple similar definitions for is_hugepage_only_range() on various platforms. Lets just add it's generic fallback definition for platforms that do not override. This help reduce code duplication. Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Helge Deller <deller@gmx.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Rich Felker <dalias@libc.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Link: http://lkml.kernel.org/r/1588907271-11920-3-git-send-email-anshuman.khandual@arm.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-04arm64/mm: drop __HAVE_ARCH_HUGE_PTEP_GETAnshuman Khandual1-1/+1
Patch series "mm/hugetlb: Add some new generic fallbacks", v3. This series adds the following new generic fallbacks. Before that it drops __HAVE_ARCH_HUGE_PTEP_GET from arm64 platform. 1. is_hugepage_only_range() 2. arch_clear_hugepage_flags() After this arm (32 bit) remains the sole platform defining it's own huge_ptep_get() via __HAVE_ARCH_HUGE_PTEP_GET. This patch (of 3): Platform specific huge_ptep_get() is required only when fetching the huge PTE involves more than just dereferencing the page table pointer. This is not the case on arm64 platform. Hence huge_ptep_pte() can be dropped along with it's __HAVE_ARCH_HUGE_PTEP_GET subscription. Before that, it updates the generic huge_ptep_get() with READ_ONCE() which will prevent known page table issues with THP on arm64. Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Link: http://lkml.kernel.org/r/1588907271-11920-1-git-send-email-anshuman.khandual@arm.com Link: http://lkml.kernel.org/r//1506527369-19535-1-git-send-email-will.deacon@arm.com/ Link: http://lkml.kernel.org/r/1588907271-11920-2-git-send-email-anshuman.khandual@arm.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-04hugetlbfs: move hugepagesz= parsing to arch independent codeMike Kravetz1-1/+0
Now that architectures provide arch_hugetlb_valid_size(), parsing of "hugepagesz=" can be done in architecture independent code. Create a single routine to handle hugepagesz= parsing and remove all arch specific routines. We can also remove the interface hugetlb_bad_size() as this is no longer used outside arch independent code. This also provides consistent behavior of hugetlbfs command line options. The hugepagesz= option should only be specified once for a specific size, but some architectures allow multiple instances. This appears to be more of an oversight when code was added by some architectures to set up ALL huge pages sizes. Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Sandipan Das <sandipan@linux.ibm.com> Reviewed-by: Peter Xu <peterx@redhat.com> Acked-by: Mina Almasry <almasrymina@google.com> Acked-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> [s390] Acked-by: Will Deacon <will@kernel.org> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Christophe Leroy <christophe.leroy@c-s.fr> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Longpeng <longpeng2@huawei.com> Cc: Nitesh Narayan Lal <nitesh@redhat.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Anders Roxell <anders.roxell@linaro.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Qian Cai <cai@lca.pw> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Link: http://lkml.kernel.org/r/20200417185049.275845-3-mike.kravetz@oracle.com Link: http://lkml.kernel.org/r/20200428205614.246260-3-mike.kravetz@oracle.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-04hugetlbfs: add arch_hugetlb_valid_sizeMike Kravetz1-0/+1
Patch series "Clean up hugetlb boot command line processing", v4. Longpeng(Mike) reported a weird message from hugetlb command line processing and proposed a solution [1]. While the proposed patch does address the specific issue, there are other related issues in command line processing. As hugetlbfs evolved, updates to command line processing have been made to meet immediate needs and not necessarily in a coordinated manner. The result is that some processing is done in arch specific code, some is done in arch independent code and coordination is problematic. Semantics can vary between architectures. The patch series does the following: - Define arch specific arch_hugetlb_valid_size routine used to validate passed huge page sizes. - Move hugepagesz= command line parsing out of arch specific code and into an arch independent routine. - Clean up command line processing to follow desired semantics and document those semantics. [1] https://lore.kernel.org/linux-mm/20200305033014.1152-1-longpeng2@huawei.com This patch (of 3): The architecture independent routine hugetlb_default_setup sets up the default huge pages size. It has no way to verify if the passed value is valid, so it accepts it and attempts to validate at a later time. This requires undocumented cooperation between the arch specific and arch independent code. For architectures that support more than one huge page size, provide a routine arch_hugetlb_valid_size to validate a huge page size. hugetlb_default_setup can use this to validate passed values. arch_hugetlb_valid_size will also be used in a subsequent patch to move processing of the "hugepagesz=" in arch specific code to a common routine in arch independent code. Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> [s390] Acked-by: Will Deacon <will@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Longpeng <longpeng2@huawei.com> Cc: Christophe Leroy <christophe.leroy@c-s.fr> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Mina Almasry <almasrymina@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Nitesh Narayan Lal <nitesh@redhat.com> Cc: Anders Roxell <anders.roxell@linaro.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Qian Cai <cai@lca.pw> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Link: http://lkml.kernel.org/r/20200428205614.246260-1-mike.kravetz@oracle.com Link: http://lkml.kernel.org/r/20200428205614.246260-2-mike.kravetz@oracle.com Link: http://lkml.kernel.org/r/20200417185049.275845-1-mike.kravetz@oracle.com Link: http://lkml.kernel.org/r/20200417185049.275845-2-mike.kravetz@oracle.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>