From ba84ed9546e91348fdf3ff2bff859b0ee53b407a Mon Sep 17 00:00:00 2001 From: "Rafael J. Wysocki" Date: Sun, 26 Oct 2008 20:56:30 +0100 Subject: ACPI hibernate: Introduce new kernel parameter acpi_sleep=s4_nonvs On some machines it may be necessary to disable the saving/restoring of the ACPI NVS memory region during hibernation/resume. For this purpose, introduce new ACPI kernel command line option acpi_sleep=s4_nonvs. Based on a patch by Zhang Rui. Signed-off-by: Rafael J. Wysocki Acked-by: Nigel Cunningham Acked-by: Pavel Machek Signed-off-by: Len Brown --- Documentation/kernel-parameters.txt | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index e0f346d201ed..1d089eeff3cf 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -149,7 +149,8 @@ and is between 256 and 4096 characters. It is defined in the file default: 0 acpi_sleep= [HW,ACPI] Sleep options - Format: { s3_bios, s3_mode, s3_beep, s4_nohwsig, old_ordering } + Format: { s3_bios, s3_mode, s3_beep, s4_nohwsig, + old_ordering, s4_nonvs } See Documentation/power/video.txt for s3_bios and s3_mode. s3_beep is for debugging; it makes the PC's speaker beep as soon as the kernel's real-mode entry point is called. @@ -159,6 +160,8 @@ and is between 256 and 4096 characters. It is defined in the file control method, wrt putting devices into low power states, to be enforced (the ACPI 2.0 ordering of _PTS is used by default). + s4_nonvs prevents the kernel from saving/restoring the + ACPI NVS memory during hibernation. acpi_sci= [HW,ACPI] ACPI System Control Interrupt trigger mode Format: { level | edge | high | low } -- cgit v1.2.3 From ada9cfdd158abb8169873dc8e5ae39b1ec6ffa8c Mon Sep 17 00:00:00 2001 From: Randy Dunlap Date: Fri, 19 Dec 2008 10:57:32 -0800 Subject: doc: fix kernel-parameters.txt formatting Spell out "wrt". I suspect plenty of people won't know what that means. Fix a '}' that should be a ']'. Reformat long lines into shorter lines. Signed-off-by: Randy Dunlap Signed-off-by: Len Brown --- Documentation/kernel-parameters.txt | 40 ++++++++++++++++++++----------------- 1 file changed, 22 insertions(+), 18 deletions(-) (limited to 'Documentation') diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 1d089eeff3cf..350e71960a96 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -151,15 +151,16 @@ and is between 256 and 4096 characters. It is defined in the file acpi_sleep= [HW,ACPI] Sleep options Format: { s3_bios, s3_mode, s3_beep, s4_nohwsig, old_ordering, s4_nonvs } - See Documentation/power/video.txt for s3_bios and s3_mode. + See Documentation/power/video.txt for information on + s3_bios and s3_mode. s3_beep is for debugging; it makes the PC's speaker beep as soon as the kernel's real-mode entry point is called. s4_nohwsig prevents ACPI hardware signature from being used during resume from hibernation. old_ordering causes the ACPI 1.0 ordering of the _PTS - control method, wrt putting devices into low power - states, to be enforced (the ACPI 2.0 ordering of _PTS is - used by default). + control method, with respect to putting devices into + low power states, to be enforced (the ACPI 2.0 ordering + of _PTS is used by default). s4_nonvs prevents the kernel from saving/restoring the ACPI NVS memory during hibernation. @@ -196,7 +197,7 @@ and is between 256 and 4096 characters. It is defined in the file acpi_skip_timer_override [HW,ACPI] Recognize and ignore IRQ0/pin2 Interrupt Override. For broken nForce2 BIOS resulting in XT-PIC timer. - acpi_use_timer_override [HW,ACPI} + acpi_use_timer_override [HW,ACPI] Use timer override. For some broken Nvidia NF5 boards that require a timer override, but don't have HPET @@ -860,17 +861,19 @@ and is between 256 and 4096 characters. It is defined in the file See Documentation/ide/ide.txt. idle= [X86] - Format: idle=poll or idle=mwait, idle=halt, idle=nomwait - Poll forces a polling idle loop that can slightly improves the performance - of waking up a idle CPU, but will use a lot of power and make the system - run hot. Not recommended. - idle=mwait. On systems which support MONITOR/MWAIT but the kernel chose - to not use it because it doesn't save as much power as a normal idle - loop use the MONITOR/MWAIT idle loop anyways. Performance should be the same - as idle=poll. - idle=halt. Halt is forced to be used for CPU idle. + Format: idle=poll, idle=mwait, idle=halt, idle=nomwait + Poll forces a polling idle loop that can slightly + improve the performance of waking up a idle CPU, but + will use a lot of power and make the system run hot. + Not recommended. + idle=mwait: On systems which support MONITOR/MWAIT but + the kernel chose to not use it because it doesn't save + as much power as a normal idle loop, use the + MONITOR/MWAIT idle loop anyways. Performance should be + the same as idle=poll. + idle=halt: Halt is forced to be used for CPU idle. In such case C2/C3 won't be used again. - idle=nomwait. Disable mwait for CPU C-states + idle=nomwait: Disable mwait for CPU C-states ide-pci-generic.all-generic-ide [HW] (E)IDE subsystem Claim all unknown PCI IDE storage controllers. @@ -1052,8 +1055,8 @@ and is between 256 and 4096 characters. It is defined in the file lapic [X86-32,APIC] Enable the local APIC even if BIOS disabled it. - lapic_timer_c2_ok [X86-32,x86-64,APIC] trust the local apic timer in - C2 power state. + lapic_timer_c2_ok [X86-32,x86-64,APIC] trust the local apic timer + in C2 power state. libata.dma= [LIBATA] DMA control libata.dma=0 Disable all PATA and SATA DMA @@ -2241,7 +2244,8 @@ and is between 256 and 4096 characters. It is defined in the file thermal.psv= [HW,ACPI] -1: disable all passive trip points - : override all passive trip points to this value + : override all passive trip points to this + value thermal.tzp= [HW,ACPI] Specify global default ACPI thermal zone polling rate -- cgit v1.2.3 From 2f6de3a199893ae3dd68e23bd79b55e1478c8268 Mon Sep 17 00:00:00 2001 From: Baodong Chen Date: Sat, 3 Jan 2009 12:37:06 +0800 Subject: Documentation/x86/boot.txt: payload length was changed to payload_length Signed-off-by: Baodong Chen <[email]chenbdchenbd@gmail.com[email]> Acked-by: Jiri Kosina Signed-off-by: Ingo Molnar --- Documentation/x86/boot.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/x86/boot.txt b/Documentation/x86/boot.txt index fcdc62b3c3d8..7b4596ac4120 100644 --- a/Documentation/x86/boot.txt +++ b/Documentation/x86/boot.txt @@ -44,7 +44,7 @@ Protocol 2.07: (Kernel 2.6.24) Added paravirtualised boot protocol. and KEEP_SEGMENTS flag in load_flags. Protocol 2.08: (Kernel 2.6.26) Added crc32 checksum and ELF format - payload. Introduced payload_offset and payload length + payload. Introduced payload_offset and payload_length fields to aid in locating the payload. Protocol 2.09: (Kernel 2.6.26) Added a field of 64-bit physical -- cgit v1.2.3 From 9eb425c046f4129f1dafce7c04e949652e69fb01 Mon Sep 17 00:00:00 2001 From: Phillip Lougher Date: Mon, 5 Jan 2009 08:46:29 +0000 Subject: Squashfs: documentation Signed-off-by: Phillip Lougher --- Documentation/filesystems/squashfs.txt | 225 +++++++++++++++++++++++++++++++++ 1 file changed, 225 insertions(+) create mode 100644 Documentation/filesystems/squashfs.txt (limited to 'Documentation') diff --git a/Documentation/filesystems/squashfs.txt b/Documentation/filesystems/squashfs.txt new file mode 100644 index 000000000000..3e79e4a7a392 --- /dev/null +++ b/Documentation/filesystems/squashfs.txt @@ -0,0 +1,225 @@ +SQUASHFS 4.0 FILESYSTEM +======================= + +Squashfs is a compressed read-only filesystem for Linux. +It uses zlib compression to compress files, inodes and directories. +Inodes in the system are very small and all blocks are packed to minimise +data overhead. Block sizes greater than 4K are supported up to a maximum +of 1Mbytes (default block size 128K). + +Squashfs is intended for general read-only filesystem use, for archival +use (i.e. in cases where a .tar.gz file may be used), and in constrained +block device/memory systems (e.g. embedded systems) where low overhead is +needed. + +Mailing list: squashfs-devel@lists.sourceforge.net +Web site: www.squashfs.org + +1. FILESYSTEM FEATURES +---------------------- + +Squashfs filesystem features versus Cramfs: + + Squashfs Cramfs + +Max filesystem size: 2^64 16 MiB +Max file size: ~ 2 TiB 16 MiB +Max files: unlimited unlimited +Max directories: unlimited unlimited +Max entries per directory: unlimited unlimited +Max block size: 1 MiB 4 KiB +Metadata compression: yes no +Directory indexes: yes no +Sparse file support: yes no +Tail-end packing (fragments): yes no +Exportable (NFS etc.): yes no +Hard link support: yes no +"." and ".." in readdir: yes no +Real inode numbers: yes no +32-bit uids/gids: yes no +File creation time: yes no +Xattr and ACL support: no no + +Squashfs compresses data, inodes and directories. In addition, inode and +directory data are highly compacted, and packed on byte boundaries. Each +compressed inode is on average 8 bytes in length (the exact length varies on +file type, i.e. regular file, directory, symbolic link, and block/char device +inodes have different sizes). + +2. USING SQUASHFS +----------------- + +As squashfs is a read-only filesystem, the mksquashfs program must be used to +create populated squashfs filesystems. This and other squashfs utilities +can be obtained from http://www.squashfs.org. Usage instructions can be +obtained from this site also. + + +3. SQUASHFS FILESYSTEM DESIGN +----------------------------- + +A squashfs filesystem consists of seven parts, packed together on a byte +alignment: + + --------------- + | superblock | + |---------------| + | datablocks | + | & fragments | + |---------------| + | inode table | + |---------------| + | directory | + | table | + |---------------| + | fragment | + | table | + |---------------| + | export | + | table | + |---------------| + | uid/gid | + | lookup table | + --------------- + +Compressed data blocks are written to the filesystem as files are read from +the source directory, and checked for duplicates. Once all file data has been +written the completed inode, directory, fragment, export and uid/gid lookup +tables are written. + +3.1 Inodes +---------- + +Metadata (inodes and directories) are compressed in 8Kbyte blocks. Each +compressed block is prefixed by a two byte length, the top bit is set if the +block is uncompressed. A block will be uncompressed if the -noI option is set, +or if the compressed block was larger than the uncompressed block. + +Inodes are packed into the metadata blocks, and are not aligned to block +boundaries, therefore inodes overlap compressed blocks. Inodes are identified +by a 48-bit number which encodes the location of the compressed metadata block +containing the inode, and the byte offset into that block where the inode is +placed (). + +To maximise compression there are different inodes for each file type +(regular file, directory, device, etc.), the inode contents and length +varying with the type. + +To further maximise compression, two types of regular file inode and +directory inode are defined: inodes optimised for frequently occurring +regular files and directories, and extended types where extra +information has to be stored. + +3.2 Directories +--------------- + +Like inodes, directories are packed into compressed metadata blocks, stored +in a directory table. Directories are accessed using the start address of +the metablock containing the directory and the offset into the +decompressed block (). + +Directories are organised in a slightly complex way, and are not simply +a list of file names. The organisation takes advantage of the +fact that (in most cases) the inodes of the files will be in the same +compressed metadata block, and therefore, can share the start block. +Directories are therefore organised in a two level list, a directory +header containing the shared start block value, and a sequence of directory +entries, each of which share the shared start block. A new directory header +is written once/if the inode start block changes. The directory +header/directory entry list is repeated as many times as necessary. + +Directories are sorted, and can contain a directory index to speed up +file lookup. Directory indexes store one entry per metablock, each entry +storing the index/filename mapping to the first directory header +in each metadata block. Directories are sorted in alphabetical order, +and at lookup the index is scanned linearly looking for the first filename +alphabetically larger than the filename being looked up. At this point the +location of the metadata block the filename is in has been found. +The general idea of the index is ensure only one metadata block needs to be +decompressed to do a lookup irrespective of the length of the directory. +This scheme has the advantage that it doesn't require extra memory overhead +and doesn't require much extra storage on disk. + +3.3 File data +------------- + +Regular files consist of a sequence of contiguous compressed blocks, and/or a +compressed fragment block (tail-end packed block). The compressed size +of each datablock is stored in a block list contained within the +file inode. + +To speed up access to datablocks when reading 'large' files (256 Mbytes or +larger), the code implements an index cache that caches the mapping from +block index to datablock location on disk. + +The index cache allows Squashfs to handle large files (up to 1.75 TiB) while +retaining a simple and space-efficient block list on disk. The cache +is split into slots, caching up to eight 224 GiB files (128 KiB blocks). +Larger files use multiple slots, with 1.75 TiB files using all 8 slots. +The index cache is designed to be memory efficient, and by default uses +16 KiB. + +3.4 Fragment lookup table +------------------------- + +Regular files can contain a fragment index which is mapped to a fragment +location on disk and compressed size using a fragment lookup table. This +fragment lookup table is itself stored compressed into metadata blocks. +A second index table is used to locate these. This second index table for +speed of access (and because it is small) is read at mount time and cached +in memory. + +3.5 Uid/gid lookup table +------------------------ + +For space efficiency regular files store uid and gid indexes, which are +converted to 32-bit uids/gids using an id look up table. This table is +stored compressed into metadata blocks. A second index table is used to +locate these. This second index table for speed of access (and because it +is small) is read at mount time and cached in memory. + +3.6 Export table +---------------- + +To enable Squashfs filesystems to be exportable (via NFS etc.) filesystems +can optionally (disabled with the -no-exports Mksquashfs option) contain +an inode number to inode disk location lookup table. This is required to +enable Squashfs to map inode numbers passed in filehandles to the inode +location on disk, which is necessary when the export code reinstantiates +expired/flushed inodes. + +This table is stored compressed into metadata blocks. A second index table is +used to locate these. This second index table for speed of access (and because +it is small) is read at mount time and cached in memory. + + +4. TODOS AND OUTSTANDING ISSUES +------------------------------- + +4.1 Todo list +------------- + +Implement Xattr and ACL support. The Squashfs 4.0 filesystem layout has hooks +for these but the code has not been written. Once the code has been written +the existing layout should not require modification. + +4.2 Squashfs internal cache +--------------------------- + +Blocks in Squashfs are compressed. To avoid repeatedly decompressing +recently accessed data Squashfs uses two small metadata and fragment caches. + +The cache is not used for file datablocks, these are decompressed and cached in +the page-cache in the normal way. The cache is used to temporarily cache +fragment and metadata blocks which have been read as a result of a metadata +(i.e. inode or directory) or fragment access. Because metadata and fragments +are packed together into blocks (to gain greater compression) the read of a +particular piece of metadata or fragment will retrieve other metadata/fragments +which have been packed with it, these because of locality-of-reference may be +read in the near future. Temporarily caching them ensures they are available +for near future access without requiring an additional read and decompress. + +In the future this internal cache may be replaced with an implementation which +uses the kernel page cache. Because the page cache operates on page sized +units this may introduce additional complexity in terms of locking and +associated race conditions. -- cgit v1.2.3 From a808ad3b0d28411e2838117c5b2ae680ae42483c Mon Sep 17 00:00:00 2001 From: Sean MacLennan Date: Wed, 10 Dec 2008 13:16:34 +0000 Subject: [MTD] [NAND] ndfc driver The current ndfc driver only compiles under arch/ppc. This arch was removed from the kernel. I notice the event entry for the ndfc in Kconfig has been removed in 2.6.28. This patch converts the ndfc to a proper OF (OpenFirmware) driver. I can give a working example of the DTS if needed. The patch has been in production use on the PIKA Warp Appliance and is in use by others. The Warp basically boots from NAND, so the ndfc driver is very important to us. Signed-off-by: Sean MacLennan Acked-By: Josh Boyer Signed-off-by: David Woodhouse --- Documentation/powerpc/dts-bindings/4xx/ndfc.txt | 39 ++++ drivers/mtd/nand/Kconfig | 7 + drivers/mtd/nand/ndfc.c | 269 ++++++++++++------------ 3 files changed, 179 insertions(+), 136 deletions(-) create mode 100644 Documentation/powerpc/dts-bindings/4xx/ndfc.txt (limited to 'Documentation') diff --git a/Documentation/powerpc/dts-bindings/4xx/ndfc.txt b/Documentation/powerpc/dts-bindings/4xx/ndfc.txt new file mode 100644 index 000000000000..869f0b5f16e8 --- /dev/null +++ b/Documentation/powerpc/dts-bindings/4xx/ndfc.txt @@ -0,0 +1,39 @@ +AMCC NDFC (NanD Flash Controller) + +Required properties: +- compatible : "ibm,ndfc". +- reg : should specify chip select and size used for the chip (0x2000). + +Optional properties: +- ccr : NDFC config and control register value (default 0). +- bank-settings : NDFC bank configuration register value (default 0). + +Notes: +- partition(s) - follows the OF MTD standard for partitions + +Example: + +ndfc@1,0 { + compatible = "ibm,ndfc"; + reg = <0x00000001 0x00000000 0x00002000>; + ccr = <0x00001000>; + bank-settings = <0x80002222>; + #address-cells = <1>; + #size-cells = <1>; + + nand { + #address-cells = <1>; + #size-cells = <1>; + + partition@0 { + label = "kernel"; + reg = <0x00000000 0x00200000>; + }; + partition@200000 { + label = "root"; + reg = <0x00200000 0x03E00000>; + }; + }; +}; + + diff --git a/drivers/mtd/nand/Kconfig b/drivers/mtd/nand/Kconfig index f8ae0400c49c..8b12e6e109d3 100644 --- a/drivers/mtd/nand/Kconfig +++ b/drivers/mtd/nand/Kconfig @@ -163,6 +163,13 @@ config MTD_NAND_S3C2410_HWECC incorrect ECC generation, and if using these, the default of software ECC is preferable. +config MTD_NAND_NDFC + tristate "NDFC NanD Flash Controller" + depends on 4xx + select MTD_NAND_ECC_SMC + help + NDFC Nand Flash Controllers are integrated in IBM/AMCC's 4xx SoCs + config MTD_NAND_S3C2410_CLKSTOP bool "S3C2410 NAND IDLE clock stop" depends on MTD_NAND_S3C2410 diff --git a/drivers/mtd/nand/ndfc.c b/drivers/mtd/nand/ndfc.c index 955959eb02d4..582cf80f555a 100644 --- a/drivers/mtd/nand/ndfc.c +++ b/drivers/mtd/nand/ndfc.c @@ -2,12 +2,20 @@ * drivers/mtd/ndfc.c * * Overview: - * Platform independend driver for NDFC (NanD Flash Controller) + * Platform independent driver for NDFC (NanD Flash Controller) * integrated into EP440 cores * + * Ported to an OF platform driver by Sean MacLennan + * + * The NDFC supports multiple chips, but this driver only supports a + * single chip since I do not have access to any boards with + * multiple chips. + * * Author: Thomas Gleixner * * Copyright 2006 IBM + * Copyright 2008 PIKA Technologies + * Sean MacLennan * * This program is free software; you can redistribute it and/or modify it * under the terms of the GNU General Public License as published by the @@ -21,27 +29,20 @@ #include #include #include -#include - +#include #include -#ifdef CONFIG_40x -#include -#else -#include -#endif - -struct ndfc_nand_mtd { - struct mtd_info mtd; - struct nand_chip chip; - struct platform_nand_chip *pl_chip; -}; -static struct ndfc_nand_mtd ndfc_mtd[NDFC_MAX_BANKS]; struct ndfc_controller { - void __iomem *ndfcbase; - struct nand_hw_control ndfc_control; - atomic_t childs_active; + struct of_device *ofdev; + void __iomem *ndfcbase; + struct mtd_info mtd; + struct nand_chip chip; + int chip_select; + struct nand_hw_control ndfc_control; +#ifdef CONFIG_MTD_PARTITIONS + struct mtd_partition *parts; +#endif }; static struct ndfc_controller ndfc_ctrl; @@ -50,17 +51,14 @@ static void ndfc_select_chip(struct mtd_info *mtd, int chip) { uint32_t ccr; struct ndfc_controller *ndfc = &ndfc_ctrl; - struct nand_chip *nandchip = mtd->priv; - struct ndfc_nand_mtd *nandmtd = nandchip->priv; - struct platform_nand_chip *pchip = nandmtd->pl_chip; - ccr = __raw_readl(ndfc->ndfcbase + NDFC_CCR); + ccr = in_be32(ndfc->ndfcbase + NDFC_CCR); if (chip >= 0) { ccr &= ~NDFC_CCR_BS_MASK; - ccr |= NDFC_CCR_BS(chip + pchip->chip_offset); + ccr |= NDFC_CCR_BS(chip + ndfc->chip_select); } else ccr |= NDFC_CCR_RESET_CE; - __raw_writel(ccr, ndfc->ndfcbase + NDFC_CCR); + out_be32(ndfc->ndfcbase + NDFC_CCR, ccr); } static void ndfc_hwcontrol(struct mtd_info *mtd, int cmd, unsigned int ctrl) @@ -80,7 +78,7 @@ static int ndfc_ready(struct mtd_info *mtd) { struct ndfc_controller *ndfc = &ndfc_ctrl; - return __raw_readl(ndfc->ndfcbase + NDFC_STAT) & NDFC_STAT_IS_READY; + return in_be32(ndfc->ndfcbase + NDFC_STAT) & NDFC_STAT_IS_READY; } static void ndfc_enable_hwecc(struct mtd_info *mtd, int mode) @@ -88,9 +86,9 @@ static void ndfc_enable_hwecc(struct mtd_info *mtd, int mode) uint32_t ccr; struct ndfc_controller *ndfc = &ndfc_ctrl; - ccr = __raw_readl(ndfc->ndfcbase + NDFC_CCR); + ccr = in_be32(ndfc->ndfcbase + NDFC_CCR); ccr |= NDFC_CCR_RESET_ECC; - __raw_writel(ccr, ndfc->ndfcbase + NDFC_CCR); + out_be32(ndfc->ndfcbase + NDFC_CCR, ccr); wmb(); } @@ -102,9 +100,10 @@ static int ndfc_calculate_ecc(struct mtd_info *mtd, uint8_t *p = (uint8_t *)&ecc; wmb(); - ecc = __raw_readl(ndfc->ndfcbase + NDFC_ECC); - ecc_code[0] = p[1]; - ecc_code[1] = p[2]; + ecc = in_be32(ndfc->ndfcbase + NDFC_ECC); + /* The NDFC uses Smart Media (SMC) bytes order */ + ecc_code[0] = p[2]; + ecc_code[1] = p[1]; ecc_code[2] = p[3]; return 0; @@ -123,7 +122,7 @@ static void ndfc_read_buf(struct mtd_info *mtd, uint8_t *buf, int len) uint32_t *p = (uint32_t *) buf; for(;len > 0; len -= 4) - *p++ = __raw_readl(ndfc->ndfcbase + NDFC_DATA); + *p++ = in_be32(ndfc->ndfcbase + NDFC_DATA); } static void ndfc_write_buf(struct mtd_info *mtd, const uint8_t *buf, int len) @@ -132,7 +131,7 @@ static void ndfc_write_buf(struct mtd_info *mtd, const uint8_t *buf, int len) uint32_t *p = (uint32_t *) buf; for(;len > 0; len -= 4) - __raw_writel(*p++, ndfc->ndfcbase + NDFC_DATA); + out_be32(ndfc->ndfcbase + NDFC_DATA, *p++); } static int ndfc_verify_buf(struct mtd_info *mtd, const uint8_t *buf, int len) @@ -141,7 +140,7 @@ static int ndfc_verify_buf(struct mtd_info *mtd, const uint8_t *buf, int len) uint32_t *p = (uint32_t *) buf; for(;len > 0; len -= 4) - if (*p++ != __raw_readl(ndfc->ndfcbase + NDFC_DATA)) + if (*p++ != in_be32(ndfc->ndfcbase + NDFC_DATA)) return -EFAULT; return 0; } @@ -149,10 +148,19 @@ static int ndfc_verify_buf(struct mtd_info *mtd, const uint8_t *buf, int len) /* * Initialize chip structure */ -static void ndfc_chip_init(struct ndfc_nand_mtd *mtd) +static int ndfc_chip_init(struct ndfc_controller *ndfc, + struct device_node *node) { - struct ndfc_controller *ndfc = &ndfc_ctrl; - struct nand_chip *chip = &mtd->chip; +#ifdef CONFIG_MTD_PARTITIONS +#ifdef CONFIG_MTD_CMDLINE_PARTS + static const char *part_types[] = { "cmdlinepart", NULL }; +#else + static const char *part_types[] = { NULL }; +#endif +#endif + struct device_node *flash_np; + struct nand_chip *chip = &ndfc->chip; + int ret; chip->IO_ADDR_R = ndfc->ndfcbase + NDFC_DATA; chip->IO_ADDR_W = ndfc->ndfcbase + NDFC_DATA; @@ -160,8 +168,6 @@ static void ndfc_chip_init(struct ndfc_nand_mtd *mtd) chip->dev_ready = ndfc_ready; chip->select_chip = ndfc_select_chip; chip->chip_delay = 50; - chip->priv = mtd; - chip->options = mtd->pl_chip->options; chip->controller = &ndfc->ndfc_control; chip->read_buf = ndfc_read_buf; chip->write_buf = ndfc_write_buf; @@ -172,143 +178,136 @@ static void ndfc_chip_init(struct ndfc_nand_mtd *mtd) chip->ecc.mode = NAND_ECC_HW; chip->ecc.size = 256; chip->ecc.bytes = 3; - chip->ecclayout = chip->ecc.layout = mtd->pl_chip->ecclayout; - mtd->mtd.priv = chip; - mtd->mtd.owner = THIS_MODULE; -} - -static int ndfc_chip_probe(struct platform_device *pdev) -{ - struct platform_nand_chip *nc = pdev->dev.platform_data; - struct ndfc_chip_settings *settings = nc->priv; - struct ndfc_controller *ndfc = &ndfc_ctrl; - struct ndfc_nand_mtd *nandmtd; - - if (nc->chip_offset >= NDFC_MAX_BANKS || nc->nr_chips > NDFC_MAX_BANKS) - return -EINVAL; - - /* Set the bank settings */ - __raw_writel(settings->bank_settings, - ndfc->ndfcbase + NDFC_BCFG0 + (nc->chip_offset << 2)); - nandmtd = &ndfc_mtd[pdev->id]; - if (nandmtd->pl_chip) - return -EBUSY; + ndfc->mtd.priv = chip; + ndfc->mtd.owner = THIS_MODULE; - nandmtd->pl_chip = nc; - ndfc_chip_init(nandmtd); - - /* Scan for chips */ - if (nand_scan(&nandmtd->mtd, nc->nr_chips)) { - nandmtd->pl_chip = NULL; + flash_np = of_get_next_child(node, NULL); + if (!flash_np) return -ENODEV; + + ndfc->mtd.name = kasprintf(GFP_KERNEL, "%s.%s", + ndfc->ofdev->dev.bus_id, flash_np->name); + if (!ndfc->mtd.name) { + ret = -ENOMEM; + goto err; } -#ifdef CONFIG_MTD_PARTITIONS - printk("Number of partitions %d\n", nc->nr_partitions); - if (nc->nr_partitions) { - /* Add the full device, so complete dumps can be made */ - add_mtd_device(&nandmtd->mtd); - add_mtd_partitions(&nandmtd->mtd, nc->partitions, - nc->nr_partitions); + ret = nand_scan(&ndfc->mtd, 1); + if (ret) + goto err; - } else -#else - add_mtd_device(&nandmtd->mtd); +#ifdef CONFIG_MTD_PARTITIONS + ret = parse_mtd_partitions(&ndfc->mtd, part_types, &ndfc->parts, 0); + if (ret < 0) + goto err; + +#ifdef CONFIG_MTD_OF_PARTS + if (ret == 0) { + ret = of_mtd_parse_partitions(&ndfc->ofdev->dev, flash_np, + &ndfc->parts); + if (ret < 0) + goto err; + } #endif - atomic_inc(&ndfc->childs_active); - return 0; -} + if (ret > 0) + ret = add_mtd_partitions(&ndfc->mtd, ndfc->parts, ret); + else +#endif + ret = add_mtd_device(&ndfc->mtd); -static int ndfc_chip_remove(struct platform_device *pdev) -{ - return 0; +err: + of_node_put(flash_np); + if (ret) + kfree(ndfc->mtd.name); + return ret; } -static int ndfc_nand_probe(struct platform_device *pdev) +static int __devinit ndfc_probe(struct of_device *ofdev, + const struct of_device_id *match) { - struct platform_nand_ctrl *nc = pdev->dev.platform_data; - struct ndfc_controller_settings *settings = nc->priv; - struct resource *res = pdev->resource; struct ndfc_controller *ndfc = &ndfc_ctrl; - unsigned long long phys = settings->ndfc_erpn | res->start; + const u32 *reg; + u32 ccr; + int err, len; -#ifndef CONFIG_PHYS_64BIT - ndfc->ndfcbase = ioremap((phys_addr_t)phys, res->end - res->start + 1); -#else - ndfc->ndfcbase = ioremap64(phys, res->end - res->start + 1); -#endif + spin_lock_init(&ndfc->ndfc_control.lock); + init_waitqueue_head(&ndfc->ndfc_control.wq); + ndfc->ofdev = ofdev; + dev_set_drvdata(&ofdev->dev, ndfc); + + /* Read the reg property to get the chip select */ + reg = of_get_property(ofdev->node, "reg", &len); + if (reg == NULL || len != 12) { + dev_err(&ofdev->dev, "unable read reg property (%d)\n", len); + return -ENOENT; + } + ndfc->chip_select = reg[0]; + + ndfc->ndfcbase = of_iomap(ofdev->node, 0); if (!ndfc->ndfcbase) { - printk(KERN_ERR "NDFC: ioremap failed\n"); + dev_err(&ofdev->dev, "failed to get memory\n"); return -EIO; } - __raw_writel(settings->ccr_settings, ndfc->ndfcbase + NDFC_CCR); + ccr = NDFC_CCR_BS(ndfc->chip_select); - spin_lock_init(&ndfc->ndfc_control.lock); - init_waitqueue_head(&ndfc->ndfc_control.wq); + /* It is ok if ccr does not exist - just default to 0 */ + reg = of_get_property(ofdev->node, "ccr", NULL); + if (reg) + ccr |= *reg; - platform_set_drvdata(pdev, ndfc); + out_be32(ndfc->ndfcbase + NDFC_CCR, ccr); - printk("NDFC NAND Driver initialized. Chip-Rev: 0x%08x\n", - __raw_readl(ndfc->ndfcbase + NDFC_REVID)); + /* Set the bank settings if given */ + reg = of_get_property(ofdev->node, "bank-settings", NULL); + if (reg) { + int offset = NDFC_BCFG0 + (ndfc->chip_select << 2); + out_be32(ndfc->ndfcbase + offset, *reg); + } + + err = ndfc_chip_init(ndfc, ofdev->node); + if (err) { + iounmap(ndfc->ndfcbase); + return err; + } return 0; } -static int ndfc_nand_remove(struct platform_device *pdev) +static int __devexit ndfc_remove(struct of_device *ofdev) { - struct ndfc_controller *ndfc = platform_get_drvdata(pdev); + struct ndfc_controller *ndfc = dev_get_drvdata(&ofdev->dev); - if (atomic_read(&ndfc->childs_active)) - return -EBUSY; + nand_release(&ndfc->mtd); - if (ndfc) { - platform_set_drvdata(pdev, NULL); - iounmap(ndfc_ctrl.ndfcbase); - ndfc_ctrl.ndfcbase = NULL; - } return 0; } -/* driver device registration */ - -static struct platform_driver ndfc_chip_driver = { - .probe = ndfc_chip_probe, - .remove = ndfc_chip_remove, - .driver = { - .name = "ndfc-chip", - .owner = THIS_MODULE, - }, +static const struct of_device_id ndfc_match[] = { + { .compatible = "ibm,ndfc", }, + {} }; +MODULE_DEVICE_TABLE(of, ndfc_match); -static struct platform_driver ndfc_nand_driver = { - .probe = ndfc_nand_probe, - .remove = ndfc_nand_remove, - .driver = { - .name = "ndfc-nand", - .owner = THIS_MODULE, +static struct of_platform_driver ndfc_driver = { + .driver = { + .name = "ndfc", }, + .match_table = ndfc_match, + .probe = ndfc_probe, + .remove = __devexit_p(ndfc_remove), }; static int __init ndfc_nand_init(void) { - int ret; - - spin_lock_init(&ndfc_ctrl.ndfc_control.lock); - init_waitqueue_head(&ndfc_ctrl.ndfc_control.wq); - - ret = platform_driver_register(&ndfc_nand_driver); - if (!ret) - ret = platform_driver_register(&ndfc_chip_driver); - return ret; + return of_register_platform_driver(&ndfc_driver); } static void __exit ndfc_nand_exit(void) { - platform_driver_unregister(&ndfc_chip_driver); - platform_driver_unregister(&ndfc_nand_driver); + of_unregister_platform_driver(&ndfc_driver); } module_init(ndfc_nand_init); @@ -316,6 +315,4 @@ module_exit(ndfc_nand_exit); MODULE_LICENSE("GPL"); MODULE_AUTHOR("Thomas Gleixner "); -MODULE_DESCRIPTION("Platform driver for NDFC"); -MODULE_ALIAS("platform:ndfc-chip"); -MODULE_ALIAS("platform:ndfc-nand"); +MODULE_DESCRIPTION("OF Platform driver for NDFC"); -- cgit v1.2.3 From 28405d8d9ce05f5bd869ef8b48da5086f9527d73 Mon Sep 17 00:00:00 2001 From: Dan Williams Date: Mon, 5 Jan 2009 17:14:31 -0700 Subject: async_tx, dmaengine: document channel allocation and api rework "Wouldn't it be better if the dmaengine layer made sure it didn't pass the same channel several times to a client? I mean, you seem concerned that the memcpy() API should be transparent and easy to use, but the whole registration interface is just ridiculously complicated..." - Haavard The dmaengine and async_tx registration/allocation interface is indeed needlessly complicated. This redesign has the following goals: 1/ Simplify reference counting: dma channels are not something one would expect to be hotplugged, it should be an exceptional event handled by drivers not something clients should be mandated to handle in a callback. The common case channel removal event is 'rmmod ', which for simplicity should be disallowed if the channel is in use. 2/ Add an interface for requesting exclusive access to a channel suitable to device-to-memory users. 3/ Convert all memory-to-memory users over to a common allocator, the goal here is to not have competing channel allocation schemes. The only competition should be between device-to-memory exclusive allocations and the memory-to-memory usage case where channels are shared between multiple "clients". Cc: Haavard Skinnemoen Cc: Neil Brown Cc: Jeff Garzik Reviewed-by: Andrew Morton Signed-off-by: Dan Williams --- Documentation/crypto/async-tx-api.txt | 96 ++++++++++++++++------------------- Documentation/dmaengine.txt | 1 + 2 files changed, 45 insertions(+), 52 deletions(-) create mode 100644 Documentation/dmaengine.txt (limited to 'Documentation') diff --git a/Documentation/crypto/async-tx-api.txt b/Documentation/crypto/async-tx-api.txt index c1e9545c59bd..9f59fcbf5d82 100644 --- a/Documentation/crypto/async-tx-api.txt +++ b/Documentation/crypto/async-tx-api.txt @@ -13,9 +13,9 @@ 3.6 Constraints 3.7 Example -4 DRIVER DEVELOPER NOTES +4 DMAENGINE DRIVER DEVELOPER NOTES 4.1 Conformance points -4.2 "My application needs finer control of hardware channels" +4.2 "My application needs exclusive control of hardware channels" 5 SOURCE @@ -150,6 +150,7 @@ ops_run_* and ops_complete_* routines in drivers/md/raid5.c for more implementation examples. 4 DRIVER DEVELOPMENT NOTES + 4.1 Conformance points: There are a few conformance points required in dmaengine drivers to accommodate assumptions made by applications using the async_tx API: @@ -158,58 +159,49 @@ accommodate assumptions made by applications using the async_tx API: 3/ Use async_tx_run_dependencies() in the descriptor clean up path to handle submission of dependent operations -4.2 "My application needs finer control of hardware channels" -This requirement seems to arise from cases where a DMA engine driver is -trying to support device-to-memory DMA. The dmaengine and async_tx -implementations were designed for offloading memory-to-memory -operations; however, there are some capabilities of the dmaengine layer -that can be used for platform-specific channel management. -Platform-specific constraints can be handled by registering the -application as a 'dma_client' and implementing a 'dma_event_callback' to -apply a filter to the available channels in the system. Before showing -how to implement a custom dma_event callback some background of -dmaengine's client support is required. - -The following routines in dmaengine support multiple clients requesting -use of a channel: -- dma_async_client_register(struct dma_client *client) -- dma_async_client_chan_request(struct dma_client *client) - -dma_async_client_register takes a pointer to an initialized dma_client -structure. It expects that the 'event_callback' and 'cap_mask' fields -are already initialized. - -dma_async_client_chan_request triggers dmaengine to notify the client of -all channels that satisfy the capability mask. It is up to the client's -event_callback routine to track how many channels the client needs and -how many it is currently using. The dma_event_callback routine returns a -dma_state_client code to let dmaengine know the status of the -allocation. - -Below is the example of how to extend this functionality for -platform-specific filtering of the available channels beyond the -standard capability mask: - -static enum dma_state_client -my_dma_client_callback(struct dma_client *client, - struct dma_chan *chan, enum dma_state state) -{ - struct dma_device *dma_dev; - struct my_platform_specific_dma *plat_dma_dev; - - dma_dev = chan->device; - plat_dma_dev = container_of(dma_dev, - struct my_platform_specific_dma, - dma_dev); - - if (!plat_dma_dev->platform_specific_capability) - return DMA_DUP; - - . . . -} +4.2 "My application needs exclusive control of hardware channels" +Primarily this requirement arises from cases where a DMA engine driver +is being used to support device-to-memory operations. A channel that is +performing these operations cannot, for many platform specific reasons, +be shared. For these cases the dma_request_channel() interface is +provided. + +The interface is: +struct dma_chan *dma_request_channel(dma_cap_mask_t mask, + dma_filter_fn filter_fn, + void *filter_param); + +Where dma_filter_fn is defined as: +typedef bool (*dma_filter_fn)(struct dma_chan *chan, void *filter_param); + +When the optional 'filter_fn' parameter is set to NULL +dma_request_channel simply returns the first channel that satisfies the +capability mask. Otherwise, when the mask parameter is insufficient for +specifying the necessary channel, the filter_fn routine can be used to +disposition the available channels in the system. The filter_fn routine +is called once for each free channel in the system. Upon seeing a +suitable channel filter_fn returns DMA_ACK which flags that channel to +be the return value from dma_request_channel. A channel allocated via +this interface is exclusive to the caller, until dma_release_channel() +is called. + +The DMA_PRIVATE capability flag is used to tag dma devices that should +not be used by the general-purpose allocator. It can be set at +initialization time if it is known that a channel will always be +private. Alternatively, it is set when dma_request_channel() finds an +unused "public" channel. + +A couple caveats to note when implementing a driver and consumer: +1/ Once a channel has been privately allocated it will no longer be + considered by the general-purpose allocator even after a call to + dma_release_channel(). +2/ Since capabilities are specified at the device level a dma_device + with multiple channels will either have all channels public, or all + channels private. 5 SOURCE -include/linux/dmaengine.h: core header file for DMA drivers and clients + +include/linux/dmaengine.h: core header file for DMA drivers and api users drivers/dma/dmaengine.c: offload engine channel management routines drivers/dma/: location for offload engine drivers include/linux/async_tx.h: core header file for the async_tx api diff --git a/Documentation/dmaengine.txt b/Documentation/dmaengine.txt new file mode 100644 index 000000000000..0c1c2f63c0a9 --- /dev/null +++ b/Documentation/dmaengine.txt @@ -0,0 +1 @@ +See Documentation/crypto/async-tx-api.txt -- cgit v1.2.3 From 709ac06a148a33493d3e2f9391bb746b067d96d6 Mon Sep 17 00:00:00 2001 From: David Woodhouse Date: Wed, 7 Jan 2009 09:54:24 -0500 Subject: Btrfs: Add Documentation/filesystem/btrfs.txt, remove old COPYING Signed-off-by: Chris Mason --- Documentation/filesystems/btrfs.txt | 91 +++++++++ fs/btrfs/COPYING | 356 ------------------------------------ fs/btrfs/INSTALL | 48 ----- 3 files changed, 91 insertions(+), 404 deletions(-) create mode 100644 Documentation/filesystems/btrfs.txt delete mode 100644 fs/btrfs/COPYING delete mode 100644 fs/btrfs/INSTALL (limited to 'Documentation') diff --git a/Documentation/filesystems/btrfs.txt b/Documentation/filesystems/btrfs.txt new file mode 100644 index 000000000000..64087c34327f --- /dev/null +++ b/Documentation/filesystems/btrfs.txt @@ -0,0 +1,91 @@ + + BTRFS + ===== + +Btrfs is a new copy on write filesystem for Linux aimed at +implementing advanced features while focusing on fault tolerance, +repair and easy administration. Initially developed by Oracle, Btrfs +is licensed under the GPL and open for contribution from anyone. + +Linux has a wealth of filesystems to choose from, but we are facing a +number of challenges with scaling to the large storage subsystems that +are becoming common in today's data centers. Filesystems need to scale +in their ability to address and manage large storage, and also in +their ability to detect, repair and tolerate errors in the data stored +on disk. Btrfs is under heavy development, and is not suitable for +any uses other than benchmarking and review. The Btrfs disk format is +not yet finalized. + +The main Btrfs features include: + + * Extent based file storage (2^64 max file size) + * Space efficient packing of small files + * Space efficient indexed directories + * Dynamic inode allocation + * Writable snapshots + * Subvolumes (separate internal filesystem roots) + * Object level mirroring and striping + * Checksums on data and metadata (multiple algorithms available) + * Compression + * Integrated multiple device support, with several raid algorithms + * Online filesystem check (not yet implemented) + * Very fast offline filesystem check + * Efficient incremental backup and FS mirroring (not yet implemented) + * Online filesystem defragmentation + + + + MAILING LIST + ============ + +There is a Btrfs mailing list hosted on vger.kernel.org. You can +find details on how to subscribe here: + +http://vger.kernel.org/vger-lists.html#linux-btrfs + +Mailing list archives are available from gmane: + +http://dir.gmane.org/gmane.comp.file-systems.btrfs + + + + IRC + === + +Discussion of Btrfs also occurs on the #btrfs channel of the Freenode +IRC network. + + + + UTILITIES + ========= + +Userspace tools for creating and manipulating Btrfs file systems are +available from the git repository at the following location: + + http://git.kernel.org/?p=linux/kernel/git/mason/btrfs-progs-unstable.git + git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-progs-unstable.git + +These include the following tools: + +mkfs.btrfs: create a filesystem + +btrfsctl: control program to create snapshots and subvolumes: + + mount /dev/sda2 /mnt + btrfsctl -s new_subvol_name /mnt + btrfsctl -s snapshot_of_default /mnt/default + btrfsctl -s snapshot_of_new_subvol /mnt/new_subvol_name + btrfsctl -s snapshot_of_a_snapshot /mnt/snapshot_of_new_subvol + ls /mnt + default snapshot_of_a_snapshot snapshot_of_new_subvol + new_subvol_name snapshot_of_default + + Snapshots and subvolumes cannot be deleted right now, but you can + rm -rf all the files and directories inside them. + +btrfsck: do a limited check of the FS extent trees. + +btrfs-debug-tree: print all of the FS metadata in text form. Example: + + btrfs-debug-tree /dev/sda2 >& big_output_file diff --git a/fs/btrfs/COPYING b/fs/btrfs/COPYING deleted file mode 100644 index ca442d313d86..000000000000 --- a/fs/btrfs/COPYING +++ /dev/null @@ -1,356 +0,0 @@ - - NOTE! This copyright does *not* cover user programs that use kernel - services by normal system calls - this is merely considered normal use - of the kernel, and does *not* fall under the heading of "derived work". - Also note that the GPL below is copyrighted by the Free Software - Foundation, but the instance of code that it refers to (the Linux - kernel) is copyrighted by me and others who actually wrote it. - - Also note that the only valid version of the GPL as far as the kernel - is concerned is _this_ particular version of the license (ie v2, not - v2.2 or v3.x or whatever), unless explicitly otherwise stated. - - Linus Torvalds - ----------------------------------------- - - GNU GENERAL PUBLIC LICENSE - Version 2, June 1991 - - Copyright (C) 1989, 1991 Free Software Foundation, Inc. - 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA - Everyone is permitted to copy and distribute verbatim copies - of this license document, but changing it is not allowed. - - Preamble - - The licenses for most software are designed to take away your -freedom to share and change it. By contrast, the GNU General Public -License is intended to guarantee your freedom to share and change free -software--to make sure the software is free for all its users. This -General Public License applies to most of the Free Software -Foundation's software and to any other program whose authors commit to -using it. (Some other Free Software Foundation software is covered by -the GNU Library General Public License instead.) You can apply it to -your programs, too. - - When we speak of free software, we are referring to freedom, not -price. Our General Public Licenses are designed to make sure that you -have the freedom to distribute copies of free software (and charge for -this service if you wish), that you receive source code or can get it -if you want it, that you can change the software or use pieces of it -in new free programs; and that you know you can do these things. - - To protect your rights, we need to make restrictions that forbid -anyone to deny you these rights or to ask you to surrender the rights. -These restrictions translate to certain responsibilities for you if you -distribute copies of the software, or if you modify it. - - For example, if you distribute copies of such a program, whether -gratis or for a fee, you must give the recipients all the rights that -you have. You must make sure that they, too, receive or can get the -source code. And you must show them these terms so they know their -rights. - - We protect your rights with two steps: (1) copyright the software, and -(2) offer you this license which gives you legal permission to copy, -distribute and/or modify the software. - - Also, for each author's protection and ours, we want to make certain -that everyone understands that there is no warranty for this free -software. If the software is modified by someone else and passed on, we -want its recipients to know that what they have is not the original, so -that any problems introduced by others will not reflect on the original -authors' reputations. - - Finally, any free program is threatened constantly by software -patents. We wish to avoid the danger that redistributors of a free -program will individually obtain patent licenses, in effect making the -program proprietary. To prevent this, we have made it clear that any -patent must be licensed for everyone's free use or not licensed at all. - - The precise terms and conditions for copying, distribution and -modification follow. - - GNU GENERAL PUBLIC LICENSE - TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION - - 0. This License applies to any program or other work which contains -a notice placed by the copyright holder saying it may be distributed -under the terms of this General Public License. The "Program", below, -refers to any such program or work, and a "work based on the Program" -means either the Program or any derivative work under copyright law: -that is to say, a work containing the Program or a portion of it, -either verbatim or with modifications and/or translated into another -language. (Hereinafter, translation is included without limitation in -the term "modification".) Each licensee is addressed as "you". - -Activities other than copying, distribution and modification are not -covered by this License; they are outside its scope. The act of -running the Program is not restricted, and the output from the Program -is covered only if its contents constitute a work based on the -Program (independent of having been made by running the Program). -Whether that is true depends on what the Program does. - - 1. You may copy and distribute verbatim copies of the Program's -source code as you receive it, in any medium, provided that you -conspicuously and appropriately publish on each copy an appropriate -copyright notice and disclaimer of warranty; keep intact all the -notices that refer to this License and to the absence of any warranty; -and give any other recipients of the Program a copy of this License -along with the Program. - -You may charge a fee for the physical act of transferring a copy, and -you may at your option offer warranty protection in exchange for a fee. - - 2. You may modify your copy or copies of the Program or any portion -of it, thus forming a work based on the Program, and copy and -distribute such modifications or work under the terms of Section 1 -above, provided that you also meet all of these conditions: - - a) You must cause the modified files to carry prominent notices - stating that you changed the files and the date of any change. - - b) You must cause any work that you distribute or publish, that in - whole or in part contains or is derived from the Program or any - part thereof, to be licensed as a whole at no charge to all third - parties under the terms of this License. - - c) If the modified program normally reads commands interactively - when run, you must cause it, when started running for such - interactive use in the most ordinary way, to print or display an - announcement including an appropriate copyright notice and a - notice that there is no warranty (or else, saying that you provide - a warranty) and that users may redistribute the program under - these conditions, and telling the user how to view a copy of this - License. (Exception: if the Program itself is interactive but - does not normally print such an announcement, your work based on - the Program is not required to print an announcement.) - -These requirements apply to the modified work as a whole. If -identifiable sections of that work are not derived from the Program, -and can be reasonably considered independent and separate works in -themselves, then this License, and its terms, do not apply to those -sections when you distribute them as separate works. But when you -distribute the same sections as part of a whole which is a work based -on the Program, the distribution of the whole must be on the terms of -this License, whose permissions for other licensees extend to the -entire whole, and thus to each and every part regardless of who wrote it. - -Thus, it is not the intent of this section to claim rights or contest -your rights to work written entirely by you; rather, the intent is to -exercise the right to control the distribution of derivative or -collective works based on the Program. - -In addition, mere aggregation of another work not based on the Program -with the Program (or with a work based on the Program) on a volume of -a storage or distribution medium does not bring the other work under -the scope of this License. - - 3. You may copy and distribute the Program (or a work based on it, -under Section 2) in object code or executable form under the terms of -Sections 1 and 2 above provided that you also do one of the following: - - a) Accompany it with the complete corresponding machine-readable - source code, which must be distributed under the terms of Sections - 1 and 2 above on a medium customarily used for software interchange; or, - - b) Accompany it with a written offer, valid for at least three - years, to give any third party, for a charge no more than your - cost of physically performing source distribution, a complete - machine-readable copy of the corresponding source code, to be - distributed under the terms of Sections 1 and 2 above on a medium - customarily used for software interchange; or, - - c) Accompany it with the information you received as to the offer - to distribute corresponding source code. (This alternative is - allowed only for noncommercial distribution and only if you - received the program in object code or executable form with such - an offer, in accord with Subsection b above.) - -The source code for a work means the preferred form of the work for -making modifications to it. For an executable work, complete source -code means all the source code for all modules it contains, plus any -associated interface definition files, plus the scripts used to -control compilation and installation of the executable. However, as a -special exception, the source code distributed need not include -anything that is normally distributed (in either source or binary -form) with the major components (compiler, kernel, and so on) of the -operating system on which the executable runs, unless that component -itself accompanies the executable. - -If distribution of executable or object code is made by offering -access to copy from a designated place, then offering equivalent -access to copy the source code from the same place counts as -distribution of the source code, even though third parties are not -compelled to copy the source along with the object code. - - 4. You may not copy, modify, sublicense, or distribute the Program -except as expressly provided under this License. Any attempt -otherwise to copy, modify, sublicense or distribute the Program is -void, and will automatically terminate your rights under this License. -However, parties who have received copies, or rights, from you under -this License will not have their licenses terminated so long as such -parties remain in full compliance. - - 5. You are not required to accept this License, since you have not -signed it. However, nothing else grants you permission to modify or -distribute the Program or its derivative works. These actions are -prohibited by law if you do not accept this License. Therefore, by -modifying or distributing the Program (or any work based on the -Program), you indicate your acceptance of this License to do so, and -all its terms and conditions for copying, distributing or modifying -the Program or works based on it. - - 6. Each time you redistribute the Program (or any work based on the -Program), the recipient automatically receives a license from the -original licensor to copy, distribute or modify the Program subject to -these terms and conditions. You may not impose any further -restrictions on the recipients' exercise of the rights granted herein. -You are not responsible for enforcing compliance by third parties to -this License. - - 7. If, as a consequence of a court judgment or allegation of patent -infringement or for any other reason (not limited to patent issues), -conditions are imposed on you (whether by court order, agreement or -otherwise) that contradict the conditions of this License, they do not -excuse you from the conditions of this License. If you cannot -distribute so as to satisfy simultaneously your obligations under this -License and any other pertinent obligations, then as a consequence you -may not distribute the Program at all. For example, if a patent -license would not permit royalty-free redistribution of the Program by -all those who receive copies directly or indirectly through you, then -the only way you could satisfy both it and this License would be to -refrain entirely from distribution of the Program. - -If any portion of this section is held invalid or unenforceable under -any particular circumstance, the balance of the section is intended to -apply and the section as a whole is intended to apply in other -circumstances. - -It is not the purpose of this section to induce you to infringe any -patents or other property right claims or to contest validity of any -such claims; this section has the sole purpose of protecting the -integrity of the free software distribution system, which is -implemented by public license practices. Many people have made -generous contributions to the wide range of software distributed -through that system in reliance on consistent application of that -system; it is up to the author/donor to decide if he or she is willing -to distribute software through any other system and a licensee cannot -impose that choice. - -This section is intended to make thoroughly clear what is believed to -be a consequence of the rest of this License. - - 8. If the distribution and/or use of the Program is restricted in -certain countries either by patents or by copyrighted interfaces, the -original copyright holder who places the Program under this License -may add an explicit geographical distribution limitation excluding -those countries, so that distribution is permitted only in or among -countries not thus excluded. In such case, this License incorporates -the limitation as if written in the body of this License. - - 9. The Free Software Foundation may publish revised and/or new versions -of the General Public License from time to time. Such new versions will -be similar in spirit to the present version, but may differ in detail to -address new problems or concerns. - -Each version is given a distinguishing version number. If the Program -specifies a version number of this License which applies to it and "any -later version", you have the option of following the terms and conditions -either of that version or of any later version published by the Free -Software Foundation. If the Program does not specify a version number of -this License, you may choose any version ever published by the Free Software -Foundation. - - 10. If you wish to incorporate parts of the Program into other free -programs whose distribution conditions are different, write to the author -to ask for permission. For software which is copyrighted by the Free -Software Foundation, write to the Free Software Foundation; we sometimes -make exceptions for this. Our decision will be guided by the two goals -of preserving the free status of all derivatives of our free software and -of promoting the sharing and reuse of software generally. - - NO WARRANTY - - 11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY -FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN -OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES -PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED -OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF -MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS -TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE -PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, -REPAIR OR CORRECTION. - - 12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING -WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR -REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, -INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING -OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED -TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY -YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER -PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE -POSSIBILITY OF SUCH DAMAGES. - - END OF TERMS AND CONDITIONS - - How to Apply These Terms to Your New Programs - - If you develop a new program, and you want it to be of the greatest -possible use to the public, the best way to achieve this is to make it -free software which everyone can redistribute and change under these terms. - - To do so, attach the following notices to the program. It is safest -to attach them to the start of each source file to most effectively -convey the exclusion of warranty; and each file should have at least -the "copyright" line and a pointer to where the full notice is found. - - - Copyright (C) - - This program is free software; you can redistribute it and/or modify - it under the terms of the GNU General Public License as published by - the Free Software Foundation; either version 2 of the License, or - (at your option) any later version. - - This program is distributed in the hope that it will be useful, - but WITHOUT ANY WARRANTY; without even the implied warranty of - MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - GNU General Public License for more details. - - You should have received a copy of the GNU General Public License - along with this program; if not, write to the Free Software - Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA - - -Also add information on how to contact you by electronic and paper mail. - -If the program is interactive, make it output a short notice like this -when it starts in an interactive mode: - - Gnomovision version 69, Copyright (C) year name of author - Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show w'. - This is free software, and you are welcome to redistribute it - under certain conditions; type `show c' for details. - -The hypothetical commands `show w' and `show c' should show the appropriate -parts of the General Public License. Of course, the commands you use may -be called something other than `show w' and `show c'; they could even be -mouse-clicks or menu items--whatever suits your program. - -You should also get your employer (if you work as a programmer) or your -school, if any, to sign a "copyright disclaimer" for the program, if -necessary. Here is a sample; alter the names: - - Yoyodyne, Inc., hereby disclaims all copyright interest in the program - `Gnomovision' (which makes passes at compilers) written by James Hacker. - - , 1 April 1989 - Ty Coon, President of Vice - -This General Public License does not permit incorporating your program into -proprietary programs. If your program is a subroutine library, you may -consider it more useful to permit linking proprietary applications with the -library. If this is what you want to do, use the GNU Library General -Public License instead of this License. diff --git a/fs/btrfs/INSTALL b/fs/btrfs/INSTALL deleted file mode 100644 index 16b45a56878d..000000000000 --- a/fs/btrfs/INSTALL +++ /dev/null @@ -1,48 +0,0 @@ -Install Instructions - -Btrfs puts snapshots and subvolumes into the root directory of the FS. This -directory can only be changed by btrfsctl right now, and normal filesystem -operations do not work on it. The default subvolume is called 'default', -and you can create files and directories in mount_point/default - -Btrfs uses libcrc32c in the kernel for file and metadata checksums. You need -to compile the kernel with: - -CONFIG_LIBCRC32C=m - -libcrc32c can be static as well. Once your kernel is setup, typing make in the -btrfs module sources will build against the running kernel. When the build is -complete: - -modprobe libcrc32c -insmod btrfs.ko - -The Btrfs utility programs require libuuid to build. This can be found -in the e2fsprogs sources, and is usually available as libuuid or -e2fsprogs-devel from various distros. - -Building the utilities is just make ; make install. The programs go -into /usr/local/bin. The commands available are: - -mkfs.btrfs: create a filesystem - -btrfsctl: control program to create snapshots and subvolumes: - - mount /dev/sda2 /mnt - btrfsctl -s new_subvol_name /mnt - btrfsctl -s snapshot_of_default /mnt/default - btrfsctl -s snapshot_of_new_subvol /mnt/new_subvol_name - btrfsctl -s snapshot_of_a_snapshot /mnt/snapshot_of_new_subvol - ls /mnt - default snapshot_of_a_snapshot snapshot_of_new_subvol - new_subvol_name snapshot_of_default - - Snapshots and subvolumes cannot be deleted right now, but you can - rm -rf all the files and directories inside them. - -btrfsck: do a limited check of the FS extent trees. - -debug-tree: print all of the FS metadata in text form. Example: - - debug-tree /dev/sda2 >& big_output_file - -- cgit v1.2.3 From 8feae13110d60cc6287afabc2887366b0eb226c2 Mon Sep 17 00:00:00 2001 From: David Howells Date: Thu, 8 Jan 2009 12:04:47 +0000 Subject: NOMMU: Make VMAs per MM as for MMU-mode linux Make VMAs per mm_struct as for MMU-mode linux. This solves two problems: (1) In SYSV SHM where nattch for a segment does not reflect the number of shmat's (and forks) done. (2) In mmap() where the VMA's vm_mm is set to point to the parent mm by an exec'ing process when VM_EXECUTABLE is specified, regardless of the fact that a VMA might be shared and already have its vm_mm assigned to another process or a dead process. A new struct (vm_region) is introduced to track a mapped region and to remember the circumstances under which it may be shared and the vm_list_struct structure is discarded as it's no longer required. This patch makes the following additional changes: (1) Regions are now allocated with alloc_pages() rather than kmalloc() and with no recourse to __GFP_COMP, so the pages are not composite. Instead, each page has a reference on it held by the region. Anything else that is interested in such a page will have to get a reference on it to retain it. When the pages are released due to unmapping, each page is passed to put_page() and will be freed when the page usage count reaches zero. (2) Excess pages are trimmed after an allocation as the allocation must be made as a power-of-2 quantity of pages. (3) VMAs are added to the parent MM's R/B tree and mmap lists. As an MM may end up with overlapping VMAs within the tree, the VMA struct address is appended to the sort key. (4) Non-anonymous VMAs are now added to the backing inode's prio list. (5) Holes may be punched in anonymous VMAs with munmap(), releasing parts of the backing region. The VMA and region structs will be split if necessary. (6) sys_shmdt() only releases one attachment to a SYSV IPC shared memory segment instead of all the attachments at that addresss. Multiple shmat()'s return the same address under NOMMU-mode instead of different virtual addresses as under MMU-mode. (7) Core dumping for ELF-FDPIC requires fewer exceptions for NOMMU-mode. (8) /proc/maps is now the global list of mapped regions, and may list bits that aren't actually mapped anywhere. (9) /proc/meminfo gains a line (tagged "MmapCopy") that indicates the amount of RAM currently allocated by mmap to hold mappable regions that can't be mapped directly. These are copies of the backing device or file if not anonymous. These changes make NOMMU mode more similar to MMU mode. The downside is that NOMMU mode requires some extra memory to track things over NOMMU without this patch (VMAs are no longer shared, and there are now region structs). Signed-off-by: David Howells Tested-by: Mike Frysinger Acked-by: Paul Mundt --- Documentation/nommu-mmap.txt | 18 +- arch/arm/include/asm/mmu.h | 1 - arch/blackfin/include/asm/mmu.h | 1 - arch/blackfin/kernel/ptrace.c | 6 +- arch/blackfin/kernel/traps.c | 11 +- arch/frv/kernel/ptrace.c | 11 +- arch/h8300/include/asm/mmu.h | 1 - arch/m68knommu/include/asm/mmu.h | 1 - arch/sh/include/asm/mmu.h | 1 - fs/binfmt_elf_fdpic.c | 27 +- fs/proc/internal.h | 2 - fs/proc/meminfo.c | 6 + fs/proc/nommu.c | 71 ++- fs/proc/task_nommu.c | 108 +++-- include/asm-frv/mmu.h | 1 - include/asm-m32r/mmu.h | 1 - include/linux/mm.h | 18 +- include/linux/mm_types.h | 18 +- ipc/shm.c | 12 + kernel/fork.c | 4 +- lib/Kconfig.debug | 7 + mm/mmap.c | 10 + mm/nommu.c | 960 +++++++++++++++++++++++++++------------ 23 files changed, 860 insertions(+), 436 deletions(-) (limited to 'Documentation') diff --git a/Documentation/nommu-mmap.txt b/Documentation/nommu-mmap.txt index 7714f57caad5..02b89dcf38ac 100644 --- a/Documentation/nommu-mmap.txt +++ b/Documentation/nommu-mmap.txt @@ -109,12 +109,18 @@ and it's also much more restricted in the latter case: FURTHER NOTES ON NO-MMU MMAP ============================ - (*) A request for a private mapping of less than a page in size may not return - a page-aligned buffer. This is because the kernel calls kmalloc() to - allocate the buffer, not get_free_page(). - - (*) A list of all the mappings on the system is visible through /proc/maps in - no-MMU mode. + (*) A request for a private mapping of a file may return a buffer that is not + page-aligned. This is because XIP may take place, and the data may not be + paged aligned in the backing store. + + (*) A request for an anonymous mapping will always be page aligned. If + possible the size of the request should be a power of two otherwise some + of the space may be wasted as the kernel must allocate a power-of-2 + granule but will only discard the excess if appropriately configured as + this has an effect on fragmentation. + + (*) A list of all the private copy and anonymous mappings on the system is + visible through /proc/maps in no-MMU mode. (*) A list of all the mappings in use by a process is visible through /proc//maps in no-MMU mode. diff --git a/arch/arm/include/asm/mmu.h b/arch/arm/include/asm/mmu.h index 53099d4ee421..b561584d04a1 100644 --- a/arch/arm/include/asm/mmu.h +++ b/arch/arm/include/asm/mmu.h @@ -24,7 +24,6 @@ typedef struct { * modified for 2.6 by Hyok S. Choi */ typedef struct { - struct vm_list_struct *vmlist; unsigned long end_brk; } mm_context_t; diff --git a/arch/blackfin/include/asm/mmu.h b/arch/blackfin/include/asm/mmu.h index 757e43906ed4..dbfd686360e6 100644 --- a/arch/blackfin/include/asm/mmu.h +++ b/arch/blackfin/include/asm/mmu.h @@ -10,7 +10,6 @@ struct sram_list_struct { }; typedef struct { - struct vm_list_struct *vmlist; unsigned long end_brk; unsigned long stack_start; diff --git a/arch/blackfin/kernel/ptrace.c b/arch/blackfin/kernel/ptrace.c index d2d388536630..594e325b40e4 100644 --- a/arch/blackfin/kernel/ptrace.c +++ b/arch/blackfin/kernel/ptrace.c @@ -160,15 +160,15 @@ put_reg(struct task_struct *task, int regno, unsigned long data) static inline int is_user_addr_valid(struct task_struct *child, unsigned long start, unsigned long len) { - struct vm_list_struct *vml; + struct vm_area_struct *vma; struct sram_list_struct *sraml; /* overflow */ if (start + len < start) return -EIO; - for (vml = child->mm->context.vmlist; vml; vml = vml->next) - if (start >= vml->vma->vm_start && start + len < vml->vma->vm_end) + vma = find_vma(child->mm, start); + if (vma && start >= vma->vm_start && start + len <= vma->vm_end) return 0; for (sraml = child->mm->context.sram_list; sraml; sraml = sraml->next) diff --git a/arch/blackfin/kernel/traps.c b/arch/blackfin/kernel/traps.c index 17d8e4172896..5b0667da8d05 100644 --- a/arch/blackfin/kernel/traps.c +++ b/arch/blackfin/kernel/traps.c @@ -32,6 +32,7 @@ #include #include #include +#include #include #include #include @@ -83,6 +84,7 @@ static void decode_address(char *buf, unsigned long address) struct mm_struct *mm; unsigned long flags, offset; unsigned char in_atomic = (bfin_read_IPEND() & 0x10) || in_atomic(); + struct rb_node *n; #ifdef CONFIG_KALLSYMS unsigned long symsize; @@ -128,9 +130,10 @@ static void decode_address(char *buf, unsigned long address) if (!mm) continue; - vml = mm->context.vmlist; - while (vml) { - struct vm_area_struct *vma = vml->vma; + for (n = rb_first(&mm->mm_rb); n; n = rb_next(n)) { + struct vm_area_struct *vma; + + vma = rb_entry(n, struct vm_area_struct, vm_rb); if (address >= vma->vm_start && address < vma->vm_end) { char _tmpbuf[256]; @@ -176,8 +179,6 @@ static void decode_address(char *buf, unsigned long address) goto done; } - - vml = vml->next; } if (!in_atomic) mmput(mm); diff --git a/arch/frv/kernel/ptrace.c b/arch/frv/kernel/ptrace.c index 709e9bdc6126..5e7d401d21e7 100644 --- a/arch/frv/kernel/ptrace.c +++ b/arch/frv/kernel/ptrace.c @@ -69,7 +69,8 @@ static inline int put_reg(struct task_struct *task, int regno, } /* - * check that an address falls within the bounds of the target process's memory mappings + * check that an address falls within the bounds of the target process's memory + * mappings */ static inline int is_user_addr_valid(struct task_struct *child, unsigned long start, unsigned long len) @@ -79,11 +80,11 @@ static inline int is_user_addr_valid(struct task_struct *child, return -EIO; return 0; #else - struct vm_list_struct *vml; + struct vm_area_struct *vma; - for (vml = child->mm->context.vmlist; vml; vml = vml->next) - if (start >= vml->vma->vm_start && start + len <= vml->vma->vm_end) - return 0; + vma = find_vma(child->mm, start); + if (vma && start >= vma->vm_start && start + len <= vma->vm_end) + return 0; return -EIO; #endif diff --git a/arch/h8300/include/asm/mmu.h b/arch/h8300/include/asm/mmu.h index 2ce06ea46104..31309969df70 100644 --- a/arch/h8300/include/asm/mmu.h +++ b/arch/h8300/include/asm/mmu.h @@ -4,7 +4,6 @@ /* Copyright (C) 2002, David McCullough */ typedef struct { - struct vm_list_struct *vmlist; unsigned long end_brk; } mm_context_t; diff --git a/arch/m68knommu/include/asm/mmu.h b/arch/m68knommu/include/asm/mmu.h index 5fa6b68353ba..e2da1e6f09fe 100644 --- a/arch/m68knommu/include/asm/mmu.h +++ b/arch/m68knommu/include/asm/mmu.h @@ -4,7 +4,6 @@ /* Copyright (C) 2002, David McCullough */ typedef struct { - struct vm_list_struct *vmlist; unsigned long end_brk; } mm_context_t; diff --git a/arch/sh/include/asm/mmu.h b/arch/sh/include/asm/mmu.h index fdcb93bc6d11..6c43625bb1a5 100644 --- a/arch/sh/include/asm/mmu.h +++ b/arch/sh/include/asm/mmu.h @@ -9,7 +9,6 @@ typedef struct { mm_context_id_t id; void *vdso; #else - struct vm_list_struct *vmlist; unsigned long end_brk; #endif #ifdef CONFIG_BINFMT_ELF_FDPIC diff --git a/fs/binfmt_elf_fdpic.c b/fs/binfmt_elf_fdpic.c index aa5b43205e37..22baf1b13493 100644 --- a/fs/binfmt_elf_fdpic.c +++ b/fs/binfmt_elf_fdpic.c @@ -1567,11 +1567,9 @@ end_coredump: static int elf_fdpic_dump_segments(struct file *file, size_t *size, unsigned long *limit, unsigned long mm_flags) { - struct vm_list_struct *vml; - - for (vml = current->mm->context.vmlist; vml; vml = vml->next) { - struct vm_area_struct *vma = vml->vma; + struct vm_area_struct *vma; + for (vma = current->mm->mmap; vma; vma = vma->vm_next) { if (!maydump(vma, mm_flags)) continue; @@ -1617,9 +1615,6 @@ static int elf_fdpic_core_dump(long signr, struct pt_regs *regs, elf_fpxregset_t *xfpu = NULL; #endif int thread_status_size = 0; -#ifndef CONFIG_MMU - struct vm_list_struct *vml; -#endif elf_addr_t *auxv; unsigned long mm_flags; @@ -1685,13 +1680,7 @@ static int elf_fdpic_core_dump(long signr, struct pt_regs *regs, fill_prstatus(prstatus, current, signr); elf_core_copy_regs(&prstatus->pr_reg, regs); -#ifdef CONFIG_MMU segs = current->mm->map_count; -#else - segs = 0; - for (vml = current->mm->context.vmlist; vml; vml = vml->next) - segs++; -#endif #ifdef ELF_CORE_EXTRA_PHDRS segs += ELF_CORE_EXTRA_PHDRS; #endif @@ -1766,20 +1755,10 @@ static int elf_fdpic_core_dump(long signr, struct pt_regs *regs, mm_flags = current->mm->flags; /* write program headers for segments dump */ - for ( -#ifdef CONFIG_MMU - vma = current->mm->mmap; vma; vma = vma->vm_next -#else - vml = current->mm->context.vmlist; vml; vml = vml->next -#endif - ) { + for (vma = current->mm->mmap; vma; vma = vma->vm_next) { struct elf_phdr phdr; size_t sz; -#ifndef CONFIG_MMU - vma = vml->vma; -#endif - sz = vma->vm_end - vma->vm_start; phdr.p_type = PT_LOAD; diff --git a/fs/proc/internal.h b/fs/proc/internal.h index 3e8aeb8b61ce..cd53ff838498 100644 --- a/fs/proc/internal.h +++ b/fs/proc/internal.h @@ -41,8 +41,6 @@ do { \ (vmi)->used = 0; \ (vmi)->largest_chunk = 0; \ } while(0) - -extern int nommu_vma_show(struct seq_file *, struct vm_area_struct *); #endif extern int proc_tid_stat(struct seq_file *m, struct pid_namespace *ns, diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c index b1675c4e66da..43d23948384a 100644 --- a/fs/proc/meminfo.c +++ b/fs/proc/meminfo.c @@ -73,6 +73,9 @@ static int meminfo_proc_show(struct seq_file *m, void *v) "HighFree: %8lu kB\n" "LowTotal: %8lu kB\n" "LowFree: %8lu kB\n" +#endif +#ifndef CONFIG_MMU + "MmapCopy: %8lu kB\n" #endif "SwapTotal: %8lu kB\n" "SwapFree: %8lu kB\n" @@ -115,6 +118,9 @@ static int meminfo_proc_show(struct seq_file *m, void *v) K(i.freehigh), K(i.totalram-i.totalhigh), K(i.freeram-i.freehigh), +#endif +#ifndef CONFIG_MMU + K((unsigned long) atomic_read(&mmap_pages_allocated)), #endif K(i.totalswap), K(i.freeswap), diff --git a/fs/proc/nommu.c b/fs/proc/nommu.c index 3f87d2632947..b446d7ad0b0d 100644 --- a/fs/proc/nommu.c +++ b/fs/proc/nommu.c @@ -33,33 +33,33 @@ #include "internal.h" /* - * display a single VMA to a sequenced file + * display a single region to a sequenced file */ -int nommu_vma_show(struct seq_file *m, struct vm_area_struct *vma) +static int nommu_region_show(struct seq_file *m, struct vm_region *region) { unsigned long ino = 0; struct file *file; dev_t dev = 0; int flags, len; - flags = vma->vm_flags; - file = vma->vm_file; + flags = region->vm_flags; + file = region->vm_file; if (file) { - struct inode *inode = vma->vm_file->f_path.dentry->d_inode; + struct inode *inode = region->vm_file->f_path.dentry->d_inode; dev = inode->i_sb->s_dev; ino = inode->i_ino; } seq_printf(m, "%08lx-%08lx %c%c%c%c %08llx %02x:%02x %lu %n", - vma->vm_start, - vma->vm_end, + region->vm_start, + region->vm_end, flags & VM_READ ? 'r' : '-', flags & VM_WRITE ? 'w' : '-', flags & VM_EXEC ? 'x' : '-', flags & VM_MAYSHARE ? flags & VM_SHARED ? 'S' : 's' : 'p', - ((loff_t)vma->vm_pgoff) << PAGE_SHIFT, + ((loff_t)region->vm_pgoff) << PAGE_SHIFT, MAJOR(dev), MINOR(dev), ino, &len); if (file) { @@ -75,61 +75,54 @@ int nommu_vma_show(struct seq_file *m, struct vm_area_struct *vma) } /* - * display a list of all the VMAs the kernel knows about + * display a list of all the REGIONs the kernel knows about * - nommu kernals have a single flat list */ -static int nommu_vma_list_show(struct seq_file *m, void *v) +static int nommu_region_list_show(struct seq_file *m, void *_p) { - struct vm_area_struct *vma; + struct rb_node *p = _p; - vma = rb_entry((struct rb_node *) v, struct vm_area_struct, vm_rb); - return nommu_vma_show(m, vma); + return nommu_region_show(m, rb_entry(p, struct vm_region, vm_rb)); } -static void *nommu_vma_list_start(struct seq_file *m, loff_t *_pos) +static void *nommu_region_list_start(struct seq_file *m, loff_t *_pos) { - struct rb_node *_rb; + struct rb_node *p; loff_t pos = *_pos; - void *next = NULL; - down_read(&nommu_vma_sem); + down_read(&nommu_region_sem); - for (_rb = rb_first(&nommu_vma_tree); _rb; _rb = rb_next(_rb)) { - if (pos == 0) { - next = _rb; - break; - } - pos--; - } - - return next; + for (p = rb_first(&nommu_region_tree); p; p = rb_next(p)) + if (pos-- == 0) + return p; + return NULL; } -static void nommu_vma_list_stop(struct seq_file *m, void *v) +static void nommu_region_list_stop(struct seq_file *m, void *v) { - up_read(&nommu_vma_sem); + up_read(&nommu_region_sem); } -static void *nommu_vma_list_next(struct seq_file *m, void *v, loff_t *pos) +static void *nommu_region_list_next(struct seq_file *m, void *v, loff_t *pos) { (*pos)++; return rb_next((struct rb_node *) v); } -static const struct seq_operations proc_nommu_vma_list_seqop = { - .start = nommu_vma_list_start, - .next = nommu_vma_list_next, - .stop = nommu_vma_list_stop, - .show = nommu_vma_list_show +static struct seq_operations proc_nommu_region_list_seqop = { + .start = nommu_region_list_start, + .next = nommu_region_list_next, + .stop = nommu_region_list_stop, + .show = nommu_region_list_show }; -static int proc_nommu_vma_list_open(struct inode *inode, struct file *file) +static int proc_nommu_region_list_open(struct inode *inode, struct file *file) { - return seq_open(file, &proc_nommu_vma_list_seqop); + return seq_open(file, &proc_nommu_region_list_seqop); } -static const struct file_operations proc_nommu_vma_list_operations = { - .open = proc_nommu_vma_list_open, +static const struct file_operations proc_nommu_region_list_operations = { + .open = proc_nommu_region_list_open, .read = seq_read, .llseek = seq_lseek, .release = seq_release, @@ -137,7 +130,7 @@ static const struct file_operations proc_nommu_vma_list_operations = { static int __init proc_nommu_init(void) { - proc_create("maps", S_IRUGO, NULL, &proc_nommu_vma_list_operations); + proc_create("maps", S_IRUGO, NULL, &proc_nommu_region_list_operations); return 0; } diff --git a/fs/proc/task_nommu.c b/fs/proc/task_nommu.c index d4a8be32b902..ca4a48d0d311 100644 --- a/fs/proc/task_nommu.c +++ b/fs/proc/task_nommu.c @@ -15,25 +15,25 @@ */ void task_mem(struct seq_file *m, struct mm_struct *mm) { - struct vm_list_struct *vml; + struct vm_area_struct *vma; + struct rb_node *p; unsigned long bytes = 0, sbytes = 0, slack = 0; down_read(&mm->mmap_sem); - for (vml = mm->context.vmlist; vml; vml = vml->next) { - if (!vml->vma) - continue; + for (p = rb_first(&mm->mm_rb); p; p = rb_next(p)) { + vma = rb_entry(p, struct vm_area_struct, vm_rb); - bytes += kobjsize(vml); + bytes += kobjsize(vma); if (atomic_read(&mm->mm_count) > 1 || - atomic_read(&vml->vma->vm_usage) > 1 - ) { - sbytes += kobjsize((void *) vml->vma->vm_start); - sbytes += kobjsize(vml->vma); + vma->vm_region || + vma->vm_flags & VM_MAYSHARE) { + sbytes += kobjsize((void *) vma->vm_start); + if (vma->vm_region) + sbytes += kobjsize(vma->vm_region); } else { - bytes += kobjsize((void *) vml->vma->vm_start); - bytes += kobjsize(vml->vma); - slack += kobjsize((void *) vml->vma->vm_start) - - (vml->vma->vm_end - vml->vma->vm_start); + bytes += kobjsize((void *) vma->vm_start); + slack += kobjsize((void *) vma->vm_start) - + (vma->vm_end - vma->vm_start); } } @@ -70,13 +70,14 @@ void task_mem(struct seq_file *m, struct mm_struct *mm) unsigned long task_vsize(struct mm_struct *mm) { - struct vm_list_struct *tbp; + struct vm_area_struct *vma; + struct rb_node *p; unsigned long vsize = 0; down_read(&mm->mmap_sem); - for (tbp = mm->context.vmlist; tbp; tbp = tbp->next) { - if (tbp->vma) - vsize += kobjsize((void *) tbp->vma->vm_start); + for (p = rb_first(&mm->mm_rb); p; p = rb_next(p)) { + vma = rb_entry(p, struct vm_area_struct, vm_rb); + vsize += vma->vm_region->vm_end - vma->vm_region->vm_start; } up_read(&mm->mmap_sem); return vsize; @@ -85,16 +86,15 @@ unsigned long task_vsize(struct mm_struct *mm) int task_statm(struct mm_struct *mm, int *shared, int *text, int *data, int *resident) { - struct vm_list_struct *tbp; + struct vm_area_struct *vma; + struct rb_node *p; int size = kobjsize(mm); down_read(&mm->mmap_sem); - for (tbp = mm->context.vmlist; tbp; tbp = tbp->next) { - size += kobjsize(tbp); - if (tbp->vma) { - size += kobjsize(tbp->vma); - size += kobjsize((void *) tbp->vma->vm_start); - } + for (p = rb_first(&mm->mm_rb); p; p = rb_next(p)) { + vma = rb_entry(p, struct vm_area_struct, vm_rb); + size += kobjsize(vma); + size += kobjsize((void *) vma->vm_start); } size += (*text = mm->end_code - mm->start_code); @@ -104,21 +104,63 @@ int task_statm(struct mm_struct *mm, int *shared, int *text, return size; } +/* + * display a single VMA to a sequenced file + */ +static int nommu_vma_show(struct seq_file *m, struct vm_area_struct *vma) +{ + unsigned long ino = 0; + struct file *file; + dev_t dev = 0; + int flags, len; + + flags = vma->vm_flags; + file = vma->vm_file; + + if (file) { + struct inode *inode = vma->vm_file->f_path.dentry->d_inode; + dev = inode->i_sb->s_dev; + ino = inode->i_ino; + } + + seq_printf(m, + "%08lx-%08lx %c%c%c%c %08lx %02x:%02x %lu %n", + vma->vm_start, + vma->vm_end, + flags & VM_READ ? 'r' : '-', + flags & VM_WRITE ? 'w' : '-', + flags & VM_EXEC ? 'x' : '-', + flags & VM_MAYSHARE ? flags & VM_SHARED ? 'S' : 's' : 'p', + vma->vm_pgoff << PAGE_SHIFT, + MAJOR(dev), MINOR(dev), ino, &len); + + if (file) { + len = 25 + sizeof(void *) * 6 - len; + if (len < 1) + len = 1; + seq_printf(m, "%*c", len, ' '); + seq_path(m, &file->f_path, ""); + } + + seq_putc(m, '\n'); + return 0; +} + /* * display mapping lines for a particular process's /proc/pid/maps */ -static int show_map(struct seq_file *m, void *_vml) +static int show_map(struct seq_file *m, void *_p) { - struct vm_list_struct *vml = _vml; + struct rb_node *p = _p; - return nommu_vma_show(m, vml->vma); + return nommu_vma_show(m, rb_entry(p, struct vm_area_struct, vm_rb)); } static void *m_start(struct seq_file *m, loff_t *pos) { struct proc_maps_private *priv = m->private; - struct vm_list_struct *vml; struct mm_struct *mm; + struct rb_node *p; loff_t n = *pos; /* pin the task and mm whilst we play with them */ @@ -134,9 +176,9 @@ static void *m_start(struct seq_file *m, loff_t *pos) } /* start from the Nth VMA */ - for (vml = mm->context.vmlist; vml; vml = vml->next) + for (p = rb_first(&mm->mm_rb); p; p = rb_next(p)) if (n-- == 0) - return vml; + return p; return NULL; } @@ -152,12 +194,12 @@ static void m_stop(struct seq_file *m, void *_vml) } } -static void *m_next(struct seq_file *m, void *_vml, loff_t *pos) +static void *m_next(struct seq_file *m, void *_p, loff_t *pos) { - struct vm_list_struct *vml = _vml; + struct rb_node *p = _p; (*pos)++; - return vml ? vml->next : NULL; + return p ? rb_next(p) : NULL; } static const struct seq_operations proc_pid_maps_ops = { diff --git a/include/asm-frv/mmu.h b/include/asm-frv/mmu.h index 22c03714fb14..86ca0e86e7d2 100644 --- a/include/asm-frv/mmu.h +++ b/include/asm-frv/mmu.h @@ -22,7 +22,6 @@ typedef struct { unsigned long dtlb_ptd_mapping; /* [DAMR5] PTD mapping for dtlb cached PGE */ #else - struct vm_list_struct *vmlist; unsigned long end_brk; #endif diff --git a/include/asm-m32r/mmu.h b/include/asm-m32r/mmu.h index d9bd724479cf..150cb92bb666 100644 --- a/include/asm-m32r/mmu.h +++ b/include/asm-m32r/mmu.h @@ -4,7 +4,6 @@ #if !defined(CONFIG_MMU) typedef struct { - struct vm_list_struct *vmlist; unsigned long end_brk; } mm_context_t; diff --git a/include/linux/mm.h b/include/linux/mm.h index 4a3d28c86443..b91a73fd1bcc 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -56,19 +56,9 @@ extern unsigned long mmap_min_addr; extern struct kmem_cache *vm_area_cachep; -/* - * This struct defines the per-mm list of VMAs for uClinux. If CONFIG_MMU is - * disabled, then there's a single shared list of VMAs maintained by the - * system, and mm's subscribe to these individually - */ -struct vm_list_struct { - struct vm_list_struct *next; - struct vm_area_struct *vma; -}; - #ifndef CONFIG_MMU -extern struct rb_root nommu_vma_tree; -extern struct rw_semaphore nommu_vma_sem; +extern struct rb_root nommu_region_tree; +extern struct rw_semaphore nommu_region_sem; extern unsigned int kobjsize(const void *objp); #endif @@ -1061,6 +1051,7 @@ extern void memmap_init_zone(unsigned long, int, unsigned long, unsigned long, enum memmap_context); extern void setup_per_zone_pages_min(void); extern void mem_init(void); +extern void __init mmap_init(void); extern void show_mem(void); extern void si_meminfo(struct sysinfo * val); extern void si_meminfo_node(struct sysinfo *val, int nid); @@ -1072,6 +1063,9 @@ extern void setup_per_cpu_pageset(void); static inline void setup_per_cpu_pageset(void) {} #endif +/* nommu.c */ +extern atomic_t mmap_pages_allocated; + /* prio_tree.c */ void vma_prio_tree_add(struct vm_area_struct *, struct vm_area_struct *old); void vma_prio_tree_insert(struct vm_area_struct *, struct prio_tree_root *); diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 9cfc9b627fdd..1c1e0d3a1714 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -96,6 +96,22 @@ struct page { #endif /* WANT_PAGE_VIRTUAL */ }; +/* + * A region containing a mapping of a non-memory backed file under NOMMU + * conditions. These are held in a global tree and are pinned by the VMAs that + * map parts of them. + */ +struct vm_region { + struct rb_node vm_rb; /* link in global region tree */ + unsigned long vm_flags; /* VMA vm_flags */ + unsigned long vm_start; /* start address of region */ + unsigned long vm_end; /* region initialised to here */ + unsigned long vm_pgoff; /* the offset in vm_file corresponding to vm_start */ + struct file *vm_file; /* the backing file or NULL */ + + atomic_t vm_usage; /* region usage count */ +}; + /* * This struct defines a memory VMM memory area. There is one of these * per VM-area/task. A VM area is any part of the process virtual memory @@ -152,7 +168,7 @@ struct vm_area_struct { unsigned long vm_truncate_count;/* truncate_count or restart_addr */ #ifndef CONFIG_MMU - atomic_t vm_usage; /* refcount (VMAs shared if !MMU) */ + struct vm_region *vm_region; /* NOMMU mapping region */ #endif #ifdef CONFIG_NUMA struct mempolicy *vm_policy; /* NUMA policy for the VMA */ diff --git a/ipc/shm.c b/ipc/shm.c index b125b560240e..d0ab5527bf45 100644 --- a/ipc/shm.c +++ b/ipc/shm.c @@ -990,6 +990,7 @@ asmlinkage long sys_shmdt(char __user *shmaddr) */ vma = find_vma(mm, addr); +#ifdef CONFIG_MMU while (vma) { next = vma->vm_next; @@ -1034,6 +1035,17 @@ asmlinkage long sys_shmdt(char __user *shmaddr) vma = next; } +#else /* CONFIG_MMU */ + /* under NOMMU conditions, the exact address to be destroyed must be + * given */ + retval = -EINVAL; + if (vma->vm_start == addr && vma->vm_ops == &shm_vm_ops) { + do_munmap(mm, vma->vm_start, vma->vm_end - vma->vm_start); + retval = 0; + } + +#endif + up_write(&mm->mmap_sem); return retval; } diff --git a/kernel/fork.c b/kernel/fork.c index 7b8f2a78be3d..0bce4a43bb37 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1481,12 +1481,10 @@ void __init proc_caches_init(void) fs_cachep = kmem_cache_create("fs_cache", sizeof(struct fs_struct), 0, SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL); - vm_area_cachep = kmem_cache_create("vm_area_struct", - sizeof(struct vm_area_struct), 0, - SLAB_PANIC, NULL); mm_cachep = kmem_cache_create("mm_struct", sizeof(struct mm_struct), ARCH_MIN_MMSTRUCT_ALIGN, SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL); + mmap_init(); } /* diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index 2e75478e9c69..d0a32aab03ff 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -512,6 +512,13 @@ config DEBUG_VIRTUAL If unsure, say N. +config DEBUG_NOMMU_REGIONS + bool "Debug the global anon/private NOMMU mapping region tree" + depends on DEBUG_KERNEL && !MMU + help + This option causes the global tree of anonymous and private mapping + regions to be regularly checked for invalid topology. + config DEBUG_WRITECOUNT bool "Debug filesystem writers count" depends on DEBUG_KERNEL diff --git a/mm/mmap.c b/mm/mmap.c index a910c045cfd4..749623196cb9 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -2472,3 +2472,13 @@ void mm_drop_all_locks(struct mm_struct *mm) mutex_unlock(&mm_all_locks_mutex); } + +/* + * initialise the VMA slab + */ +void __init mmap_init(void) +{ + vm_area_cachep = kmem_cache_create("vm_area_struct", + sizeof(struct vm_area_struct), 0, + SLAB_PANIC, NULL); +} diff --git a/mm/nommu.c b/mm/nommu.c index 23f355bbe262..0d363dfcf10e 100644 --- a/mm/nommu.c +++ b/mm/nommu.c @@ -6,7 +6,7 @@ * * See Documentation/nommu-mmap.txt * - * Copyright (c) 2004-2005 David Howells + * Copyright (c) 2004-2008 David Howells * Copyright (c) 2000-2003 David McCullough * Copyright (c) 2000-2001 D Jeff Dionne * Copyright (c) 2002 Greg Ungerer @@ -33,6 +33,28 @@ #include #include #include +#include "internal.h" + +static inline __attribute__((format(printf, 1, 2))) +void no_printk(const char *fmt, ...) +{ +} + +#if 0 +#define kenter(FMT, ...) \ + printk(KERN_DEBUG "==> %s("FMT")\n", __func__, ##__VA_ARGS__) +#define kleave(FMT, ...) \ + printk(KERN_DEBUG "<== %s()"FMT"\n", __func__, ##__VA_ARGS__) +#define kdebug(FMT, ...) \ + printk(KERN_DEBUG "xxx" FMT"yyy\n", ##__VA_ARGS__) +#else +#define kenter(FMT, ...) \ + no_printk(KERN_DEBUG "==> %s("FMT")\n", __func__, ##__VA_ARGS__) +#define kleave(FMT, ...) \ + no_printk(KERN_DEBUG "<== %s()"FMT"\n", __func__, ##__VA_ARGS__) +#define kdebug(FMT, ...) \ + no_printk(KERN_DEBUG FMT"\n", ##__VA_ARGS__) +#endif #include "internal.h" @@ -46,12 +68,15 @@ int sysctl_overcommit_ratio = 50; /* default is 50% */ int sysctl_max_map_count = DEFAULT_MAX_MAP_COUNT; int heap_stack_gap = 0; +atomic_t mmap_pages_allocated; + EXPORT_SYMBOL(mem_map); EXPORT_SYMBOL(num_physpages); -/* list of shareable VMAs */ -struct rb_root nommu_vma_tree = RB_ROOT; -DECLARE_RWSEM(nommu_vma_sem); +/* list of mapped, potentially shareable regions */ +static struct kmem_cache *vm_region_jar; +struct rb_root nommu_region_tree = RB_ROOT; +DECLARE_RWSEM(nommu_region_sem); struct vm_operations_struct generic_file_vm_ops = { }; @@ -400,129 +425,174 @@ asmlinkage unsigned long sys_brk(unsigned long brk) return mm->brk = brk; } -#ifdef DEBUG -static void show_process_blocks(void) +/* + * initialise the VMA and region record slabs + */ +void __init mmap_init(void) { - struct vm_list_struct *vml; - - printk("Process blocks %d:", current->pid); - - for (vml = ¤t->mm->context.vmlist; vml; vml = vml->next) { - printk(" %p: %p", vml, vml->vma); - if (vml->vma) - printk(" (%d @%lx #%d)", - kobjsize((void *) vml->vma->vm_start), - vml->vma->vm_start, - atomic_read(&vml->vma->vm_usage)); - printk(vml->next ? " ->" : ".\n"); - } + vm_region_jar = kmem_cache_create("vm_region_jar", + sizeof(struct vm_region), 0, + SLAB_PANIC, NULL); + vm_area_cachep = kmem_cache_create("vm_area_struct", + sizeof(struct vm_area_struct), 0, + SLAB_PANIC, NULL); } -#endif /* DEBUG */ /* - * add a VMA into a process's mm_struct in the appropriate place in the list - * - should be called with mm->mmap_sem held writelocked + * validate the region tree + * - the caller must hold the region lock */ -static void add_vma_to_mm(struct mm_struct *mm, struct vm_list_struct *vml) +#ifdef CONFIG_DEBUG_NOMMU_REGIONS +static noinline void validate_nommu_regions(void) { - struct vm_list_struct **ppv; + struct vm_region *region, *last; + struct rb_node *p, *lastp; - for (ppv = ¤t->mm->context.vmlist; *ppv; ppv = &(*ppv)->next) - if ((*ppv)->vma->vm_start > vml->vma->vm_start) - break; + lastp = rb_first(&nommu_region_tree); + if (!lastp) + return; + + last = rb_entry(lastp, struct vm_region, vm_rb); + if (unlikely(last->vm_end <= last->vm_start)) + BUG(); + + while ((p = rb_next(lastp))) { + region = rb_entry(p, struct vm_region, vm_rb); + last = rb_entry(lastp, struct vm_region, vm_rb); + + if (unlikely(region->vm_end <= region->vm_start)) + BUG(); + if (unlikely(region->vm_start < last->vm_end)) + BUG(); - vml->next = *ppv; - *ppv = vml; + lastp = p; + } } +#else +#define validate_nommu_regions() do {} while(0) +#endif /* - * look up the first VMA in which addr resides, NULL if none - * - should be called with mm->mmap_sem at least held readlocked + * add a region into the global tree */ -struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr) +static void add_nommu_region(struct vm_region *region) { - struct vm_list_struct *loop, *vml; + struct vm_region *pregion; + struct rb_node **p, *parent; - /* search the vm_start ordered list */ - vml = NULL; - for (loop = mm->context.vmlist; loop; loop = loop->next) { - if (loop->vma->vm_start > addr) - break; - vml = loop; + validate_nommu_regions(); + + BUG_ON(region->vm_start & ~PAGE_MASK); + + parent = NULL; + p = &nommu_region_tree.rb_node; + while (*p) { + parent = *p; + pregion = rb_entry(parent, struct vm_region, vm_rb); + if (region->vm_start < pregion->vm_start) + p = &(*p)->rb_left; + else if (region->vm_start > pregion->vm_start) + p = &(*p)->rb_right; + else if (pregion == region) + return; + else + BUG(); } - if (vml && vml->vma->vm_end > addr) - return vml->vma; + rb_link_node(®ion->vm_rb, parent, p); + rb_insert_color(®ion->vm_rb, &nommu_region_tree); - return NULL; + validate_nommu_regions(); } -EXPORT_SYMBOL(find_vma); /* - * find a VMA - * - we don't extend stack VMAs under NOMMU conditions + * delete a region from the global tree */ -struct vm_area_struct *find_extend_vma(struct mm_struct *mm, unsigned long addr) +static void delete_nommu_region(struct vm_region *region) { - return find_vma(mm, addr); -} + BUG_ON(!nommu_region_tree.rb_node); -int expand_stack(struct vm_area_struct *vma, unsigned long address) -{ - return -ENOMEM; + validate_nommu_regions(); + rb_erase(®ion->vm_rb, &nommu_region_tree); + validate_nommu_regions(); } /* - * look up the first VMA exactly that exactly matches addr - * - should be called with mm->mmap_sem at least held readlocked + * free a contiguous series of pages */ -static inline struct vm_area_struct *find_vma_exact(struct mm_struct *mm, - unsigned long addr) +static void free_page_series(unsigned long from, unsigned long to) { - struct vm_list_struct *vml; - - /* search the vm_start ordered list */ - for (vml = mm->context.vmlist; vml; vml = vml->next) { - if (vml->vma->vm_start == addr) - return vml->vma; - if (vml->vma->vm_start > addr) - break; + for (; from < to; from += PAGE_SIZE) { + struct page *page = virt_to_page(from); + + kdebug("- free %lx", from); + atomic_dec(&mmap_pages_allocated); + if (page_count(page) != 1) + kdebug("free page %p [%d]", page, page_count(page)); + put_page(page); } - - return NULL; } /* - * find a VMA in the global tree + * release a reference to a region + * - the caller must hold the region semaphore, which this releases + * - the region may not have been added to the tree yet, in which case vm_end + * will equal vm_start */ -static inline struct vm_area_struct *find_nommu_vma(unsigned long start) +static void __put_nommu_region(struct vm_region *region) + __releases(nommu_region_sem) { - struct vm_area_struct *vma; - struct rb_node *n = nommu_vma_tree.rb_node; + kenter("%p{%d}", region, atomic_read(®ion->vm_usage)); - while (n) { - vma = rb_entry(n, struct vm_area_struct, vm_rb); + BUG_ON(!nommu_region_tree.rb_node); - if (start < vma->vm_start) - n = n->rb_left; - else if (start > vma->vm_start) - n = n->rb_right; - else - return vma; + if (atomic_dec_and_test(®ion->vm_usage)) { + if (region->vm_end > region->vm_start) + delete_nommu_region(region); + up_write(&nommu_region_sem); + + if (region->vm_file) + fput(region->vm_file); + + /* IO memory and memory shared directly out of the pagecache + * from ramfs/tmpfs mustn't be released here */ + if (region->vm_flags & VM_MAPPED_COPY) { + kdebug("free series"); + free_page_series(region->vm_start, region->vm_end); + } + kmem_cache_free(vm_region_jar, region); + } else { + up_write(&nommu_region_sem); } +} - return NULL; +/* + * release a reference to a region + */ +static void put_nommu_region(struct vm_region *region) +{ + down_write(&nommu_region_sem); + __put_nommu_region(region); } /* - * add a VMA in the global tree + * add a VMA into a process's mm_struct in the appropriate place in the list + * and tree and add to the address space's page tree also if not an anonymous + * page + * - should be called with mm->mmap_sem held writelocked */ -static void add_nommu_vma(struct vm_area_struct *vma) +static void add_vma_to_mm(struct mm_struct *mm, struct vm_area_struct *vma) { - struct vm_area_struct *pvma; + struct vm_area_struct *pvma, **pp; struct address_space *mapping; - struct rb_node **p = &nommu_vma_tree.rb_node; - struct rb_node *parent = NULL; + struct rb_node **p, *parent; + + kenter(",%p", vma); + + BUG_ON(!vma->vm_region); + + mm->map_count++; + vma->vm_mm = mm; /* add the VMA to the mapping */ if (vma->vm_file) { @@ -533,42 +603,62 @@ static void add_nommu_vma(struct vm_area_struct *vma) flush_dcache_mmap_unlock(mapping); } - /* add the VMA to the master list */ + /* add the VMA to the tree */ + parent = NULL; + p = &mm->mm_rb.rb_node; while (*p) { parent = *p; pvma = rb_entry(parent, struct vm_area_struct, vm_rb); - if (vma->vm_start < pvma->vm_start) { + /* sort by: start addr, end addr, VMA struct addr in that order + * (the latter is necessary as we may get identical VMAs) */ + if (vma->vm_start < pvma->vm_start) p = &(*p)->rb_left; - } - else if (vma->vm_start > pvma->vm_start) { + else if (vma->vm_start > pvma->vm_start) p = &(*p)->rb_right; - } - else { - /* mappings are at the same address - this can only - * happen for shared-mem chardevs and shared file - * mappings backed by ramfs/tmpfs */ - BUG_ON(!(pvma->vm_flags & VM_SHARED)); - - if (vma < pvma) - p = &(*p)->rb_left; - else if (vma > pvma) - p = &(*p)->rb_right; - else - BUG(); - } + else if (vma->vm_end < pvma->vm_end) + p = &(*p)->rb_left; + else if (vma->vm_end > pvma->vm_end) + p = &(*p)->rb_right; + else if (vma < pvma) + p = &(*p)->rb_left; + else if (vma > pvma) + p = &(*p)->rb_right; + else + BUG(); } rb_link_node(&vma->vm_rb, parent, p); - rb_insert_color(&vma->vm_rb, &nommu_vma_tree); + rb_insert_color(&vma->vm_rb, &mm->mm_rb); + + /* add VMA to the VMA list also */ + for (pp = &mm->mmap; (pvma = *pp); pp = &(*pp)->vm_next) { + if (pvma->vm_start > vma->vm_start) + break; + if (pvma->vm_start < vma->vm_start) + continue; + if (pvma->vm_end < vma->vm_end) + break; + } + + vma->vm_next = *pp; + *pp = vma; } /* - * delete a VMA from the global list + * delete a VMA from its owning mm_struct and address space */ -static void delete_nommu_vma(struct vm_area_struct *vma) +static void delete_vma_from_mm(struct vm_area_struct *vma) { + struct vm_area_struct **pp; struct address_space *mapping; + struct mm_struct *mm = vma->vm_mm; + + kenter("%p", vma); + + mm->map_count--; + if (mm->mmap_cache == vma) + mm->mmap_cache = NULL; /* remove the VMA from the mapping */ if (vma->vm_file) { @@ -579,8 +669,115 @@ static void delete_nommu_vma(struct vm_area_struct *vma) flush_dcache_mmap_unlock(mapping); } - /* remove from the master list */ - rb_erase(&vma->vm_rb, &nommu_vma_tree); + /* remove from the MM's tree and list */ + rb_erase(&vma->vm_rb, &mm->mm_rb); + for (pp = &mm->mmap; *pp; pp = &(*pp)->vm_next) { + if (*pp == vma) { + *pp = vma->vm_next; + break; + } + } + + vma->vm_mm = NULL; +} + +/* + * destroy a VMA record + */ +static void delete_vma(struct mm_struct *mm, struct vm_area_struct *vma) +{ + kenter("%p", vma); + if (vma->vm_ops && vma->vm_ops->close) + vma->vm_ops->close(vma); + if (vma->vm_file) { + fput(vma->vm_file); + if (vma->vm_flags & VM_EXECUTABLE) + removed_exe_file_vma(mm); + } + put_nommu_region(vma->vm_region); + kmem_cache_free(vm_area_cachep, vma); +} + +/* + * look up the first VMA in which addr resides, NULL if none + * - should be called with mm->mmap_sem at least held readlocked + */ +struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr) +{ + struct vm_area_struct *vma; + struct rb_node *n = mm->mm_rb.rb_node; + + /* check the cache first */ + vma = mm->mmap_cache; + if (vma && vma->vm_start <= addr && vma->vm_end > addr) + return vma; + + /* trawl the tree (there may be multiple mappings in which addr + * resides) */ + for (n = rb_first(&mm->mm_rb); n; n = rb_next(n)) { + vma = rb_entry(n, struct vm_area_struct, vm_rb); + if (vma->vm_start > addr) + return NULL; + if (vma->vm_end > addr) { + mm->mmap_cache = vma; + return vma; + } + } + + return NULL; +} +EXPORT_SYMBOL(find_vma); + +/* + * find a VMA + * - we don't extend stack VMAs under NOMMU conditions + */ +struct vm_area_struct *find_extend_vma(struct mm_struct *mm, unsigned long addr) +{ + return find_vma(mm, addr); +} + +/* + * expand a stack to a given address + * - not supported under NOMMU conditions + */ +int expand_stack(struct vm_area_struct *vma, unsigned long address) +{ + return -ENOMEM; +} + +/* + * look up the first VMA exactly that exactly matches addr + * - should be called with mm->mmap_sem at least held readlocked + */ +static struct vm_area_struct *find_vma_exact(struct mm_struct *mm, + unsigned long addr, + unsigned long len) +{ + struct vm_area_struct *vma; + struct rb_node *n = mm->mm_rb.rb_node; + unsigned long end = addr + len; + + /* check the cache first */ + vma = mm->mmap_cache; + if (vma && vma->vm_start == addr && vma->vm_end == end) + return vma; + + /* trawl the tree (there may be multiple mappings in which addr + * resides) */ + for (n = rb_first(&mm->mm_rb); n; n = rb_next(n)) { + vma = rb_entry(n, struct vm_area_struct, vm_rb); + if (vma->vm_start < addr) + continue; + if (vma->vm_start > addr) + return NULL; + if (vma->vm_end == end) { + mm->mmap_cache = vma; + return vma; + } + } + + return NULL; } /* @@ -595,7 +792,7 @@ static int validate_mmap_request(struct file *file, unsigned long pgoff, unsigned long *_capabilities) { - unsigned long capabilities; + unsigned long capabilities, rlen; unsigned long reqprot = prot; int ret; @@ -615,12 +812,12 @@ static int validate_mmap_request(struct file *file, return -EINVAL; /* Careful about overflows.. */ - len = PAGE_ALIGN(len); - if (!len || len > TASK_SIZE) + rlen = PAGE_ALIGN(len); + if (!rlen || rlen > TASK_SIZE) return -ENOMEM; /* offset overflow? */ - if ((pgoff + (len >> PAGE_SHIFT)) < pgoff) + if ((pgoff + (rlen >> PAGE_SHIFT)) < pgoff) return -EOVERFLOW; if (file) { @@ -794,9 +991,10 @@ static unsigned long determine_vm_flags(struct file *file, } /* - * set up a shared mapping on a file + * set up a shared mapping on a file (the driver or filesystem provides and + * pins the storage) */ -static int do_mmap_shared_file(struct vm_area_struct *vma, unsigned long len) +static int do_mmap_shared_file(struct vm_area_struct *vma) { int ret; @@ -814,10 +1012,14 @@ static int do_mmap_shared_file(struct vm_area_struct *vma, unsigned long len) /* * set up a private mapping or an anonymous shared mapping */ -static int do_mmap_private(struct vm_area_struct *vma, unsigned long len) +static int do_mmap_private(struct vm_area_struct *vma, + struct vm_region *region, + unsigned long len) { + struct page *pages; + unsigned long total, point, n, rlen; void *base; - int ret; + int ret, order; /* invoke the file's mapping function so that it can keep track of * shared mappings on devices or memory @@ -836,23 +1038,46 @@ static int do_mmap_private(struct vm_area_struct *vma, unsigned long len) * make a private copy of the data and map that instead */ } + rlen = PAGE_ALIGN(len); + /* allocate some memory to hold the mapping * - note that this may not return a page-aligned address if the object * we're allocating is smaller than a page */ - base = kmalloc(len, GFP_KERNEL|__GFP_COMP); - if (!base) + order = get_order(rlen); + kdebug("alloc order %d for %lx", order, len); + + pages = alloc_pages(GFP_KERNEL, order); + if (!pages) goto enomem; - vma->vm_start = (unsigned long) base; - vma->vm_end = vma->vm_start + len; - vma->vm_flags |= VM_MAPPED_COPY; + /* we allocated a power-of-2 sized page set, so we need to trim off the + * excess */ + total = 1 << order; + atomic_add(total, &mmap_pages_allocated); + + point = rlen >> PAGE_SHIFT; + while (total > point) { + order = ilog2(total - point); + n = 1 << order; + kdebug("shave %lu/%lu @%lu", n, total - point, total); + atomic_sub(n, &mmap_pages_allocated); + total -= n; + set_page_refcounted(pages + total); + __free_pages(pages + total, order); + } + + total = rlen >> PAGE_SHIFT; + for (point = 1; point < total; point++) + set_page_refcounted(&pages[point]); -#ifdef WARN_ON_SLACK - if (len + WARN_ON_SLACK <= kobjsize(result)) - printk("Allocation of %lu bytes from process %d has %lu bytes of slack\n", - len, current->pid, kobjsize(result) - len); -#endif + base = page_address(pages); + region->vm_flags = vma->vm_flags |= VM_MAPPED_COPY; + region->vm_start = (unsigned long) base; + region->vm_end = region->vm_start + rlen; + + vma->vm_start = region->vm_start; + vma->vm_end = region->vm_start + len; if (vma->vm_file) { /* read the contents of a file into the copy */ @@ -864,26 +1089,27 @@ static int do_mmap_private(struct vm_area_struct *vma, unsigned long len) old_fs = get_fs(); set_fs(KERNEL_DS); - ret = vma->vm_file->f_op->read(vma->vm_file, base, len, &fpos); + ret = vma->vm_file->f_op->read(vma->vm_file, base, rlen, &fpos); set_fs(old_fs); if (ret < 0) goto error_free; /* clear the last little bit */ - if (ret < len) - memset(base + ret, 0, len - ret); + if (ret < rlen) + memset(base + ret, 0, rlen - ret); } else { /* if it's an anonymous mapping, then just clear it */ - memset(base, 0, len); + memset(base, 0, rlen); } return 0; error_free: - kfree(base); - vma->vm_start = 0; + free_page_series(region->vm_start, region->vm_end); + region->vm_start = vma->vm_start = 0; + region->vm_end = vma->vm_end = 0; return ret; enomem: @@ -903,13 +1129,14 @@ unsigned long do_mmap_pgoff(struct file *file, unsigned long flags, unsigned long pgoff) { - struct vm_list_struct *vml = NULL; - struct vm_area_struct *vma = NULL; + struct vm_area_struct *vma; + struct vm_region *region; struct rb_node *rb; - unsigned long capabilities, vm_flags; - void *result; + unsigned long capabilities, vm_flags, result; int ret; + kenter(",%lx,%lx,%lx,%lx,%lx", addr, len, prot, flags, pgoff); + if (!(flags & MAP_FIXED)) addr = round_hint_to_min(addr); @@ -917,73 +1144,120 @@ unsigned long do_mmap_pgoff(struct file *file, * mapping */ ret = validate_mmap_request(file, addr, len, prot, flags, pgoff, &capabilities); - if (ret < 0) + if (ret < 0) { + kleave(" = %d [val]", ret); return ret; + } /* we've determined that we can make the mapping, now translate what we * now know into VMA flags */ vm_flags = determine_vm_flags(file, prot, flags, capabilities); - /* we're going to need to record the mapping if it works */ - vml = kzalloc(sizeof(struct vm_list_struct), GFP_KERNEL); - if (!vml) - goto error_getting_vml; + /* we're going to need to record the mapping */ + region = kmem_cache_zalloc(vm_region_jar, GFP_KERNEL); + if (!region) + goto error_getting_region; + + vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL); + if (!vma) + goto error_getting_vma; + + atomic_set(®ion->vm_usage, 1); + region->vm_flags = vm_flags; + region->vm_pgoff = pgoff; - down_write(&nommu_vma_sem); + INIT_LIST_HEAD(&vma->anon_vma_node); + vma->vm_flags = vm_flags; + vma->vm_pgoff = pgoff; - /* if we want to share, we need to check for VMAs created by other + if (file) { + region->vm_file = file; + get_file(file); + vma->vm_file = file; + get_file(file); + if (vm_flags & VM_EXECUTABLE) { + added_exe_file_vma(current->mm); + vma->vm_mm = current->mm; + } + } + + down_write(&nommu_region_sem); + + /* if we want to share, we need to check for regions created by other * mmap() calls that overlap with our proposed mapping - * - we can only share with an exact match on most regular files + * - we can only share with a superset match on most regular files * - shared mappings on character devices and memory backed files are * permitted to overlap inexactly as far as we are concerned for in * these cases, sharing is handled in the driver or filesystem rather * than here */ if (vm_flags & VM_MAYSHARE) { - unsigned long pglen = (len + PAGE_SIZE - 1) >> PAGE_SHIFT; - unsigned long vmpglen; + struct vm_region *pregion; + unsigned long pglen, rpglen, pgend, rpgend, start; - /* suppress VMA sharing for shared regions */ - if (vm_flags & VM_SHARED && - capabilities & BDI_CAP_MAP_DIRECT) - goto dont_share_VMAs; + pglen = (len + PAGE_SIZE - 1) >> PAGE_SHIFT; + pgend = pgoff + pglen; - for (rb = rb_first(&nommu_vma_tree); rb; rb = rb_next(rb)) { - vma = rb_entry(rb, struct vm_area_struct, vm_rb); + for (rb = rb_first(&nommu_region_tree); rb; rb = rb_next(rb)) { + pregion = rb_entry(rb, struct vm_region, vm_rb); - if (!(vma->vm_flags & VM_MAYSHARE)) + if (!(pregion->vm_flags & VM_MAYSHARE)) continue; /* search for overlapping mappings on the same file */ - if (vma->vm_file->f_path.dentry->d_inode != file->f_path.dentry->d_inode) + if (pregion->vm_file->f_path.dentry->d_inode != + file->f_path.dentry->d_inode) continue; - if (vma->vm_pgoff >= pgoff + pglen) + if (pregion->vm_pgoff >= pgend) continue; - vmpglen = vma->vm_end - vma->vm_start + PAGE_SIZE - 1; - vmpglen >>= PAGE_SHIFT; - if (pgoff >= vma->vm_pgoff + vmpglen) + rpglen = pregion->vm_end - pregion->vm_start; + rpglen = (rpglen + PAGE_SIZE - 1) >> PAGE_SHIFT; + rpgend = pregion->vm_pgoff + rpglen; + if (pgoff >= rpgend) continue; - /* handle inexactly overlapping matches between mappings */ - if (vma->vm_pgoff != pgoff || vmpglen != pglen) { + /* handle inexactly overlapping matches between + * mappings */ + if ((pregion->vm_pgoff != pgoff || rpglen != pglen) && + !(pgoff >= pregion->vm_pgoff && pgend <= rpgend)) { + /* new mapping is not a subset of the region */ if (!(capabilities & BDI_CAP_MAP_DIRECT)) goto sharing_violation; continue; } - /* we've found a VMA we can share */ - atomic_inc(&vma->vm_usage); - - vml->vma = vma; - result = (void *) vma->vm_start; - goto shared; + /* we've found a region we can share */ + atomic_inc(&pregion->vm_usage); + vma->vm_region = pregion; + start = pregion->vm_start; + start += (pgoff - pregion->vm_pgoff) << PAGE_SHIFT; + vma->vm_start = start; + vma->vm_end = start + len; + + if (pregion->vm_flags & VM_MAPPED_COPY) { + kdebug("share copy"); + vma->vm_flags |= VM_MAPPED_COPY; + } else { + kdebug("share mmap"); + ret = do_mmap_shared_file(vma); + if (ret < 0) { + vma->vm_region = NULL; + vma->vm_start = 0; + vma->vm_end = 0; + atomic_dec(&pregion->vm_usage); + pregion = NULL; + goto error_just_free; + } + } + fput(region->vm_file); + kmem_cache_free(vm_region_jar, region); + region = pregion; + result = start; + goto share; } - dont_share_VMAs: - vma = NULL; - /* obtain the address at which to make a shared mapping * - this is the hook for quasi-memory character devices to * tell us the location of a shared mapping @@ -994,102 +1268,93 @@ unsigned long do_mmap_pgoff(struct file *file, if (IS_ERR((void *) addr)) { ret = addr; if (ret != (unsigned long) -ENOSYS) - goto error; + goto error_just_free; /* the driver refused to tell us where to site * the mapping so we'll have to attempt to copy * it */ ret = (unsigned long) -ENODEV; if (!(capabilities & BDI_CAP_MAP_COPY)) - goto error; + goto error_just_free; capabilities &= ~BDI_CAP_MAP_DIRECT; + } else { + vma->vm_start = region->vm_start = addr; + vma->vm_end = region->vm_end = addr + len; } } } - /* we're going to need a VMA struct as well */ - vma = kzalloc(sizeof(struct vm_area_struct), GFP_KERNEL); - if (!vma) - goto error_getting_vma; - - INIT_LIST_HEAD(&vma->anon_vma_node); - atomic_set(&vma->vm_usage, 1); - if (file) { - get_file(file); - if (vm_flags & VM_EXECUTABLE) { - added_exe_file_vma(current->mm); - vma->vm_mm = current->mm; - } - } - vma->vm_file = file; - vma->vm_flags = vm_flags; - vma->vm_start = addr; - vma->vm_end = addr + len; - vma->vm_pgoff = pgoff; - - vml->vma = vma; + vma->vm_region = region; /* set up the mapping */ if (file && vma->vm_flags & VM_SHARED) - ret = do_mmap_shared_file(vma, len); + ret = do_mmap_shared_file(vma); else - ret = do_mmap_private(vma, len); + ret = do_mmap_private(vma, region, len); if (ret < 0) - goto error; + goto error_put_region; + + add_nommu_region(region); /* okay... we have a mapping; now we have to register it */ - result = (void *) vma->vm_start; + result = vma->vm_start; current->mm->total_vm += len >> PAGE_SHIFT; - add_nommu_vma(vma); +share: + add_vma_to_mm(current->mm, vma); - shared: - add_vma_to_mm(current->mm, vml); - - up_write(&nommu_vma_sem); + up_write(&nommu_region_sem); if (prot & PROT_EXEC) - flush_icache_range((unsigned long) result, - (unsigned long) result + len); + flush_icache_range(result, result + len); -#ifdef DEBUG - printk("do_mmap:\n"); - show_process_blocks(); -#endif + kleave(" = %lx", result); + return result; - return (unsigned long) result; - - error: - up_write(&nommu_vma_sem); - kfree(vml); +error_put_region: + __put_nommu_region(region); if (vma) { if (vma->vm_file) { fput(vma->vm_file); if (vma->vm_flags & VM_EXECUTABLE) removed_exe_file_vma(vma->vm_mm); } - kfree(vma); + kmem_cache_free(vm_area_cachep, vma); } + kleave(" = %d [pr]", ret); return ret; - sharing_violation: - up_write(&nommu_vma_sem); - printk("Attempt to share mismatched mappings\n"); - kfree(vml); - return -EINVAL; +error_just_free: + up_write(&nommu_region_sem); +error: + fput(region->vm_file); + kmem_cache_free(vm_region_jar, region); + fput(vma->vm_file); + if (vma->vm_flags & VM_EXECUTABLE) + removed_exe_file_vma(vma->vm_mm); + kmem_cache_free(vm_area_cachep, vma); + kleave(" = %d", ret); + return ret; + +sharing_violation: + up_write(&nommu_region_sem); + printk(KERN_WARNING "Attempt to share mismatched mappings\n"); + ret = -EINVAL; + goto error; - error_getting_vma: - up_write(&nommu_vma_sem); - kfree(vml); - printk("Allocation of vma for %lu byte allocation from process %d failed\n", +error_getting_vma: + kmem_cache_free(vm_region_jar, region); + printk(KERN_WARNING "Allocation of vma for %lu byte allocation" + " from process %d failed\n", len, current->pid); show_free_areas(); return -ENOMEM; - error_getting_vml: - printk("Allocation of vml for %lu byte allocation from process %d failed\n", +error_getting_region: + printk(KERN_WARNING "Allocation of vm region for %lu byte allocation" + " from process %d failed\n", len, current->pid); show_free_areas(); return -ENOMEM; @@ -1097,77 +1362,180 @@ unsigned long do_mmap_pgoff(struct file *file, EXPORT_SYMBOL(do_mmap_pgoff); /* - * handle mapping disposal for uClinux + * split a vma into two pieces at address 'addr', a new vma is allocated either + * for the first part or the tail. */ -static void put_vma(struct mm_struct *mm, struct vm_area_struct *vma) +int split_vma(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long addr, int new_below) { - if (vma) { - down_write(&nommu_vma_sem); + struct vm_area_struct *new; + struct vm_region *region; + unsigned long npages; - if (atomic_dec_and_test(&vma->vm_usage)) { - delete_nommu_vma(vma); + kenter(""); - if (vma->vm_ops && vma->vm_ops->close) - vma->vm_ops->close(vma); + /* we're only permitted to split anonymous regions that have a single + * owner */ + if (vma->vm_file || + atomic_read(&vma->vm_region->vm_usage) != 1) + return -ENOMEM; - /* IO memory and memory shared directly out of the pagecache from - * ramfs/tmpfs mustn't be released here */ - if (vma->vm_flags & VM_MAPPED_COPY) - kfree((void *) vma->vm_start); + if (mm->map_count >= sysctl_max_map_count) + return -ENOMEM; - if (vma->vm_file) { - fput(vma->vm_file); - if (vma->vm_flags & VM_EXECUTABLE) - removed_exe_file_vma(mm); - } - kfree(vma); - } + region = kmem_cache_alloc(vm_region_jar, GFP_KERNEL); + if (!region) + return -ENOMEM; + + new = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL); + if (!new) { + kmem_cache_free(vm_region_jar, region); + return -ENOMEM; + } + + /* most fields are the same, copy all, and then fixup */ + *new = *vma; + *region = *vma->vm_region; + new->vm_region = region; + + npages = (addr - vma->vm_start) >> PAGE_SHIFT; + + if (new_below) { + region->vm_end = new->vm_end = addr; + } else { + region->vm_start = new->vm_start = addr; + region->vm_pgoff = new->vm_pgoff += npages; + } - up_write(&nommu_vma_sem); + if (new->vm_ops && new->vm_ops->open) + new->vm_ops->open(new); + + delete_vma_from_mm(vma); + down_write(&nommu_region_sem); + delete_nommu_region(vma->vm_region); + if (new_below) { + vma->vm_region->vm_start = vma->vm_start = addr; + vma->vm_region->vm_pgoff = vma->vm_pgoff += npages; + } else { + vma->vm_region->vm_end = vma->vm_end = addr; } + add_nommu_region(vma->vm_region); + add_nommu_region(new->vm_region); + up_write(&nommu_region_sem); + add_vma_to_mm(mm, vma); + add_vma_to_mm(mm, new); + return 0; } /* - * release a mapping - * - under NOMMU conditions the parameters must match exactly to the mapping to - * be removed + * shrink a VMA by removing the specified chunk from either the beginning or + * the end */ -int do_munmap(struct mm_struct *mm, unsigned long addr, size_t len) +static int shrink_vma(struct mm_struct *mm, + struct vm_area_struct *vma, + unsigned long from, unsigned long to) { - struct vm_list_struct *vml, **parent; - unsigned long end = addr + len; + struct vm_region *region; -#ifdef DEBUG - printk("do_munmap:\n"); -#endif + kenter(""); - for (parent = &mm->context.vmlist; *parent; parent = &(*parent)->next) { - if ((*parent)->vma->vm_start > addr) - break; - if ((*parent)->vma->vm_start == addr && - ((len == 0) || ((*parent)->vma->vm_end == end))) - goto found; - } + /* adjust the VMA's pointers, which may reposition it in the MM's tree + * and list */ + delete_vma_from_mm(vma); + if (from > vma->vm_start) + vma->vm_end = from; + else + vma->vm_start = to; + add_vma_to_mm(mm, vma); - printk("munmap of non-mmaped memory by process %d (%s): %p\n", - current->pid, current->comm, (void *) addr); - return -EINVAL; + /* cut the backing region down to size */ + region = vma->vm_region; + BUG_ON(atomic_read(®ion->vm_usage) != 1); - found: - vml = *parent; + down_write(&nommu_region_sem); + delete_nommu_region(region); + if (from > region->vm_start) + region->vm_end = from; + else + region->vm_start = to; + add_nommu_region(region); + up_write(&nommu_region_sem); - put_vma(mm, vml->vma); + free_page_series(from, to); + return 0; +} - *parent = vml->next; - kfree(vml); +/* + * release a mapping + * - under NOMMU conditions the chunk to be unmapped must be backed by a single + * VMA, though it need not cover the whole VMA + */ +int do_munmap(struct mm_struct *mm, unsigned long start, size_t len) +{ + struct vm_area_struct *vma; + struct rb_node *rb; + unsigned long end = start + len; + int ret; - update_hiwater_vm(mm); - mm->total_vm -= len >> PAGE_SHIFT; + kenter(",%lx,%zx", start, len); -#ifdef DEBUG - show_process_blocks(); -#endif + if (len == 0) + return -EINVAL; + + /* find the first potentially overlapping VMA */ + vma = find_vma(mm, start); + if (!vma) { + printk(KERN_WARNING + "munmap of memory not mmapped by process %d (%s):" + " 0x%lx-0x%lx\n", + current->pid, current->comm, start, start + len - 1); + return -EINVAL; + } + /* we're allowed to split an anonymous VMA but not a file-backed one */ + if (vma->vm_file) { + do { + if (start > vma->vm_start) { + kleave(" = -EINVAL [miss]"); + return -EINVAL; + } + if (end == vma->vm_end) + goto erase_whole_vma; + rb = rb_next(&vma->vm_rb); + vma = rb_entry(rb, struct vm_area_struct, vm_rb); + } while (rb); + kleave(" = -EINVAL [split file]"); + return -EINVAL; + } else { + /* the chunk must be a subset of the VMA found */ + if (start == vma->vm_start && end == vma->vm_end) + goto erase_whole_vma; + if (start < vma->vm_start || end > vma->vm_end) { + kleave(" = -EINVAL [superset]"); + return -EINVAL; + } + if (start & ~PAGE_MASK) { + kleave(" = -EINVAL [unaligned start]"); + return -EINVAL; + } + if (end != vma->vm_end && end & ~PAGE_MASK) { + kleave(" = -EINVAL [unaligned split]"); + return -EINVAL; + } + if (start != vma->vm_start && end != vma->vm_end) { + ret = split_vma(mm, vma, start, 1); + if (ret < 0) { + kleave(" = %d [split]", ret); + return ret; + } + } + return shrink_vma(mm, vma, start, end); + } + +erase_whole_vma: + delete_vma_from_mm(vma); + delete_vma(mm, vma); + kleave(" = 0"); return 0; } EXPORT_SYMBOL(do_munmap); @@ -1184,29 +1552,26 @@ asmlinkage long sys_munmap(unsigned long addr, size_t len) } /* - * Release all mappings + * release all the mappings made in a process's VM space */ -void exit_mmap(struct mm_struct * mm) +void exit_mmap(struct mm_struct *mm) { - struct vm_list_struct *tmp; + struct vm_area_struct *vma; - if (mm) { -#ifdef DEBUG - printk("Exit_mmap:\n"); -#endif + if (!mm) + return; - mm->total_vm = 0; + kenter(""); - while ((tmp = mm->context.vmlist)) { - mm->context.vmlist = tmp->next; - put_vma(mm, tmp->vma); - kfree(tmp); - } + mm->total_vm = 0; -#ifdef DEBUG - show_process_blocks(); -#endif + while ((vma = mm->mmap)) { + mm->mmap = vma->vm_next; + delete_vma_from_mm(vma); + delete_vma(mm, vma); } + + kleave(""); } unsigned long do_brk(unsigned long addr, unsigned long len) @@ -1219,8 +1584,8 @@ unsigned long do_brk(unsigned long addr, unsigned long len) * time (controlled by the MREMAP_MAYMOVE flag and available VM space) * * under NOMMU conditions, we only permit changing a mapping's size, and only - * as long as it stays within the hole allocated by the kmalloc() call in - * do_mmap_pgoff() and the block is not shareable + * as long as it stays within the region allocated by do_mmap_private() and the + * block is not shareable * * MREMAP_FIXED is not supported under NOMMU conditions */ @@ -1231,13 +1596,16 @@ unsigned long do_mremap(unsigned long addr, struct vm_area_struct *vma; /* insanity checks first */ - if (new_len == 0) + if (old_len == 0 || new_len == 0) return (unsigned long) -EINVAL; + if (addr & ~PAGE_MASK) + return -EINVAL; + if (flags & MREMAP_FIXED && new_addr != addr) return (unsigned long) -EINVAL; - vma = find_vma_exact(current->mm, addr); + vma = find_vma_exact(current->mm, addr, old_len); if (!vma) return (unsigned long) -EINVAL; @@ -1247,19 +1615,19 @@ unsigned long do_mremap(unsigned long addr, if (vma->vm_flags & VM_MAYSHARE) return (unsigned long) -EPERM; - if (new_len > kobjsize((void *) addr)) + if (new_len > vma->vm_region->vm_end - vma->vm_region->vm_start) return (unsigned long) -ENOMEM; /* all checks complete - do it */ vma->vm_end = vma->vm_start + new_len; - return vma->vm_start; } EXPORT_SYMBOL(do_mremap); -asmlinkage unsigned long sys_mremap(unsigned long addr, - unsigned long old_len, unsigned long new_len, - unsigned long flags, unsigned long new_addr) +asmlinkage +unsigned long sys_mremap(unsigned long addr, + unsigned long old_len, unsigned long new_len, + unsigned long flags, unsigned long new_addr) { unsigned long ret; -- cgit v1.2.3 From dd8632a12e500a684478fea0951f380478d56fed Mon Sep 17 00:00:00 2001 From: Paul Mundt Date: Thu, 8 Jan 2009 12:04:47 +0000 Subject: NOMMU: Make mmap allocation page trimming behaviour configurable. NOMMU mmap allocates a piece of memory for an mmap that's rounded up in size to the nearest power-of-2 number of pages. Currently it then discards the excess pages back to the page allocator, making that memory available for use by other things. This can, however, cause greater amount of fragmentation. To counter this, a sysctl is added in order to fine-tune the trimming behaviour. The default behaviour remains to trim pages aggressively, while this can either be disabled completely or set to a higher page-granular watermark in order to have finer-grained control. vm region vm_top bits taken from an earlier patch by David Howells. Signed-off-by: Paul Mundt Signed-off-by: David Howells Tested-by: Mike Frysinger --- Documentation/nommu-mmap.txt | 15 ++++++++++ Documentation/sysctl/vm.txt | 18 ++++++++++++ include/linux/mm_types.h | 1 + kernel/sysctl.c | 14 ++++++++++ mm/nommu.c | 65 ++++++++++++++++++++++++++++---------------- 5 files changed, 90 insertions(+), 23 deletions(-) (limited to 'Documentation') diff --git a/Documentation/nommu-mmap.txt b/Documentation/nommu-mmap.txt index 02b89dcf38ac..b565e8279d13 100644 --- a/Documentation/nommu-mmap.txt +++ b/Documentation/nommu-mmap.txt @@ -248,3 +248,18 @@ PROVIDING SHAREABLE BLOCK DEVICE SUPPORT Provision of shared mappings on block device files is exactly the same as for character devices. If there isn't a real device underneath, then the driver should allocate sufficient contiguous memory to honour any supported mapping. + + +================================= +ADJUSTING PAGE TRIMMING BEHAVIOUR +================================= + +NOMMU mmap automatically rounds up to the nearest power-of-2 number of pages +when performing an allocation. This can have adverse effects on memory +fragmentation, and as such, is left configurable. The default behaviour is to +aggressively trim allocations and discard any excess pages back in to the page +allocator. In order to retain finer-grained control over fragmentation, this +behaviour can either be disabled completely, or bumped up to a higher page +watermark where trimming begins. + +Page trimming behaviour is configurable via the sysctl `vm.nr_trim_pages'. diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index cd05994a49e6..a3415070bcac 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -38,6 +38,7 @@ Currently, these files are in /proc/sys/vm: - numa_zonelist_order - nr_hugepages - nr_overcommit_hugepages +- nr_trim_pages (only if CONFIG_MMU=n) ============================================================== @@ -348,3 +349,20 @@ Change the maximum size of the hugepage pool. The maximum is nr_hugepages + nr_overcommit_hugepages. See Documentation/vm/hugetlbpage.txt + +============================================================== + +nr_trim_pages + +This is available only on NOMMU kernels. + +This value adjusts the excess page trimming behaviour of power-of-2 aligned +NOMMU mmap allocations. + +A value of 0 disables trimming of allocations entirely, while a value of 1 +trims excess pages aggressively. Any value >= 1 acts as the watermark where +trimming of allocations is initiated. + +The default value is 1. + +See Documentation/nommu-mmap.txt for more information. diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 1c1e0d3a1714..92915e81443f 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -106,6 +106,7 @@ struct vm_region { unsigned long vm_flags; /* VMA vm_flags */ unsigned long vm_start; /* start address of region */ unsigned long vm_end; /* region initialised to here */ + unsigned long vm_top; /* region allocated to here */ unsigned long vm_pgoff; /* the offset in vm_file corresponding to vm_start */ struct file *vm_file; /* the backing file or NULL */ diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 92f6e5bc3c24..89d74436318c 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -82,6 +82,9 @@ extern int percpu_pagelist_fraction; extern int compat_log; extern int latencytop_enabled; extern int sysctl_nr_open_min, sysctl_nr_open_max; +#ifndef CONFIG_MMU +extern int sysctl_nr_trim_pages; +#endif #ifdef CONFIG_RCU_TORTURE_TEST extern int rcutorture_runnable; #endif /* #ifdef CONFIG_RCU_TORTURE_TEST */ @@ -1102,6 +1105,17 @@ static struct ctl_table vm_table[] = { .mode = 0644, .proc_handler = &proc_dointvec }, +#else + { + .ctl_name = CTL_UNNUMBERED, + .procname = "nr_trim_pages", + .data = &sysctl_nr_trim_pages, + .maxlen = sizeof(sysctl_nr_trim_pages), + .mode = 0644, + .proc_handler = &proc_dointvec_minmax, + .strategy = &sysctl_intvec, + .extra1 = &zero, + }, #endif { .ctl_name = VM_LAPTOP_MODE, diff --git a/mm/nommu.c b/mm/nommu.c index 0d363dfcf10e..a6e8ccfbd400 100644 --- a/mm/nommu.c +++ b/mm/nommu.c @@ -10,7 +10,7 @@ * Copyright (c) 2000-2003 David McCullough * Copyright (c) 2000-2001 D Jeff Dionne * Copyright (c) 2002 Greg Ungerer - * Copyright (c) 2007 Paul Mundt + * Copyright (c) 2007-2008 Paul Mundt */ #include @@ -66,6 +66,7 @@ atomic_long_t vm_committed_space = ATOMIC_LONG_INIT(0); int sysctl_overcommit_memory = OVERCOMMIT_GUESS; /* heuristic overcommit */ int sysctl_overcommit_ratio = 50; /* default is 50% */ int sysctl_max_map_count = DEFAULT_MAX_MAP_COUNT; +int sysctl_nr_trim_pages = 1; /* page trimming behaviour */ int heap_stack_gap = 0; atomic_t mmap_pages_allocated; @@ -455,6 +456,8 @@ static noinline void validate_nommu_regions(void) last = rb_entry(lastp, struct vm_region, vm_rb); if (unlikely(last->vm_end <= last->vm_start)) BUG(); + if (unlikely(last->vm_top < last->vm_end)) + BUG(); while ((p = rb_next(lastp))) { region = rb_entry(p, struct vm_region, vm_rb); @@ -462,7 +465,9 @@ static noinline void validate_nommu_regions(void) if (unlikely(region->vm_end <= region->vm_start)) BUG(); - if (unlikely(region->vm_start < last->vm_end)) + if (unlikely(region->vm_top < region->vm_end)) + BUG(); + if (unlikely(region->vm_start < last->vm_top)) BUG(); lastp = p; @@ -536,7 +541,7 @@ static void free_page_series(unsigned long from, unsigned long to) /* * release a reference to a region * - the caller must hold the region semaphore, which this releases - * - the region may not have been added to the tree yet, in which case vm_end + * - the region may not have been added to the tree yet, in which case vm_top * will equal vm_start */ static void __put_nommu_region(struct vm_region *region) @@ -547,7 +552,7 @@ static void __put_nommu_region(struct vm_region *region) BUG_ON(!nommu_region_tree.rb_node); if (atomic_dec_and_test(®ion->vm_usage)) { - if (region->vm_end > region->vm_start) + if (region->vm_top > region->vm_start) delete_nommu_region(region); up_write(&nommu_region_sem); @@ -558,7 +563,7 @@ static void __put_nommu_region(struct vm_region *region) * from ramfs/tmpfs mustn't be released here */ if (region->vm_flags & VM_MAPPED_COPY) { kdebug("free series"); - free_page_series(region->vm_start, region->vm_end); + free_page_series(region->vm_start, region->vm_top); } kmem_cache_free(vm_region_jar, region); } else { @@ -999,6 +1004,10 @@ static int do_mmap_shared_file(struct vm_area_struct *vma) int ret; ret = vma->vm_file->f_op->mmap(vma->vm_file, vma); + if (ret == 0) { + vma->vm_region->vm_top = vma->vm_region->vm_end; + return ret; + } if (ret != -ENOSYS) return ret; @@ -1027,11 +1036,14 @@ static int do_mmap_private(struct vm_area_struct *vma, */ if (vma->vm_file) { ret = vma->vm_file->f_op->mmap(vma->vm_file, vma); - if (ret != -ENOSYS) { + if (ret == 0) { /* shouldn't return success if we're not sharing */ - BUG_ON(ret == 0 && !(vma->vm_flags & VM_MAYSHARE)); - return ret; /* success or a real error */ + BUG_ON(!(vma->vm_flags & VM_MAYSHARE)); + vma->vm_region->vm_top = vma->vm_region->vm_end; + return ret; } + if (ret != -ENOSYS) + return ret; /* getting an ENOSYS error indicates that direct mmap isn't * possible (as opposed to tried but failed) so we'll try to @@ -1051,23 +1063,25 @@ static int do_mmap_private(struct vm_area_struct *vma, if (!pages) goto enomem; - /* we allocated a power-of-2 sized page set, so we need to trim off the - * excess */ total = 1 << order; atomic_add(total, &mmap_pages_allocated); point = rlen >> PAGE_SHIFT; - while (total > point) { - order = ilog2(total - point); - n = 1 << order; - kdebug("shave %lu/%lu @%lu", n, total - point, total); - atomic_sub(n, &mmap_pages_allocated); - total -= n; - set_page_refcounted(pages + total); - __free_pages(pages + total, order); + + /* we allocated a power-of-2 sized page set, so we may want to trim off + * the excess */ + if (sysctl_nr_trim_pages && total - point >= sysctl_nr_trim_pages) { + while (total > point) { + order = ilog2(total - point); + n = 1 << order; + kdebug("shave %lu/%lu @%lu", n, total - point, total); + atomic_sub(n, &mmap_pages_allocated); + total -= n; + set_page_refcounted(pages + total); + __free_pages(pages + total, order); + } } - total = rlen >> PAGE_SHIFT; for (point = 1; point < total; point++) set_page_refcounted(&pages[point]); @@ -1075,6 +1089,7 @@ static int do_mmap_private(struct vm_area_struct *vma, region->vm_flags = vma->vm_flags |= VM_MAPPED_COPY; region->vm_start = (unsigned long) base; region->vm_end = region->vm_start + rlen; + region->vm_top = region->vm_start + (total << PAGE_SHIFT); vma->vm_start = region->vm_start; vma->vm_end = region->vm_start + len; @@ -1110,6 +1125,7 @@ error_free: free_page_series(region->vm_start, region->vm_end); region->vm_start = vma->vm_start = 0; region->vm_end = vma->vm_end = 0; + region->vm_top = 0; return ret; enomem: @@ -1401,7 +1417,7 @@ int split_vma(struct mm_struct *mm, struct vm_area_struct *vma, npages = (addr - vma->vm_start) >> PAGE_SHIFT; if (new_below) { - region->vm_end = new->vm_end = addr; + region->vm_top = region->vm_end = new->vm_end = addr; } else { region->vm_start = new->vm_start = addr; region->vm_pgoff = new->vm_pgoff += npages; @@ -1418,6 +1434,7 @@ int split_vma(struct mm_struct *mm, struct vm_area_struct *vma, vma->vm_region->vm_pgoff = vma->vm_pgoff += npages; } else { vma->vm_region->vm_end = vma->vm_end = addr; + vma->vm_region->vm_top = addr; } add_nommu_region(vma->vm_region); add_nommu_region(new->vm_region); @@ -1454,10 +1471,12 @@ static int shrink_vma(struct mm_struct *mm, down_write(&nommu_region_sem); delete_nommu_region(region); - if (from > region->vm_start) - region->vm_end = from; - else + if (from > region->vm_start) { + to = region->vm_top; + region->vm_top = region->vm_end = from; + } else { region->vm_start = to; + } add_nommu_region(region); up_write(&nommu_region_sem); -- cgit v1.2.3 From 237889bf0a62f1399fb2ba0c2a259e6a96597131 Mon Sep 17 00:00:00 2001 From: Zhao Yakui Date: Wed, 17 Dec 2008 16:55:18 +0800 Subject: ACPI : Use RSDT instead of XSDT by adding boot option of "acpi=rsdt" On some boxes there exist both RSDT and XSDT table. But unfortunately sometimes there exists the following error when XSDT table is used: a. 32/64X address mismatch b. The 32/64X FACS address mismatch In such case the boot option of "acpi=rsdt" is provided so that RSDT is tried instead of XSDT table when the system can't work well. http://bugzilla.kernel.org/show_bug.cgi?id=8246 Signed-off-by: Zhao Yakui cc:Thomas Renninger Signed-off-by: Len Brown --- Documentation/kernel-parameters.txt | 1 + arch/ia64/kernel/acpi.c | 1 + arch/x86/kernel/acpi/boot.c | 6 +++++- drivers/acpi/tables/tbutils.c | 3 ++- include/acpi/acpixf.h | 1 + 5 files changed, 10 insertions(+), 2 deletions(-) (limited to 'Documentation') diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index c9115c1b672c..136f02842de9 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -139,6 +139,7 @@ and is between 256 and 4096 characters. It is defined in the file ht -- run only enough ACPI to enable Hyper Threading strict -- Be less tolerant of platforms that are not strictly ACPI specification compliant. + rsdt -- prefer RSDT over (default) XSDT See also Documentation/power/pm.txt, pci=noacpi diff --git a/arch/ia64/kernel/acpi.c b/arch/ia64/kernel/acpi.c index bd7acc71e8a9..c19b686db9b8 100644 --- a/arch/ia64/kernel/acpi.c +++ b/arch/ia64/kernel/acpi.c @@ -65,6 +65,7 @@ EXPORT_SYMBOL(pm_idle); void (*pm_power_off) (void); EXPORT_SYMBOL(pm_power_off); +u32 acpi_rsdt_forced; unsigned int acpi_cpei_override; unsigned int acpi_cpei_phys_cpuid; diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c index 4c51a2f8fd31..db1a90a76b3e 100644 --- a/arch/x86/kernel/acpi/boot.c +++ b/arch/x86/kernel/acpi/boot.c @@ -47,7 +47,7 @@ #endif static int __initdata acpi_force = 0; - +u32 acpi_rsdt_forced; #ifdef CONFIG_ACPI int acpi_disabled = 0; #else @@ -1783,6 +1783,10 @@ static int __init parse_acpi(char *arg) disable_acpi(); acpi_ht = 1; } + /* acpi=rsdt use RSDT instead of XSDT */ + else if (strcmp(arg, "rsdt") == 0) { + acpi_rsdt_forced = 1; + } /* "acpi=noirq" disables ACPI interrupt routing */ else if (strcmp(arg, "noirq") == 0) { acpi_noirq_set(); diff --git a/drivers/acpi/tables/tbutils.c b/drivers/acpi/tables/tbutils.c index 0cc92ef5236f..da9f240186e8 100644 --- a/drivers/acpi/tables/tbutils.c +++ b/drivers/acpi/tables/tbutils.c @@ -420,7 +420,8 @@ acpi_tb_parse_root_table(acpi_physical_address rsdp_address, u8 flags) /* Differentiate between RSDT and XSDT root tables */ - if (rsdp->revision > 1 && rsdp->xsdt_physical_address) { + if (rsdp->revision > 1 && rsdp->xsdt_physical_address + && !acpi_rsdt_forced) { /* * Root table is an XSDT (64-bit physical addresses). We must use the * XSDT if the revision is > 1 and the XSDT pointer is present, as per diff --git a/include/acpi/acpixf.h b/include/acpi/acpixf.h index 33bc0e3b1954..05d2614e0078 100644 --- a/include/acpi/acpixf.h +++ b/include/acpi/acpixf.h @@ -48,6 +48,7 @@ #include "actypes.h" #include "actbl.h" +extern u32 acpi_rsdt_forced; /* * Global interfaces */ -- cgit v1.2.3 From 555d61d6542d51563e50532ff604dcd31c96fb24 Mon Sep 17 00:00:00 2001 From: Hendrik Brueckner Date: Fri, 9 Jan 2009 12:15:02 +0100 Subject: [S390] update documentation for hvc_iucv kernel parameter. Signed-off-by: Hendrik Brueckner Signed-off-by: Martin Schwidefsky --- Documentation/kernel-parameters.txt | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'Documentation') diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index fb849020aea9..ed0a72442cf3 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -829,8 +829,8 @@ and is between 256 and 4096 characters. It is defined in the file hlt [BUGS=ARM,SH] - hvc_iucv= [S390] Number of z/VM IUCV Hypervisor console (HVC) - back-ends. Valid parameters: 0..8 + hvc_iucv= [S390] Number of z/VM IUCV hypervisor console (HVC) + terminal devices. Valid values: 0..8 i8042.debug [HW] Toggle i8042 debug mode i8042.direct [HW] Put keyboard port into non-translated mode -- cgit v1.2.3 From c4be0c1dc4cdc37b175579be1460f15ac6495e9a Mon Sep 17 00:00:00 2001 From: Takashi Sato Date: Fri, 9 Jan 2009 16:40:58 -0800 Subject: filesystem freeze: add error handling of write_super_lockfs/unlockfs Currently, ext3 in mainline Linux doesn't have the freeze feature which suspends write requests. So, we cannot take a backup which keeps the filesystem's consistency with the storage device's features (snapshot and replication) while it is mounted. In many case, a commercial filesystem (e.g. VxFS) has the freeze feature and it would be used to get the consistent backup. If Linux's standard filesystem ext3 has the freeze feature, we can do it without a commercial filesystem. So I have implemented the ioctls of the freeze feature. I think we can take the consistent backup with the following steps. 1. Freeze the filesystem with the freeze ioctl. 2. Separate the replication volume or create the snapshot with the storage device's feature. 3. Unfreeze the filesystem with the unfreeze ioctl. 4. Take the backup from the separated replication volume or the snapshot. This patch: VFS: Changed the type of write_super_lockfs and unlockfs from "void" to "int" so that they can return an error. Rename write_super_lockfs and unlockfs of the super block operation freeze_fs and unfreeze_fs to avoid a confusion. ext3, ext4, xfs, gfs2, jfs: Changed the type of write_super_lockfs and unlockfs from "void" to "int" so that write_super_lockfs returns an error if needed, and unlockfs always returns 0. reiserfs: Changed the type of write_super_lockfs and unlockfs from "void" to "int" so that they always return 0 (success) to keep a current behavior. Signed-off-by: Takashi Sato Signed-off-by: Masayuki Hamaguchi Cc: Cc: Cc: Christoph Hellwig Cc: Dave Kleikamp Cc: Dave Chinner Cc: Alasdair G Kergon Cc: Al Viro Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/filesystems/Locking | 8 +++---- Documentation/filesystems/vfs.txt | 8 +++---- fs/buffer.c | 8 +++---- fs/ext3/super.c | 45 +++++++++++++++++++++++++-------------- fs/ext4/super.c | 45 +++++++++++++++++++++++++++------------ fs/gfs2/ops_super.c | 16 ++++++++------ fs/jfs/super.c | 10 +++++---- fs/reiserfs/super.c | 10 +++++---- fs/xfs/linux-2.6/xfs_super.c | 8 +++---- fs/xfs/xfs_fsops.c | 11 ++++++---- fs/xfs/xfs_fsops.h | 2 +- include/linux/fs.h | 4 ++-- 12 files changed, 107 insertions(+), 68 deletions(-) (limited to 'Documentation') diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking index cfbfa15a46ba..ec6a9392a173 100644 --- a/Documentation/filesystems/Locking +++ b/Documentation/filesystems/Locking @@ -97,8 +97,8 @@ prototypes: void (*put_super) (struct super_block *); void (*write_super) (struct super_block *); int (*sync_fs)(struct super_block *sb, int wait); - void (*write_super_lockfs) (struct super_block *); - void (*unlockfs) (struct super_block *); + int (*freeze_fs) (struct super_block *); + int (*unfreeze_fs) (struct super_block *); int (*statfs) (struct dentry *, struct kstatfs *); int (*remount_fs) (struct super_block *, int *, char *); void (*clear_inode) (struct inode *); @@ -119,8 +119,8 @@ delete_inode: no put_super: yes yes no write_super: no yes read sync_fs: no no read -write_super_lockfs: ? -unlockfs: ? +freeze_fs: ? +unfreeze_fs: ? statfs: no no no remount_fs: yes yes maybe (see below) clear_inode: no diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt index ef19afa186a9..deeeed0faa8f 100644 --- a/Documentation/filesystems/vfs.txt +++ b/Documentation/filesystems/vfs.txt @@ -210,8 +210,8 @@ struct super_operations { void (*put_super) (struct super_block *); void (*write_super) (struct super_block *); int (*sync_fs)(struct super_block *sb, int wait); - void (*write_super_lockfs) (struct super_block *); - void (*unlockfs) (struct super_block *); + int (*freeze_fs) (struct super_block *); + int (*unfreeze_fs) (struct super_block *); int (*statfs) (struct dentry *, struct kstatfs *); int (*remount_fs) (struct super_block *, int *, char *); void (*clear_inode) (struct inode *); @@ -270,11 +270,11 @@ or bottom half). a superblock. The second parameter indicates whether the method should wait until the write out has been completed. Optional. - write_super_lockfs: called when VFS is locking a filesystem and + freeze_fs: called when VFS is locking a filesystem and forcing it into a consistent state. This method is currently used by the Logical Volume Manager (LVM). - unlockfs: called when VFS is unlocking a filesystem and making it writable + unfreeze_fs: called when VFS is unlocking a filesystem and making it writable again. statfs: called when the VFS needs to get filesystem statistics. This diff --git a/fs/buffer.c b/fs/buffer.c index c26da785938a..87f9e537b8c3 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -221,8 +221,8 @@ struct super_block *freeze_bdev(struct block_device *bdev) sync_blockdev(sb->s_bdev); - if (sb->s_op->write_super_lockfs) - sb->s_op->write_super_lockfs(sb); + if (sb->s_op->freeze_fs) + sb->s_op->freeze_fs(sb); } sync_blockdev(bdev); @@ -242,8 +242,8 @@ void thaw_bdev(struct block_device *bdev, struct super_block *sb) if (sb) { BUG_ON(sb->s_bdev != bdev); - if (sb->s_op->unlockfs) - sb->s_op->unlockfs(sb); + if (sb->s_op->unfreeze_fs) + sb->s_op->unfreeze_fs(sb); sb->s_frozen = SB_UNFROZEN; smp_wmb(); wake_up(&sb->s_wait_unfrozen); diff --git a/fs/ext3/super.c b/fs/ext3/super.c index 5d047a030a73..b70d90e08a3c 100644 --- a/fs/ext3/super.c +++ b/fs/ext3/super.c @@ -48,8 +48,8 @@ static int ext3_load_journal(struct super_block *, struct ext3_super_block *, unsigned long journal_devnum); static int ext3_create_journal(struct super_block *, struct ext3_super_block *, unsigned int); -static void ext3_commit_super (struct super_block * sb, - struct ext3_super_block * es, +static int ext3_commit_super(struct super_block *sb, + struct ext3_super_block *es, int sync); static void ext3_mark_recovery_complete(struct super_block * sb, struct ext3_super_block * es); @@ -60,9 +60,9 @@ static const char *ext3_decode_error(struct super_block * sb, int errno, char nbuf[16]); static int ext3_remount (struct super_block * sb, int * flags, char * data); static int ext3_statfs (struct dentry * dentry, struct kstatfs * buf); -static void ext3_unlockfs(struct super_block *sb); +static int ext3_unfreeze(struct super_block *sb); static void ext3_write_super (struct super_block * sb); -static void ext3_write_super_lockfs(struct super_block *sb); +static int ext3_freeze(struct super_block *sb); /* * Wrappers for journal_start/end. @@ -759,8 +759,8 @@ static const struct super_operations ext3_sops = { .put_super = ext3_put_super, .write_super = ext3_write_super, .sync_fs = ext3_sync_fs, - .write_super_lockfs = ext3_write_super_lockfs, - .unlockfs = ext3_unlockfs, + .freeze_fs = ext3_freeze, + .unfreeze_fs = ext3_unfreeze, .statfs = ext3_statfs, .remount_fs = ext3_remount, .clear_inode = ext3_clear_inode, @@ -2311,21 +2311,23 @@ static int ext3_create_journal(struct super_block * sb, return 0; } -static void ext3_commit_super (struct super_block * sb, - struct ext3_super_block * es, +static int ext3_commit_super(struct super_block *sb, + struct ext3_super_block *es, int sync) { struct buffer_head *sbh = EXT3_SB(sb)->s_sbh; + int error = 0; if (!sbh) - return; + return error; es->s_wtime = cpu_to_le32(get_seconds()); es->s_free_blocks_count = cpu_to_le32(ext3_count_free_blocks(sb)); es->s_free_inodes_count = cpu_to_le32(ext3_count_free_inodes(sb)); BUFFER_TRACE(sbh, "marking dirty"); mark_buffer_dirty(sbh); if (sync) - sync_dirty_buffer(sbh); + error = sync_dirty_buffer(sbh); + return error; } @@ -2439,12 +2441,14 @@ static int ext3_sync_fs(struct super_block *sb, int wait) * LVM calls this function before a (read-only) snapshot is created. This * gives us a chance to flush the journal completely and mark the fs clean. */ -static void ext3_write_super_lockfs(struct super_block *sb) +static int ext3_freeze(struct super_block *sb) { + int error = 0; + journal_t *journal; sb->s_dirt = 0; if (!(sb->s_flags & MS_RDONLY)) { - journal_t *journal = EXT3_SB(sb)->s_journal; + journal = EXT3_SB(sb)->s_journal; /* Now we set up the journal barrier. */ journal_lock_updates(journal); @@ -2453,20 +2457,28 @@ static void ext3_write_super_lockfs(struct super_block *sb) * We don't want to clear needs_recovery flag when we failed * to flush the journal. */ - if (journal_flush(journal) < 0) - return; + error = journal_flush(journal); + if (error < 0) + goto out; /* Journal blocked and flushed, clear needs_recovery flag. */ EXT3_CLEAR_INCOMPAT_FEATURE(sb, EXT3_FEATURE_INCOMPAT_RECOVER); - ext3_commit_super(sb, EXT3_SB(sb)->s_es, 1); + error = ext3_commit_super(sb, EXT3_SB(sb)->s_es, 1); + if (error) + goto out; } + return 0; + +out: + journal_unlock_updates(journal); + return error; } /* * Called by LVM after the snapshot is done. We need to reset the RECOVER * flag here, even though the filesystem is not technically dirty yet. */ -static void ext3_unlockfs(struct super_block *sb) +static int ext3_unfreeze(struct super_block *sb) { if (!(sb->s_flags & MS_RDONLY)) { lock_super(sb); @@ -2476,6 +2488,7 @@ static void ext3_unlockfs(struct super_block *sb) unlock_super(sb); journal_unlock_updates(EXT3_SB(sb)->s_journal); } + return 0; } static int ext3_remount (struct super_block * sb, int * flags, char * data) diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 8f7e0be8ab1b..e5f06a5f045e 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -51,7 +51,7 @@ struct proc_dir_entry *ext4_proc_root; static int ext4_load_journal(struct super_block *, struct ext4_super_block *, unsigned long journal_devnum); -static void ext4_commit_super(struct super_block *sb, +static int ext4_commit_super(struct super_block *sb, struct ext4_super_block *es, int sync); static void ext4_mark_recovery_complete(struct super_block *sb, struct ext4_super_block *es); @@ -62,9 +62,9 @@ static const char *ext4_decode_error(struct super_block *sb, int errno, char nbuf[16]); static int ext4_remount(struct super_block *sb, int *flags, char *data); static int ext4_statfs(struct dentry *dentry, struct kstatfs *buf); -static void ext4_unlockfs(struct super_block *sb); +static int ext4_unfreeze(struct super_block *sb); static void ext4_write_super(struct super_block *sb); -static void ext4_write_super_lockfs(struct super_block *sb); +static int ext4_freeze(struct super_block *sb); ext4_fsblk_t ext4_block_bitmap(struct super_block *sb, @@ -978,8 +978,8 @@ static const struct super_operations ext4_sops = { .put_super = ext4_put_super, .write_super = ext4_write_super, .sync_fs = ext4_sync_fs, - .write_super_lockfs = ext4_write_super_lockfs, - .unlockfs = ext4_unlockfs, + .freeze_fs = ext4_freeze, + .unfreeze_fs = ext4_unfreeze, .statfs = ext4_statfs, .remount_fs = ext4_remount, .clear_inode = ext4_clear_inode, @@ -2888,13 +2888,14 @@ static int ext4_load_journal(struct super_block *sb, return 0; } -static void ext4_commit_super(struct super_block *sb, +static int ext4_commit_super(struct super_block *sb, struct ext4_super_block *es, int sync) { struct buffer_head *sbh = EXT4_SB(sb)->s_sbh; + int error = 0; if (!sbh) - return; + return error; if (buffer_write_io_error(sbh)) { /* * Oh, dear. A previous attempt to write the @@ -2918,14 +2919,19 @@ static void ext4_commit_super(struct super_block *sb, BUFFER_TRACE(sbh, "marking dirty"); mark_buffer_dirty(sbh); if (sync) { - sync_dirty_buffer(sbh); - if (buffer_write_io_error(sbh)) { + error = sync_dirty_buffer(sbh); + if (error) + return error; + + error = buffer_write_io_error(sbh); + if (error) { printk(KERN_ERR "EXT4-fs: I/O error while writing " "superblock for %s.\n", sb->s_id); clear_buffer_write_io_error(sbh); set_buffer_uptodate(sbh); } } + return error; } @@ -3058,12 +3064,14 @@ static int ext4_sync_fs(struct super_block *sb, int wait) * LVM calls this function before a (read-only) snapshot is created. This * gives us a chance to flush the journal completely and mark the fs clean. */ -static void ext4_write_super_lockfs(struct super_block *sb) +static int ext4_freeze(struct super_block *sb) { + int error = 0; + journal_t *journal; sb->s_dirt = 0; if (!(sb->s_flags & MS_RDONLY)) { - journal_t *journal = EXT4_SB(sb)->s_journal; + journal = EXT4_SB(sb)->s_journal; if (journal) { /* Now we set up the journal barrier. */ @@ -3073,21 +3081,29 @@ static void ext4_write_super_lockfs(struct super_block *sb) * We don't want to clear needs_recovery flag when we * failed to flush the journal. */ - if (jbd2_journal_flush(journal) < 0) - return; + error = jbd2_journal_flush(journal); + if (error < 0) + goto out; } /* Journal blocked and flushed, clear needs_recovery flag. */ EXT4_CLEAR_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_RECOVER); ext4_commit_super(sb, EXT4_SB(sb)->s_es, 1); + error = ext4_commit_super(sb, EXT4_SB(sb)->s_es, 1); + if (error) + goto out; } + return 0; +out: + jbd2_journal_unlock_updates(journal); + return error; } /* * Called by LVM after the snapshot is done. We need to reset the RECOVER * flag here, even though the filesystem is not technically dirty yet. */ -static void ext4_unlockfs(struct super_block *sb) +static int ext4_unfreeze(struct super_block *sb) { if (EXT4_SB(sb)->s_journal && !(sb->s_flags & MS_RDONLY)) { lock_super(sb); @@ -3097,6 +3113,7 @@ static void ext4_unlockfs(struct super_block *sb) unlock_super(sb); jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal); } + return 0; } static int ext4_remount(struct super_block *sb, int *flags, char *data) diff --git a/fs/gfs2/ops_super.c b/fs/gfs2/ops_super.c index 777783deddcb..320323d03479 100644 --- a/fs/gfs2/ops_super.c +++ b/fs/gfs2/ops_super.c @@ -211,18 +211,18 @@ static int gfs2_sync_fs(struct super_block *sb, int wait) } /** - * gfs2_write_super_lockfs - prevent further writes to the filesystem + * gfs2_freeze - prevent further writes to the filesystem * @sb: the VFS structure for the filesystem * */ -static void gfs2_write_super_lockfs(struct super_block *sb) +static int gfs2_freeze(struct super_block *sb) { struct gfs2_sbd *sdp = sb->s_fs_info; int error; if (test_bit(SDF_SHUTDOWN, &sdp->sd_flags)) - return; + return -EINVAL; for (;;) { error = gfs2_freeze_fs(sdp); @@ -242,17 +242,19 @@ static void gfs2_write_super_lockfs(struct super_block *sb) fs_err(sdp, "retrying...\n"); msleep(1000); } + return 0; } /** - * gfs2_unlockfs - reallow writes to the filesystem + * gfs2_unfreeze - reallow writes to the filesystem * @sb: the VFS structure for the filesystem * */ -static void gfs2_unlockfs(struct super_block *sb) +static int gfs2_unfreeze(struct super_block *sb) { gfs2_unfreeze_fs(sb->s_fs_info); + return 0; } /** @@ -688,8 +690,8 @@ const struct super_operations gfs2_super_ops = { .put_super = gfs2_put_super, .write_super = gfs2_write_super, .sync_fs = gfs2_sync_fs, - .write_super_lockfs = gfs2_write_super_lockfs, - .unlockfs = gfs2_unlockfs, + .freeze_fs = gfs2_freeze, + .unfreeze_fs = gfs2_unfreeze, .statfs = gfs2_statfs, .remount_fs = gfs2_remount_fs, .clear_inode = gfs2_clear_inode, diff --git a/fs/jfs/super.c b/fs/jfs/super.c index 0dae345e481b..b37d1f78b854 100644 --- a/fs/jfs/super.c +++ b/fs/jfs/super.c @@ -543,7 +543,7 @@ out_kfree: return ret; } -static void jfs_write_super_lockfs(struct super_block *sb) +static int jfs_freeze(struct super_block *sb) { struct jfs_sb_info *sbi = JFS_SBI(sb); struct jfs_log *log = sbi->log; @@ -553,9 +553,10 @@ static void jfs_write_super_lockfs(struct super_block *sb) lmLogShutdown(log); updateSuper(sb, FM_CLEAN); } + return 0; } -static void jfs_unlockfs(struct super_block *sb) +static int jfs_unfreeze(struct super_block *sb) { struct jfs_sb_info *sbi = JFS_SBI(sb); struct jfs_log *log = sbi->log; @@ -568,6 +569,7 @@ static void jfs_unlockfs(struct super_block *sb) else txResume(sb); } + return 0; } static int jfs_get_sb(struct file_system_type *fs_type, @@ -735,8 +737,8 @@ static const struct super_operations jfs_super_operations = { .delete_inode = jfs_delete_inode, .put_super = jfs_put_super, .sync_fs = jfs_sync_fs, - .write_super_lockfs = jfs_write_super_lockfs, - .unlockfs = jfs_unlockfs, + .freeze_fs = jfs_freeze, + .unfreeze_fs = jfs_unfreeze, .statfs = jfs_statfs, .remount_fs = jfs_remount, .show_options = jfs_show_options, diff --git a/fs/reiserfs/super.c b/fs/reiserfs/super.c index c55651f1407c..f3c820b75829 100644 --- a/fs/reiserfs/super.c +++ b/fs/reiserfs/super.c @@ -83,7 +83,7 @@ static void reiserfs_write_super(struct super_block *s) reiserfs_sync_fs(s, 1); } -static void reiserfs_write_super_lockfs(struct super_block *s) +static int reiserfs_freeze(struct super_block *s) { struct reiserfs_transaction_handle th; reiserfs_write_lock(s); @@ -101,11 +101,13 @@ static void reiserfs_write_super_lockfs(struct super_block *s) } s->s_dirt = 0; reiserfs_write_unlock(s); + return 0; } -static void reiserfs_unlockfs(struct super_block *s) +static int reiserfs_unfreeze(struct super_block *s) { reiserfs_allow_writes(s); + return 0; } extern const struct in_core_key MAX_IN_CORE_KEY; @@ -613,8 +615,8 @@ static const struct super_operations reiserfs_sops = { .put_super = reiserfs_put_super, .write_super = reiserfs_write_super, .sync_fs = reiserfs_sync_fs, - .write_super_lockfs = reiserfs_write_super_lockfs, - .unlockfs = reiserfs_unlockfs, + .freeze_fs = reiserfs_freeze, + .unfreeze_fs = reiserfs_unfreeze, .statfs = reiserfs_statfs, .remount_fs = reiserfs_remount, .show_options = generic_show_options, diff --git a/fs/xfs/linux-2.6/xfs_super.c b/fs/xfs/linux-2.6/xfs_super.c index be846d606ae8..95a971080368 100644 --- a/fs/xfs/linux-2.6/xfs_super.c +++ b/fs/xfs/linux-2.6/xfs_super.c @@ -1269,14 +1269,14 @@ xfs_fs_remount( * need to take care of the metadata. Once that's done write a dummy * record to dirty the log in case of a crash while frozen. */ -STATIC void -xfs_fs_lockfs( +STATIC int +xfs_fs_freeze( struct super_block *sb) { struct xfs_mount *mp = XFS_M(sb); xfs_quiesce_attr(mp); - xfs_fs_log_dummy(mp); + return -xfs_fs_log_dummy(mp); } STATIC int @@ -1557,7 +1557,7 @@ static struct super_operations xfs_super_operations = { .put_super = xfs_fs_put_super, .write_super = xfs_fs_write_super, .sync_fs = xfs_fs_sync_super, - .write_super_lockfs = xfs_fs_lockfs, + .freeze_fs = xfs_fs_freeze, .statfs = xfs_fs_statfs, .remount_fs = xfs_fs_remount, .show_options = xfs_fs_show_options, diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c index 852b6d32e8d0..680d0e0ec932 100644 --- a/fs/xfs/xfs_fsops.c +++ b/fs/xfs/xfs_fsops.c @@ -595,17 +595,19 @@ out: return 0; } -void +int xfs_fs_log_dummy( xfs_mount_t *mp) { xfs_trans_t *tp; xfs_inode_t *ip; + int error; tp = _xfs_trans_alloc(mp, XFS_TRANS_DUMMY1); - if (xfs_trans_reserve(tp, 0, XFS_ICHANGE_LOG_RES(mp), 0, 0, 0)) { + error = xfs_trans_reserve(tp, 0, XFS_ICHANGE_LOG_RES(mp), 0, 0, 0); + if (error) { xfs_trans_cancel(tp, 0); - return; + return error; } ip = mp->m_rootip; @@ -615,9 +617,10 @@ xfs_fs_log_dummy( xfs_trans_ihold(tp, ip); xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE); xfs_trans_set_sync(tp); - xfs_trans_commit(tp, 0); + error = xfs_trans_commit(tp, 0); xfs_iunlock(ip, XFS_ILOCK_EXCL); + return error; } int diff --git a/fs/xfs/xfs_fsops.h b/fs/xfs/xfs_fsops.h index 300d0c9d61ad..88435e0a77c9 100644 --- a/fs/xfs/xfs_fsops.h +++ b/fs/xfs/xfs_fsops.h @@ -25,6 +25,6 @@ extern int xfs_fs_counts(xfs_mount_t *mp, xfs_fsop_counts_t *cnt); extern int xfs_reserve_blocks(xfs_mount_t *mp, __uint64_t *inval, xfs_fsop_resblks_t *outval); extern int xfs_fs_goingdown(xfs_mount_t *mp, __uint32_t inflags); -extern void xfs_fs_log_dummy(xfs_mount_t *mp); +extern int xfs_fs_log_dummy(xfs_mount_t *mp); #endif /* __XFS_FSOPS_H__ */ diff --git a/include/linux/fs.h b/include/linux/fs.h index 0b87b29f4797..3e59182de9df 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1377,8 +1377,8 @@ struct super_operations { void (*put_super) (struct super_block *); void (*write_super) (struct super_block *); int (*sync_fs)(struct super_block *sb, int wait); - void (*write_super_lockfs) (struct super_block *); - void (*unlockfs) (struct super_block *); + int (*freeze_fs) (struct super_block *); + int (*unfreeze_fs) (struct super_block *); int (*statfs) (struct dentry *, struct kstatfs *); int (*remount_fs) (struct super_block *, int *, char *); void (*clear_inode) (struct inode *); -- cgit v1.2.3 From 417bec5b0f25e000866f1be845d44a3ca0690697 Mon Sep 17 00:00:00 2001 From: Takashi Iwai Date: Tue, 13 Jan 2009 17:57:12 +0100 Subject: ALSA: hda - Update model descriptions in patch_sigmatel.c Update models in patch_sigmatel.c, mainly for the last Gateway updates. Signed-off-by: Takashi Iwai --- Documentation/sound/alsa/HD-Audio-Models.txt | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-) (limited to 'Documentation') diff --git a/Documentation/sound/alsa/HD-Audio-Models.txt b/Documentation/sound/alsa/HD-Audio-Models.txt index 4b7ac21ea9eb..64eb1100eec1 100644 --- a/Documentation/sound/alsa/HD-Audio-Models.txt +++ b/Documentation/sound/alsa/HD-Audio-Models.txt @@ -275,7 +275,8 @@ STAC9200 dell-m25 Dell Inspiron E1505n dell-m26 Dell Inspiron 1501 dell-m27 Dell Inspiron E1705/9400 - gateway Gateway laptops with EAPD control + gateway-m4 Gateway laptops with EAPD control + gateway-m4-2 Gateway laptops with EAPD control panasonic Panasonic CF-74 STAC9205/9254 @@ -302,6 +303,7 @@ STAC9220/9221 macbook-pro Intel Mac Book Pro 2nd generation (eq. type 3) imac-intel Intel iMac (eq. type 2) imac-intel-20 Intel iMac (newer version) (eq. type 3) + ecs202 ECS/PC chips dell-d81 Dell (unknown) dell-d82 Dell (unknown) dell-m81 Dell (unknown) @@ -310,9 +312,13 @@ STAC9220/9221 STAC9202/9250/9251 ================== ref Reference board, base config + m1 Some Gateway MX series laptops (NX560XL) + m1-2 Some Gateway MX series laptops (MX6453) + m2 Some Gateway MX series laptops (M255) m2-2 Some Gateway MX series laptops + m3 Some Gateway MX series laptops + m5 Some Gateway MX series laptops (MP6954) m6 Some Gateway NX series laptops - pa6 Gateway NX860 series STAC9227/9228/9229/927x ======================= @@ -329,6 +335,7 @@ STAC92HD71B* dell-m4-1 Dell desktops dell-m4-2 Dell desktops dell-m4-3 Dell desktops + hp-m4 HP dv laptops STAC92HD73* =========== @@ -337,6 +344,7 @@ STAC92HD73* dell-m6-amic Dell desktops/laptops with analog mics dell-m6-dmic Dell desktops/laptops with digital mics dell-m6 Dell desktops/laptops with both type of mics + dell-eq Dell desktops/laptops STAC92HD83* =========== -- cgit v1.2.3 From e86c1451d3138b4cd0378282b30397d171fa4252 Mon Sep 17 00:00:00 2001 From: Bartlomiej Zolnierkiewicz Date: Wed, 14 Jan 2009 19:19:03 +0100 Subject: ide: remove unused CONFIG_BLK_DEV_IDE_AU1XXX_SEQTS_PER_RQ Acked-by: Sergei Shtylyov Signed-off-by: Bartlomiej Zolnierkiewicz --- Documentation/mips/AU1xxx_IDE.README | 6 +----- drivers/ide/Kconfig | 5 ----- 2 files changed, 1 insertion(+), 10 deletions(-) (limited to 'Documentation') diff --git a/Documentation/mips/AU1xxx_IDE.README b/Documentation/mips/AU1xxx_IDE.README index f54962aea84d..8ace35ebdcd5 100644 --- a/Documentation/mips/AU1xxx_IDE.README +++ b/Documentation/mips/AU1xxx_IDE.README @@ -52,14 +52,12 @@ Two files are introduced: b) 'drivers/ide/mips/au1xxx-ide.c' contains the functionality of the AU1XXX IDE driver -Four configs variables are introduced: +Following extra configs variables are introduced: CONFIG_BLK_DEV_IDE_AU1XXX_PIO_DBDMA - enable the PIO+DBDMA mode CONFIG_BLK_DEV_IDE_AU1XXX_MDMA2_DBDMA - enable the MWDMA mode CONFIG_BLK_DEV_IDE_AU1XXX_BURSTABLE_ON - set Burstable FIFO in DBDMA controller - CONFIG_BLK_DEV_IDE_AU1XXX_SEQTS_PER_RQ - maximum transfer size - per descriptor SUPPORTED IDE MODES @@ -87,7 +85,6 @@ CONFIG_BLK_DEV_IDEDMA_PCI=y CONFIG_IDEDMA_PCI_AUTO=y CONFIG_BLK_DEV_IDE_AU1XXX=y CONFIG_BLK_DEV_IDE_AU1XXX_MDMA2_DBDMA=y -CONFIG_BLK_DEV_IDE_AU1XXX_SEQTS_PER_RQ=128 CONFIG_BLK_DEV_IDEDMA=y CONFIG_IDEDMA_AUTO=y @@ -105,7 +102,6 @@ CONFIG_BLK_DEV_IDEDMA_PCI=y CONFIG_IDEDMA_PCI_AUTO=y CONFIG_BLK_DEV_IDE_AU1XXX=y CONFIG_BLK_DEV_IDE_AU1XXX_MDMA2_DBDMA=y -CONFIG_BLK_DEV_IDE_AU1XXX_SEQTS_PER_RQ=128 CONFIG_BLK_DEV_IDEDMA=y CONFIG_IDEDMA_AUTO=y diff --git a/drivers/ide/Kconfig b/drivers/ide/Kconfig index 3f9503867e6b..b1c6f68d98ce 100644 --- a/drivers/ide/Kconfig +++ b/drivers/ide/Kconfig @@ -701,11 +701,6 @@ config BLK_DEV_IDE_AU1XXX_MDMA2_DBDMA depends on SOC_AU1200 && BLK_DEV_IDE_AU1XXX endchoice -config BLK_DEV_IDE_AU1XXX_SEQTS_PER_RQ - int "Maximum transfer size (KB) per request (up to 128)" - default "128" - depends on BLK_DEV_IDE_AU1XXX - config BLK_DEV_IDE_TX4938 tristate "TX4938 internal IDE support" depends on SOC_TX4938 -- cgit v1.2.3 From df291fa993c506da89a89264ff8166bccd172a14 Mon Sep 17 00:00:00 2001 From: Randy Dunlap Date: Thu, 8 Jan 2009 10:59:34 -0800 Subject: kbuild: fix kbuild.txt typos Fix typos in the new kbuild.txt file. Signed-off-by: Randy Dunlap Signed-off-by: Sam Ravnborg --- Documentation/kbuild/kbuild.txt | 29 +++++++++++++++-------------- 1 file changed, 15 insertions(+), 14 deletions(-) (limited to 'Documentation') diff --git a/Documentation/kbuild/kbuild.txt b/Documentation/kbuild/kbuild.txt index 923f9ddee8f6..f3355b6812df 100644 --- a/Documentation/kbuild/kbuild.txt +++ b/Documentation/kbuild/kbuild.txt @@ -3,7 +3,7 @@ Environment variables KCPPFLAGS -------------------------------------------------- Additional options to pass when preprocessing. The preprocessing options -will be used in all cases where kbuild do preprocessing including +will be used in all cases where kbuild does preprocessing including building C files and assembler files. KAFLAGS @@ -16,7 +16,7 @@ Additional options to the C compiler. KBUILD_VERBOSE -------------------------------------------------- -Set the kbuild verbosity. Can be assinged same values as "V=...". +Set the kbuild verbosity. Can be assigned same values as "V=...". See make help for the full list. Setting "V=..." takes precedence over KBUILD_VERBOSE. @@ -35,14 +35,14 @@ KBUILD_OUTPUT -------------------------------------------------- Specify the output directory when building the kernel. The output directory can also be specificed using "O=...". -Setting "O=..." takes precedence over KBUILD_OUTPUT +Setting "O=..." takes precedence over KBUILD_OUTPUT. ARCH -------------------------------------------------- Set ARCH to the architecture to be built. In most cases the name of the architecture is the same as the directory name found in the arch/ directory. -But some architectures suach as x86 and sparc has aliases. +But some architectures such as x86 and sparc have aliases. x86: i386 for 32 bit, x86_64 for 64 bit sparc: sparc for 32 bit, sparc64 for 64 bit @@ -63,7 +63,7 @@ CF is often used on the command-line like this: INSTALL_PATH -------------------------------------------------- INSTALL_PATH specifies where to place the updated kernel and system map -images. Default is /boot, but you can set it to other values +images. Default is /boot, but you can set it to other values. MODLIB @@ -90,7 +90,7 @@ INSTALL_MOD_STRIP will used as the options to the strip command. INSTALL_FW_PATH -------------------------------------------------- -INSTALL_FW_PATH specify where to install the firmware blobs. +INSTALL_FW_PATH specifies where to install the firmware blobs. The default value is: $(INSTALL_MOD_PATH)/lib/firmware @@ -99,7 +99,7 @@ The value can be overridden in which case the default value is ignored. INSTALL_HDR_PATH -------------------------------------------------- -INSTALL_HDR_PATH specify where to install user space headers when +INSTALL_HDR_PATH specifies where to install user space headers when executing "make headers_*". The default value is: @@ -112,22 +112,23 @@ The value can be overridden in which case the default value is ignored. KBUILD_MODPOST_WARN -------------------------------------------------- -KBUILD_MODPOST_WARN can be set to avoid error out in case of undefined -symbols in the final module linking stage. +KBUILD_MODPOST_WARN can be set to avoid errors in case of undefined +symbols in the final module linking stage. It changes such errors +into warnings. -KBUILD_MODPOST_FINAL +KBUILD_MODPOST_NOFINAL -------------------------------------------------- KBUILD_MODPOST_NOFINAL can be set to skip the final link of modules. -This is solely usefull to speed up test compiles. +This is solely useful to speed up test compiles. KBUILD_EXTRA_SYMBOLS -------------------------------------------------- -For modules use symbols from another modules. +For modules that use symbols from other modules. See more details in modules.txt. ALLSOURCE_ARCHS -------------------------------------------------- -For tags/TAGS/cscope targets, you can specify more than one archs -to be included in the databases, separated by blankspace. e.g. +For tags/TAGS/cscope targets, you can specify more than one arch +to be included in the databases, separated by blank space. E.g.: $ make ALLSOURCE_ARCHS="x86 mips arm" tags -- cgit v1.2.3 From 9abf0eea877d6107d3a8a5c6913450e961fb7050 Mon Sep 17 00:00:00 2001 From: Henrique de Moraes Holschuh Date: Sun, 11 Jan 2009 03:00:58 -0200 Subject: ACPI: thinkpad-acpi: update documents for the new location Update documentation to reflect the new location of the thinkpad-acpi driver. Signed-off-by: Henrique de Moraes Holschuh Signed-off-by: Len Brown --- Documentation/laptops/thinkpad-acpi.txt | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/laptops/thinkpad-acpi.txt b/Documentation/laptops/thinkpad-acpi.txt index 898b4987bb80..ddc371e0a1a6 100644 --- a/Documentation/laptops/thinkpad-acpi.txt +++ b/Documentation/laptops/thinkpad-acpi.txt @@ -16,7 +16,8 @@ supported by the generic Linux ACPI drivers. This driver used to be named ibm-acpi until kernel 2.6.21 and release 0.13-20070314. It used to be in the drivers/acpi tree, but it was moved to the drivers/misc tree and renamed to thinkpad-acpi for kernel -2.6.22, and release 0.14. +2.6.22, and release 0.14. It was moved to drivers/platform/x86 for +kernel 2.6.29. The driver is named "thinkpad-acpi". In some places, like module names, "thinkpad_acpi" is used because of userspace issues. -- cgit v1.2.3 From 0045c0aa7d5e787f78938e6a10927b8a516f0b83 Mon Sep 17 00:00:00 2001 From: Henrique de Moraes Holschuh Date: Sun, 11 Jan 2009 03:01:03 -0200 Subject: ACPI: thinkpad-acpi: add UWB radio support Add rfkill support for USB UWB radio devices on very recent ThinkPad laptop models. The new subdriver is moslty a trimmed down copy of the wwan subdriver. Signed-off-by: Henrique de Moraes Holschuh Cc: Ivo van Doorn Signed-off-by: Len Brown --- Documentation/laptops/thinkpad-acpi.txt | 18 +++ drivers/platform/x86/thinkpad_acpi.c | 206 +++++++++++++++++++++++++++++++- 2 files changed, 223 insertions(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/laptops/thinkpad-acpi.txt b/Documentation/laptops/thinkpad-acpi.txt index ddc371e0a1a6..91c00010b15c 100644 --- a/Documentation/laptops/thinkpad-acpi.txt +++ b/Documentation/laptops/thinkpad-acpi.txt @@ -1413,6 +1413,24 @@ Sysfs notes: rfkill controller switch "tpacpi_wwan_sw": refer to Documentation/rfkill.txt for details. +EXPERIMENTAL: UWB +----------------- + +This feature is marked EXPERIMENTAL because it has not been extensively +tested and validated in various ThinkPad models yet. The feature may not +work as expected. USE WITH CAUTION! To use this feature, you need to supply +the experimental=1 parameter when loading the module. + +sysfs rfkill class: switch "tpacpi_uwb_sw" + +This feature exports an rfkill controller for the UWB device, if one is +present and enabled in the BIOS. + +Sysfs notes: + + rfkill controller switch "tpacpi_uwb_sw": refer to + Documentation/rfkill.txt for details. + Multiple Commands, Module Parameters ------------------------------------ diff --git a/drivers/platform/x86/thinkpad_acpi.c b/drivers/platform/x86/thinkpad_acpi.c index 27d709bac98f..c1d40410ad79 100644 --- a/drivers/platform/x86/thinkpad_acpi.c +++ b/drivers/platform/x86/thinkpad_acpi.c @@ -169,6 +169,7 @@ enum { enum { TPACPI_RFK_BLUETOOTH_SW_ID = 0, TPACPI_RFK_WWAN_SW_ID, + TPACPI_RFK_UWB_SW_ID, }; /* Debugging */ @@ -261,6 +262,7 @@ static struct { u32 bright_16levels:1; u32 bright_acpimode:1; u32 wan:1; + u32 uwb:1; u32 fan_ctrl_status_undef:1; u32 input_device_registered:1; u32 platform_drv_registered:1; @@ -317,6 +319,8 @@ static int dbg_bluetoothemul; static int tpacpi_bluetooth_emulstate; static int dbg_wwanemul; static int tpacpi_wwan_emulstate; +static int dbg_uwbemul; +static int tpacpi_uwb_emulstate; #endif @@ -967,6 +971,7 @@ static int __init tpacpi_new_rfkill(const unsigned int id, struct rfkill **rfk, const enum rfkill_type rfktype, const char *name, + const bool set_default, int (*toggle_radio)(void *, enum rfkill_state), int (*get_state)(void *, enum rfkill_state *)) { @@ -978,7 +983,7 @@ static int __init tpacpi_new_rfkill(const unsigned int id, printk(TPACPI_ERR "failed to read initial state for %s, error %d; " "will turn radio off\n", name, res); - } else { + } else if (set_default) { /* try to set the initial state as the default for the rfkill * type, since we ask the firmware to preserve it across S5 in * NVRAM */ @@ -1148,6 +1153,31 @@ static DRIVER_ATTR(wwan_emulstate, S_IWUSR | S_IRUGO, tpacpi_driver_wwan_emulstate_show, tpacpi_driver_wwan_emulstate_store); +/* uwb_emulstate ------------------------------------------------- */ +static ssize_t tpacpi_driver_uwb_emulstate_show( + struct device_driver *drv, + char *buf) +{ + return snprintf(buf, PAGE_SIZE, "%d\n", !!tpacpi_uwb_emulstate); +} + +static ssize_t tpacpi_driver_uwb_emulstate_store( + struct device_driver *drv, + const char *buf, size_t count) +{ + unsigned long t; + + if (parse_strtoul(buf, 1, &t)) + return -EINVAL; + + tpacpi_uwb_emulstate = !!t; + + return count; +} + +static DRIVER_ATTR(uwb_emulstate, S_IWUSR | S_IRUGO, + tpacpi_driver_uwb_emulstate_show, + tpacpi_driver_uwb_emulstate_store); #endif /* --------------------------------------------------------------------- */ @@ -1175,6 +1205,8 @@ static int __init tpacpi_create_driver_attributes(struct device_driver *drv) res = driver_create_file(drv, &driver_attr_bluetooth_emulstate); if (!res && dbg_wwanemul) res = driver_create_file(drv, &driver_attr_wwan_emulstate); + if (!res && dbg_uwbemul) + res = driver_create_file(drv, &driver_attr_uwb_emulstate); #endif return res; @@ -1191,6 +1223,7 @@ static void tpacpi_remove_driver_attributes(struct device_driver *drv) driver_remove_file(drv, &driver_attr_wlsw_emulstate); driver_remove_file(drv, &driver_attr_bluetooth_emulstate); driver_remove_file(drv, &driver_attr_wwan_emulstate); + driver_remove_file(drv, &driver_attr_uwb_emulstate); #endif } @@ -2125,6 +2158,7 @@ static struct attribute *hotkey_mask_attributes[] __initdata = { static void bluetooth_update_rfk(void); static void wan_update_rfk(void); +static void uwb_update_rfk(void); static void tpacpi_send_radiosw_update(void) { int wlsw; @@ -2134,6 +2168,8 @@ static void tpacpi_send_radiosw_update(void) bluetooth_update_rfk(); if (tp_features.wan) wan_update_rfk(); + if (tp_features.uwb) + uwb_update_rfk(); if (tp_features.hotkey_wlsw && !hotkey_get_wlsw(&wlsw)) { mutex_lock(&tpacpi_inputdev_send_mutex); @@ -3035,6 +3071,7 @@ static int __init bluetooth_init(struct ibm_init_struct *iibm) &tpacpi_bluetooth_rfkill, RFKILL_TYPE_BLUETOOTH, "tpacpi_bluetooth_sw", + true, tpacpi_bluetooth_rfk_set, tpacpi_bluetooth_rfk_get); if (res) { @@ -3309,6 +3346,7 @@ static int __init wan_init(struct ibm_init_struct *iibm) &tpacpi_wan_rfkill, RFKILL_TYPE_WWAN, "tpacpi_wwan_sw", + true, tpacpi_wan_rfk_set, tpacpi_wan_rfk_get); if (res) { @@ -3365,6 +3403,162 @@ static struct ibm_struct wan_driver_data = { .shutdown = wan_shutdown, }; +/************************************************************************* + * UWB subdriver + */ + +enum { + /* ACPI GUWB/SUWB bits */ + TP_ACPI_UWB_HWPRESENT = 0x01, /* UWB hw available */ + TP_ACPI_UWB_RADIOSSW = 0x02, /* UWB radio enabled */ +}; + +static struct rfkill *tpacpi_uwb_rfkill; + +static int uwb_get_radiosw(void) +{ + int status; + + if (!tp_features.uwb) + return -ENODEV; + + /* WLSW overrides UWB in firmware/hardware, reflect that */ + if (tp_features.hotkey_wlsw && !hotkey_get_wlsw(&status) && !status) + return RFKILL_STATE_HARD_BLOCKED; + +#ifdef CONFIG_THINKPAD_ACPI_DEBUGFACILITIES + if (dbg_uwbemul) + return (tpacpi_uwb_emulstate) ? + RFKILL_STATE_UNBLOCKED : RFKILL_STATE_SOFT_BLOCKED; +#endif + + if (!acpi_evalf(hkey_handle, &status, "GUWB", "d")) + return -EIO; + + return ((status & TP_ACPI_UWB_RADIOSSW) != 0) ? + RFKILL_STATE_UNBLOCKED : RFKILL_STATE_SOFT_BLOCKED; +} + +static void uwb_update_rfk(void) +{ + int status; + + if (!tpacpi_uwb_rfkill) + return; + + status = uwb_get_radiosw(); + if (status < 0) + return; + rfkill_force_state(tpacpi_uwb_rfkill, status); +} + +static int uwb_set_radiosw(int radio_on, int update_rfk) +{ + int status; + + if (!tp_features.uwb) + return -ENODEV; + + /* WLSW overrides UWB in firmware/hardware, but there is no + * reason to risk weird behaviour. */ + if (tp_features.hotkey_wlsw && !hotkey_get_wlsw(&status) && !status + && radio_on) + return -EPERM; + +#ifdef CONFIG_THINKPAD_ACPI_DEBUGFACILITIES + if (dbg_uwbemul) { + tpacpi_uwb_emulstate = !!radio_on; + if (update_rfk) + uwb_update_rfk(); + return 0; + } +#endif + + status = (radio_on) ? TP_ACPI_UWB_RADIOSSW : 0; + if (!acpi_evalf(hkey_handle, NULL, "SUWB", "vd", status)) + return -EIO; + + if (update_rfk) + uwb_update_rfk(); + + return 0; +} + +/* --------------------------------------------------------------------- */ + +static int tpacpi_uwb_rfk_get(void *data, enum rfkill_state *state) +{ + int uwbs = uwb_get_radiosw(); + + if (uwbs < 0) + return uwbs; + + *state = uwbs; + return 0; +} + +static int tpacpi_uwb_rfk_set(void *data, enum rfkill_state state) +{ + return uwb_set_radiosw((state == RFKILL_STATE_UNBLOCKED), 0); +} + +static void uwb_exit(void) +{ + if (tpacpi_uwb_rfkill) + rfkill_unregister(tpacpi_uwb_rfkill); +} + +static int __init uwb_init(struct ibm_init_struct *iibm) +{ + int res; + int status = 0; + + vdbg_printk(TPACPI_DBG_INIT, "initializing uwb subdriver\n"); + + TPACPI_ACPIHANDLE_INIT(hkey); + + tp_features.uwb = hkey_handle && + acpi_evalf(hkey_handle, &status, "GUWB", "qd"); + + vdbg_printk(TPACPI_DBG_INIT, "uwb is %s, status 0x%02x\n", + str_supported(tp_features.uwb), + status); + +#ifdef CONFIG_THINKPAD_ACPI_DEBUGFACILITIES + if (dbg_uwbemul) { + tp_features.uwb = 1; + printk(TPACPI_INFO + "uwb switch emulation enabled\n"); + } else +#endif + if (tp_features.uwb && + !(status & TP_ACPI_UWB_HWPRESENT)) { + /* no uwb hardware present in system */ + tp_features.uwb = 0; + dbg_printk(TPACPI_DBG_INIT, + "uwb hardware not installed\n"); + } + + if (!tp_features.uwb) + return 1; + + res = tpacpi_new_rfkill(TPACPI_RFK_UWB_SW_ID, + &tpacpi_uwb_rfkill, + RFKILL_TYPE_UWB, + "tpacpi_uwb_sw", + false, + tpacpi_uwb_rfk_set, + tpacpi_uwb_rfk_get); + + return res; +} + +static struct ibm_struct uwb_driver_data = { + .name = "uwb", + .exit = uwb_exit, + .flags.experimental = 1, +}; + /************************************************************************* * Video subdriver */ @@ -6830,6 +7024,10 @@ static struct ibm_init_struct ibms_init[] __initdata = { .init = wan_init, .data = &wan_driver_data, }, + { + .init = uwb_init, + .data = &uwb_driver_data, + }, #ifdef CONFIG_THINKPAD_ACPI_VIDEO { .init = video_init, @@ -6986,6 +7184,12 @@ MODULE_PARM_DESC(dbg_wwanemul, "Enables WWAN switch emulation"); module_param_named(wwan_state, tpacpi_wwan_emulstate, bool, 0); MODULE_PARM_DESC(wwan_state, "Initial state of the emulated WWAN switch"); + +module_param(dbg_uwbemul, uint, 0); +MODULE_PARM_DESC(dbg_uwbemul, "Enables UWB switch emulation"); +module_param_named(uwb_state, tpacpi_uwb_emulstate, bool, 0); +MODULE_PARM_DESC(uwb_state, + "Initial state of the emulated UWB switch"); #endif static void thinkpad_acpi_module_exit(void) -- cgit v1.2.3 From 175add1981e53d22caba8f42d5f924a4de507b6c Mon Sep 17 00:00:00 2001 From: John Keller Date: Mon, 24 Nov 2008 16:47:17 -0600 Subject: [IA64] SN specific version of dma_get_required_mask() Create a platform specific version of dma_get_required_mask() for ia64 SN Altix. All SN Altix platforms support 64 bit DMA addressing regardless of the size of system memory. Create an ia64 machvec for dma_get_required_mask, with the SN version unconditionally returning DMA_64BIT_MASK. Signed-off-by: John Keller Signed-off-by: Tony Luck --- Documentation/DMA-API.txt | 9 ++++----- arch/ia64/include/asm/dma-mapping.h | 2 ++ arch/ia64/include/asm/machvec.h | 7 +++++++ arch/ia64/include/asm/machvec_init.h | 1 + arch/ia64/include/asm/machvec_sn2.h | 2 ++ arch/ia64/pci/pci.c | 27 +++++++++++++++++++++++++++ arch/ia64/sn/pci/pci_dma.c | 6 ++++++ 7 files changed, 49 insertions(+), 5 deletions(-) (limited to 'Documentation') diff --git a/Documentation/DMA-API.txt b/Documentation/DMA-API.txt index b462bb149543..52441694fe03 100644 --- a/Documentation/DMA-API.txt +++ b/Documentation/DMA-API.txt @@ -170,16 +170,15 @@ Returns: 0 if successful and a negative error if not. u64 dma_get_required_mask(struct device *dev) -After setting the mask with dma_set_mask(), this API returns the -actual mask (within that already set) that the platform actually -requires to operate efficiently. Usually this means the returned mask +This API returns the mask that the platform requires to +operate efficiently. Usually this means the returned mask is the minimum required to cover all of memory. Examining the required mask gives drivers with variable descriptor sizes the opportunity to use smaller descriptors as necessary. Requesting the required mask does not alter the current mask. If you -wish to take advantage of it, you should issue another dma_set_mask() -call to lower the mask again. +wish to take advantage of it, you should issue a dma_set_mask() +call to set the mask to the value returned. Part Id - Streaming DMA mappings diff --git a/arch/ia64/include/asm/dma-mapping.h b/arch/ia64/include/asm/dma-mapping.h index bbab7e2b0fc9..1f912d927585 100644 --- a/arch/ia64/include/asm/dma-mapping.h +++ b/arch/ia64/include/asm/dma-mapping.h @@ -9,6 +9,8 @@ #include #include +#define ARCH_HAS_DMA_GET_REQUIRED_MASK + struct dma_mapping_ops { int (*mapping_error)(struct device *dev, dma_addr_t dma_addr); diff --git a/arch/ia64/include/asm/machvec.h b/arch/ia64/include/asm/machvec.h index 59c17e446683..fe87b2121707 100644 --- a/arch/ia64/include/asm/machvec.h +++ b/arch/ia64/include/asm/machvec.h @@ -62,6 +62,7 @@ typedef dma_addr_t ia64_mv_dma_map_single_attrs (struct device *, void *, size_t typedef void ia64_mv_dma_unmap_single_attrs (struct device *, dma_addr_t, size_t, int, struct dma_attrs *); typedef int ia64_mv_dma_map_sg_attrs (struct device *, struct scatterlist *, int, int, struct dma_attrs *); typedef void ia64_mv_dma_unmap_sg_attrs (struct device *, struct scatterlist *, int, int, struct dma_attrs *); +typedef u64 ia64_mv_dma_get_required_mask (struct device *); /* * WARNING: The legacy I/O space is _architected_. Platforms are @@ -159,6 +160,7 @@ extern void machvec_tlb_migrate_finish (struct mm_struct *); # define platform_dma_sync_sg_for_device ia64_mv.dma_sync_sg_for_device # define platform_dma_mapping_error ia64_mv.dma_mapping_error # define platform_dma_supported ia64_mv.dma_supported +# define platform_dma_get_required_mask ia64_mv.dma_get_required_mask # define platform_irq_to_vector ia64_mv.irq_to_vector # define platform_local_vector_to_irq ia64_mv.local_vector_to_irq # define platform_pci_get_legacy_mem ia64_mv.pci_get_legacy_mem @@ -213,6 +215,7 @@ struct ia64_machine_vector { ia64_mv_dma_sync_sg_for_device *dma_sync_sg_for_device; ia64_mv_dma_mapping_error *dma_mapping_error; ia64_mv_dma_supported *dma_supported; + ia64_mv_dma_get_required_mask *dma_get_required_mask; ia64_mv_irq_to_vector *irq_to_vector; ia64_mv_local_vector_to_irq *local_vector_to_irq; ia64_mv_pci_get_legacy_mem_t *pci_get_legacy_mem; @@ -263,6 +266,7 @@ struct ia64_machine_vector { platform_dma_sync_sg_for_device, \ platform_dma_mapping_error, \ platform_dma_supported, \ + platform_dma_get_required_mask, \ platform_irq_to_vector, \ platform_local_vector_to_irq, \ platform_pci_get_legacy_mem, \ @@ -366,6 +370,9 @@ extern void machvec_init_from_cmdline(const char *cmdline); #ifndef platform_dma_supported # define platform_dma_supported swiotlb_dma_supported #endif +#ifndef platform_dma_get_required_mask +# define platform_dma_get_required_mask ia64_dma_get_required_mask +#endif #ifndef platform_irq_to_vector # define platform_irq_to_vector __ia64_irq_to_vector #endif diff --git a/arch/ia64/include/asm/machvec_init.h b/arch/ia64/include/asm/machvec_init.h index ef964b286842..37a469849ab9 100644 --- a/arch/ia64/include/asm/machvec_init.h +++ b/arch/ia64/include/asm/machvec_init.h @@ -3,6 +3,7 @@ extern ia64_mv_send_ipi_t ia64_send_ipi; extern ia64_mv_global_tlb_purge_t ia64_global_tlb_purge; +extern ia64_mv_dma_get_required_mask ia64_dma_get_required_mask; extern ia64_mv_irq_to_vector __ia64_irq_to_vector; extern ia64_mv_local_vector_to_irq __ia64_local_vector_to_irq; extern ia64_mv_pci_get_legacy_mem_t ia64_pci_get_legacy_mem; diff --git a/arch/ia64/include/asm/machvec_sn2.h b/arch/ia64/include/asm/machvec_sn2.h index 781308ea7b88..f1a6e0d6dfa5 100644 --- a/arch/ia64/include/asm/machvec_sn2.h +++ b/arch/ia64/include/asm/machvec_sn2.h @@ -67,6 +67,7 @@ extern ia64_mv_dma_sync_single_for_device sn_dma_sync_single_for_device; extern ia64_mv_dma_sync_sg_for_device sn_dma_sync_sg_for_device; extern ia64_mv_dma_mapping_error sn_dma_mapping_error; extern ia64_mv_dma_supported sn_dma_supported; +extern ia64_mv_dma_get_required_mask sn_dma_get_required_mask; extern ia64_mv_migrate_t sn_migrate; extern ia64_mv_kernel_launch_event_t sn_kernel_launch_event; extern ia64_mv_setup_msi_irq_t sn_setup_msi_irq; @@ -123,6 +124,7 @@ extern ia64_mv_pci_fixup_bus_t sn_pci_fixup_bus; #define platform_dma_sync_sg_for_device sn_dma_sync_sg_for_device #define platform_dma_mapping_error sn_dma_mapping_error #define platform_dma_supported sn_dma_supported +#define platform_dma_get_required_mask sn_dma_get_required_mask #define platform_migrate sn_migrate #define platform_kernel_launch_event sn_kernel_launch_event #ifdef CONFIG_PCI_MSI diff --git a/arch/ia64/pci/pci.c b/arch/ia64/pci/pci.c index 211fcfd115f9..61f1af5c23c1 100644 --- a/arch/ia64/pci/pci.c +++ b/arch/ia64/pci/pci.c @@ -19,6 +19,7 @@ #include #include #include +#include #include #include @@ -748,6 +749,32 @@ static void __init set_pci_cacheline_size(void) pci_cache_line_size = (1 << cci.pcci_line_size) / 4; } +u64 ia64_dma_get_required_mask(struct device *dev) +{ + u32 low_totalram = ((max_pfn - 1) << PAGE_SHIFT); + u32 high_totalram = ((max_pfn - 1) >> (32 - PAGE_SHIFT)); + u64 mask; + + if (!high_totalram) { + /* convert to mask just covering totalram */ + low_totalram = (1 << (fls(low_totalram) - 1)); + low_totalram += low_totalram - 1; + mask = low_totalram; + } else { + high_totalram = (1 << (fls(high_totalram) - 1)); + high_totalram += high_totalram - 1; + mask = (((u64)high_totalram) << 32) + 0xffffffff; + } + return mask; +} +EXPORT_SYMBOL_GPL(ia64_dma_get_required_mask); + +u64 dma_get_required_mask(struct device *dev) +{ + return platform_dma_get_required_mask(dev); +} +EXPORT_SYMBOL_GPL(dma_get_required_mask); + static int __init pcibios_init(void) { set_pci_cacheline_size(); diff --git a/arch/ia64/sn/pci/pci_dma.c b/arch/ia64/sn/pci/pci_dma.c index 53ebb6484495..863f5017baae 100644 --- a/arch/ia64/sn/pci/pci_dma.c +++ b/arch/ia64/sn/pci/pci_dma.c @@ -356,6 +356,12 @@ int sn_dma_mapping_error(struct device *dev, dma_addr_t dma_addr) } EXPORT_SYMBOL(sn_dma_mapping_error); +u64 sn_dma_get_required_mask(struct device *dev) +{ + return DMA_64BIT_MASK; +} +EXPORT_SYMBOL_GPL(sn_dma_get_required_mask); + char *sn_pci_get_legacy_mem(struct pci_bus *bus) { if (!SN_PCIBUS_BUSSOFT(bus)) -- cgit v1.2.3 From aa2fbcec07b0d594808bc3058692395d24eba66e Mon Sep 17 00:00:00 2001 From: Henrique de Moraes Holschuh Date: Sun, 11 Jan 2009 03:01:10 -0200 Subject: ACPI: thinkpad-acpi: bump up version to 0.22 It is about time to bump up the version. Features added since 0.21: fan suspend/resume support, preserve radio state across power off (for some radio types), built-in UWB radio rfkill support and thermal alarm events support. Signed-off-by: Henrique de Moraes Holschuh Signed-off-by: Len Brown --- Documentation/laptops/thinkpad-acpi.txt | 6 +++--- drivers/platform/x86/thinkpad_acpi.c | 2 +- 2 files changed, 4 insertions(+), 4 deletions(-) (limited to 'Documentation') diff --git a/Documentation/laptops/thinkpad-acpi.txt b/Documentation/laptops/thinkpad-acpi.txt index 91c00010b15c..41bc99fa1884 100644 --- a/Documentation/laptops/thinkpad-acpi.txt +++ b/Documentation/laptops/thinkpad-acpi.txt @@ -1,7 +1,7 @@ ThinkPad ACPI Extras Driver - Version 0.21 - May 29th, 2008 + Version 0.22 + November 23rd, 2008 Borislav Deianov Henrique de Moraes Holschuh @@ -17,7 +17,7 @@ This driver used to be named ibm-acpi until kernel 2.6.21 and release 0.13-20070314. It used to be in the drivers/acpi tree, but it was moved to the drivers/misc tree and renamed to thinkpad-acpi for kernel 2.6.22, and release 0.14. It was moved to drivers/platform/x86 for -kernel 2.6.29. +kernel 2.6.29 and release 0.22. The driver is named "thinkpad-acpi". In some places, like module names, "thinkpad_acpi" is used because of userspace issues. diff --git a/drivers/platform/x86/thinkpad_acpi.c b/drivers/platform/x86/thinkpad_acpi.c index 886a4306e78c..bcbc05107ba8 100644 --- a/drivers/platform/x86/thinkpad_acpi.c +++ b/drivers/platform/x86/thinkpad_acpi.c @@ -21,7 +21,7 @@ * 02110-1301, USA. */ -#define TPACPI_VERSION "0.21" +#define TPACPI_VERSION "0.22" #define TPACPI_SYSFS_VERSION 0x020200 /* -- cgit v1.2.3 From 1c301fc5394f7e1aa4c201e6e03d55d9c08b3bdf Mon Sep 17 00:00:00 2001 From: Jordan Crouse Date: Thu, 15 Jan 2009 22:27:47 +0100 Subject: hwmon: Add a driver for the ADT7475 hardware monitoring chip Hwmon driver for the ADT7475 chip. Signed-off-by: Jordan Crouse Signed-off-by: Hans de Goede Signed-off-by: Jean Delvare --- Documentation/hwmon/adt7475 | 87 +++ drivers/hwmon/Kconfig | 10 + drivers/hwmon/Makefile | 2 + drivers/hwmon/adt7475.c | 1221 +++++++++++++++++++++++++++++++++++++++++++ 4 files changed, 1320 insertions(+) create mode 100644 Documentation/hwmon/adt7475 create mode 100644 drivers/hwmon/adt7475.c (limited to 'Documentation') diff --git a/Documentation/hwmon/adt7475 b/Documentation/hwmon/adt7475 new file mode 100644 index 000000000000..a2b1abec850e --- /dev/null +++ b/Documentation/hwmon/adt7475 @@ -0,0 +1,87 @@ +This describes the interface for the ADT7475 driver: + +(there are 4 fans, numbered fan1 to fan4): + +fanX_input Read the current speed of the fan (in RPMs) +fanX_min Read/write the minimum speed of the fan. Dropping + below this sets an alarm. + +(there are three PWMs, numbered pwm1 to pwm3): + +pwmX Read/write the current duty cycle of the PWM. Writes + only have effect when auto mode is turned off (see + below). Range is 0 - 255. + +pwmX_enable Fan speed control method: + + 0 - No control (fan at full speed) + 1 - Manual fan speed control (using pwm[1-*]) + 2 - Automatic fan speed control + +pwmX_auto_channels_temp Select which channels affect this PWM + + 1 - TEMP1 controls PWM + 2 - TEMP2 controls PWM + 4 - TEMP3 controls PWM + 6 - TEMP2 and TEMP3 control PWM + 7 - All three inputs control PWM + +pwmX_freq Read/write the PWM frequency in Hz. The number + should be one of the following: + + 11 Hz + 14 Hz + 22 Hz + 29 Hz + 35 Hz + 44 Hz + 58 Hz + 88 Hz + +pwmX_auto_point1_pwm Read/write the minimum PWM duty cycle in automatic mode + +pwmX_auto_point2_pwm Read/write the maximum PWM duty cycle in automatic mode + +(there are three temperature settings numbered temp1 to temp3): + +tempX_input Read the current temperature. The value is in milli + degrees of Celsius. + +tempX_max Read/write the upper temperature limit - exceeding this + will cause an alarm. + +tempX_min Read/write the lower temperature limit - exceeding this + will cause an alarm. + +tempX_offset Read/write the temperature adjustment offset + +tempX_crit Read/write the THERM limit for remote1. + +tempX_crit_hyst Set the temperature value below crit where the + fans will stay on - this helps drive the temperature + low enough so it doesn't stay near the edge and + cause THERM to keep tripping. + +tempX_auto_point1_temp Read/write the minimum temperature where the fans will + turn on in automatic mode. + +tempX_auto_point2_temp Read/write the maximum temperature over which the fans + will run in automatic mode. tempX_auto_point1_temp + and tempX_auto_point2_temp together define the + range of automatic control. + +tempX_alarm Read a 1 if the max/min alarm is set +tempX_fault Read a 1 if either temp1 or temp3 diode has a fault + +(There are two voltage settings, in1 and in2): + +inX_input Read the current voltage on VCC. Value is in + millivolts. + +inX_min read/write the minimum voltage limit. + Dropping below this causes an alarm. + +inX_max read/write the maximum voltage limit. + Exceeding this causes an alarm. + +inX_alarm Read a 1 if the max/min alarm is set. diff --git a/drivers/hwmon/Kconfig b/drivers/hwmon/Kconfig index 4b33bc82cc24..5c349a19a3a3 100644 --- a/drivers/hwmon/Kconfig +++ b/drivers/hwmon/Kconfig @@ -189,6 +189,16 @@ config SENSORS_ADT7473 This driver can also be built as a module. If so, the module will be called adt7473. +config SENSORS_ADT7475 + tristate "Analog Devices ADT7475" + depends on I2C && EXPERIMENTAL + help + If you say yes here you get support for the Analog Devices + ADT7475 hardware monitoring chips. + + This driver can also be build as a module. If so, the module + will be called adt7475. + config SENSORS_K8TEMP tristate "AMD Athlon64/FX or Opteron temperature sensor" depends on X86 && PCI && EXPERIMENTAL diff --git a/drivers/hwmon/Makefile b/drivers/hwmon/Makefile index 19cb1ace3eb4..2e80f37f39eb 100644 --- a/drivers/hwmon/Makefile +++ b/drivers/hwmon/Makefile @@ -28,6 +28,8 @@ obj-$(CONFIG_SENSORS_ADS7828) += ads7828.o obj-$(CONFIG_SENSORS_ADT7462) += adt7462.o obj-$(CONFIG_SENSORS_ADT7470) += adt7470.o obj-$(CONFIG_SENSORS_ADT7473) += adt7473.o +obj-$(CONFIG_SENSORS_ADT7475) += adt7475.o + obj-$(CONFIG_SENSORS_APPLESMC) += applesmc.o obj-$(CONFIG_SENSORS_AMS) += ams/ obj-$(CONFIG_SENSORS_ATXP1) += atxp1.o diff --git a/drivers/hwmon/adt7475.c b/drivers/hwmon/adt7475.c new file mode 100644 index 000000000000..d39877a7da63 --- /dev/null +++ b/drivers/hwmon/adt7475.c @@ -0,0 +1,1221 @@ +/* + * adt7475 - Thermal sensor driver for the ADT7475 chip and derivatives + * Copyright (C) 2007-2008, Advanced Micro Devices, Inc. + * Copyright (C) 2008 Jordan Crouse + * Copyright (C) 2008 Hans de Goede + + * Derived from the lm83 driver by Jean Delvare + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + */ + +#include +#include +#include +#include +#include +#include +#include + +/* Indexes for the sysfs hooks */ + +#define INPUT 0 +#define MIN 1 +#define MAX 2 +#define CONTROL 3 +#define OFFSET 3 +#define AUTOMIN 4 +#define THERM 5 +#define HYSTERSIS 6 + +/* These are unique identifiers for the sysfs functions - unlike the + numbers above, these are not also indexes into an array +*/ + +#define ALARM 9 +#define FAULT 10 + +/* 7475 Common Registers */ + +#define REG_VOLTAGE_BASE 0x21 +#define REG_TEMP_BASE 0x25 +#define REG_TACH_BASE 0x28 +#define REG_PWM_BASE 0x30 +#define REG_PWM_MAX_BASE 0x38 + +#define REG_DEVID 0x3D +#define REG_VENDID 0x3E + +#define REG_STATUS1 0x41 +#define REG_STATUS2 0x42 + +#define REG_VOLTAGE_MIN_BASE 0x46 +#define REG_VOLTAGE_MAX_BASE 0x47 + +#define REG_TEMP_MIN_BASE 0x4E +#define REG_TEMP_MAX_BASE 0x4F + +#define REG_TACH_MIN_BASE 0x54 + +#define REG_PWM_CONFIG_BASE 0x5C + +#define REG_TEMP_TRANGE_BASE 0x5F + +#define REG_PWM_MIN_BASE 0x64 + +#define REG_TEMP_TMIN_BASE 0x67 +#define REG_TEMP_THERM_BASE 0x6A + +#define REG_REMOTE1_HYSTERSIS 0x6D +#define REG_REMOTE2_HYSTERSIS 0x6E + +#define REG_TEMP_OFFSET_BASE 0x70 + +#define REG_EXTEND1 0x76 +#define REG_EXTEND2 0x77 +#define REG_CONFIG5 0x7C + +#define CONFIG5_TWOSCOMP 0x01 +#define CONFIG5_TEMPOFFSET 0x02 + +/* ADT7475 Settings */ + +#define ADT7475_VOLTAGE_COUNT 2 +#define ADT7475_TEMP_COUNT 3 +#define ADT7475_TACH_COUNT 4 +#define ADT7475_PWM_COUNT 3 + +/* Macro to read the registers */ + +#define adt7475_read(reg) i2c_smbus_read_byte_data(client, (reg)) + +/* Macros to easily index the registers */ + +#define TACH_REG(idx) (REG_TACH_BASE + ((idx) * 2)) +#define TACH_MIN_REG(idx) (REG_TACH_MIN_BASE + ((idx) * 2)) + +#define PWM_REG(idx) (REG_PWM_BASE + (idx)) +#define PWM_MAX_REG(idx) (REG_PWM_MAX_BASE + (idx)) +#define PWM_MIN_REG(idx) (REG_PWM_MIN_BASE + (idx)) +#define PWM_CONFIG_REG(idx) (REG_PWM_CONFIG_BASE + (idx)) + +#define VOLTAGE_REG(idx) (REG_VOLTAGE_BASE + (idx)) +#define VOLTAGE_MIN_REG(idx) (REG_VOLTAGE_MIN_BASE + ((idx) * 2)) +#define VOLTAGE_MAX_REG(idx) (REG_VOLTAGE_MAX_BASE + ((idx) * 2)) + +#define TEMP_REG(idx) (REG_TEMP_BASE + (idx)) +#define TEMP_MIN_REG(idx) (REG_TEMP_MIN_BASE + ((idx) * 2)) +#define TEMP_MAX_REG(idx) (REG_TEMP_MAX_BASE + ((idx) * 2)) +#define TEMP_TMIN_REG(idx) (REG_TEMP_TMIN_BASE + (idx)) +#define TEMP_THERM_REG(idx) (REG_TEMP_THERM_BASE + (idx)) +#define TEMP_OFFSET_REG(idx) (REG_TEMP_OFFSET_BASE + (idx)) +#define TEMP_TRANGE_REG(idx) (REG_TEMP_TRANGE_BASE + (idx)) + +static unsigned short normal_i2c[] = { 0x2e, I2C_CLIENT_END }; + +I2C_CLIENT_INSMOD_1(adt7475); + +static const struct i2c_device_id adt7475_id[] = { + { "adt7475", adt7475 }, + { } +}; +MODULE_DEVICE_TABLE(i2c, adt7475_id); + +struct adt7475_data { + struct device *hwmon_dev; + struct mutex lock; + + unsigned long measure_updated; + unsigned long limits_updated; + char valid; + + u8 config5; + u16 alarms; + u16 voltage[3][3]; + u16 temp[7][3]; + u16 tach[2][4]; + u8 pwm[4][3]; + u8 range[3]; + u8 pwmctl[3]; + u8 pwmchan[3]; +}; + +static struct i2c_driver adt7475_driver; +static struct adt7475_data *adt7475_update_device(struct device *dev); +static void adt7475_read_hystersis(struct i2c_client *client); +static void adt7475_read_pwm(struct i2c_client *client, int index); + +/* Given a temp value, convert it to register value */ + +static inline u16 temp2reg(struct adt7475_data *data, long val) +{ + u16 ret; + + if (!(data->config5 & CONFIG5_TWOSCOMP)) { + val = SENSORS_LIMIT(val, -64000, 191000); + ret = (val + 64500) / 1000; + } else { + val = SENSORS_LIMIT(val, -128000, 127000); + if (val < -500) + ret = (256500 + val) / 1000; + else + ret = (val + 500) / 1000; + } + + return ret << 2; +} + +/* Given a register value, convert it to a real temp value */ + +static inline int reg2temp(struct adt7475_data *data, u16 reg) +{ + if (data->config5 & CONFIG5_TWOSCOMP) { + if (reg >= 512) + return (reg - 1024) * 250; + else + return reg * 250; + } else + return (reg - 256) * 250; +} + +static inline int tach2rpm(u16 tach) +{ + if (tach == 0 || tach == 0xFFFF) + return 0; + + return (90000 * 60) / tach; +} + +static inline u16 rpm2tach(unsigned long rpm) +{ + if (rpm == 0) + return 0; + + return SENSORS_LIMIT((90000 * 60) / rpm, 1, 0xFFFF); +} + +static inline int reg2vcc(u16 reg) +{ + return (4296 * reg) / 1000; +} + +static inline int reg2vccp(u16 reg) +{ + return (2929 * reg) / 1000; +} + +static inline u16 vcc2reg(long vcc) +{ + vcc = SENSORS_LIMIT(vcc, 0, 4396); + return (vcc * 1000) / 4296; +} + +static inline u16 vccp2reg(long vcc) +{ + vcc = SENSORS_LIMIT(vcc, 0, 2998); + return (vcc * 1000) / 2929; +} + +static u16 adt7475_read_word(struct i2c_client *client, int reg) +{ + u16 val; + + val = i2c_smbus_read_byte_data(client, reg); + val |= (i2c_smbus_read_byte_data(client, reg + 1) << 8); + + return val; +} + +static void adt7475_write_word(struct i2c_client *client, int reg, u16 val) +{ + i2c_smbus_write_byte_data(client, reg + 1, val >> 8); + i2c_smbus_write_byte_data(client, reg, val & 0xFF); +} + +/* Find the nearest value in a table - used for pwm frequency and + auto temp range */ +static int find_nearest(long val, const int *array, int size) +{ + int i; + + if (val < array[0]) + return 0; + + if (val > array[size - 1]) + return size - 1; + + for (i = 0; i < size - 1; i++) { + int a, b; + + if (val > array[i + 1]) + continue; + + a = val - array[i]; + b = array[i + 1] - val; + + return (a <= b) ? i : i + 1; + } + + return 0; +} + +static ssize_t show_voltage(struct device *dev, struct device_attribute *attr, + char *buf) +{ + struct adt7475_data *data = adt7475_update_device(dev); + struct sensor_device_attribute_2 *sattr = to_sensor_dev_attr_2(attr); + unsigned short val; + + switch (sattr->nr) { + case ALARM: + return sprintf(buf, "%d\n", + (data->alarms >> (sattr->index + 1)) & 1); + default: + val = data->voltage[sattr->nr][sattr->index]; + return sprintf(buf, "%d\n", + sattr->index == + 0 ? reg2vccp(val) : reg2vcc(val)); + } +} + +static ssize_t set_voltage(struct device *dev, struct device_attribute *attr, + const char *buf, size_t count) +{ + + struct sensor_device_attribute_2 *sattr = to_sensor_dev_attr_2(attr); + struct i2c_client *client = to_i2c_client(dev); + struct adt7475_data *data = i2c_get_clientdata(client); + unsigned char reg; + long val; + + if (strict_strtol(buf, 10, &val)) + return -EINVAL; + + mutex_lock(&data->lock); + + data->voltage[sattr->nr][sattr->index] = + sattr->index ? vcc2reg(val) : vccp2reg(val); + + if (sattr->nr == MIN) + reg = VOLTAGE_MIN_REG(sattr->index); + else + reg = VOLTAGE_MAX_REG(sattr->index); + + i2c_smbus_write_byte_data(client, reg, + data->voltage[sattr->nr][sattr->index] >> 2); + mutex_unlock(&data->lock); + + return count; +} + +static ssize_t show_temp(struct device *dev, struct device_attribute *attr, + char *buf) +{ + struct adt7475_data *data = adt7475_update_device(dev); + struct sensor_device_attribute_2 *sattr = to_sensor_dev_attr_2(attr); + int out; + + switch (sattr->nr) { + case HYSTERSIS: + mutex_lock(&data->lock); + out = data->temp[sattr->nr][sattr->index]; + if (sattr->index != 1) + out = (out >> 4) & 0xF; + else + out = (out & 0xF); + /* Show the value as an absolute number tied to + * THERM */ + out = reg2temp(data, data->temp[THERM][sattr->index]) - + out * 1000; + mutex_unlock(&data->lock); + break; + + case OFFSET: + /* Offset is always 2's complement, regardless of the + * setting in CONFIG5 */ + mutex_lock(&data->lock); + out = (s8)data->temp[sattr->nr][sattr->index]; + if (data->config5 & CONFIG5_TEMPOFFSET) + out *= 1000; + else + out *= 500; + mutex_unlock(&data->lock); + break; + + case ALARM: + out = (data->alarms >> (sattr->index + 4)) & 1; + break; + + case FAULT: + /* Note - only for remote1 and remote2 */ + out = data->alarms & (sattr->index ? 0x8000 : 0x4000); + out = out ? 0 : 1; + break; + + default: + /* All other temp values are in the configured format */ + out = reg2temp(data, data->temp[sattr->nr][sattr->index]); + } + + return sprintf(buf, "%d\n", out); +} + +static ssize_t set_temp(struct device *dev, struct device_attribute *attr, + const char *buf, size_t count) +{ + struct sensor_device_attribute_2 *sattr = to_sensor_dev_attr_2(attr); + struct i2c_client *client = to_i2c_client(dev); + struct adt7475_data *data = i2c_get_clientdata(client); + unsigned char reg = 0; + u8 out; + int temp; + long val; + + if (strict_strtol(buf, 10, &val)) + return -EINVAL; + + mutex_lock(&data->lock); + + /* We need the config register in all cases for temp <-> reg conv. */ + data->config5 = adt7475_read(REG_CONFIG5); + + switch (sattr->nr) { + case OFFSET: + if (data->config5 & CONFIG5_TEMPOFFSET) { + val = SENSORS_LIMIT(val, -63000, 127000); + out = data->temp[OFFSET][sattr->index] = val / 1000; + } else { + val = SENSORS_LIMIT(val, -63000, 64000); + out = data->temp[OFFSET][sattr->index] = val / 500; + } + break; + + case HYSTERSIS: + /* The value will be given as an absolute value, turn it + into an offset based on THERM */ + + /* Read fresh THERM and HYSTERSIS values from the chip */ + data->temp[THERM][sattr->index] = + adt7475_read(TEMP_THERM_REG(sattr->index)) << 2; + adt7475_read_hystersis(client); + + temp = reg2temp(data, data->temp[THERM][sattr->index]); + val = SENSORS_LIMIT(val, temp - 15000, temp); + val = (temp - val) / 1000; + + if (sattr->index != 1) { + data->temp[HYSTERSIS][sattr->index] &= 0xF0; + data->temp[HYSTERSIS][sattr->index] |= (val & 0xF) << 4; + } else { + data->temp[HYSTERSIS][sattr->index] &= 0x0F; + data->temp[HYSTERSIS][sattr->index] |= (val & 0xF); + } + + out = data->temp[HYSTERSIS][sattr->index]; + break; + + default: + data->temp[sattr->nr][sattr->index] = temp2reg(data, val); + + /* We maintain an extra 2 digits of precision for simplicity + * - shift those back off before writing the value */ + out = (u8) (data->temp[sattr->nr][sattr->index] >> 2); + } + + switch (sattr->nr) { + case MIN: + reg = TEMP_MIN_REG(sattr->index); + break; + case MAX: + reg = TEMP_MAX_REG(sattr->index); + break; + case OFFSET: + reg = TEMP_OFFSET_REG(sattr->index); + break; + case AUTOMIN: + reg = TEMP_TMIN_REG(sattr->index); + break; + case THERM: + reg = TEMP_THERM_REG(sattr->index); + break; + case HYSTERSIS: + if (sattr->index != 2) + reg = REG_REMOTE1_HYSTERSIS; + else + reg = REG_REMOTE2_HYSTERSIS; + + break; + } + + i2c_smbus_write_byte_data(client, reg, out); + + mutex_unlock(&data->lock); + return count; +} + +/* Table of autorange values - the user will write the value in millidegrees, + and we'll convert it */ +static const int autorange_table[] = { + 2000, 2500, 3330, 4000, 5000, 6670, 8000, + 10000, 13330, 16000, 20000, 26670, 32000, 40000, + 53330, 80000 +}; + +static ssize_t show_point2(struct device *dev, struct device_attribute *attr, + char *buf) +{ + struct adt7475_data *data = adt7475_update_device(dev); + struct sensor_device_attribute_2 *sattr = to_sensor_dev_attr_2(attr); + int out, val; + + mutex_lock(&data->lock); + out = (data->range[sattr->index] >> 4) & 0x0F; + val = reg2temp(data, data->temp[AUTOMIN][sattr->index]); + mutex_unlock(&data->lock); + + return sprintf(buf, "%d\n", val + autorange_table[out]); +} + +static ssize_t set_point2(struct device *dev, struct device_attribute *attr, + const char *buf, size_t count) +{ + struct i2c_client *client = to_i2c_client(dev); + struct adt7475_data *data = i2c_get_clientdata(client); + struct sensor_device_attribute_2 *sattr = to_sensor_dev_attr_2(attr); + int temp; + long val; + + if (strict_strtol(buf, 10, &val)) + return -EINVAL; + + mutex_lock(&data->lock); + + /* Get a fresh copy of the needed registers */ + data->config5 = adt7475_read(REG_CONFIG5); + data->temp[AUTOMIN][sattr->index] = + adt7475_read(TEMP_TMIN_REG(sattr->index)) << 2; + data->range[sattr->index] = + adt7475_read(TEMP_TRANGE_REG(sattr->index)); + + /* The user will write an absolute value, so subtract the start point + to figure the range */ + temp = reg2temp(data, data->temp[AUTOMIN][sattr->index]); + val = SENSORS_LIMIT(val, temp + autorange_table[0], + temp + autorange_table[ARRAY_SIZE(autorange_table) - 1]); + val -= temp; + + /* Find the nearest table entry to what the user wrote */ + val = find_nearest(val, autorange_table, ARRAY_SIZE(autorange_table)); + + data->range[sattr->index] &= ~0xF0; + data->range[sattr->index] |= val << 4; + + i2c_smbus_write_byte_data(client, TEMP_TRANGE_REG(sattr->index), + data->range[sattr->index]); + + mutex_unlock(&data->lock); + return count; +} + +static ssize_t show_tach(struct device *dev, struct device_attribute *attr, + char *buf) +{ + struct adt7475_data *data = adt7475_update_device(dev); + struct sensor_device_attribute_2 *sattr = to_sensor_dev_attr_2(attr); + int out; + + if (sattr->nr == ALARM) + out = (data->alarms >> (sattr->index + 10)) & 1; + else + out = tach2rpm(data->tach[sattr->nr][sattr->index]); + + return sprintf(buf, "%d\n", out); +} + +static ssize_t set_tach(struct device *dev, struct device_attribute *attr, + const char *buf, size_t count) +{ + + struct sensor_device_attribute_2 *sattr = to_sensor_dev_attr_2(attr); + struct i2c_client *client = to_i2c_client(dev); + struct adt7475_data *data = i2c_get_clientdata(client); + unsigned long val; + + if (strict_strtoul(buf, 10, &val)) + return -EINVAL; + + mutex_lock(&data->lock); + + data->tach[MIN][sattr->index] = rpm2tach(val); + + adt7475_write_word(client, TACH_MIN_REG(sattr->index), + data->tach[MIN][sattr->index]); + + mutex_unlock(&data->lock); + return count; +} + +static ssize_t show_pwm(struct device *dev, struct device_attribute *attr, + char *buf) +{ + struct adt7475_data *data = adt7475_update_device(dev); + struct sensor_device_attribute_2 *sattr = to_sensor_dev_attr_2(attr); + + return sprintf(buf, "%d\n", data->pwm[sattr->nr][sattr->index]); +} + +static ssize_t show_pwmchan(struct device *dev, struct device_attribute *attr, + char *buf) +{ + struct adt7475_data *data = adt7475_update_device(dev); + struct sensor_device_attribute_2 *sattr = to_sensor_dev_attr_2(attr); + + return sprintf(buf, "%d\n", data->pwmchan[sattr->index]); +} + +static ssize_t show_pwmctrl(struct device *dev, struct device_attribute *attr, + char *buf) +{ + struct adt7475_data *data = adt7475_update_device(dev); + struct sensor_device_attribute_2 *sattr = to_sensor_dev_attr_2(attr); + + return sprintf(buf, "%d\n", data->pwmctl[sattr->index]); +} + +static ssize_t set_pwm(struct device *dev, struct device_attribute *attr, + const char *buf, size_t count) +{ + + struct sensor_device_attribute_2 *sattr = to_sensor_dev_attr_2(attr); + struct i2c_client *client = to_i2c_client(dev); + struct adt7475_data *data = i2c_get_clientdata(client); + unsigned char reg = 0; + long val; + + if (strict_strtol(buf, 10, &val)) + return -EINVAL; + + mutex_lock(&data->lock); + + switch (sattr->nr) { + case INPUT: + /* Get a fresh value for CONTROL */ + data->pwm[CONTROL][sattr->index] = + adt7475_read(PWM_CONFIG_REG(sattr->index)); + + /* If we are not in manual mode, then we shouldn't allow + * the user to set the pwm speed */ + if (((data->pwm[CONTROL][sattr->index] >> 5) & 7) != 7) { + mutex_unlock(&data->lock); + return count; + } + + reg = PWM_REG(sattr->index); + break; + + case MIN: + reg = PWM_MIN_REG(sattr->index); + break; + + case MAX: + reg = PWM_MAX_REG(sattr->index); + break; + } + + data->pwm[sattr->nr][sattr->index] = SENSORS_LIMIT(val, 0, 0xFF); + i2c_smbus_write_byte_data(client, reg, + data->pwm[sattr->nr][sattr->index]); + + mutex_unlock(&data->lock); + + return count; +} + +/* Called by set_pwmctrl and set_pwmchan */ + +static int hw_set_pwm(struct i2c_client *client, int index, + unsigned int pwmctl, unsigned int pwmchan) +{ + struct adt7475_data *data = i2c_get_clientdata(client); + long val = 0; + + switch (pwmctl) { + case 0: + val = 0x03; /* Run at full speed */ + break; + case 1: + val = 0x07; /* Manual mode */ + break; + case 2: + switch (pwmchan) { + case 1: + /* Remote1 controls PWM */ + val = 0x00; + break; + case 2: + /* local controls PWM */ + val = 0x01; + break; + case 4: + /* remote2 controls PWM */ + val = 0x02; + break; + case 6: + /* local/remote2 control PWM */ + val = 0x05; + break; + case 7: + /* All three control PWM */ + val = 0x06; + break; + default: + return -EINVAL; + } + break; + default: + return -EINVAL; + } + + data->pwmctl[index] = pwmctl; + data->pwmchan[index] = pwmchan; + + data->pwm[CONTROL][index] &= ~0xE0; + data->pwm[CONTROL][index] |= (val & 7) << 5; + + i2c_smbus_write_byte_data(client, PWM_CONFIG_REG(index), + data->pwm[CONTROL][index]); + + return 0; +} + +static ssize_t set_pwmchan(struct device *dev, struct device_attribute *attr, + const char *buf, size_t count) +{ + struct sensor_device_attribute_2 *sattr = to_sensor_dev_attr_2(attr); + struct i2c_client *client = to_i2c_client(dev); + struct adt7475_data *data = i2c_get_clientdata(client); + int r; + long val; + + if (strict_strtol(buf, 10, &val)) + return -EINVAL; + + mutex_lock(&data->lock); + /* Read Modify Write PWM values */ + adt7475_read_pwm(client, sattr->index); + r = hw_set_pwm(client, sattr->index, data->pwmctl[sattr->index], val); + if (r) + count = r; + mutex_unlock(&data->lock); + + return count; +} + +static ssize_t set_pwmctrl(struct device *dev, struct device_attribute *attr, + const char *buf, size_t count) +{ + struct sensor_device_attribute_2 *sattr = to_sensor_dev_attr_2(attr); + struct i2c_client *client = to_i2c_client(dev); + struct adt7475_data *data = i2c_get_clientdata(client); + int r; + long val; + + if (strict_strtol(buf, 10, &val)) + return -EINVAL; + + mutex_lock(&data->lock); + /* Read Modify Write PWM values */ + adt7475_read_pwm(client, sattr->index); + r = hw_set_pwm(client, sattr->index, val, data->pwmchan[sattr->index]); + if (r) + count = r; + mutex_unlock(&data->lock); + + return count; +} + +/* List of frequencies for the PWM */ +static const int pwmfreq_table[] = { + 11, 14, 22, 29, 35, 44, 58, 88 +}; + +static ssize_t show_pwmfreq(struct device *dev, struct device_attribute *attr, + char *buf) +{ + struct adt7475_data *data = adt7475_update_device(dev); + struct sensor_device_attribute_2 *sattr = to_sensor_dev_attr_2(attr); + + return sprintf(buf, "%d\n", + pwmfreq_table[data->range[sattr->index] & 7]); +} + +static ssize_t set_pwmfreq(struct device *dev, struct device_attribute *attr, + const char *buf, size_t count) +{ + struct sensor_device_attribute_2 *sattr = to_sensor_dev_attr_2(attr); + struct i2c_client *client = to_i2c_client(dev); + struct adt7475_data *data = i2c_get_clientdata(client); + int out; + long val; + + if (strict_strtol(buf, 10, &val)) + return -EINVAL; + + out = find_nearest(val, pwmfreq_table, ARRAY_SIZE(pwmfreq_table)); + + mutex_lock(&data->lock); + + data->range[sattr->index] = + adt7475_read(TEMP_TRANGE_REG(sattr->index)); + data->range[sattr->index] &= ~7; + data->range[sattr->index] |= out; + + i2c_smbus_write_byte_data(client, TEMP_TRANGE_REG(sattr->index), + data->range[sattr->index]); + + mutex_unlock(&data->lock); + return count; +} + +static SENSOR_DEVICE_ATTR_2(in1_input, S_IRUGO, show_voltage, NULL, INPUT, 0); +static SENSOR_DEVICE_ATTR_2(in1_max, S_IRUGO | S_IWUSR, show_voltage, + set_voltage, MAX, 0); +static SENSOR_DEVICE_ATTR_2(in1_min, S_IRUGO | S_IWUSR, show_voltage, + set_voltage, MIN, 0); +static SENSOR_DEVICE_ATTR_2(in1_alarm, S_IRUGO, show_voltage, NULL, ALARM, 0); +static SENSOR_DEVICE_ATTR_2(in2_input, S_IRUGO, show_voltage, NULL, INPUT, 1); +static SENSOR_DEVICE_ATTR_2(in2_max, S_IRUGO | S_IWUSR, show_voltage, + set_voltage, MAX, 1); +static SENSOR_DEVICE_ATTR_2(in2_min, S_IRUGO | S_IWUSR, show_voltage, + set_voltage, MIN, 1); +static SENSOR_DEVICE_ATTR_2(in2_alarm, S_IRUGO, show_voltage, NULL, ALARM, 1); +static SENSOR_DEVICE_ATTR_2(temp1_input, S_IRUGO, show_temp, NULL, INPUT, 0); +static SENSOR_DEVICE_ATTR_2(temp1_alarm, S_IRUGO, show_temp, NULL, ALARM, 0); +static SENSOR_DEVICE_ATTR_2(temp1_fault, S_IRUGO, show_temp, NULL, FAULT, 0); +static SENSOR_DEVICE_ATTR_2(temp1_max, S_IRUGO | S_IWUSR, show_temp, set_temp, + MAX, 0); +static SENSOR_DEVICE_ATTR_2(temp1_min, S_IRUGO | S_IWUSR, show_temp, set_temp, + MIN, 0); +static SENSOR_DEVICE_ATTR_2(temp1_offset, S_IRUGO | S_IWUSR, show_temp, + set_temp, OFFSET, 0); +static SENSOR_DEVICE_ATTR_2(temp1_auto_point1_temp, S_IRUGO | S_IWUSR, + show_temp, set_temp, AUTOMIN, 0); +static SENSOR_DEVICE_ATTR_2(temp1_auto_point2_temp, S_IRUGO | S_IWUSR, + show_point2, set_point2, 0, 0); +static SENSOR_DEVICE_ATTR_2(temp1_crit, S_IRUGO | S_IWUSR, show_temp, set_temp, + THERM, 0); +static SENSOR_DEVICE_ATTR_2(temp1_crit_hyst, S_IRUGO | S_IWUSR, show_temp, + set_temp, HYSTERSIS, 0); +static SENSOR_DEVICE_ATTR_2(temp2_input, S_IRUGO, show_temp, NULL, INPUT, 1); +static SENSOR_DEVICE_ATTR_2(temp2_alarm, S_IRUGO, show_temp, NULL, ALARM, 1); +static SENSOR_DEVICE_ATTR_2(temp2_max, S_IRUGO | S_IWUSR, show_temp, set_temp, + MAX, 1); +static SENSOR_DEVICE_ATTR_2(temp2_min, S_IRUGO | S_IWUSR, show_temp, set_temp, + MIN, 1); +static SENSOR_DEVICE_ATTR_2(temp2_offset, S_IRUGO | S_IWUSR, show_temp, + set_temp, OFFSET, 1); +static SENSOR_DEVICE_ATTR_2(temp2_auto_point1_temp, S_IRUGO | S_IWUSR, + show_temp, set_temp, AUTOMIN, 1); +static SENSOR_DEVICE_ATTR_2(temp2_auto_point2_temp, S_IRUGO | S_IWUSR, + show_point2, set_point2, 0, 1); +static SENSOR_DEVICE_ATTR_2(temp2_crit, S_IRUGO | S_IWUSR, show_temp, set_temp, + THERM, 1); +static SENSOR_DEVICE_ATTR_2(temp2_crit_hyst, S_IRUGO | S_IWUSR, show_temp, + set_temp, HYSTERSIS, 1); +static SENSOR_DEVICE_ATTR_2(temp3_input, S_IRUGO, show_temp, NULL, INPUT, 2); +static SENSOR_DEVICE_ATTR_2(temp3_alarm, S_IRUGO, show_temp, NULL, ALARM, 2); +static SENSOR_DEVICE_ATTR_2(temp3_fault, S_IRUGO, show_temp, NULL, FAULT, 2); +static SENSOR_DEVICE_ATTR_2(temp3_max, S_IRUGO | S_IWUSR, show_temp, set_temp, + MAX, 2); +static SENSOR_DEVICE_ATTR_2(temp3_min, S_IRUGO | S_IWUSR, show_temp, set_temp, + MIN, 2); +static SENSOR_DEVICE_ATTR_2(temp3_offset, S_IRUGO | S_IWUSR, show_temp, + set_temp, OFFSET, 2); +static SENSOR_DEVICE_ATTR_2(temp3_auto_point1_temp, S_IRUGO | S_IWUSR, + show_temp, set_temp, AUTOMIN, 2); +static SENSOR_DEVICE_ATTR_2(temp3_auto_point2_temp, S_IRUGO | S_IWUSR, + show_point2, set_point2, 0, 2); +static SENSOR_DEVICE_ATTR_2(temp3_crit, S_IRUGO | S_IWUSR, show_temp, set_temp, + THERM, 2); +static SENSOR_DEVICE_ATTR_2(temp3_crit_hyst, S_IRUGO | S_IWUSR, show_temp, + set_temp, HYSTERSIS, 2); +static SENSOR_DEVICE_ATTR_2(fan1_input, S_IRUGO, show_tach, NULL, INPUT, 0); +static SENSOR_DEVICE_ATTR_2(fan1_min, S_IRUGO | S_IWUSR, show_tach, set_tach, + MIN, 0); +static SENSOR_DEVICE_ATTR_2(fan1_alarm, S_IRUGO, show_tach, NULL, ALARM, 0); +static SENSOR_DEVICE_ATTR_2(fan2_input, S_IRUGO, show_tach, NULL, INPUT, 1); +static SENSOR_DEVICE_ATTR_2(fan2_min, S_IRUGO | S_IWUSR, show_tach, set_tach, + MIN, 1); +static SENSOR_DEVICE_ATTR_2(fan2_alarm, S_IRUGO, show_tach, NULL, ALARM, 1); +static SENSOR_DEVICE_ATTR_2(fan3_input, S_IRUGO, show_tach, NULL, INPUT, 2); +static SENSOR_DEVICE_ATTR_2(fan3_min, S_IRUGO | S_IWUSR, show_tach, set_tach, + MIN, 2); +static SENSOR_DEVICE_ATTR_2(fan3_alarm, S_IRUGO, show_tach, NULL, ALARM, 2); +static SENSOR_DEVICE_ATTR_2(fan4_input, S_IRUGO, show_tach, NULL, INPUT, 3); +static SENSOR_DEVICE_ATTR_2(fan4_min, S_IRUGO | S_IWUSR, show_tach, set_tach, + MIN, 3); +static SENSOR_DEVICE_ATTR_2(fan4_alarm, S_IRUGO, show_tach, NULL, ALARM, 3); +static SENSOR_DEVICE_ATTR_2(pwm1, S_IRUGO | S_IWUSR, show_pwm, set_pwm, INPUT, + 0); +static SENSOR_DEVICE_ATTR_2(pwm1_freq, S_IRUGO | S_IWUSR, show_pwmfreq, + set_pwmfreq, INPUT, 0); +static SENSOR_DEVICE_ATTR_2(pwm1_enable, S_IRUGO | S_IWUSR, show_pwmctrl, + set_pwmctrl, INPUT, 0); +static SENSOR_DEVICE_ATTR_2(pwm1_auto_channel_temp, S_IRUGO | S_IWUSR, + show_pwmchan, set_pwmchan, INPUT, 0); +static SENSOR_DEVICE_ATTR_2(pwm1_auto_point1_pwm, S_IRUGO | S_IWUSR, show_pwm, + set_pwm, MIN, 0); +static SENSOR_DEVICE_ATTR_2(pwm1_auto_point2_pwm, S_IRUGO | S_IWUSR, show_pwm, + set_pwm, MAX, 0); +static SENSOR_DEVICE_ATTR_2(pwm2, S_IRUGO | S_IWUSR, show_pwm, set_pwm, INPUT, + 1); +static SENSOR_DEVICE_ATTR_2(pwm2_freq, S_IRUGO | S_IWUSR, show_pwmfreq, + set_pwmfreq, INPUT, 1); +static SENSOR_DEVICE_ATTR_2(pwm2_enable, S_IRUGO | S_IWUSR, show_pwmctrl, + set_pwmctrl, INPUT, 1); +static SENSOR_DEVICE_ATTR_2(pwm2_auto_channel_temp, S_IRUGO | S_IWUSR, + show_pwmchan, set_pwmchan, INPUT, 1); +static SENSOR_DEVICE_ATTR_2(pwm2_auto_point1_pwm, S_IRUGO | S_IWUSR, show_pwm, + set_pwm, MIN, 1); +static SENSOR_DEVICE_ATTR_2(pwm2_auto_point2_pwm, S_IRUGO | S_IWUSR, show_pwm, + set_pwm, MAX, 1); +static SENSOR_DEVICE_ATTR_2(pwm3, S_IRUGO | S_IWUSR, show_pwm, set_pwm, INPUT, + 2); +static SENSOR_DEVICE_ATTR_2(pwm3_freq, S_IRUGO | S_IWUSR, show_pwmfreq, + set_pwmfreq, INPUT, 2); +static SENSOR_DEVICE_ATTR_2(pwm3_enable, S_IRUGO | S_IWUSR, show_pwmctrl, + set_pwmctrl, INPUT, 2); +static SENSOR_DEVICE_ATTR_2(pwm3_auto_channel_temp, S_IRUGO | S_IWUSR, + show_pwmchan, set_pwmchan, INPUT, 2); +static SENSOR_DEVICE_ATTR_2(pwm3_auto_point1_pwm, S_IRUGO | S_IWUSR, show_pwm, + set_pwm, MIN, 2); +static SENSOR_DEVICE_ATTR_2(pwm3_auto_point2_pwm, S_IRUGO | S_IWUSR, show_pwm, + set_pwm, MAX, 2); + +static struct attribute *adt7475_attrs[] = { + &sensor_dev_attr_in1_input.dev_attr.attr, + &sensor_dev_attr_in1_max.dev_attr.attr, + &sensor_dev_attr_in1_min.dev_attr.attr, + &sensor_dev_attr_in1_alarm.dev_attr.attr, + &sensor_dev_attr_in2_input.dev_attr.attr, + &sensor_dev_attr_in2_max.dev_attr.attr, + &sensor_dev_attr_in2_min.dev_attr.attr, + &sensor_dev_attr_in2_alarm.dev_attr.attr, + &sensor_dev_attr_temp1_input.dev_attr.attr, + &sensor_dev_attr_temp1_alarm.dev_attr.attr, + &sensor_dev_attr_temp1_fault.dev_attr.attr, + &sensor_dev_attr_temp1_max.dev_attr.attr, + &sensor_dev_attr_temp1_min.dev_attr.attr, + &sensor_dev_attr_temp1_offset.dev_attr.attr, + &sensor_dev_attr_temp1_auto_point1_temp.dev_attr.attr, + &sensor_dev_attr_temp1_auto_point2_temp.dev_attr.attr, + &sensor_dev_attr_temp1_crit.dev_attr.attr, + &sensor_dev_attr_temp1_crit_hyst.dev_attr.attr, + &sensor_dev_attr_temp2_input.dev_attr.attr, + &sensor_dev_attr_temp2_alarm.dev_attr.attr, + &sensor_dev_attr_temp2_max.dev_attr.attr, + &sensor_dev_attr_temp2_min.dev_attr.attr, + &sensor_dev_attr_temp2_offset.dev_attr.attr, + &sensor_dev_attr_temp2_auto_point1_temp.dev_attr.attr, + &sensor_dev_attr_temp2_auto_point2_temp.dev_attr.attr, + &sensor_dev_attr_temp2_crit.dev_attr.attr, + &sensor_dev_attr_temp2_crit_hyst.dev_attr.attr, + &sensor_dev_attr_temp3_input.dev_attr.attr, + &sensor_dev_attr_temp3_fault.dev_attr.attr, + &sensor_dev_attr_temp3_alarm.dev_attr.attr, + &sensor_dev_attr_temp3_max.dev_attr.attr, + &sensor_dev_attr_temp3_min.dev_attr.attr, + &sensor_dev_attr_temp3_offset.dev_attr.attr, + &sensor_dev_attr_temp3_auto_point1_temp.dev_attr.attr, + &sensor_dev_attr_temp3_auto_point2_temp.dev_attr.attr, + &sensor_dev_attr_temp3_crit.dev_attr.attr, + &sensor_dev_attr_temp3_crit_hyst.dev_attr.attr, + &sensor_dev_attr_fan1_input.dev_attr.attr, + &sensor_dev_attr_fan1_min.dev_attr.attr, + &sensor_dev_attr_fan1_alarm.dev_attr.attr, + &sensor_dev_attr_fan2_input.dev_attr.attr, + &sensor_dev_attr_fan2_min.dev_attr.attr, + &sensor_dev_attr_fan2_alarm.dev_attr.attr, + &sensor_dev_attr_fan3_input.dev_attr.attr, + &sensor_dev_attr_fan3_min.dev_attr.attr, + &sensor_dev_attr_fan3_alarm.dev_attr.attr, + &sensor_dev_attr_fan4_input.dev_attr.attr, + &sensor_dev_attr_fan4_min.dev_attr.attr, + &sensor_dev_attr_fan4_alarm.dev_attr.attr, + &sensor_dev_attr_pwm1.dev_attr.attr, + &sensor_dev_attr_pwm1_freq.dev_attr.attr, + &sensor_dev_attr_pwm1_enable.dev_attr.attr, + &sensor_dev_attr_pwm1_auto_channel_temp.dev_attr.attr, + &sensor_dev_attr_pwm1_auto_point1_pwm.dev_attr.attr, + &sensor_dev_attr_pwm1_auto_point2_pwm.dev_attr.attr, + &sensor_dev_attr_pwm2.dev_attr.attr, + &sensor_dev_attr_pwm2_freq.dev_attr.attr, + &sensor_dev_attr_pwm2_enable.dev_attr.attr, + &sensor_dev_attr_pwm2_auto_channel_temp.dev_attr.attr, + &sensor_dev_attr_pwm2_auto_point1_pwm.dev_attr.attr, + &sensor_dev_attr_pwm2_auto_point2_pwm.dev_attr.attr, + &sensor_dev_attr_pwm3.dev_attr.attr, + &sensor_dev_attr_pwm3_freq.dev_attr.attr, + &sensor_dev_attr_pwm3_enable.dev_attr.attr, + &sensor_dev_attr_pwm3_auto_channel_temp.dev_attr.attr, + &sensor_dev_attr_pwm3_auto_point1_pwm.dev_attr.attr, + &sensor_dev_attr_pwm3_auto_point2_pwm.dev_attr.attr, + NULL, +}; + +struct attribute_group adt7475_attr_group = { .attrs = adt7475_attrs }; + +static int adt7475_detect(struct i2c_client *client, int kind, + struct i2c_board_info *info) +{ + struct i2c_adapter *adapter = client->adapter; + + if (!i2c_check_functionality(adapter, I2C_FUNC_SMBUS_BYTE_DATA)) + return -ENODEV; + + if (kind <= 0) { + if (adt7475_read(REG_VENDID) != 0x41 || + adt7475_read(REG_DEVID) != 0x75) { + dev_err(&adapter->dev, + "Couldn't detect a adt7475 part at 0x%02x\n", + (unsigned int)client->addr); + return -ENODEV; + } + } + + strlcpy(info->type, adt7475_id[0].name, I2C_NAME_SIZE); + + return 0; +} + +static int adt7475_probe(struct i2c_client *client, + const struct i2c_device_id *id) +{ + struct adt7475_data *data; + int i, ret = 0; + + data = kzalloc(sizeof(*data), GFP_KERNEL); + if (data == NULL) + return -ENOMEM; + + mutex_init(&data->lock); + i2c_set_clientdata(client, data); + + /* Call adt7475_read_pwm for all pwm's as this will reprogram any + pwm's which are disabled to manual mode with 0% duty cycle */ + for (i = 0; i < ADT7475_PWM_COUNT; i++) + adt7475_read_pwm(client, i); + + ret = sysfs_create_group(&client->dev.kobj, &adt7475_attr_group); + if (ret) + goto efree; + + data->hwmon_dev = hwmon_device_register(&client->dev); + if (IS_ERR(data->hwmon_dev)) { + ret = PTR_ERR(data->hwmon_dev); + goto eremove; + } + + return 0; + +eremove: + sysfs_remove_group(&client->dev.kobj, &adt7475_attr_group); +efree: + kfree(data); + return ret; +} + +static int adt7475_remove(struct i2c_client *client) +{ + struct adt7475_data *data = i2c_get_clientdata(client); + + hwmon_device_unregister(data->hwmon_dev); + sysfs_remove_group(&client->dev.kobj, &adt7475_attr_group); + kfree(data); + + return 0; +} + +static struct i2c_driver adt7475_driver = { + .class = I2C_CLASS_HWMON, + .driver = { + .name = "adt7475", + }, + .probe = adt7475_probe, + .remove = adt7475_remove, + .id_table = adt7475_id, + .detect = adt7475_detect, + .address_data = &addr_data, +}; + +static void adt7475_read_hystersis(struct i2c_client *client) +{ + struct adt7475_data *data = i2c_get_clientdata(client); + + data->temp[HYSTERSIS][0] = (u16) adt7475_read(REG_REMOTE1_HYSTERSIS); + data->temp[HYSTERSIS][1] = data->temp[HYSTERSIS][0]; + data->temp[HYSTERSIS][2] = (u16) adt7475_read(REG_REMOTE2_HYSTERSIS); +} + +static void adt7475_read_pwm(struct i2c_client *client, int index) +{ + struct adt7475_data *data = i2c_get_clientdata(client); + unsigned int v; + + data->pwm[CONTROL][index] = adt7475_read(PWM_CONFIG_REG(index)); + + /* Figure out the internal value for pwmctrl and pwmchan + based on the current settings */ + v = (data->pwm[CONTROL][index] >> 5) & 7; + + if (v == 3) + data->pwmctl[index] = 0; + else if (v == 7) + data->pwmctl[index] = 1; + else if (v == 4) { + /* The fan is disabled - we don't want to + support that, so change to manual mode and + set the duty cycle to 0 instead + */ + data->pwm[INPUT][index] = 0; + data->pwm[CONTROL][index] &= ~0xE0; + data->pwm[CONTROL][index] |= (7 << 5); + + i2c_smbus_write_byte_data(client, PWM_CONFIG_REG(index), + data->pwm[INPUT][index]); + + i2c_smbus_write_byte_data(client, PWM_CONFIG_REG(index), + data->pwm[CONTROL][index]); + + data->pwmctl[index] = 1; + } else { + data->pwmctl[index] = 2; + + switch (v) { + case 0: + data->pwmchan[index] = 1; + break; + case 1: + data->pwmchan[index] = 2; + break; + case 2: + data->pwmchan[index] = 4; + break; + case 5: + data->pwmchan[index] = 6; + break; + case 6: + data->pwmchan[index] = 7; + break; + } + } +} + +static struct adt7475_data *adt7475_update_device(struct device *dev) +{ + struct i2c_client *client = to_i2c_client(dev); + struct adt7475_data *data = i2c_get_clientdata(client); + u8 ext; + int i; + + mutex_lock(&data->lock); + + /* Measurement values update every 2 seconds */ + if (time_after(jiffies, data->measure_updated + HZ * 2) || + !data->valid) { + data->alarms = adt7475_read(REG_STATUS2) << 8; + data->alarms |= adt7475_read(REG_STATUS1); + + ext = adt7475_read(REG_EXTEND1); + for (i = 0; i < ADT7475_VOLTAGE_COUNT; i++) + data->voltage[INPUT][i] = + (adt7475_read(VOLTAGE_REG(i)) << 2) | + ((ext >> ((i + 1) * 2)) & 3); + + ext = adt7475_read(REG_EXTEND2); + for (i = 0; i < ADT7475_TEMP_COUNT; i++) + data->temp[INPUT][i] = + (adt7475_read(TEMP_REG(i)) << 2) | + ((ext >> ((i + 1) * 2)) & 3); + + for (i = 0; i < ADT7475_TACH_COUNT; i++) + data->tach[INPUT][i] = + adt7475_read_word(client, TACH_REG(i)); + + /* Updated by hw when in auto mode */ + for (i = 0; i < ADT7475_PWM_COUNT; i++) + data->pwm[INPUT][i] = adt7475_read(PWM_REG(i)); + + data->measure_updated = jiffies; + } + + /* Limits and settings, should never change update every 60 seconds */ + if (time_after(jiffies, data->limits_updated + HZ * 2) || + !data->valid) { + data->config5 = adt7475_read(REG_CONFIG5); + + for (i = 0; i < ADT7475_VOLTAGE_COUNT; i++) { + /* Adjust values so they match the input precision */ + data->voltage[MIN][i] = + adt7475_read(VOLTAGE_MIN_REG(i)) << 2; + data->voltage[MAX][i] = + adt7475_read(VOLTAGE_MAX_REG(i)) << 2; + } + + for (i = 0; i < ADT7475_TEMP_COUNT; i++) { + /* Adjust values so they match the input precision */ + data->temp[MIN][i] = + adt7475_read(TEMP_MIN_REG(i)) << 2; + data->temp[MAX][i] = + adt7475_read(TEMP_MAX_REG(i)) << 2; + data->temp[AUTOMIN][i] = + adt7475_read(TEMP_TMIN_REG(i)) << 2; + data->temp[THERM][i] = + adt7475_read(TEMP_THERM_REG(i)) << 2; + data->temp[OFFSET][i] = + adt7475_read(TEMP_OFFSET_REG(i)); + } + adt7475_read_hystersis(client); + + for (i = 0; i < ADT7475_TACH_COUNT; i++) + data->tach[MIN][i] = + adt7475_read_word(client, TACH_MIN_REG(i)); + + for (i = 0; i < ADT7475_PWM_COUNT; i++) { + data->pwm[MAX][i] = adt7475_read(PWM_MAX_REG(i)); + data->pwm[MIN][i] = adt7475_read(PWM_MIN_REG(i)); + /* Set the channel and control information */ + adt7475_read_pwm(client, i); + } + + data->range[0] = adt7475_read(TEMP_TRANGE_REG(0)); + data->range[1] = adt7475_read(TEMP_TRANGE_REG(1)); + data->range[2] = adt7475_read(TEMP_TRANGE_REG(2)); + + data->limits_updated = jiffies; + data->valid = 1; + } + + mutex_unlock(&data->lock); + + return data; +} + +static int __init sensors_adt7475_init(void) +{ + return i2c_add_driver(&adt7475_driver); +} + +static void __exit sensors_adt7475_exit(void) +{ + i2c_del_driver(&adt7475_driver); +} + +MODULE_AUTHOR("Advanced Micro Devices, Inc"); +MODULE_DESCRIPTION("adt7475 driver"); +MODULE_LICENSE("GPL"); + +module_init(sensors_adt7475_init); +module_exit(sensors_adt7475_exit); -- cgit v1.2.3 From db0fb1848a645b0b1b033765f3a5244e7afd2e3c Mon Sep 17 00:00:00 2001 From: Peter W Morreale Date: Thu, 15 Jan 2009 13:50:42 -0800 Subject: Update of Documentation: vm.txt and proc.txt Update Documentation/sysctl/vm.txt and Documentation/filesystems/proc.txt. More specifically, the section on /proc/sys/vm in Documentation/filesystems/proc.txt was removed and a link to Documentation/sysctl/vm.txt added. Most of the verbiage from proc.txt was simply moved in vm.txt, with new addtional text for "swappiness" and "stat_interval". Signed-off-by: Peter W Morreale Acked-by: Randy Dunlap Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/filesystems/proc.txt | 288 +---------------- Documentation/sysctl/vm.txt | 619 ++++++++++++++++++++++++++----------- 2 files changed, 437 insertions(+), 470 deletions(-) (limited to 'Documentation') diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index d105eb45282a..bbebc3a43ac0 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -1371,292 +1371,8 @@ auto_msgmni default value is 1. 2.4 /proc/sys/vm - The virtual memory subsystem ----------------------------------------------- -The files in this directory can be used to tune the operation of the virtual -memory (VM) subsystem of the Linux kernel. - -vfs_cache_pressure ------------------- - -Controls the tendency of the kernel to reclaim the memory which is used for -caching of directory and inode objects. - -At the default value of vfs_cache_pressure=100 the kernel will attempt to -reclaim dentries and inodes at a "fair" rate with respect to pagecache and -swapcache reclaim. Decreasing vfs_cache_pressure causes the kernel to prefer -to retain dentry and inode caches. Increasing vfs_cache_pressure beyond 100 -causes the kernel to prefer to reclaim dentries and inodes. - -dirty_background_bytes ----------------------- - -Contains the amount of dirty memory at which the pdflush background writeback -daemon will start writeback. - -If dirty_background_bytes is written, dirty_background_ratio becomes a function -of its value (dirty_background_bytes / the amount of dirtyable system memory). - -dirty_background_ratio ----------------------- - -Contains, as a percentage of the dirtyable system memory (free pages + mapped -pages + file cache, not including locked pages and HugePages), the number of -pages at which the pdflush background writeback daemon will start writing out -dirty data. - -If dirty_background_ratio is written, dirty_background_bytes becomes a function -of its value (dirty_background_ratio * the amount of dirtyable system memory). - -dirty_bytes ------------ - -Contains the amount of dirty memory at which a process generating disk writes -will itself start writeback. - -If dirty_bytes is written, dirty_ratio becomes a function of its value -(dirty_bytes / the amount of dirtyable system memory). - -dirty_ratio ------------ - -Contains, as a percentage of the dirtyable system memory (free pages + mapped -pages + file cache, not including locked pages and HugePages), the number of -pages at which a process which is generating disk writes will itself start -writing out dirty data. - -If dirty_ratio is written, dirty_bytes becomes a function of its value -(dirty_ratio * the amount of dirtyable system memory). - -dirty_writeback_centisecs -------------------------- - -The pdflush writeback daemons will periodically wake up and write `old' data -out to disk. This tunable expresses the interval between those wakeups, in -100'ths of a second. - -Setting this to zero disables periodic writeback altogether. - -dirty_expire_centisecs ----------------------- - -This tunable is used to define when dirty data is old enough to be eligible -for writeout by the pdflush daemons. It is expressed in 100'ths of a second. -Data which has been dirty in-memory for longer than this interval will be -written out next time a pdflush daemon wakes up. - -highmem_is_dirtyable --------------------- - -Only present if CONFIG_HIGHMEM is set. - -This defaults to 0 (false), meaning that the ratios set above are calculated -as a percentage of lowmem only. This protects against excessive scanning -in page reclaim, swapping and general VM distress. - -Setting this to 1 can be useful on 32 bit machines where you want to make -random changes within an MMAPed file that is larger than your available -lowmem without causing large quantities of random IO. Is is safe if the -behavior of all programs running on the machine is known and memory will -not be otherwise stressed. - -legacy_va_layout ----------------- - -If non-zero, this sysctl disables the new 32-bit mmap mmap layout - the kernel -will use the legacy (2.4) layout for all processes. - -lowmem_reserve_ratio ---------------------- - -For some specialised workloads on highmem machines it is dangerous for -the kernel to allow process memory to be allocated from the "lowmem" -zone. This is because that memory could then be pinned via the mlock() -system call, or by unavailability of swapspace. - -And on large highmem machines this lack of reclaimable lowmem memory -can be fatal. - -So the Linux page allocator has a mechanism which prevents allocations -which _could_ use highmem from using too much lowmem. This means that -a certain amount of lowmem is defended from the possibility of being -captured into pinned user memory. - -(The same argument applies to the old 16 megabyte ISA DMA region. This -mechanism will also defend that region from allocations which could use -highmem or lowmem). - -The `lowmem_reserve_ratio' tunable determines how aggressive the kernel is -in defending these lower zones. - -If you have a machine which uses highmem or ISA DMA and your -applications are using mlock(), or if you are running with no swap then -you probably should change the lowmem_reserve_ratio setting. - -The lowmem_reserve_ratio is an array. You can see them by reading this file. -- -% cat /proc/sys/vm/lowmem_reserve_ratio -256 256 32 -- -Note: # of this elements is one fewer than number of zones. Because the highest - zone's value is not necessary for following calculation. - -But, these values are not used directly. The kernel calculates # of protection -pages for each zones from them. These are shown as array of protection pages -in /proc/zoneinfo like followings. (This is an example of x86-64 box). -Each zone has an array of protection pages like this. - -- -Node 0, zone DMA - pages free 1355 - min 3 - low 3 - high 4 - : - : - numa_other 0 - protection: (0, 2004, 2004, 2004) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - pagesets - cpu: 0 pcp: 0 - : -- -These protections are added to score to judge whether this zone should be used -for page allocation or should be reclaimed. - -In this example, if normal pages (index=2) are required to this DMA zone and -pages_high is used for watermark, the kernel judges this zone should not be -used because pages_free(1355) is smaller than watermark + protection[2] -(4 + 2004 = 2008). If this protection value is 0, this zone would be used for -normal page requirement. If requirement is DMA zone(index=0), protection[0] -(=0) is used. - -zone[i]'s protection[j] is calculated by following expression. - -(i < j): - zone[i]->protection[j] - = (total sums of present_pages from zone[i+1] to zone[j] on the node) - / lowmem_reserve_ratio[i]; -(i = j): - (should not be protected. = 0; -(i > j): - (not necessary, but looks 0) - -The default values of lowmem_reserve_ratio[i] are - 256 (if zone[i] means DMA or DMA32 zone) - 32 (others). -As above expression, they are reciprocal number of ratio. -256 means 1/256. # of protection pages becomes about "0.39%" of total present -pages of higher zones on the node. - -If you would like to protect more pages, smaller values are effective. -The minimum value is 1 (1/1 -> 100%). - -page-cluster ------------- - -page-cluster controls the number of pages which are written to swap in -a single attempt. The swap I/O size. - -It is a logarithmic value - setting it to zero means "1 page", setting -it to 1 means "2 pages", setting it to 2 means "4 pages", etc. - -The default value is three (eight pages at a time). There may be some -small benefits in tuning this to a different value if your workload is -swap-intensive. - -overcommit_memory ------------------ - -Controls overcommit of system memory, possibly allowing processes -to allocate (but not use) more memory than is actually available. - - -0 - Heuristic overcommit handling. Obvious overcommits of - address space are refused. Used for a typical system. It - ensures a seriously wild allocation fails while allowing - overcommit to reduce swap usage. root is allowed to - allocate slightly more memory in this mode. This is the - default. - -1 - Always overcommit. Appropriate for some scientific - applications. - -2 - Don't overcommit. The total address space commit - for the system is not permitted to exceed swap plus a - configurable percentage (default is 50) of physical RAM. - Depending on the percentage you use, in most situations - this means a process will not be killed while attempting - to use already-allocated memory but will receive errors - on memory allocation as appropriate. - -overcommit_ratio ----------------- - -Percentage of physical memory size to include in overcommit calculations -(see above.) - -Memory allocation limit = swapspace + physmem * (overcommit_ratio / 100) - - swapspace = total size of all swap areas - physmem = size of physical memory in system - -nr_hugepages and hugetlb_shm_group ----------------------------------- - -nr_hugepages configures number of hugetlb page reserved for the system. - -hugetlb_shm_group contains group id that is allowed to create SysV shared -memory segment using hugetlb page. - -hugepages_treat_as_movable --------------------------- - -This parameter is only useful when kernelcore= is specified at boot time to -create ZONE_MOVABLE for pages that may be reclaimed or migrated. Huge pages -are not movable so are not normally allocated from ZONE_MOVABLE. A non-zero -value written to hugepages_treat_as_movable allows huge pages to be allocated -from ZONE_MOVABLE. - -Once enabled, the ZONE_MOVABLE is treated as an area of memory the huge -pages pool can easily grow or shrink within. Assuming that applications are -not running that mlock() a lot of memory, it is likely the huge pages pool -can grow to the size of ZONE_MOVABLE by repeatedly entering the desired value -into nr_hugepages and triggering page reclaim. - -laptop_mode ------------ - -laptop_mode is a knob that controls "laptop mode". All the things that are -controlled by this knob are discussed in Documentation/laptops/laptop-mode.txt. - -block_dump ----------- - -block_dump enables block I/O debugging when set to a nonzero value. More -information on block I/O debugging is in Documentation/laptops/laptop-mode.txt. - -swap_token_timeout ------------------- - -This file contains valid hold time of swap out protection token. The Linux -VM has token based thrashing control mechanism and uses the token to prevent -unnecessary page faults in thrashing situation. The unit of the value is -second. The value would be useful to tune thrashing behavior. - -drop_caches ------------ - -Writing to this will cause the kernel to drop clean caches, dentries and -inodes from memory, causing that memory to become free. - -To free pagecache: - echo 1 > /proc/sys/vm/drop_caches -To free dentries and inodes: - echo 2 > /proc/sys/vm/drop_caches -To free pagecache, dentries and inodes: - echo 3 > /proc/sys/vm/drop_caches - -As this is a non-destructive operation and dirty objects are not freeable, the -user should run `sync' first. +Please see: Documentation/sysctls/vm.txt for a description of these +entries. 2.5 /proc/sys/dev - Device specific parameters diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index a3415070bcac..3197fc83bc51 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -1,12 +1,13 @@ -Documentation for /proc/sys/vm/* kernel version 2.2.10 +Documentation for /proc/sys/vm/* kernel version 2.6.29 (c) 1998, 1999, Rik van Riel + (c) 2008 Peter W. Morreale For general info and legal blurb, please look in README. ============================================================== This file contains the documentation for the sysctl files in -/proc/sys/vm and is valid for Linux kernel version 2.2. +/proc/sys/vm and is valid for Linux kernel version 2.6.29. The files in this directory can be used to tune the operation of the virtual memory (VM) subsystem of the Linux kernel and @@ -16,180 +17,274 @@ Default values and initialization routines for most of these files can be found in mm/swap.c. Currently, these files are in /proc/sys/vm: -- overcommit_memory -- page-cluster -- dirty_ratio + +- block_dump +- dirty_background_bytes - dirty_background_ratio +- dirty_bytes - dirty_expire_centisecs +- dirty_ratio - dirty_writeback_centisecs -- highmem_is_dirtyable (only if CONFIG_HIGHMEM set) +- drop_caches +- hugepages_treat_as_movable +- hugetlb_shm_group +- laptop_mode +- legacy_va_layout +- lowmem_reserve_ratio - max_map_count - min_free_kbytes -- laptop_mode -- block_dump -- drop-caches -- zone_reclaim_mode -- min_unmapped_ratio - min_slab_ratio -- panic_on_oom -- oom_dump_tasks -- oom_kill_allocating_task -- mmap_min_address -- numa_zonelist_order +- min_unmapped_ratio +- mmap_min_addr - nr_hugepages - nr_overcommit_hugepages -- nr_trim_pages (only if CONFIG_MMU=n) +- nr_pdflush_threads +- nr_trim_pages (only if CONFIG_MMU=n) +- numa_zonelist_order +- oom_dump_tasks +- oom_kill_allocating_task +- overcommit_memory +- overcommit_ratio +- page-cluster +- panic_on_oom +- percpu_pagelist_fraction +- stat_interval +- swappiness +- vfs_cache_pressure +- zone_reclaim_mode + ============================================================== -dirty_bytes, dirty_ratio, dirty_background_bytes, -dirty_background_ratio, dirty_expire_centisecs, -dirty_writeback_centisecs, highmem_is_dirtyable, -vfs_cache_pressure, laptop_mode, block_dump, swap_token_timeout, -drop-caches, hugepages_treat_as_movable: +block_dump -See Documentation/filesystems/proc.txt +block_dump enables block I/O debugging when set to a nonzero value. More +information on block I/O debugging is in Documentation/laptops/laptop-mode.txt. ============================================================== -overcommit_memory: +dirty_background_bytes -This value contains a flag that enables memory overcommitment. +Contains the amount of dirty memory at which the pdflush background writeback +daemon will start writeback. -When this flag is 0, the kernel attempts to estimate the amount -of free memory left when userspace requests more memory. +If dirty_background_bytes is written, dirty_background_ratio becomes a function +of its value (dirty_background_bytes / the amount of dirtyable system memory). -When this flag is 1, the kernel pretends there is always enough -memory until it actually runs out. +============================================================== -When this flag is 2, the kernel uses a "never overcommit" -policy that attempts to prevent any overcommit of memory. +dirty_background_ratio -This feature can be very useful because there are a lot of -programs that malloc() huge amounts of memory "just-in-case" -and don't use much of it. +Contains, as a percentage of total system memory, the number of pages at which +the pdflush background writeback daemon will start writing out dirty data. -The default value is 0. +============================================================== -See Documentation/vm/overcommit-accounting and -security/commoncap.c::cap_vm_enough_memory() for more information. +dirty_bytes + +Contains the amount of dirty memory at which a process generating disk writes +will itself start writeback. + +If dirty_bytes is written, dirty_ratio becomes a function of its value +(dirty_bytes / the amount of dirtyable system memory). ============================================================== -overcommit_ratio: +dirty_expire_centisecs -When overcommit_memory is set to 2, the committed address -space is not permitted to exceed swap plus this percentage -of physical RAM. See above. +This tunable is used to define when dirty data is old enough to be eligible +for writeout by the pdflush daemons. It is expressed in 100'ths of a second. +Data which has been dirty in-memory for longer than this interval will be +written out next time a pdflush daemon wakes up. + +============================================================== + +dirty_ratio + +Contains, as a percentage of total system memory, the number of pages at which +a process which is generating disk writes will itself start writing out dirty +data. ============================================================== -page-cluster: +dirty_writeback_centisecs -The Linux VM subsystem avoids excessive disk seeks by reading -multiple pages on a page fault. The number of pages it reads -is dependent on the amount of memory in your machine. +The pdflush writeback daemons will periodically wake up and write `old' data +out to disk. This tunable expresses the interval between those wakeups, in +100'ths of a second. -The number of pages the kernel reads in at once is equal to -2 ^ page-cluster. Values above 2 ^ 5 don't make much sense -for swap because we only cluster swap data in 32-page groups. +Setting this to zero disables periodic writeback altogether. ============================================================== -max_map_count: +drop_caches -This file contains the maximum number of memory map areas a process -may have. Memory map areas are used as a side-effect of calling -malloc, directly by mmap and mprotect, and also when loading shared -libraries. +Writing to this will cause the kernel to drop clean caches, dentries and +inodes from memory, causing that memory to become free. -While most applications need less than a thousand maps, certain -programs, particularly malloc debuggers, may consume lots of them, -e.g., up to one or two maps per allocation. +To free pagecache: + echo 1 > /proc/sys/vm/drop_caches +To free dentries and inodes: + echo 2 > /proc/sys/vm/drop_caches +To free pagecache, dentries and inodes: + echo 3 > /proc/sys/vm/drop_caches -The default value is 65536. +As this is a non-destructive operation and dirty objects are not freeable, the +user should run `sync' first. ============================================================== -min_free_kbytes: +hugepages_treat_as_movable -This is used to force the Linux VM to keep a minimum number -of kilobytes free. The VM uses this number to compute a pages_min -value for each lowmem zone in the system. Each lowmem zone gets -a number of reserved free pages based proportionally on its size. +This parameter is only useful when kernelcore= is specified at boot time to +create ZONE_MOVABLE for pages that may be reclaimed or migrated. Huge pages +are not movable so are not normally allocated from ZONE_MOVABLE. A non-zero +value written to hugepages_treat_as_movable allows huge pages to be allocated +from ZONE_MOVABLE. -Some minimal amount of memory is needed to satisfy PF_MEMALLOC -allocations; if you set this to lower than 1024KB, your system will -become subtly broken, and prone to deadlock under high loads. - -Setting this too high will OOM your machine instantly. +Once enabled, the ZONE_MOVABLE is treated as an area of memory the huge +pages pool can easily grow or shrink within. Assuming that applications are +not running that mlock() a lot of memory, it is likely the huge pages pool +can grow to the size of ZONE_MOVABLE by repeatedly entering the desired value +into nr_hugepages and triggering page reclaim. ============================================================== -percpu_pagelist_fraction +hugetlb_shm_group -This is the fraction of pages at most (high mark pcp->high) in each zone that -are allocated for each per cpu page list. The min value for this is 8. It -means that we don't allow more than 1/8th of pages in each zone to be -allocated in any single per_cpu_pagelist. This entry only changes the value -of hot per cpu pagelists. User can specify a number like 100 to allocate -1/100th of each zone to each per cpu page list. +hugetlb_shm_group contains group id that is allowed to create SysV +shared memory segment using hugetlb page. -The batch value of each per cpu pagelist is also updated as a result. It is -set to pcp->high/4. The upper limit of batch is (PAGE_SHIFT * 8) +============================================================== -The initial value is zero. Kernel does not use this value at boot time to set -the high water marks for each per cpu page list. +laptop_mode -=============================================================== +laptop_mode is a knob that controls "laptop mode". All the things that are +controlled by this knob are discussed in Documentation/laptops/laptop-mode.txt. -zone_reclaim_mode: +============================================================== -Zone_reclaim_mode allows someone to set more or less aggressive approaches to -reclaim memory when a zone runs out of memory. If it is set to zero then no -zone reclaim occurs. Allocations will be satisfied from other zones / nodes -in the system. +legacy_va_layout -This is value ORed together of +If non-zero, this sysctl disables the new 32-bit mmap mmap layout - the kernel +will use the legacy (2.4) layout for all processes. -1 = Zone reclaim on -2 = Zone reclaim writes dirty pages out -4 = Zone reclaim swaps pages +============================================================== -zone_reclaim_mode is set during bootup to 1 if it is determined that pages -from remote zones will cause a measurable performance reduction. The -page allocator will then reclaim easily reusable pages (those page -cache pages that are currently not used) before allocating off node pages. +lowmem_reserve_ratio + +For some specialised workloads on highmem machines it is dangerous for +the kernel to allow process memory to be allocated from the "lowmem" +zone. This is because that memory could then be pinned via the mlock() +system call, or by unavailability of swapspace. + +And on large highmem machines this lack of reclaimable lowmem memory +can be fatal. + +So the Linux page allocator has a mechanism which prevents allocations +which _could_ use highmem from using too much lowmem. This means that +a certain amount of lowmem is defended from the possibility of being +captured into pinned user memory. + +(The same argument applies to the old 16 megabyte ISA DMA region. This +mechanism will also defend that region from allocations which could use +highmem or lowmem). + +The `lowmem_reserve_ratio' tunable determines how aggressive the kernel is +in defending these lower zones. + +If you have a machine which uses highmem or ISA DMA and your +applications are using mlock(), or if you are running with no swap then +you probably should change the lowmem_reserve_ratio setting. + +The lowmem_reserve_ratio is an array. You can see them by reading this file. +- +% cat /proc/sys/vm/lowmem_reserve_ratio +256 256 32 +- +Note: # of this elements is one fewer than number of zones. Because the highest + zone's value is not necessary for following calculation. + +But, these values are not used directly. The kernel calculates # of protection +pages for each zones from them. These are shown as array of protection pages +in /proc/zoneinfo like followings. (This is an example of x86-64 box). +Each zone has an array of protection pages like this. + +- +Node 0, zone DMA + pages free 1355 + min 3 + low 3 + high 4 + : + : + numa_other 0 + protection: (0, 2004, 2004, 2004) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + pagesets + cpu: 0 pcp: 0 + : +- +These protections are added to score to judge whether this zone should be used +for page allocation or should be reclaimed. + +In this example, if normal pages (index=2) are required to this DMA zone and +pages_high is used for watermark, the kernel judges this zone should not be +used because pages_free(1355) is smaller than watermark + protection[2] +(4 + 2004 = 2008). If this protection value is 0, this zone would be used for +normal page requirement. If requirement is DMA zone(index=0), protection[0] +(=0) is used. + +zone[i]'s protection[j] is calculated by following expression. + +(i < j): + zone[i]->protection[j] + = (total sums of present_pages from zone[i+1] to zone[j] on the node) + / lowmem_reserve_ratio[i]; +(i = j): + (should not be protected. = 0; +(i > j): + (not necessary, but looks 0) + +The default values of lowmem_reserve_ratio[i] are + 256 (if zone[i] means DMA or DMA32 zone) + 32 (others). +As above expression, they are reciprocal number of ratio. +256 means 1/256. # of protection pages becomes about "0.39%" of total present +pages of higher zones on the node. + +If you would like to protect more pages, smaller values are effective. +The minimum value is 1 (1/1 -> 100%). -It may be beneficial to switch off zone reclaim if the system is -used for a file server and all of memory should be used for caching files -from disk. In that case the caching effect is more important than -data locality. +============================================================== -Allowing zone reclaim to write out pages stops processes that are -writing large amounts of data from dirtying pages on other nodes. Zone -reclaim will write out dirty pages if a zone fills up and so effectively -throttle the process. This may decrease the performance of a single process -since it cannot use all of system memory to buffer the outgoing writes -anymore but it preserve the memory on other nodes so that the performance -of other processes running on other nodes will not be affected. +max_map_count: -Allowing regular swap effectively restricts allocations to the local -node unless explicitly overridden by memory policies or cpuset -configurations. +This file contains the maximum number of memory map areas a process +may have. Memory map areas are used as a side-effect of calling +malloc, directly by mmap and mprotect, and also when loading shared +libraries. -============================================================= +While most applications need less than a thousand maps, certain +programs, particularly malloc debuggers, may consume lots of them, +e.g., up to one or two maps per allocation. -min_unmapped_ratio: +The default value is 65536. -This is available only on NUMA kernels. +============================================================== -A percentage of the total pages in each zone. Zone reclaim will only -occur if more than this percentage of pages are file backed and unmapped. -This is to insure that a minimal amount of local pages is still available for -file I/O even if the node is overallocated. +min_free_kbytes: -The default is 1 percent. +This is used to force the Linux VM to keep a minimum number +of kilobytes free. The VM uses this number to compute a pages_min +value for each lowmem zone in the system. Each lowmem zone gets +a number of reserved free pages based proportionally on its size. + +Some minimal amount of memory is needed to satisfy PF_MEMALLOC +allocations; if you set this to lower than 1024KB, your system will +become subtly broken, and prone to deadlock under high loads. + +Setting this too high will OOM your machine instantly. ============================================================= @@ -211,82 +306,73 @@ and may not be fast. ============================================================= -panic_on_oom +min_unmapped_ratio: -This enables or disables panic on out-of-memory feature. +This is available only on NUMA kernels. -If this is set to 0, the kernel will kill some rogue process, -called oom_killer. Usually, oom_killer can kill rogue processes and -system will survive. +A percentage of the total pages in each zone. Zone reclaim will only +occur if more than this percentage of pages are file backed and unmapped. +This is to insure that a minimal amount of local pages is still available for +file I/O even if the node is overallocated. -If this is set to 1, the kernel panics when out-of-memory happens. -However, if a process limits using nodes by mempolicy/cpusets, -and those nodes become memory exhaustion status, one process -may be killed by oom-killer. No panic occurs in this case. -Because other nodes' memory may be free. This means system total status -may be not fatal yet. +The default is 1 percent. -If this is set to 2, the kernel panics compulsorily even on the -above-mentioned. +============================================================== -The default value is 0. -1 and 2 are for failover of clustering. Please select either -according to your policy of failover. +mmap_min_addr -============================================================= +This file indicates the amount of address space which a user process will +be restricted from mmaping. Since kernel null dereference bugs could +accidentally operate based on the information in the first couple of pages +of memory userspace processes should not be allowed to write to them. By +default this value is set to 0 and no protections will be enforced by the +security module. Setting this value to something like 64k will allow the +vast majority of applications to work correctly and provide defense in depth +against future potential kernel bugs. -oom_dump_tasks +============================================================== -Enables a system-wide task dump (excluding kernel threads) to be -produced when the kernel performs an OOM-killing and includes such -information as pid, uid, tgid, vm size, rss, cpu, oom_adj score, and -name. This is helpful to determine why the OOM killer was invoked -and to identify the rogue task that caused it. +nr_hugepages -If this is set to zero, this information is suppressed. On very -large systems with thousands of tasks it may not be feasible to dump -the memory state information for each one. Such systems should not -be forced to incur a performance penalty in OOM conditions when the -information may not be desired. +Change the minimum size of the hugepage pool. -If this is set to non-zero, this information is shown whenever the -OOM killer actually kills a memory-hogging task. +See Documentation/vm/hugetlbpage.txt -The default value is 0. +============================================================== -============================================================= +nr_overcommit_hugepages -oom_kill_allocating_task +Change the maximum size of the hugepage pool. The maximum is +nr_hugepages + nr_overcommit_hugepages. -This enables or disables killing the OOM-triggering task in -out-of-memory situations. +See Documentation/vm/hugetlbpage.txt -If this is set to zero, the OOM killer will scan through the entire -tasklist and select a task based on heuristics to kill. This normally -selects a rogue memory-hogging task that frees up a large amount of -memory when killed. +============================================================== -If this is set to non-zero, the OOM killer simply kills the task that -triggered the out-of-memory condition. This avoids the expensive -tasklist scan. +nr_pdflush_threads -If panic_on_oom is selected, it takes precedence over whatever value -is used in oom_kill_allocating_task. +The current number of pdflush threads. This value is read-only. +The value changes according to the number of dirty pages in the system. -The default value is 0. +When neccessary, additional pdflush threads are created, one per second, up to +nr_pdflush_threads_max. ============================================================== -mmap_min_addr +nr_trim_pages -This file indicates the amount of address space which a user process will -be restricted from mmaping. Since kernel null dereference bugs could -accidentally operate based on the information in the first couple of pages -of memory userspace processes should not be allowed to write to them. By -default this value is set to 0 and no protections will be enforced by the -security module. Setting this value to something like 64k will allow the -vast majority of applications to work correctly and provide defense in depth -against future potential kernel bugs. +This is available only on NOMMU kernels. + +This value adjusts the excess page trimming behaviour of power-of-2 aligned +NOMMU mmap allocations. + +A value of 0 disables trimming of allocations entirely, while a value of 1 +trims excess pages aggressively. Any value >= 1 acts as the watermark where +trimming of allocations is initiated. + +The default value is 1. + +See Documentation/nommu-mmap.txt for more information. ============================================================== @@ -335,34 +421,199 @@ this is causing problems for your system/application. ============================================================== -nr_hugepages +oom_dump_tasks -Change the minimum size of the hugepage pool. +Enables a system-wide task dump (excluding kernel threads) to be +produced when the kernel performs an OOM-killing and includes such +information as pid, uid, tgid, vm size, rss, cpu, oom_adj score, and +name. This is helpful to determine why the OOM killer was invoked +and to identify the rogue task that caused it. -See Documentation/vm/hugetlbpage.txt +If this is set to zero, this information is suppressed. On very +large systems with thousands of tasks it may not be feasible to dump +the memory state information for each one. Such systems should not +be forced to incur a performance penalty in OOM conditions when the +information may not be desired. + +If this is set to non-zero, this information is shown whenever the +OOM killer actually kills a memory-hogging task. + +The default value is 0. ============================================================== -nr_overcommit_hugepages +oom_kill_allocating_task -Change the maximum size of the hugepage pool. The maximum is -nr_hugepages + nr_overcommit_hugepages. +This enables or disables killing the OOM-triggering task in +out-of-memory situations. -See Documentation/vm/hugetlbpage.txt +If this is set to zero, the OOM killer will scan through the entire +tasklist and select a task based on heuristics to kill. This normally +selects a rogue memory-hogging task that frees up a large amount of +memory when killed. + +If this is set to non-zero, the OOM killer simply kills the task that +triggered the out-of-memory condition. This avoids the expensive +tasklist scan. + +If panic_on_oom is selected, it takes precedence over whatever value +is used in oom_kill_allocating_task. + +The default value is 0. ============================================================== -nr_trim_pages +overcommit_memory: -This is available only on NOMMU kernels. +This value contains a flag that enables memory overcommitment. -This value adjusts the excess page trimming behaviour of power-of-2 aligned -NOMMU mmap allocations. +When this flag is 0, the kernel attempts to estimate the amount +of free memory left when userspace requests more memory. -A value of 0 disables trimming of allocations entirely, while a value of 1 -trims excess pages aggressively. Any value >= 1 acts as the watermark where -trimming of allocations is initiated. +When this flag is 1, the kernel pretends there is always enough +memory until it actually runs out. -The default value is 1. +When this flag is 2, the kernel uses a "never overcommit" +policy that attempts to prevent any overcommit of memory. -See Documentation/nommu-mmap.txt for more information. +This feature can be very useful because there are a lot of +programs that malloc() huge amounts of memory "just-in-case" +and don't use much of it. + +The default value is 0. + +See Documentation/vm/overcommit-accounting and +security/commoncap.c::cap_vm_enough_memory() for more information. + +============================================================== + +overcommit_ratio: + +When overcommit_memory is set to 2, the committed address +space is not permitted to exceed swap plus this percentage +of physical RAM. See above. + +============================================================== + +page-cluster + +page-cluster controls the number of pages which are written to swap in +a single attempt. The swap I/O size. + +It is a logarithmic value - setting it to zero means "1 page", setting +it to 1 means "2 pages", setting it to 2 means "4 pages", etc. + +The default value is three (eight pages at a time). There may be some +small benefits in tuning this to a different value if your workload is +swap-intensive. + +============================================================= + +panic_on_oom + +This enables or disables panic on out-of-memory feature. + +If this is set to 0, the kernel will kill some rogue process, +called oom_killer. Usually, oom_killer can kill rogue processes and +system will survive. + +If this is set to 1, the kernel panics when out-of-memory happens. +However, if a process limits using nodes by mempolicy/cpusets, +and those nodes become memory exhaustion status, one process +may be killed by oom-killer. No panic occurs in this case. +Because other nodes' memory may be free. This means system total status +may be not fatal yet. + +If this is set to 2, the kernel panics compulsorily even on the +above-mentioned. + +The default value is 0. +1 and 2 are for failover of clustering. Please select either +according to your policy of failover. + +============================================================= + +percpu_pagelist_fraction + +This is the fraction of pages at most (high mark pcp->high) in each zone that +are allocated for each per cpu page list. The min value for this is 8. It +means that we don't allow more than 1/8th of pages in each zone to be +allocated in any single per_cpu_pagelist. This entry only changes the value +of hot per cpu pagelists. User can specify a number like 100 to allocate +1/100th of each zone to each per cpu page list. + +The batch value of each per cpu pagelist is also updated as a result. It is +set to pcp->high/4. The upper limit of batch is (PAGE_SHIFT * 8) + +The initial value is zero. Kernel does not use this value at boot time to set +the high water marks for each per cpu page list. + +============================================================== + +stat_interval + +The time interval between which vm statistics are updated. The default +is 1 second. + +============================================================== + +swappiness + +This control is used to define how aggressive the kernel will swap +memory pages. Higher values will increase agressiveness, lower values +descrease the amount of swap. + +The default value is 60. + +============================================================== + +vfs_cache_pressure +------------------ + +Controls the tendency of the kernel to reclaim the memory which is used for +caching of directory and inode objects. + +At the default value of vfs_cache_pressure=100 the kernel will attempt to +reclaim dentries and inodes at a "fair" rate with respect to pagecache and +swapcache reclaim. Decreasing vfs_cache_pressure causes the kernel to prefer +to retain dentry and inode caches. Increasing vfs_cache_pressure beyond 100 +causes the kernel to prefer to reclaim dentries and inodes. + +============================================================== + +zone_reclaim_mode: + +Zone_reclaim_mode allows someone to set more or less aggressive approaches to +reclaim memory when a zone runs out of memory. If it is set to zero then no +zone reclaim occurs. Allocations will be satisfied from other zones / nodes +in the system. + +This is value ORed together of + +1 = Zone reclaim on +2 = Zone reclaim writes dirty pages out +4 = Zone reclaim swaps pages + +zone_reclaim_mode is set during bootup to 1 if it is determined that pages +from remote zones will cause a measurable performance reduction. The +page allocator will then reclaim easily reusable pages (those page +cache pages that are currently not used) before allocating off node pages. + +It may be beneficial to switch off zone reclaim if the system is +used for a file server and all of memory should be used for caching files +from disk. In that case the caching effect is more important than +data locality. + +Allowing zone reclaim to write out pages stops processes that are +writing large amounts of data from dirtying pages on other nodes. Zone +reclaim will write out dirty pages if a zone fills up and so effectively +throttle the process. This may decrease the performance of a single process +since it cannot use all of system memory to buffer the outgoing writes +anymore but it preserve the memory on other nodes so that the performance +of other processes running on other nodes will not be affected. + +Allowing regular swap effectively restricts allocations to the local +node unless explicitly overridden by memory policies or cpuset +configurations. + +============ End of Document ================================= -- cgit v1.2.3 From 89365e264104b52da6a61c4e227bb5a934764fa7 Mon Sep 17 00:00:00 2001 From: Andy Whitcroft Date: Thu, 15 Jan 2009 13:50:50 -0800 Subject: sysrq documentation: remove the redundant updated date git is maintaining the last update time much more accuratly than the internal update time. Remove it. Signed-off-by: Andy Whitcroft Cc: Randy Dunlap Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/sysrq.txt | 1 - 1 file changed, 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/sysrq.txt b/Documentation/sysrq.txt index 10a0263ebb3f..265f637b97d3 100644 --- a/Documentation/sysrq.txt +++ b/Documentation/sysrq.txt @@ -1,6 +1,5 @@ Linux Magic System Request Key Hacks Documentation for sysrq.c -Last update: 2007-AUG-04 * What is the magic SysRq key? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -- cgit v1.2.3 From 47c33d9c1984ae4c5bd1f144024eacc14c5bc0c0 Mon Sep 17 00:00:00 2001 From: Andy Whitcroft Date: Thu, 15 Jan 2009 13:50:51 -0800 Subject: sysrq documentation: document why the command header only is shown Document the interactions between loglevel and the sysrq output. Also document how to work round it should output be required on the console. Signed-off-by: Andy Whitcroft Cc: Martin Mares Cc: Randy Dunlap Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/sysrq.txt | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) (limited to 'Documentation') diff --git a/Documentation/sysrq.txt b/Documentation/sysrq.txt index 265f637b97d3..9e592c718afb 100644 --- a/Documentation/sysrq.txt +++ b/Documentation/sysrq.txt @@ -210,6 +210,24 @@ within a function called by handle_sysrq, you must be aware that you are in a lock (you are also in an interrupt handler, which means don't sleep!), so you must call __handle_sysrq_nolock instead. +* When I hit a SysRq key combination only the header appears on the console? +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Sysrq output is subject to the same console loglevel control as all +other console output. This means that if the kernel was booted 'quiet' +as is common on distro kernels the output may not appear on the actual +console, even though it will appear in the dmesg buffer, and be accessible +via the dmesg command and to the consumers of /proc/kmsg. As a specific +exception the header line from the sysrq command is passed to all console +consumers as if the current loglevel was maximum. If only the header +is emitted it is almost certain that the kernel loglevel is too low. +Should you require the output on the console channel then you will need +to temporarily up the console loglevel using alt-sysrq-8 or: + + echo 8 > /proc/sysrq-trigger + +Remember to return the loglevel to normal after triggering the sysrq +command you are interested in. + * I have more questions, who can I ask? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ And I'll answer any questions about the registration system you got, also -- cgit v1.2.3 From 45ce80fb6b6f9594d1396d44dd7e7c02d596fef8 Mon Sep 17 00:00:00 2001 From: Li Zefan Date: Thu, 15 Jan 2009 13:50:59 -0800 Subject: cgroups: consolidate cgroup documents Move Documentation/cpusets.txt and Documentation/controllers/* to Documentation/cgroups/ Signed-off-by: Li Zefan Acked-by: KAMEZAWA Hiroyuki Acked-by: Balbir Singh Acked-by: Paul Menage Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/cgroups/cgroups.txt | 5 +- Documentation/cgroups/cpuacct.txt | 32 + Documentation/cgroups/cpusets.txt | 808 +++++++++++++++++++++++++ Documentation/cgroups/devices.txt | 52 ++ Documentation/cgroups/memcg_test.txt | 342 +++++++++++ Documentation/cgroups/memory.txt | 399 ++++++++++++ Documentation/cgroups/resource_counter.txt | 181 ++++++ Documentation/controllers/cpuacct.txt | 32 - Documentation/controllers/devices.txt | 52 -- Documentation/controllers/memcg_test.txt | 342 ----------- Documentation/controllers/memory.txt | 399 ------------ Documentation/controllers/resource_counter.txt | 181 ------ Documentation/cpusets.txt | 808 ------------------------- Documentation/scheduler/sched-design-CFS.txt | 2 +- include/linux/res_counter.h | 2 +- init/Kconfig | 9 +- kernel/cpuset.c | 2 +- 17 files changed, 1824 insertions(+), 1824 deletions(-) create mode 100644 Documentation/cgroups/cpuacct.txt create mode 100644 Documentation/cgroups/cpusets.txt create mode 100644 Documentation/cgroups/devices.txt create mode 100644 Documentation/cgroups/memcg_test.txt create mode 100644 Documentation/cgroups/memory.txt create mode 100644 Documentation/cgroups/resource_counter.txt delete mode 100644 Documentation/controllers/cpuacct.txt delete mode 100644 Documentation/controllers/devices.txt delete mode 100644 Documentation/controllers/memcg_test.txt delete mode 100644 Documentation/controllers/memory.txt delete mode 100644 Documentation/controllers/resource_counter.txt delete mode 100644 Documentation/cpusets.txt (limited to 'Documentation') diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt index e33ee74eee77..d9e5d6f41b92 100644 --- a/Documentation/cgroups/cgroups.txt +++ b/Documentation/cgroups/cgroups.txt @@ -1,7 +1,8 @@ CGROUPS ------- -Written by Paul Menage based on Documentation/cpusets.txt +Written by Paul Menage based on +Documentation/cgroups/cpusets.txt Original copyright statements from cpusets.txt: Portions Copyright (C) 2004 BULL SA. @@ -68,7 +69,7 @@ On their own, the only use for cgroups is for simple job tracking. The intention is that other subsystems hook into the generic cgroup support to provide new attributes for cgroups, such as accounting/limiting the resources which processes in a cgroup can -access. For example, cpusets (see Documentation/cpusets.txt) allows +access. For example, cpusets (see Documentation/cgroups/cpusets.txt) allows you to associate a set of CPUs and a set of memory nodes with the tasks in each cgroup. diff --git a/Documentation/cgroups/cpuacct.txt b/Documentation/cgroups/cpuacct.txt new file mode 100644 index 000000000000..bb775fbe43d7 --- /dev/null +++ b/Documentation/cgroups/cpuacct.txt @@ -0,0 +1,32 @@ +CPU Accounting Controller +------------------------- + +The CPU accounting controller is used to group tasks using cgroups and +account the CPU usage of these groups of tasks. + +The CPU accounting controller supports multi-hierarchy groups. An accounting +group accumulates the CPU usage of all of its child groups and the tasks +directly present in its group. + +Accounting groups can be created by first mounting the cgroup filesystem. + +# mkdir /cgroups +# mount -t cgroup -ocpuacct none /cgroups + +With the above step, the initial or the parent accounting group +becomes visible at /cgroups. At bootup, this group includes all the +tasks in the system. /cgroups/tasks lists the tasks in this cgroup. +/cgroups/cpuacct.usage gives the CPU time (in nanoseconds) obtained by +this group which is essentially the CPU time obtained by all the tasks +in the system. + +New accounting groups can be created under the parent group /cgroups. + +# cd /cgroups +# mkdir g1 +# echo $$ > g1 + +The above steps create a new group g1 and move the current shell +process (bash) into it. CPU time consumed by this bash and its children +can be obtained from g1/cpuacct.usage and the same is accumulated in +/cgroups/cpuacct.usage also. diff --git a/Documentation/cgroups/cpusets.txt b/Documentation/cgroups/cpusets.txt new file mode 100644 index 000000000000..5c86c258c791 --- /dev/null +++ b/Documentation/cgroups/cpusets.txt @@ -0,0 +1,808 @@ + CPUSETS + ------- + +Copyright (C) 2004 BULL SA. +Written by Simon.Derr@bull.net + +Portions Copyright (c) 2004-2006 Silicon Graphics, Inc. +Modified by Paul Jackson +Modified by Christoph Lameter +Modified by Paul Menage +Modified by Hidetoshi Seto + +CONTENTS: +========= + +1. Cpusets + 1.1 What are cpusets ? + 1.2 Why are cpusets needed ? + 1.3 How are cpusets implemented ? + 1.4 What are exclusive cpusets ? + 1.5 What is memory_pressure ? + 1.6 What is memory spread ? + 1.7 What is sched_load_balance ? + 1.8 What is sched_relax_domain_level ? + 1.9 How do I use cpusets ? +2. Usage Examples and Syntax + 2.1 Basic Usage + 2.2 Adding/removing cpus + 2.3 Setting flags + 2.4 Attaching processes +3. Questions +4. Contact + +1. Cpusets +========== + +1.1 What are cpusets ? +---------------------- + +Cpusets provide a mechanism for assigning a set of CPUs and Memory +Nodes to a set of tasks. In this document "Memory Node" refers to +an on-line node that contains memory. + +Cpusets constrain the CPU and Memory placement of tasks to only +the resources within a tasks current cpuset. They form a nested +hierarchy visible in a virtual file system. These are the essential +hooks, beyond what is already present, required to manage dynamic +job placement on large systems. + +Cpusets use the generic cgroup subsystem described in +Documentation/cgroups/cgroups.txt. + +Requests by a task, using the sched_setaffinity(2) system call to +include CPUs in its CPU affinity mask, and using the mbind(2) and +set_mempolicy(2) system calls to include Memory Nodes in its memory +policy, are both filtered through that tasks cpuset, filtering out any +CPUs or Memory Nodes not in that cpuset. The scheduler will not +schedule a task on a CPU that is not allowed in its cpus_allowed +vector, and the kernel page allocator will not allocate a page on a +node that is not allowed in the requesting tasks mems_allowed vector. + +User level code may create and destroy cpusets by name in the cgroup +virtual file system, manage the attributes and permissions of these +cpusets and which CPUs and Memory Nodes are assigned to each cpuset, +specify and query to which cpuset a task is assigned, and list the +task pids assigned to a cpuset. + + +1.2 Why are cpusets needed ? +---------------------------- + +The management of large computer systems, with many processors (CPUs), +complex memory cache hierarchies and multiple Memory Nodes having +non-uniform access times (NUMA) presents additional challenges for +the efficient scheduling and memory placement of processes. + +Frequently more modest sized systems can be operated with adequate +efficiency just by letting the operating system automatically share +the available CPU and Memory resources amongst the requesting tasks. + +But larger systems, which benefit more from careful processor and +memory placement to reduce memory access times and contention, +and which typically represent a larger investment for the customer, +can benefit from explicitly placing jobs on properly sized subsets of +the system. + +This can be especially valuable on: + + * Web Servers running multiple instances of the same web application, + * Servers running different applications (for instance, a web server + and a database), or + * NUMA systems running large HPC applications with demanding + performance characteristics. + +These subsets, or "soft partitions" must be able to be dynamically +adjusted, as the job mix changes, without impacting other concurrently +executing jobs. The location of the running jobs pages may also be moved +when the memory locations are changed. + +The kernel cpuset patch provides the minimum essential kernel +mechanisms required to efficiently implement such subsets. It +leverages existing CPU and Memory Placement facilities in the Linux +kernel to avoid any additional impact on the critical scheduler or +memory allocator code. + + +1.3 How are cpusets implemented ? +--------------------------------- + +Cpusets provide a Linux kernel mechanism to constrain which CPUs and +Memory Nodes are used by a process or set of processes. + +The Linux kernel already has a pair of mechanisms to specify on which +CPUs a task may be scheduled (sched_setaffinity) and on which Memory +Nodes it may obtain memory (mbind, set_mempolicy). + +Cpusets extends these two mechanisms as follows: + + - Cpusets are sets of allowed CPUs and Memory Nodes, known to the + kernel. + - Each task in the system is attached to a cpuset, via a pointer + in the task structure to a reference counted cgroup structure. + - Calls to sched_setaffinity are filtered to just those CPUs + allowed in that tasks cpuset. + - Calls to mbind and set_mempolicy are filtered to just + those Memory Nodes allowed in that tasks cpuset. + - The root cpuset contains all the systems CPUs and Memory + Nodes. + - For any cpuset, one can define child cpusets containing a subset + of the parents CPU and Memory Node resources. + - The hierarchy of cpusets can be mounted at /dev/cpuset, for + browsing and manipulation from user space. + - A cpuset may be marked exclusive, which ensures that no other + cpuset (except direct ancestors and descendents) may contain + any overlapping CPUs or Memory Nodes. + - You can list all the tasks (by pid) attached to any cpuset. + +The implementation of cpusets requires a few, simple hooks +into the rest of the kernel, none in performance critical paths: + + - in init/main.c, to initialize the root cpuset at system boot. + - in fork and exit, to attach and detach a task from its cpuset. + - in sched_setaffinity, to mask the requested CPUs by what's + allowed in that tasks cpuset. + - in sched.c migrate_all_tasks(), to keep migrating tasks within + the CPUs allowed by their cpuset, if possible. + - in the mbind and set_mempolicy system calls, to mask the requested + Memory Nodes by what's allowed in that tasks cpuset. + - in page_alloc.c, to restrict memory to allowed nodes. + - in vmscan.c, to restrict page recovery to the current cpuset. + +You should mount the "cgroup" filesystem type in order to enable +browsing and modifying the cpusets presently known to the kernel. No +new system calls are added for cpusets - all support for querying and +modifying cpusets is via this cpuset file system. + +The /proc//status file for each task has four added lines, +displaying the tasks cpus_allowed (on which CPUs it may be scheduled) +and mems_allowed (on which Memory Nodes it may obtain memory), +in the two formats seen in the following example: + + Cpus_allowed: ffffffff,ffffffff,ffffffff,ffffffff + Cpus_allowed_list: 0-127 + Mems_allowed: ffffffff,ffffffff + Mems_allowed_list: 0-63 + +Each cpuset is represented by a directory in the cgroup file system +containing (on top of the standard cgroup files) the following +files describing that cpuset: + + - cpus: list of CPUs in that cpuset + - mems: list of Memory Nodes in that cpuset + - memory_migrate flag: if set, move pages to cpusets nodes + - cpu_exclusive flag: is cpu placement exclusive? + - mem_exclusive flag: is memory placement exclusive? + - mem_hardwall flag: is memory allocation hardwalled + - memory_pressure: measure of how much paging pressure in cpuset + +In addition, the root cpuset only has the following file: + - memory_pressure_enabled flag: compute memory_pressure? + +New cpusets are created using the mkdir system call or shell +command. The properties of a cpuset, such as its flags, allowed +CPUs and Memory Nodes, and attached tasks, are modified by writing +to the appropriate file in that cpusets directory, as listed above. + +The named hierarchical structure of nested cpusets allows partitioning +a large system into nested, dynamically changeable, "soft-partitions". + +The attachment of each task, automatically inherited at fork by any +children of that task, to a cpuset allows organizing the work load +on a system into related sets of tasks such that each set is constrained +to using the CPUs and Memory Nodes of a particular cpuset. A task +may be re-attached to any other cpuset, if allowed by the permissions +on the necessary cpuset file system directories. + +Such management of a system "in the large" integrates smoothly with +the detailed placement done on individual tasks and memory regions +using the sched_setaffinity, mbind and set_mempolicy system calls. + +The following rules apply to each cpuset: + + - Its CPUs and Memory Nodes must be a subset of its parents. + - It can't be marked exclusive unless its parent is. + - If its cpu or memory is exclusive, they may not overlap any sibling. + +These rules, and the natural hierarchy of cpusets, enable efficient +enforcement of the exclusive guarantee, without having to scan all +cpusets every time any of them change to ensure nothing overlaps a +exclusive cpuset. Also, the use of a Linux virtual file system (vfs) +to represent the cpuset hierarchy provides for a familiar permission +and name space for cpusets, with a minimum of additional kernel code. + +The cpus and mems files in the root (top_cpuset) cpuset are +read-only. The cpus file automatically tracks the value of +cpu_online_map using a CPU hotplug notifier, and the mems file +automatically tracks the value of node_states[N_HIGH_MEMORY]--i.e., +nodes with memory--using the cpuset_track_online_nodes() hook. + + +1.4 What are exclusive cpusets ? +-------------------------------- + +If a cpuset is cpu or mem exclusive, no other cpuset, other than +a direct ancestor or descendent, may share any of the same CPUs or +Memory Nodes. + +A cpuset that is mem_exclusive *or* mem_hardwall is "hardwalled", +i.e. it restricts kernel allocations for page, buffer and other data +commonly shared by the kernel across multiple users. All cpusets, +whether hardwalled or not, restrict allocations of memory for user +space. This enables configuring a system so that several independent +jobs can share common kernel data, such as file system pages, while +isolating each job's user allocation in its own cpuset. To do this, +construct a large mem_exclusive cpuset to hold all the jobs, and +construct child, non-mem_exclusive cpusets for each individual job. +Only a small amount of typical kernel memory, such as requests from +interrupt handlers, is allowed to be taken outside even a +mem_exclusive cpuset. + + +1.5 What is memory_pressure ? +----------------------------- +The memory_pressure of a cpuset provides a simple per-cpuset metric +of the rate that the tasks in a cpuset are attempting to free up in +use memory on the nodes of the cpuset to satisfy additional memory +requests. + +This enables batch managers monitoring jobs running in dedicated +cpusets to efficiently detect what level of memory pressure that job +is causing. + +This is useful both on tightly managed systems running a wide mix of +submitted jobs, which may choose to terminate or re-prioritize jobs that +are trying to use more memory than allowed on the nodes assigned them, +and with tightly coupled, long running, massively parallel scientific +computing jobs that will dramatically fail to meet required performance +goals if they start to use more memory than allowed to them. + +This mechanism provides a very economical way for the batch manager +to monitor a cpuset for signs of memory pressure. It's up to the +batch manager or other user code to decide what to do about it and +take action. + +==> Unless this feature is enabled by writing "1" to the special file + /dev/cpuset/memory_pressure_enabled, the hook in the rebalance + code of __alloc_pages() for this metric reduces to simply noticing + that the cpuset_memory_pressure_enabled flag is zero. So only + systems that enable this feature will compute the metric. + +Why a per-cpuset, running average: + + Because this meter is per-cpuset, rather than per-task or mm, + the system load imposed by a batch scheduler monitoring this + metric is sharply reduced on large systems, because a scan of + the tasklist can be avoided on each set of queries. + + Because this meter is a running average, instead of an accumulating + counter, a batch scheduler can detect memory pressure with a + single read, instead of having to read and accumulate results + for a period of time. + + Because this meter is per-cpuset rather than per-task or mm, + the batch scheduler can obtain the key information, memory + pressure in a cpuset, with a single read, rather than having to + query and accumulate results over all the (dynamically changing) + set of tasks in the cpuset. + +A per-cpuset simple digital filter (requires a spinlock and 3 words +of data per-cpuset) is kept, and updated by any task attached to that +cpuset, if it enters the synchronous (direct) page reclaim code. + +A per-cpuset file provides an integer number representing the recent +(half-life of 10 seconds) rate of direct page reclaims caused by +the tasks in the cpuset, in units of reclaims attempted per second, +times 1000. + + +1.6 What is memory spread ? +--------------------------- +There are two boolean flag files per cpuset that control where the +kernel allocates pages for the file system buffers and related in +kernel data structures. They are called 'memory_spread_page' and +'memory_spread_slab'. + +If the per-cpuset boolean flag file 'memory_spread_page' is set, then +the kernel will spread the file system buffers (page cache) evenly +over all the nodes that the faulting task is allowed to use, instead +of preferring to put those pages on the node where the task is running. + +If the per-cpuset boolean flag file 'memory_spread_slab' is set, +then the kernel will spread some file system related slab caches, +such as for inodes and dentries evenly over all the nodes that the +faulting task is allowed to use, instead of preferring to put those +pages on the node where the task is running. + +The setting of these flags does not affect anonymous data segment or +stack segment pages of a task. + +By default, both kinds of memory spreading are off, and memory +pages are allocated on the node local to where the task is running, +except perhaps as modified by the tasks NUMA mempolicy or cpuset +configuration, so long as sufficient free memory pages are available. + +When new cpusets are created, they inherit the memory spread settings +of their parent. + +Setting memory spreading causes allocations for the affected page +or slab caches to ignore the tasks NUMA mempolicy and be spread +instead. Tasks using mbind() or set_mempolicy() calls to set NUMA +mempolicies will not notice any change in these calls as a result of +their containing tasks memory spread settings. If memory spreading +is turned off, then the currently specified NUMA mempolicy once again +applies to memory page allocations. + +Both 'memory_spread_page' and 'memory_spread_slab' are boolean flag +files. By default they contain "0", meaning that the feature is off +for that cpuset. If a "1" is written to that file, then that turns +the named feature on. + +The implementation is simple. + +Setting the flag 'memory_spread_page' turns on a per-process flag +PF_SPREAD_PAGE for each task that is in that cpuset or subsequently +joins that cpuset. The page allocation calls for the page cache +is modified to perform an inline check for this PF_SPREAD_PAGE task +flag, and if set, a call to a new routine cpuset_mem_spread_node() +returns the node to prefer for the allocation. + +Similarly, setting 'memory_spread_slab' turns on the flag +PF_SPREAD_SLAB, and appropriately marked slab caches will allocate +pages from the node returned by cpuset_mem_spread_node(). + +The cpuset_mem_spread_node() routine is also simple. It uses the +value of a per-task rotor cpuset_mem_spread_rotor to select the next +node in the current tasks mems_allowed to prefer for the allocation. + +This memory placement policy is also known (in other contexts) as +round-robin or interleave. + +This policy can provide substantial improvements for jobs that need +to place thread local data on the corresponding node, but that need +to access large file system data sets that need to be spread across +the several nodes in the jobs cpuset in order to fit. Without this +policy, especially for jobs that might have one thread reading in the +data set, the memory allocation across the nodes in the jobs cpuset +can become very uneven. + +1.7 What is sched_load_balance ? +-------------------------------- + +The kernel scheduler (kernel/sched.c) automatically load balances +tasks. If one CPU is underutilized, kernel code running on that +CPU will look for tasks on other more overloaded CPUs and move those +tasks to itself, within the constraints of such placement mechanisms +as cpusets and sched_setaffinity. + +The algorithmic cost of load balancing and its impact on key shared +kernel data structures such as the task list increases more than +linearly with the number of CPUs being balanced. So the scheduler +has support to partition the systems CPUs into a number of sched +domains such that it only load balances within each sched domain. +Each sched domain covers some subset of the CPUs in the system; +no two sched domains overlap; some CPUs might not be in any sched +domain and hence won't be load balanced. + +Put simply, it costs less to balance between two smaller sched domains +than one big one, but doing so means that overloads in one of the +two domains won't be load balanced to the other one. + +By default, there is one sched domain covering all CPUs, except those +marked isolated using the kernel boot time "isolcpus=" argument. + +This default load balancing across all CPUs is not well suited for +the following two situations: + 1) On large systems, load balancing across many CPUs is expensive. + If the system is managed using cpusets to place independent jobs + on separate sets of CPUs, full load balancing is unnecessary. + 2) Systems supporting realtime on some CPUs need to minimize + system overhead on those CPUs, including avoiding task load + balancing if that is not needed. + +When the per-cpuset flag "sched_load_balance" is enabled (the default +setting), it requests that all the CPUs in that cpusets allowed 'cpus' +be contained in a single sched domain, ensuring that load balancing +can move a task (not otherwised pinned, as by sched_setaffinity) +from any CPU in that cpuset to any other. + +When the per-cpuset flag "sched_load_balance" is disabled, then the +scheduler will avoid load balancing across the CPUs in that cpuset, +--except-- in so far as is necessary because some overlapping cpuset +has "sched_load_balance" enabled. + +So, for example, if the top cpuset has the flag "sched_load_balance" +enabled, then the scheduler will have one sched domain covering all +CPUs, and the setting of the "sched_load_balance" flag in any other +cpusets won't matter, as we're already fully load balancing. + +Therefore in the above two situations, the top cpuset flag +"sched_load_balance" should be disabled, and only some of the smaller, +child cpusets have this flag enabled. + +When doing this, you don't usually want to leave any unpinned tasks in +the top cpuset that might use non-trivial amounts of CPU, as such tasks +may be artificially constrained to some subset of CPUs, depending on +the particulars of this flag setting in descendent cpusets. Even if +such a task could use spare CPU cycles in some other CPUs, the kernel +scheduler might not consider the possibility of load balancing that +task to that underused CPU. + +Of course, tasks pinned to a particular CPU can be left in a cpuset +that disables "sched_load_balance" as those tasks aren't going anywhere +else anyway. + +There is an impedance mismatch here, between cpusets and sched domains. +Cpusets are hierarchical and nest. Sched domains are flat; they don't +overlap and each CPU is in at most one sched domain. + +It is necessary for sched domains to be flat because load balancing +across partially overlapping sets of CPUs would risk unstable dynamics +that would be beyond our understanding. So if each of two partially +overlapping cpusets enables the flag 'sched_load_balance', then we +form a single sched domain that is a superset of both. We won't move +a task to a CPU outside it cpuset, but the scheduler load balancing +code might waste some compute cycles considering that possibility. + +This mismatch is why there is not a simple one-to-one relation +between which cpusets have the flag "sched_load_balance" enabled, +and the sched domain configuration. If a cpuset enables the flag, it +will get balancing across all its CPUs, but if it disables the flag, +it will only be assured of no load balancing if no other overlapping +cpuset enables the flag. + +If two cpusets have partially overlapping 'cpus' allowed, and only +one of them has this flag enabled, then the other may find its +tasks only partially load balanced, just on the overlapping CPUs. +This is just the general case of the top_cpuset example given a few +paragraphs above. In the general case, as in the top cpuset case, +don't leave tasks that might use non-trivial amounts of CPU in +such partially load balanced cpusets, as they may be artificially +constrained to some subset of the CPUs allowed to them, for lack of +load balancing to the other CPUs. + +1.7.1 sched_load_balance implementation details. +------------------------------------------------ + +The per-cpuset flag 'sched_load_balance' defaults to enabled (contrary +to most cpuset flags.) When enabled for a cpuset, the kernel will +ensure that it can load balance across all the CPUs in that cpuset +(makes sure that all the CPUs in the cpus_allowed of that cpuset are +in the same sched domain.) + +If two overlapping cpusets both have 'sched_load_balance' enabled, +then they will be (must be) both in the same sched domain. + +If, as is the default, the top cpuset has 'sched_load_balance' enabled, +then by the above that means there is a single sched domain covering +the whole system, regardless of any other cpuset settings. + +The kernel commits to user space that it will avoid load balancing +where it can. It will pick as fine a granularity partition of sched +domains as it can while still providing load balancing for any set +of CPUs allowed to a cpuset having 'sched_load_balance' enabled. + +The internal kernel cpuset to scheduler interface passes from the +cpuset code to the scheduler code a partition of the load balanced +CPUs in the system. This partition is a set of subsets (represented +as an array of cpumask_t) of CPUs, pairwise disjoint, that cover all +the CPUs that must be load balanced. + +Whenever the 'sched_load_balance' flag changes, or CPUs come or go +from a cpuset with this flag enabled, or a cpuset with this flag +enabled is removed, the cpuset code builds a new such partition and +passes it to the scheduler sched domain setup code, to have the sched +domains rebuilt as necessary. + +This partition exactly defines what sched domains the scheduler should +setup - one sched domain for each element (cpumask_t) in the partition. + +The scheduler remembers the currently active sched domain partitions. +When the scheduler routine partition_sched_domains() is invoked from +the cpuset code to update these sched domains, it compares the new +partition requested with the current, and updates its sched domains, +removing the old and adding the new, for each change. + + +1.8 What is sched_relax_domain_level ? +-------------------------------------- + +In sched domain, the scheduler migrates tasks in 2 ways; periodic load +balance on tick, and at time of some schedule events. + +When a task is woken up, scheduler try to move the task on idle CPU. +For example, if a task A running on CPU X activates another task B +on the same CPU X, and if CPU Y is X's sibling and performing idle, +then scheduler migrate task B to CPU Y so that task B can start on +CPU Y without waiting task A on CPU X. + +And if a CPU run out of tasks in its runqueue, the CPU try to pull +extra tasks from other busy CPUs to help them before it is going to +be idle. + +Of course it takes some searching cost to find movable tasks and/or +idle CPUs, the scheduler might not search all CPUs in the domain +everytime. In fact, in some architectures, the searching ranges on +events are limited in the same socket or node where the CPU locates, +while the load balance on tick searchs all. + +For example, assume CPU Z is relatively far from CPU X. Even if CPU Z +is idle while CPU X and the siblings are busy, scheduler can't migrate +woken task B from X to Z since it is out of its searching range. +As the result, task B on CPU X need to wait task A or wait load balance +on the next tick. For some applications in special situation, waiting +1 tick may be too long. + +The 'sched_relax_domain_level' file allows you to request changing +this searching range as you like. This file takes int value which +indicates size of searching range in levels ideally as follows, +otherwise initial value -1 that indicates the cpuset has no request. + + -1 : no request. use system default or follow request of others. + 0 : no search. + 1 : search siblings (hyperthreads in a core). + 2 : search cores in a package. + 3 : search cpus in a node [= system wide on non-NUMA system] + ( 4 : search nodes in a chunk of node [on NUMA system] ) + ( 5 : search system wide [on NUMA system] ) + +The system default is architecture dependent. The system default +can be changed using the relax_domain_level= boot parameter. + +This file is per-cpuset and affect the sched domain where the cpuset +belongs to. Therefore if the flag 'sched_load_balance' of a cpuset +is disabled, then 'sched_relax_domain_level' have no effect since +there is no sched domain belonging the cpuset. + +If multiple cpusets are overlapping and hence they form a single sched +domain, the largest value among those is used. Be careful, if one +requests 0 and others are -1 then 0 is used. + +Note that modifying this file will have both good and bad effects, +and whether it is acceptable or not will be depend on your situation. +Don't modify this file if you are not sure. + +If your situation is: + - The migration costs between each cpu can be assumed considerably + small(for you) due to your special application's behavior or + special hardware support for CPU cache etc. + - The searching cost doesn't have impact(for you) or you can make + the searching cost enough small by managing cpuset to compact etc. + - The latency is required even it sacrifices cache hit rate etc. +then increasing 'sched_relax_domain_level' would benefit you. + + +1.9 How do I use cpusets ? +-------------------------- + +In order to minimize the impact of cpusets on critical kernel +code, such as the scheduler, and due to the fact that the kernel +does not support one task updating the memory placement of another +task directly, the impact on a task of changing its cpuset CPU +or Memory Node placement, or of changing to which cpuset a task +is attached, is subtle. + +If a cpuset has its Memory Nodes modified, then for each task attached +to that cpuset, the next time that the kernel attempts to allocate +a page of memory for that task, the kernel will notice the change +in the tasks cpuset, and update its per-task memory placement to +remain within the new cpusets memory placement. If the task was using +mempolicy MPOL_BIND, and the nodes to which it was bound overlap with +its new cpuset, then the task will continue to use whatever subset +of MPOL_BIND nodes are still allowed in the new cpuset. If the task +was using MPOL_BIND and now none of its MPOL_BIND nodes are allowed +in the new cpuset, then the task will be essentially treated as if it +was MPOL_BIND bound to the new cpuset (even though its numa placement, +as queried by get_mempolicy(), doesn't change). If a task is moved +from one cpuset to another, then the kernel will adjust the tasks +memory placement, as above, the next time that the kernel attempts +to allocate a page of memory for that task. + +If a cpuset has its 'cpus' modified, then each task in that cpuset +will have its allowed CPU placement changed immediately. Similarly, +if a tasks pid is written to a cpusets 'tasks' file, in either its +current cpuset or another cpuset, then its allowed CPU placement is +changed immediately. If such a task had been bound to some subset +of its cpuset using the sched_setaffinity() call, the task will be +allowed to run on any CPU allowed in its new cpuset, negating the +affect of the prior sched_setaffinity() call. + +In summary, the memory placement of a task whose cpuset is changed is +updated by the kernel, on the next allocation of a page for that task, +but the processor placement is not updated, until that tasks pid is +rewritten to the 'tasks' file of its cpuset. This is done to avoid +impacting the scheduler code in the kernel with a check for changes +in a tasks processor placement. + +Normally, once a page is allocated (given a physical page +of main memory) then that page stays on whatever node it +was allocated, so long as it remains allocated, even if the +cpusets memory placement policy 'mems' subsequently changes. +If the cpuset flag file 'memory_migrate' is set true, then when +tasks are attached to that cpuset, any pages that task had +allocated to it on nodes in its previous cpuset are migrated +to the tasks new cpuset. The relative placement of the page within +the cpuset is preserved during these migration operations if possible. +For example if the page was on the second valid node of the prior cpuset +then the page will be placed on the second valid node of the new cpuset. + +Also if 'memory_migrate' is set true, then if that cpusets +'mems' file is modified, pages allocated to tasks in that +cpuset, that were on nodes in the previous setting of 'mems', +will be moved to nodes in the new setting of 'mems.' +Pages that were not in the tasks prior cpuset, or in the cpusets +prior 'mems' setting, will not be moved. + +There is an exception to the above. If hotplug functionality is used +to remove all the CPUs that are currently assigned to a cpuset, +then all the tasks in that cpuset will be moved to the nearest ancestor +with non-empty cpus. But the moving of some (or all) tasks might fail if +cpuset is bound with another cgroup subsystem which has some restrictions +on task attaching. In this failing case, those tasks will stay +in the original cpuset, and the kernel will automatically update +their cpus_allowed to allow all online CPUs. When memory hotplug +functionality for removing Memory Nodes is available, a similar exception +is expected to apply there as well. In general, the kernel prefers to +violate cpuset placement, over starving a task that has had all +its allowed CPUs or Memory Nodes taken offline. + +There is a second exception to the above. GFP_ATOMIC requests are +kernel internal allocations that must be satisfied, immediately. +The kernel may drop some request, in rare cases even panic, if a +GFP_ATOMIC alloc fails. If the request cannot be satisfied within +the current tasks cpuset, then we relax the cpuset, and look for +memory anywhere we can find it. It's better to violate the cpuset +than stress the kernel. + +To start a new job that is to be contained within a cpuset, the steps are: + + 1) mkdir /dev/cpuset + 2) mount -t cgroup -ocpuset cpuset /dev/cpuset + 3) Create the new cpuset by doing mkdir's and write's (or echo's) in + the /dev/cpuset virtual file system. + 4) Start a task that will be the "founding father" of the new job. + 5) Attach that task to the new cpuset by writing its pid to the + /dev/cpuset tasks file for that cpuset. + 6) fork, exec or clone the job tasks from this founding father task. + +For example, the following sequence of commands will setup a cpuset +named "Charlie", containing just CPUs 2 and 3, and Memory Node 1, +and then start a subshell 'sh' in that cpuset: + + mount -t cgroup -ocpuset cpuset /dev/cpuset + cd /dev/cpuset + mkdir Charlie + cd Charlie + /bin/echo 2-3 > cpus + /bin/echo 1 > mems + /bin/echo $$ > tasks + sh + # The subshell 'sh' is now running in cpuset Charlie + # The next line should display '/Charlie' + cat /proc/self/cpuset + +In the future, a C library interface to cpusets will likely be +available. For now, the only way to query or modify cpusets is +via the cpuset file system, using the various cd, mkdir, echo, cat, +rmdir commands from the shell, or their equivalent from C. + +The sched_setaffinity calls can also be done at the shell prompt using +SGI's runon or Robert Love's taskset. The mbind and set_mempolicy +calls can be done at the shell prompt using the numactl command +(part of Andi Kleen's numa package). + +2. Usage Examples and Syntax +============================ + +2.1 Basic Usage +--------------- + +Creating, modifying, using the cpusets can be done through the cpuset +virtual filesystem. + +To mount it, type: +# mount -t cgroup -o cpuset cpuset /dev/cpuset + +Then under /dev/cpuset you can find a tree that corresponds to the +tree of the cpusets in the system. For instance, /dev/cpuset +is the cpuset that holds the whole system. + +If you want to create a new cpuset under /dev/cpuset: +# cd /dev/cpuset +# mkdir my_cpuset + +Now you want to do something with this cpuset. +# cd my_cpuset + +In this directory you can find several files: +# ls +cpu_exclusive memory_migrate mems tasks +cpus memory_pressure notify_on_release +mem_exclusive memory_spread_page sched_load_balance +mem_hardwall memory_spread_slab sched_relax_domain_level + +Reading them will give you information about the state of this cpuset: +the CPUs and Memory Nodes it can use, the processes that are using +it, its properties. By writing to these files you can manipulate +the cpuset. + +Set some flags: +# /bin/echo 1 > cpu_exclusive + +Add some cpus: +# /bin/echo 0-7 > cpus + +Add some mems: +# /bin/echo 0-7 > mems + +Now attach your shell to this cpuset: +# /bin/echo $$ > tasks + +You can also create cpusets inside your cpuset by using mkdir in this +directory. +# mkdir my_sub_cs + +To remove a cpuset, just use rmdir: +# rmdir my_sub_cs +This will fail if the cpuset is in use (has cpusets inside, or has +processes attached). + +Note that for legacy reasons, the "cpuset" filesystem exists as a +wrapper around the cgroup filesystem. + +The command + +mount -t cpuset X /dev/cpuset + +is equivalent to + +mount -t cgroup -ocpuset X /dev/cpuset +echo "/sbin/cpuset_release_agent" > /dev/cpuset/release_agent + +2.2 Adding/removing cpus +------------------------ + +This is the syntax to use when writing in the cpus or mems files +in cpuset directories: + +# /bin/echo 1-4 > cpus -> set cpus list to cpus 1,2,3,4 +# /bin/echo 1,2,3,4 > cpus -> set cpus list to cpus 1,2,3,4 + +2.3 Setting flags +----------------- + +The syntax is very simple: + +# /bin/echo 1 > cpu_exclusive -> set flag 'cpu_exclusive' +# /bin/echo 0 > cpu_exclusive -> unset flag 'cpu_exclusive' + +2.4 Attaching processes +----------------------- + +# /bin/echo PID > tasks + +Note that it is PID, not PIDs. You can only attach ONE task at a time. +If you have several tasks to attach, you have to do it one after another: + +# /bin/echo PID1 > tasks +# /bin/echo PID2 > tasks + ... +# /bin/echo PIDn > tasks + + +3. Questions +============ + +Q: what's up with this '/bin/echo' ? +A: bash's builtin 'echo' command does not check calls to write() against + errors. If you use it in the cpuset file system, you won't be + able to tell whether a command succeeded or failed. + +Q: When I attach processes, only the first of the line gets really attached ! +A: We can only return one error code per call to write(). So you should also + put only ONE pid. + +4. Contact +========== + +Web: http://www.bullopensource.org/cpuset diff --git a/Documentation/cgroups/devices.txt b/Documentation/cgroups/devices.txt new file mode 100644 index 000000000000..7cc6e6a60672 --- /dev/null +++ b/Documentation/cgroups/devices.txt @@ -0,0 +1,52 @@ +Device Whitelist Controller + +1. Description: + +Implement a cgroup to track and enforce open and mknod restrictions +on device files. A device cgroup associates a device access +whitelist with each cgroup. A whitelist entry has 4 fields. +'type' is a (all), c (char), or b (block). 'all' means it applies +to all types and all major and minor numbers. Major and minor are +either an integer or * for all. Access is a composition of r +(read), w (write), and m (mknod). + +The root device cgroup starts with rwm to 'all'. A child device +cgroup gets a copy of the parent. Administrators can then remove +devices from the whitelist or add new entries. A child cgroup can +never receive a device access which is denied by its parent. However +when a device access is removed from a parent it will not also be +removed from the child(ren). + +2. User Interface + +An entry is added using devices.allow, and removed using +devices.deny. For instance + + echo 'c 1:3 mr' > /cgroups/1/devices.allow + +allows cgroup 1 to read and mknod the device usually known as +/dev/null. Doing + + echo a > /cgroups/1/devices.deny + +will remove the default 'a *:* rwm' entry. Doing + + echo a > /cgroups/1/devices.allow + +will add the 'a *:* rwm' entry to the whitelist. + +3. Security + +Any task can move itself between cgroups. This clearly won't +suffice, but we can decide the best way to adequately restrict +movement as people get some experience with this. We may just want +to require CAP_SYS_ADMIN, which at least is a separate bit from +CAP_MKNOD. We may want to just refuse moving to a cgroup which +isn't a descendent of the current one. Or we may want to use +CAP_MAC_ADMIN, since we really are trying to lock down root. + +CAP_SYS_ADMIN is needed to modify the whitelist or move another +task to a new cgroup. (Again we'll probably want to change that). + +A cgroup may not be granted more permissions than the cgroup's +parent has. diff --git a/Documentation/cgroups/memcg_test.txt b/Documentation/cgroups/memcg_test.txt new file mode 100644 index 000000000000..19533f93b7a2 --- /dev/null +++ b/Documentation/cgroups/memcg_test.txt @@ -0,0 +1,342 @@ +Memory Resource Controller(Memcg) Implementation Memo. +Last Updated: 2008/12/15 +Base Kernel Version: based on 2.6.28-rc8-mm. + +Because VM is getting complex (one of reasons is memcg...), memcg's behavior +is complex. This is a document for memcg's internal behavior. +Please note that implementation details can be changed. + +(*) Topics on API should be in Documentation/cgroups/memory.txt) + +0. How to record usage ? + 2 objects are used. + + page_cgroup ....an object per page. + Allocated at boot or memory hotplug. Freed at memory hot removal. + + swap_cgroup ... an entry per swp_entry. + Allocated at swapon(). Freed at swapoff(). + + The page_cgroup has USED bit and double count against a page_cgroup never + occurs. swap_cgroup is used only when a charged page is swapped-out. + +1. Charge + + a page/swp_entry may be charged (usage += PAGE_SIZE) at + + mem_cgroup_newpage_charge() + Called at new page fault and Copy-On-Write. + + mem_cgroup_try_charge_swapin() + Called at do_swap_page() (page fault on swap entry) and swapoff. + Followed by charge-commit-cancel protocol. (With swap accounting) + At commit, a charge recorded in swap_cgroup is removed. + + mem_cgroup_cache_charge() + Called at add_to_page_cache() + + mem_cgroup_cache_charge_swapin() + Called at shmem's swapin. + + mem_cgroup_prepare_migration() + Called before migration. "extra" charge is done and followed by + charge-commit-cancel protocol. + At commit, charge against oldpage or newpage will be committed. + +2. Uncharge + a page/swp_entry may be uncharged (usage -= PAGE_SIZE) by + + mem_cgroup_uncharge_page() + Called when an anonymous page is fully unmapped. I.e., mapcount goes + to 0. If the page is SwapCache, uncharge is delayed until + mem_cgroup_uncharge_swapcache(). + + mem_cgroup_uncharge_cache_page() + Called when a page-cache is deleted from radix-tree. If the page is + SwapCache, uncharge is delayed until mem_cgroup_uncharge_swapcache(). + + mem_cgroup_uncharge_swapcache() + Called when SwapCache is removed from radix-tree. The charge itself + is moved to swap_cgroup. (If mem+swap controller is disabled, no + charge to swap occurs.) + + mem_cgroup_uncharge_swap() + Called when swp_entry's refcnt goes down to 0. A charge against swap + disappears. + + mem_cgroup_end_migration(old, new) + At success of migration old is uncharged (if necessary), a charge + to new page is committed. At failure, charge to old page is committed. + +3. charge-commit-cancel + In some case, we can't know this "charge" is valid or not at charging + (because of races). + To handle such case, there are charge-commit-cancel functions. + mem_cgroup_try_charge_XXX + mem_cgroup_commit_charge_XXX + mem_cgroup_cancel_charge_XXX + these are used in swap-in and migration. + + At try_charge(), there are no flags to say "this page is charged". + at this point, usage += PAGE_SIZE. + + At commit(), the function checks the page should be charged or not + and set flags or avoid charging.(usage -= PAGE_SIZE) + + At cancel(), simply usage -= PAGE_SIZE. + +Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y. + +4. Anonymous + Anonymous page is newly allocated at + - page fault into MAP_ANONYMOUS mapping. + - Copy-On-Write. + It is charged right after it's allocated before doing any page table + related operations. Of course, it's uncharged when another page is used + for the fault address. + + At freeing anonymous page (by exit() or munmap()), zap_pte() is called + and pages for ptes are freed one by one.(see mm/memory.c). Uncharges + are done at page_remove_rmap() when page_mapcount() goes down to 0. + + Another page freeing is by page-reclaim (vmscan.c) and anonymous + pages are swapped out. In this case, the page is marked as + PageSwapCache(). uncharge() routine doesn't uncharge the page marked + as SwapCache(). It's delayed until __delete_from_swap_cache(). + + 4.1 Swap-in. + At swap-in, the page is taken from swap-cache. There are 2 cases. + + (a) If the SwapCache is newly allocated and read, it has no charges. + (b) If the SwapCache has been mapped by processes, it has been + charged already. + + This swap-in is one of the most complicated work. In do_swap_page(), + following events occur when pte is unchanged. + + (1) the page (SwapCache) is looked up. + (2) lock_page() + (3) try_charge_swapin() + (4) reuse_swap_page() (may call delete_swap_cache()) + (5) commit_charge_swapin() + (6) swap_free(). + + Considering following situation for example. + + (A) The page has not been charged before (2) and reuse_swap_page() + doesn't call delete_from_swap_cache(). + (B) The page has not been charged before (2) and reuse_swap_page() + calls delete_from_swap_cache(). + (C) The page has been charged before (2) and reuse_swap_page() doesn't + call delete_from_swap_cache(). + (D) The page has been charged before (2) and reuse_swap_page() calls + delete_from_swap_cache(). + + memory.usage/memsw.usage changes to this page/swp_entry will be + Case (A) (B) (C) (D) + Event + Before (2) 0/ 1 0/ 1 1/ 1 1/ 1 + =========================================== + (3) +1/+1 +1/+1 +1/+1 +1/+1 + (4) - 0/ 0 - -1/ 0 + (5) 0/-1 0/ 0 -1/-1 0/ 0 + (6) - 0/-1 - 0/-1 + =========================================== + Result 1/ 1 1/ 1 1/ 1 1/ 1 + + In any cases, charges to this page should be 1/ 1. + + 4.2 Swap-out. + At swap-out, typical state transition is below. + + (a) add to swap cache. (marked as SwapCache) + swp_entry's refcnt += 1. + (b) fully unmapped. + swp_entry's refcnt += # of ptes. + (c) write back to swap. + (d) delete from swap cache. (remove from SwapCache) + swp_entry's refcnt -= 1. + + + At (b), the page is marked as SwapCache and not uncharged. + At (d), the page is removed from SwapCache and a charge in page_cgroup + is moved to swap_cgroup. + + Finally, at task exit, + (e) zap_pte() is called and swp_entry's refcnt -=1 -> 0. + Here, a charge in swap_cgroup disappears. + +5. Page Cache + Page Cache is charged at + - add_to_page_cache_locked(). + + uncharged at + - __remove_from_page_cache(). + + The logic is very clear. (About migration, see below) + Note: __remove_from_page_cache() is called by remove_from_page_cache() + and __remove_mapping(). + +6. Shmem(tmpfs) Page Cache + Memcg's charge/uncharge have special handlers of shmem. The best way + to understand shmem's page state transition is to read mm/shmem.c. + But brief explanation of the behavior of memcg around shmem will be + helpful to understand the logic. + + Shmem's page (just leaf page, not direct/indirect block) can be on + - radix-tree of shmem's inode. + - SwapCache. + - Both on radix-tree and SwapCache. This happens at swap-in + and swap-out, + + It's charged when... + - A new page is added to shmem's radix-tree. + - A swp page is read. (move a charge from swap_cgroup to page_cgroup) + It's uncharged when + - A page is removed from radix-tree and not SwapCache. + - When SwapCache is removed, a charge is moved to swap_cgroup. + - When swp_entry's refcnt goes down to 0, a charge in swap_cgroup + disappears. + +7. Page Migration + One of the most complicated functions is page-migration-handler. + Memcg has 2 routines. Assume that we are migrating a page's contents + from OLDPAGE to NEWPAGE. + + Usual migration logic is.. + (a) remove the page from LRU. + (b) allocate NEWPAGE (migration target) + (c) lock by lock_page(). + (d) unmap all mappings. + (e-1) If necessary, replace entry in radix-tree. + (e-2) move contents of a page. + (f) map all mappings again. + (g) pushback the page to LRU. + (-) OLDPAGE will be freed. + + Before (g), memcg should complete all necessary charge/uncharge to + NEWPAGE/OLDPAGE. + + The point is.... + - If OLDPAGE is anonymous, all charges will be dropped at (d) because + try_to_unmap() drops all mapcount and the page will not be + SwapCache. + + - If OLDPAGE is SwapCache, charges will be kept at (g) because + __delete_from_swap_cache() isn't called at (e-1) + + - If OLDPAGE is page-cache, charges will be kept at (g) because + remove_from_swap_cache() isn't called at (e-1) + + memcg provides following hooks. + + - mem_cgroup_prepare_migration(OLDPAGE) + Called after (b) to account a charge (usage += PAGE_SIZE) against + memcg which OLDPAGE belongs to. + + - mem_cgroup_end_migration(OLDPAGE, NEWPAGE) + Called after (f) before (g). + If OLDPAGE is used, commit OLDPAGE again. If OLDPAGE is already + charged, a charge by prepare_migration() is automatically canceled. + If NEWPAGE is used, commit NEWPAGE and uncharge OLDPAGE. + + But zap_pte() (by exit or munmap) can be called while migration, + we have to check if OLDPAGE/NEWPAGE is a valid page after commit(). + +8. LRU + Each memcg has its own private LRU. Now, it's handling is under global + VM's control (means that it's handled under global zone->lru_lock). + Almost all routines around memcg's LRU is called by global LRU's + list management functions under zone->lru_lock(). + + A special function is mem_cgroup_isolate_pages(). This scans + memcg's private LRU and call __isolate_lru_page() to extract a page + from LRU. + (By __isolate_lru_page(), the page is removed from both of global and + private LRU.) + + +9. Typical Tests. + + Tests for racy cases. + + 9.1 Small limit to memcg. + When you do test to do racy case, it's good test to set memcg's limit + to be very small rather than GB. Many races found in the test under + xKB or xxMB limits. + (Memory behavior under GB and Memory behavior under MB shows very + different situation.) + + 9.2 Shmem + Historically, memcg's shmem handling was poor and we saw some amount + of troubles here. This is because shmem is page-cache but can be + SwapCache. Test with shmem/tmpfs is always good test. + + 9.3 Migration + For NUMA, migration is an another special case. To do easy test, cpuset + is useful. Following is a sample script to do migration. + + mount -t cgroup -o cpuset none /opt/cpuset + + mkdir /opt/cpuset/01 + echo 1 > /opt/cpuset/01/cpuset.cpus + echo 0 > /opt/cpuset/01/cpuset.mems + echo 1 > /opt/cpuset/01/cpuset.memory_migrate + mkdir /opt/cpuset/02 + echo 1 > /opt/cpuset/02/cpuset.cpus + echo 1 > /opt/cpuset/02/cpuset.mems + echo 1 > /opt/cpuset/02/cpuset.memory_migrate + + In above set, when you moves a task from 01 to 02, page migration to + node 0 to node 1 will occur. Following is a script to migrate all + under cpuset. + -- + move_task() + { + for pid in $1 + do + /bin/echo $pid >$2/tasks 2>/dev/null + echo -n $pid + echo -n " " + done + echo END + } + + G1_TASK=`cat ${G1}/tasks` + G2_TASK=`cat ${G2}/tasks` + move_task "${G1_TASK}" ${G2} & + -- + 9.4 Memory hotplug. + memory hotplug test is one of good test. + to offline memory, do following. + # echo offline > /sys/devices/system/memory/memoryXXX/state + (XXX is the place of memory) + This is an easy way to test page migration, too. + + 9.5 mkdir/rmdir + When using hierarchy, mkdir/rmdir test should be done. + Use tests like the following. + + echo 1 >/opt/cgroup/01/memory/use_hierarchy + mkdir /opt/cgroup/01/child_a + mkdir /opt/cgroup/01/child_b + + set limit to 01. + add limit to 01/child_b + run jobs under child_a and child_b + + create/delete following groups at random while jobs are running. + /opt/cgroup/01/child_a/child_aa + /opt/cgroup/01/child_b/child_bb + /opt/cgroup/01/child_c + + running new jobs in new group is also good. + + 9.6 Mount with other subsystems. + Mounting with other subsystems is a good test because there is a + race and lock dependency with other cgroup subsystems. + + example) + # mount -t cgroup none /cgroup -t cpuset,memory,cpu,devices + + and do task move, mkdir, rmdir etc...under this. diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt new file mode 100644 index 000000000000..e1501964df1e --- /dev/null +++ b/Documentation/cgroups/memory.txt @@ -0,0 +1,399 @@ +Memory Resource Controller + +NOTE: The Memory Resource Controller has been generically been referred +to as the memory controller in this document. Do not confuse memory controller +used here with the memory controller that is used in hardware. + +Salient features + +a. Enable control of both RSS (mapped) and Page Cache (unmapped) pages +b. The infrastructure allows easy addition of other types of memory to control +c. Provides *zero overhead* for non memory controller users +d. Provides a double LRU: global memory pressure causes reclaim from the + global LRU; a cgroup on hitting a limit, reclaims from the per + cgroup LRU + +NOTE: Swap Cache (unmapped) is not accounted now. + +Benefits and Purpose of the memory controller + +The memory controller isolates the memory behaviour of a group of tasks +from the rest of the system. The article on LWN [12] mentions some probable +uses of the memory controller. The memory controller can be used to + +a. Isolate an application or a group of applications + Memory hungry applications can be isolated and limited to a smaller + amount of memory. +b. Create a cgroup with limited amount of memory, this can be used + as a good alternative to booting with mem=XXXX. +c. Virtualization solutions can control the amount of memory they want + to assign to a virtual machine instance. +d. A CD/DVD burner could control the amount of memory used by the + rest of the system to ensure that burning does not fail due to lack + of available memory. +e. There are several other use cases, find one or use the controller just + for fun (to learn and hack on the VM subsystem). + +1. History + +The memory controller has a long history. A request for comments for the memory +controller was posted by Balbir Singh [1]. At the time the RFC was posted +there were several implementations for memory control. The goal of the +RFC was to build consensus and agreement for the minimal features required +for memory control. The first RSS controller was posted by Balbir Singh[2] +in Feb 2007. Pavel Emelianov [3][4][5] has since posted three versions of the +RSS controller. At OLS, at the resource management BoF, everyone suggested +that we handle both page cache and RSS together. Another request was raised +to allow user space handling of OOM. The current memory controller is +at version 6; it combines both mapped (RSS) and unmapped Page +Cache Control [11]. + +2. Memory Control + +Memory is a unique resource in the sense that it is present in a limited +amount. If a task requires a lot of CPU processing, the task can spread +its processing over a period of hours, days, months or years, but with +memory, the same physical memory needs to be reused to accomplish the task. + +The memory controller implementation has been divided into phases. These +are: + +1. Memory controller +2. mlock(2) controller +3. Kernel user memory accounting and slab control +4. user mappings length controller + +The memory controller is the first controller developed. + +2.1. Design + +The core of the design is a counter called the res_counter. The res_counter +tracks the current memory usage and limit of the group of processes associated +with the controller. Each cgroup has a memory controller specific data +structure (mem_cgroup) associated with it. + +2.2. Accounting + + +--------------------+ + | mem_cgroup | + | (res_counter) | + +--------------------+ + / ^ \ + / | \ + +---------------+ | +---------------+ + | mm_struct | |.... | mm_struct | + | | | | | + +---------------+ | +---------------+ + | + + --------------+ + | + +---------------+ +------+--------+ + | page +----------> page_cgroup| + | | | | + +---------------+ +---------------+ + + (Figure 1: Hierarchy of Accounting) + + +Figure 1 shows the important aspects of the controller + +1. Accounting happens per cgroup +2. Each mm_struct knows about which cgroup it belongs to +3. Each page has a pointer to the page_cgroup, which in turn knows the + cgroup it belongs to + +The accounting is done as follows: mem_cgroup_charge() is invoked to setup +the necessary data structures and check if the cgroup that is being charged +is over its limit. If it is then reclaim is invoked on the cgroup. +More details can be found in the reclaim section of this document. +If everything goes well, a page meta-data-structure called page_cgroup is +allocated and associated with the page. This routine also adds the page to +the per cgroup LRU. + +2.2.1 Accounting details + +All mapped anon pages (RSS) and cache pages (Page Cache) are accounted. +(some pages which never be reclaimable and will not be on global LRU + are not accounted. we just accounts pages under usual vm management.) + +RSS pages are accounted at page_fault unless they've already been accounted +for earlier. A file page will be accounted for as Page Cache when it's +inserted into inode (radix-tree). While it's mapped into the page tables of +processes, duplicate accounting is carefully avoided. + +A RSS page is unaccounted when it's fully unmapped. A PageCache page is +unaccounted when it's removed from radix-tree. + +At page migration, accounting information is kept. + +Note: we just account pages-on-lru because our purpose is to control amount +of used pages. not-on-lru pages are tend to be out-of-control from vm view. + +2.3 Shared Page Accounting + +Shared pages are accounted on the basis of the first touch approach. The +cgroup that first touches a page is accounted for the page. The principle +behind this approach is that a cgroup that aggressively uses a shared +page will eventually get charged for it (once it is uncharged from +the cgroup that brought it in -- this will happen on memory pressure). + +Exception: If CONFIG_CGROUP_CGROUP_MEM_RES_CTLR_SWAP is not used.. +When you do swapoff and make swapped-out pages of shmem(tmpfs) to +be backed into memory in force, charges for pages are accounted against the +caller of swapoff rather than the users of shmem. + + +2.4 Swap Extension (CONFIG_CGROUP_MEM_RES_CTLR_SWAP) +Swap Extension allows you to record charge for swap. A swapped-in page is +charged back to original page allocator if possible. + +When swap is accounted, following files are added. + - memory.memsw.usage_in_bytes. + - memory.memsw.limit_in_bytes. + +usage of mem+swap is limited by memsw.limit_in_bytes. + +Note: why 'mem+swap' rather than swap. +The global LRU(kswapd) can swap out arbitrary pages. Swap-out means +to move account from memory to swap...there is no change in usage of +mem+swap. + +In other words, when we want to limit the usage of swap without affecting +global LRU, mem+swap limit is better than just limiting swap from OS point +of view. + +2.5 Reclaim + +Each cgroup maintains a per cgroup LRU that consists of an active +and inactive list. When a cgroup goes over its limit, we first try +to reclaim memory from the cgroup so as to make space for the new +pages that the cgroup has touched. If the reclaim is unsuccessful, +an OOM routine is invoked to select and kill the bulkiest task in the +cgroup. + +The reclaim algorithm has not been modified for cgroups, except that +pages that are selected for reclaiming come from the per cgroup LRU +list. + +2. Locking + +The memory controller uses the following hierarchy + +1. zone->lru_lock is used for selecting pages to be isolated +2. mem->per_zone->lru_lock protects the per cgroup LRU (per zone) +3. lock_page_cgroup() is used to protect page->page_cgroup + +3. User Interface + +0. Configuration + +a. Enable CONFIG_CGROUPS +b. Enable CONFIG_RESOURCE_COUNTERS +c. Enable CONFIG_CGROUP_MEM_RES_CTLR + +1. Prepare the cgroups +# mkdir -p /cgroups +# mount -t cgroup none /cgroups -o memory + +2. Make the new group and move bash into it +# mkdir /cgroups/0 +# echo $$ > /cgroups/0/tasks + +Since now we're in the 0 cgroup, +We can alter the memory limit: +# echo 4M > /cgroups/0/memory.limit_in_bytes + +NOTE: We can use a suffix (k, K, m, M, g or G) to indicate values in kilo, +mega or gigabytes. + +# cat /cgroups/0/memory.limit_in_bytes +4194304 + +NOTE: The interface has now changed to display the usage in bytes +instead of pages + +We can check the usage: +# cat /cgroups/0/memory.usage_in_bytes +1216512 + +A successful write to this file does not guarantee a successful set of +this limit to the value written into the file. This can be due to a +number of factors, such as rounding up to page boundaries or the total +availability of memory on the system. The user is required to re-read +this file after a write to guarantee the value committed by the kernel. + +# echo 1 > memory.limit_in_bytes +# cat memory.limit_in_bytes +4096 + +The memory.failcnt field gives the number of times that the cgroup limit was +exceeded. + +The memory.stat file gives accounting information. Now, the number of +caches, RSS and Active pages/Inactive pages are shown. + +4. Testing + +Balbir posted lmbench, AIM9, LTP and vmmstress results [10] and [11]. +Apart from that v6 has been tested with several applications and regular +daily use. The controller has also been tested on the PPC64, x86_64 and +UML platforms. + +4.1 Troubleshooting + +Sometimes a user might find that the application under a cgroup is +terminated. There are several causes for this: + +1. The cgroup limit is too low (just too low to do anything useful) +2. The user is using anonymous memory and swap is turned off or too low + +A sync followed by echo 1 > /proc/sys/vm/drop_caches will help get rid of +some of the pages cached in the cgroup (page cache pages). + +4.2 Task migration + +When a task migrates from one cgroup to another, it's charge is not +carried forward. The pages allocated from the original cgroup still +remain charged to it, the charge is dropped when the page is freed or +reclaimed. + +4.3 Removing a cgroup + +A cgroup can be removed by rmdir, but as discussed in sections 4.1 and 4.2, a +cgroup might have some charge associated with it, even though all +tasks have migrated away from it. +Such charges are freed(at default) or moved to its parent. When moved, +both of RSS and CACHES are moved to parent. +If both of them are busy, rmdir() returns -EBUSY. See 5.1 Also. + +Charges recorded in swap information is not updated at removal of cgroup. +Recorded information is discarded and a cgroup which uses swap (swapcache) +will be charged as a new owner of it. + + +5. Misc. interfaces. + +5.1 force_empty + memory.force_empty interface is provided to make cgroup's memory usage empty. + You can use this interface only when the cgroup has no tasks. + When writing anything to this + + # echo 0 > memory.force_empty + + Almost all pages tracked by this memcg will be unmapped and freed. Some of + pages cannot be freed because it's locked or in-use. Such pages are moved + to parent and this cgroup will be empty. But this may return -EBUSY in + some too busy case. + + Typical use case of this interface is that calling this before rmdir(). + Because rmdir() moves all pages to parent, some out-of-use page caches can be + moved to the parent. If you want to avoid that, force_empty will be useful. + +5.2 stat file + memory.stat file includes following statistics (now) + cache - # of pages from page-cache and shmem. + rss - # of pages from anonymous memory. + pgpgin - # of event of charging + pgpgout - # of event of uncharging + active_anon - # of pages on active lru of anon, shmem. + inactive_anon - # of pages on active lru of anon, shmem + active_file - # of pages on active lru of file-cache + inactive_file - # of pages on inactive lru of file cache + unevictable - # of pages cannot be reclaimed.(mlocked etc) + + Below is depend on CONFIG_DEBUG_VM. + inactive_ratio - VM inernal parameter. (see mm/page_alloc.c) + recent_rotated_anon - VM internal parameter. (see mm/vmscan.c) + recent_rotated_file - VM internal parameter. (see mm/vmscan.c) + recent_scanned_anon - VM internal parameter. (see mm/vmscan.c) + recent_scanned_file - VM internal parameter. (see mm/vmscan.c) + + Memo: + recent_rotated means recent frequency of lru rotation. + recent_scanned means recent # of scans to lru. + showing for better debug please see the code for meanings. + + +5.3 swappiness + Similar to /proc/sys/vm/swappiness, but affecting a hierarchy of groups only. + + Following cgroup's swapiness can't be changed. + - root cgroup (uses /proc/sys/vm/swappiness). + - a cgroup which uses hierarchy and it has child cgroup. + - a cgroup which uses hierarchy and not the root of hierarchy. + + +6. Hierarchy support + +The memory controller supports a deep hierarchy and hierarchical accounting. +The hierarchy is created by creating the appropriate cgroups in the +cgroup filesystem. Consider for example, the following cgroup filesystem +hierarchy + + root + / | \ + / | \ + a b c + | \ + | \ + d e + +In the diagram above, with hierarchical accounting enabled, all memory +usage of e, is accounted to its ancestors up until the root (i.e, c and root), +that has memory.use_hierarchy enabled. If one of the ancestors goes over its +limit, the reclaim algorithm reclaims from the tasks in the ancestor and the +children of the ancestor. + +6.1 Enabling hierarchical accounting and reclaim + +The memory controller by default disables the hierarchy feature. Support +can be enabled by writing 1 to memory.use_hierarchy file of the root cgroup + +# echo 1 > memory.use_hierarchy + +The feature can be disabled by + +# echo 0 > memory.use_hierarchy + +NOTE1: Enabling/disabling will fail if the cgroup already has other +cgroups created below it. + +NOTE2: This feature can be enabled/disabled per subtree. + +7. TODO + +1. Add support for accounting huge pages (as a separate controller) +2. Make per-cgroup scanner reclaim not-shared pages first +3. Teach controller to account for shared-pages +4. Start reclamation in the background when the limit is + not yet hit but the usage is getting closer + +Summary + +Overall, the memory controller has been a stable controller and has been +commented and discussed quite extensively in the community. + +References + +1. Singh, Balbir. RFC: Memory Controller, http://lwn.net/Articles/206697/ +2. Singh, Balbir. Memory Controller (RSS Control), + http://lwn.net/Articles/222762/ +3. Emelianov, Pavel. Resource controllers based on process cgroups + http://lkml.org/lkml/2007/3/6/198 +4. Emelianov, Pavel. RSS controller based on process cgroups (v2) + http://lkml.org/lkml/2007/4/9/78 +5. Emelianov, Pavel. RSS controller based on process cgroups (v3) + http://lkml.org/lkml/2007/5/30/244 +6. Menage, Paul. Control Groups v10, http://lwn.net/Articles/236032/ +7. Vaidyanathan, Srinivasan, Control Groups: Pagecache accounting and control + subsystem (v3), http://lwn.net/Articles/235534/ +8. Singh, Balbir. RSS controller v2 test results (lmbench), + http://lkml.org/lkml/2007/5/17/232 +9. Singh, Balbir. RSS controller v2 AIM9 results + http://lkml.org/lkml/2007/5/18/1 +10. Singh, Balbir. Memory controller v6 test results, + http://lkml.org/lkml/2007/8/19/36 +11. Singh, Balbir. Memory controller introduction (v6), + http://lkml.org/lkml/2007/8/17/69 +12. Corbet, Jonathan, Controlling memory use in cgroups, + http://lwn.net/Articles/243795/ diff --git a/Documentation/cgroups/resource_counter.txt b/Documentation/cgroups/resource_counter.txt new file mode 100644 index 000000000000..f196ac1d7d25 --- /dev/null +++ b/Documentation/cgroups/resource_counter.txt @@ -0,0 +1,181 @@ + + The Resource Counter + +The resource counter, declared at include/linux/res_counter.h, +is supposed to facilitate the resource management by controllers +by providing common stuff for accounting. + +This "stuff" includes the res_counter structure and routines +to work with it. + + + +1. Crucial parts of the res_counter structure + + a. unsigned long long usage + + The usage value shows the amount of a resource that is consumed + by a group at a given time. The units of measurement should be + determined by the controller that uses this counter. E.g. it can + be bytes, items or any other unit the controller operates on. + + b. unsigned long long max_usage + + The maximal value of the usage over time. + + This value is useful when gathering statistical information about + the particular group, as it shows the actual resource requirements + for a particular group, not just some usage snapshot. + + c. unsigned long long limit + + The maximal allowed amount of resource to consume by the group. In + case the group requests for more resources, so that the usage value + would exceed the limit, the resource allocation is rejected (see + the next section). + + d. unsigned long long failcnt + + The failcnt stands for "failures counter". This is the number of + resource allocation attempts that failed. + + c. spinlock_t lock + + Protects changes of the above values. + + + +2. Basic accounting routines + + a. void res_counter_init(struct res_counter *rc) + + Initializes the resource counter. As usual, should be the first + routine called for a new counter. + + b. int res_counter_charge[_locked] + (struct res_counter *rc, unsigned long val) + + When a resource is about to be allocated it has to be accounted + with the appropriate resource counter (controller should determine + which one to use on its own). This operation is called "charging". + + This is not very important which operation - resource allocation + or charging - is performed first, but + * if the allocation is performed first, this may create a + temporary resource over-usage by the time resource counter is + charged; + * if the charging is performed first, then it should be uncharged + on error path (if the one is called). + + c. void res_counter_uncharge[_locked] + (struct res_counter *rc, unsigned long val) + + When a resource is released (freed) it should be de-accounted + from the resource counter it was accounted to. This is called + "uncharging". + + The _locked routines imply that the res_counter->lock is taken. + + + 2.1 Other accounting routines + + There are more routines that may help you with common needs, like + checking whether the limit is reached or resetting the max_usage + value. They are all declared in include/linux/res_counter.h. + + + +3. Analyzing the resource counter registrations + + a. If the failcnt value constantly grows, this means that the counter's + limit is too tight. Either the group is misbehaving and consumes too + many resources, or the configuration is not suitable for the group + and the limit should be increased. + + b. The max_usage value can be used to quickly tune the group. One may + set the limits to maximal values and either load the container with + a common pattern or leave one for a while. After this the max_usage + value shows the amount of memory the container would require during + its common activity. + + Setting the limit a bit above this value gives a pretty good + configuration that works in most of the cases. + + c. If the max_usage is much less than the limit, but the failcnt value + is growing, then the group tries to allocate a big chunk of resource + at once. + + d. If the max_usage is much less than the limit, but the failcnt value + is 0, then this group is given too high limit, that it does not + require. It is better to lower the limit a bit leaving more resource + for other groups. + + + +4. Communication with the control groups subsystem (cgroups) + +All the resource controllers that are using cgroups and resource counters +should provide files (in the cgroup filesystem) to work with the resource +counter fields. They are recommended to adhere to the following rules: + + a. File names + + Field name File name + --------------------------------------------------- + usage usage_in_ + max_usage max_usage_in_ + limit limit_in_ + failcnt failcnt + lock no file :) + + b. Reading from file should show the corresponding field value in the + appropriate format. + + c. Writing to file + + Field Expected behavior + ---------------------------------- + usage prohibited + max_usage reset to usage + limit set the limit + failcnt reset to zero + + + +5. Usage example + + a. Declare a task group (take a look at cgroups subsystem for this) and + fold a res_counter into it + + struct my_group { + struct res_counter res; + + + } + + b. Put hooks in resource allocation/release paths + + int alloc_something(...) + { + if (res_counter_charge(res_counter_ptr, amount) < 0) + return -ENOMEM; + + + } + + void release_something(...) + { + res_counter_uncharge(res_counter_ptr, amount); + + + } + + In order to keep the usage value self-consistent, both the + "res_counter_ptr" and the "amount" in release_something() should be + the same as they were in the alloc_something() when the releasing + resource was allocated. + + c. Provide the way to read res_counter values and set them (the cgroups + still can help with it). + + c. Compile and run :) diff --git a/Documentation/controllers/cpuacct.txt b/Documentation/controllers/cpuacct.txt deleted file mode 100644 index bb775fbe43d7..000000000000 --- a/Documentation/controllers/cpuacct.txt +++ /dev/null @@ -1,32 +0,0 @@ -CPU Accounting Controller -------------------------- - -The CPU accounting controller is used to group tasks using cgroups and -account the CPU usage of these groups of tasks. - -The CPU accounting controller supports multi-hierarchy groups. An accounting -group accumulates the CPU usage of all of its child groups and the tasks -directly present in its group. - -Accounting groups can be created by first mounting the cgroup filesystem. - -# mkdir /cgroups -# mount -t cgroup -ocpuacct none /cgroups - -With the above step, the initial or the parent accounting group -becomes visible at /cgroups. At bootup, this group includes all the -tasks in the system. /cgroups/tasks lists the tasks in this cgroup. -/cgroups/cpuacct.usage gives the CPU time (in nanoseconds) obtained by -this group which is essentially the CPU time obtained by all the tasks -in the system. - -New accounting groups can be created under the parent group /cgroups. - -# cd /cgroups -# mkdir g1 -# echo $$ > g1 - -The above steps create a new group g1 and move the current shell -process (bash) into it. CPU time consumed by this bash and its children -can be obtained from g1/cpuacct.usage and the same is accumulated in -/cgroups/cpuacct.usage also. diff --git a/Documentation/controllers/devices.txt b/Documentation/controllers/devices.txt deleted file mode 100644 index 7cc6e6a60672..000000000000 --- a/Documentation/controllers/devices.txt +++ /dev/null @@ -1,52 +0,0 @@ -Device Whitelist Controller - -1. Description: - -Implement a cgroup to track and enforce open and mknod restrictions -on device files. A device cgroup associates a device access -whitelist with each cgroup. A whitelist entry has 4 fields. -'type' is a (all), c (char), or b (block). 'all' means it applies -to all types and all major and minor numbers. Major and minor are -either an integer or * for all. Access is a composition of r -(read), w (write), and m (mknod). - -The root device cgroup starts with rwm to 'all'. A child device -cgroup gets a copy of the parent. Administrators can then remove -devices from the whitelist or add new entries. A child cgroup can -never receive a device access which is denied by its parent. However -when a device access is removed from a parent it will not also be -removed from the child(ren). - -2. User Interface - -An entry is added using devices.allow, and removed using -devices.deny. For instance - - echo 'c 1:3 mr' > /cgroups/1/devices.allow - -allows cgroup 1 to read and mknod the device usually known as -/dev/null. Doing - - echo a > /cgroups/1/devices.deny - -will remove the default 'a *:* rwm' entry. Doing - - echo a > /cgroups/1/devices.allow - -will add the 'a *:* rwm' entry to the whitelist. - -3. Security - -Any task can move itself between cgroups. This clearly won't -suffice, but we can decide the best way to adequately restrict -movement as people get some experience with this. We may just want -to require CAP_SYS_ADMIN, which at least is a separate bit from -CAP_MKNOD. We may want to just refuse moving to a cgroup which -isn't a descendent of the current one. Or we may want to use -CAP_MAC_ADMIN, since we really are trying to lock down root. - -CAP_SYS_ADMIN is needed to modify the whitelist or move another -task to a new cgroup. (Again we'll probably want to change that). - -A cgroup may not be granted more permissions than the cgroup's -parent has. diff --git a/Documentation/controllers/memcg_test.txt b/Documentation/controllers/memcg_test.txt deleted file mode 100644 index 08d4d3ea0d79..000000000000 --- a/Documentation/controllers/memcg_test.txt +++ /dev/null @@ -1,342 +0,0 @@ -Memory Resource Controller(Memcg) Implementation Memo. -Last Updated: 2008/12/15 -Base Kernel Version: based on 2.6.28-rc8-mm. - -Because VM is getting complex (one of reasons is memcg...), memcg's behavior -is complex. This is a document for memcg's internal behavior. -Please note that implementation details can be changed. - -(*) Topics on API should be in Documentation/controllers/memory.txt) - -0. How to record usage ? - 2 objects are used. - - page_cgroup ....an object per page. - Allocated at boot or memory hotplug. Freed at memory hot removal. - - swap_cgroup ... an entry per swp_entry. - Allocated at swapon(). Freed at swapoff(). - - The page_cgroup has USED bit and double count against a page_cgroup never - occurs. swap_cgroup is used only when a charged page is swapped-out. - -1. Charge - - a page/swp_entry may be charged (usage += PAGE_SIZE) at - - mem_cgroup_newpage_charge() - Called at new page fault and Copy-On-Write. - - mem_cgroup_try_charge_swapin() - Called at do_swap_page() (page fault on swap entry) and swapoff. - Followed by charge-commit-cancel protocol. (With swap accounting) - At commit, a charge recorded in swap_cgroup is removed. - - mem_cgroup_cache_charge() - Called at add_to_page_cache() - - mem_cgroup_cache_charge_swapin() - Called at shmem's swapin. - - mem_cgroup_prepare_migration() - Called before migration. "extra" charge is done and followed by - charge-commit-cancel protocol. - At commit, charge against oldpage or newpage will be committed. - -2. Uncharge - a page/swp_entry may be uncharged (usage -= PAGE_SIZE) by - - mem_cgroup_uncharge_page() - Called when an anonymous page is fully unmapped. I.e., mapcount goes - to 0. If the page is SwapCache, uncharge is delayed until - mem_cgroup_uncharge_swapcache(). - - mem_cgroup_uncharge_cache_page() - Called when a page-cache is deleted from radix-tree. If the page is - SwapCache, uncharge is delayed until mem_cgroup_uncharge_swapcache(). - - mem_cgroup_uncharge_swapcache() - Called when SwapCache is removed from radix-tree. The charge itself - is moved to swap_cgroup. (If mem+swap controller is disabled, no - charge to swap occurs.) - - mem_cgroup_uncharge_swap() - Called when swp_entry's refcnt goes down to 0. A charge against swap - disappears. - - mem_cgroup_end_migration(old, new) - At success of migration old is uncharged (if necessary), a charge - to new page is committed. At failure, charge to old page is committed. - -3. charge-commit-cancel - In some case, we can't know this "charge" is valid or not at charging - (because of races). - To handle such case, there are charge-commit-cancel functions. - mem_cgroup_try_charge_XXX - mem_cgroup_commit_charge_XXX - mem_cgroup_cancel_charge_XXX - these are used in swap-in and migration. - - At try_charge(), there are no flags to say "this page is charged". - at this point, usage += PAGE_SIZE. - - At commit(), the function checks the page should be charged or not - and set flags or avoid charging.(usage -= PAGE_SIZE) - - At cancel(), simply usage -= PAGE_SIZE. - -Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y. - -4. Anonymous - Anonymous page is newly allocated at - - page fault into MAP_ANONYMOUS mapping. - - Copy-On-Write. - It is charged right after it's allocated before doing any page table - related operations. Of course, it's uncharged when another page is used - for the fault address. - - At freeing anonymous page (by exit() or munmap()), zap_pte() is called - and pages for ptes are freed one by one.(see mm/memory.c). Uncharges - are done at page_remove_rmap() when page_mapcount() goes down to 0. - - Another page freeing is by page-reclaim (vmscan.c) and anonymous - pages are swapped out. In this case, the page is marked as - PageSwapCache(). uncharge() routine doesn't uncharge the page marked - as SwapCache(). It's delayed until __delete_from_swap_cache(). - - 4.1 Swap-in. - At swap-in, the page is taken from swap-cache. There are 2 cases. - - (a) If the SwapCache is newly allocated and read, it has no charges. - (b) If the SwapCache has been mapped by processes, it has been - charged already. - - This swap-in is one of the most complicated work. In do_swap_page(), - following events occur when pte is unchanged. - - (1) the page (SwapCache) is looked up. - (2) lock_page() - (3) try_charge_swapin() - (4) reuse_swap_page() (may call delete_swap_cache()) - (5) commit_charge_swapin() - (6) swap_free(). - - Considering following situation for example. - - (A) The page has not been charged before (2) and reuse_swap_page() - doesn't call delete_from_swap_cache(). - (B) The page has not been charged before (2) and reuse_swap_page() - calls delete_from_swap_cache(). - (C) The page has been charged before (2) and reuse_swap_page() doesn't - call delete_from_swap_cache(). - (D) The page has been charged before (2) and reuse_swap_page() calls - delete_from_swap_cache(). - - memory.usage/memsw.usage changes to this page/swp_entry will be - Case (A) (B) (C) (D) - Event - Before (2) 0/ 1 0/ 1 1/ 1 1/ 1 - =========================================== - (3) +1/+1 +1/+1 +1/+1 +1/+1 - (4) - 0/ 0 - -1/ 0 - (5) 0/-1 0/ 0 -1/-1 0/ 0 - (6) - 0/-1 - 0/-1 - =========================================== - Result 1/ 1 1/ 1 1/ 1 1/ 1 - - In any cases, charges to this page should be 1/ 1. - - 4.2 Swap-out. - At swap-out, typical state transition is below. - - (a) add to swap cache. (marked as SwapCache) - swp_entry's refcnt += 1. - (b) fully unmapped. - swp_entry's refcnt += # of ptes. - (c) write back to swap. - (d) delete from swap cache. (remove from SwapCache) - swp_entry's refcnt -= 1. - - - At (b), the page is marked as SwapCache and not uncharged. - At (d), the page is removed from SwapCache and a charge in page_cgroup - is moved to swap_cgroup. - - Finally, at task exit, - (e) zap_pte() is called and swp_entry's refcnt -=1 -> 0. - Here, a charge in swap_cgroup disappears. - -5. Page Cache - Page Cache is charged at - - add_to_page_cache_locked(). - - uncharged at - - __remove_from_page_cache(). - - The logic is very clear. (About migration, see below) - Note: __remove_from_page_cache() is called by remove_from_page_cache() - and __remove_mapping(). - -6. Shmem(tmpfs) Page Cache - Memcg's charge/uncharge have special handlers of shmem. The best way - to understand shmem's page state transition is to read mm/shmem.c. - But brief explanation of the behavior of memcg around shmem will be - helpful to understand the logic. - - Shmem's page (just leaf page, not direct/indirect block) can be on - - radix-tree of shmem's inode. - - SwapCache. - - Both on radix-tree and SwapCache. This happens at swap-in - and swap-out, - - It's charged when... - - A new page is added to shmem's radix-tree. - - A swp page is read. (move a charge from swap_cgroup to page_cgroup) - It's uncharged when - - A page is removed from radix-tree and not SwapCache. - - When SwapCache is removed, a charge is moved to swap_cgroup. - - When swp_entry's refcnt goes down to 0, a charge in swap_cgroup - disappears. - -7. Page Migration - One of the most complicated functions is page-migration-handler. - Memcg has 2 routines. Assume that we are migrating a page's contents - from OLDPAGE to NEWPAGE. - - Usual migration logic is.. - (a) remove the page from LRU. - (b) allocate NEWPAGE (migration target) - (c) lock by lock_page(). - (d) unmap all mappings. - (e-1) If necessary, replace entry in radix-tree. - (e-2) move contents of a page. - (f) map all mappings again. - (g) pushback the page to LRU. - (-) OLDPAGE will be freed. - - Before (g), memcg should complete all necessary charge/uncharge to - NEWPAGE/OLDPAGE. - - The point is.... - - If OLDPAGE is anonymous, all charges will be dropped at (d) because - try_to_unmap() drops all mapcount and the page will not be - SwapCache. - - - If OLDPAGE is SwapCache, charges will be kept at (g) because - __delete_from_swap_cache() isn't called at (e-1) - - - If OLDPAGE is page-cache, charges will be kept at (g) because - remove_from_swap_cache() isn't called at (e-1) - - memcg provides following hooks. - - - mem_cgroup_prepare_migration(OLDPAGE) - Called after (b) to account a charge (usage += PAGE_SIZE) against - memcg which OLDPAGE belongs to. - - - mem_cgroup_end_migration(OLDPAGE, NEWPAGE) - Called after (f) before (g). - If OLDPAGE is used, commit OLDPAGE again. If OLDPAGE is already - charged, a charge by prepare_migration() is automatically canceled. - If NEWPAGE is used, commit NEWPAGE and uncharge OLDPAGE. - - But zap_pte() (by exit or munmap) can be called while migration, - we have to check if OLDPAGE/NEWPAGE is a valid page after commit(). - -8. LRU - Each memcg has its own private LRU. Now, it's handling is under global - VM's control (means that it's handled under global zone->lru_lock). - Almost all routines around memcg's LRU is called by global LRU's - list management functions under zone->lru_lock(). - - A special function is mem_cgroup_isolate_pages(). This scans - memcg's private LRU and call __isolate_lru_page() to extract a page - from LRU. - (By __isolate_lru_page(), the page is removed from both of global and - private LRU.) - - -9. Typical Tests. - - Tests for racy cases. - - 9.1 Small limit to memcg. - When you do test to do racy case, it's good test to set memcg's limit - to be very small rather than GB. Many races found in the test under - xKB or xxMB limits. - (Memory behavior under GB and Memory behavior under MB shows very - different situation.) - - 9.2 Shmem - Historically, memcg's shmem handling was poor and we saw some amount - of troubles here. This is because shmem is page-cache but can be - SwapCache. Test with shmem/tmpfs is always good test. - - 9.3 Migration - For NUMA, migration is an another special case. To do easy test, cpuset - is useful. Following is a sample script to do migration. - - mount -t cgroup -o cpuset none /opt/cpuset - - mkdir /opt/cpuset/01 - echo 1 > /opt/cpuset/01/cpuset.cpus - echo 0 > /opt/cpuset/01/cpuset.mems - echo 1 > /opt/cpuset/01/cpuset.memory_migrate - mkdir /opt/cpuset/02 - echo 1 > /opt/cpuset/02/cpuset.cpus - echo 1 > /opt/cpuset/02/cpuset.mems - echo 1 > /opt/cpuset/02/cpuset.memory_migrate - - In above set, when you moves a task from 01 to 02, page migration to - node 0 to node 1 will occur. Following is a script to migrate all - under cpuset. - -- - move_task() - { - for pid in $1 - do - /bin/echo $pid >$2/tasks 2>/dev/null - echo -n $pid - echo -n " " - done - echo END - } - - G1_TASK=`cat ${G1}/tasks` - G2_TASK=`cat ${G2}/tasks` - move_task "${G1_TASK}" ${G2} & - -- - 9.4 Memory hotplug. - memory hotplug test is one of good test. - to offline memory, do following. - # echo offline > /sys/devices/system/memory/memoryXXX/state - (XXX is the place of memory) - This is an easy way to test page migration, too. - - 9.5 mkdir/rmdir - When using hierarchy, mkdir/rmdir test should be done. - Use tests like the following. - - echo 1 >/opt/cgroup/01/memory/use_hierarchy - mkdir /opt/cgroup/01/child_a - mkdir /opt/cgroup/01/child_b - - set limit to 01. - add limit to 01/child_b - run jobs under child_a and child_b - - create/delete following groups at random while jobs are running. - /opt/cgroup/01/child_a/child_aa - /opt/cgroup/01/child_b/child_bb - /opt/cgroup/01/child_c - - running new jobs in new group is also good. - - 9.6 Mount with other subsystems. - Mounting with other subsystems is a good test because there is a - race and lock dependency with other cgroup subsystems. - - example) - # mount -t cgroup none /cgroup -t cpuset,memory,cpu,devices - - and do task move, mkdir, rmdir etc...under this. diff --git a/Documentation/controllers/memory.txt b/Documentation/controllers/memory.txt deleted file mode 100644 index e1501964df1e..000000000000 --- a/Documentation/controllers/memory.txt +++ /dev/null @@ -1,399 +0,0 @@ -Memory Resource Controller - -NOTE: The Memory Resource Controller has been generically been referred -to as the memory controller in this document. Do not confuse memory controller -used here with the memory controller that is used in hardware. - -Salient features - -a. Enable control of both RSS (mapped) and Page Cache (unmapped) pages -b. The infrastructure allows easy addition of other types of memory to control -c. Provides *zero overhead* for non memory controller users -d. Provides a double LRU: global memory pressure causes reclaim from the - global LRU; a cgroup on hitting a limit, reclaims from the per - cgroup LRU - -NOTE: Swap Cache (unmapped) is not accounted now. - -Benefits and Purpose of the memory controller - -The memory controller isolates the memory behaviour of a group of tasks -from the rest of the system. The article on LWN [12] mentions some probable -uses of the memory controller. The memory controller can be used to - -a. Isolate an application or a group of applications - Memory hungry applications can be isolated and limited to a smaller - amount of memory. -b. Create a cgroup with limited amount of memory, this can be used - as a good alternative to booting with mem=XXXX. -c. Virtualization solutions can control the amount of memory they want - to assign to a virtual machine instance. -d. A CD/DVD burner could control the amount of memory used by the - rest of the system to ensure that burning does not fail due to lack - of available memory. -e. There are several other use cases, find one or use the controller just - for fun (to learn and hack on the VM subsystem). - -1. History - -The memory controller has a long history. A request for comments for the memory -controller was posted by Balbir Singh [1]. At the time the RFC was posted -there were several implementations for memory control. The goal of the -RFC was to build consensus and agreement for the minimal features required -for memory control. The first RSS controller was posted by Balbir Singh[2] -in Feb 2007. Pavel Emelianov [3][4][5] has since posted three versions of the -RSS controller. At OLS, at the resource management BoF, everyone suggested -that we handle both page cache and RSS together. Another request was raised -to allow user space handling of OOM. The current memory controller is -at version 6; it combines both mapped (RSS) and unmapped Page -Cache Control [11]. - -2. Memory Control - -Memory is a unique resource in the sense that it is present in a limited -amount. If a task requires a lot of CPU processing, the task can spread -its processing over a period of hours, days, months or years, but with -memory, the same physical memory needs to be reused to accomplish the task. - -The memory controller implementation has been divided into phases. These -are: - -1. Memory controller -2. mlock(2) controller -3. Kernel user memory accounting and slab control -4. user mappings length controller - -The memory controller is the first controller developed. - -2.1. Design - -The core of the design is a counter called the res_counter. The res_counter -tracks the current memory usage and limit of the group of processes associated -with the controller. Each cgroup has a memory controller specific data -structure (mem_cgroup) associated with it. - -2.2. Accounting - - +--------------------+ - | mem_cgroup | - | (res_counter) | - +--------------------+ - / ^ \ - / | \ - +---------------+ | +---------------+ - | mm_struct | |.... | mm_struct | - | | | | | - +---------------+ | +---------------+ - | - + --------------+ - | - +---------------+ +------+--------+ - | page +----------> page_cgroup| - | | | | - +---------------+ +---------------+ - - (Figure 1: Hierarchy of Accounting) - - -Figure 1 shows the important aspects of the controller - -1. Accounting happens per cgroup -2. Each mm_struct knows about which cgroup it belongs to -3. Each page has a pointer to the page_cgroup, which in turn knows the - cgroup it belongs to - -The accounting is done as follows: mem_cgroup_charge() is invoked to setup -the necessary data structures and check if the cgroup that is being charged -is over its limit. If it is then reclaim is invoked on the cgroup. -More details can be found in the reclaim section of this document. -If everything goes well, a page meta-data-structure called page_cgroup is -allocated and associated with the page. This routine also adds the page to -the per cgroup LRU. - -2.2.1 Accounting details - -All mapped anon pages (RSS) and cache pages (Page Cache) are accounted. -(some pages which never be reclaimable and will not be on global LRU - are not accounted. we just accounts pages under usual vm management.) - -RSS pages are accounted at page_fault unless they've already been accounted -for earlier. A file page will be accounted for as Page Cache when it's -inserted into inode (radix-tree). While it's mapped into the page tables of -processes, duplicate accounting is carefully avoided. - -A RSS page is unaccounted when it's fully unmapped. A PageCache page is -unaccounted when it's removed from radix-tree. - -At page migration, accounting information is kept. - -Note: we just account pages-on-lru because our purpose is to control amount -of used pages. not-on-lru pages are tend to be out-of-control from vm view. - -2.3 Shared Page Accounting - -Shared pages are accounted on the basis of the first touch approach. The -cgroup that first touches a page is accounted for the page. The principle -behind this approach is that a cgroup that aggressively uses a shared -page will eventually get charged for it (once it is uncharged from -the cgroup that brought it in -- this will happen on memory pressure). - -Exception: If CONFIG_CGROUP_CGROUP_MEM_RES_CTLR_SWAP is not used.. -When you do swapoff and make swapped-out pages of shmem(tmpfs) to -be backed into memory in force, charges for pages are accounted against the -caller of swapoff rather than the users of shmem. - - -2.4 Swap Extension (CONFIG_CGROUP_MEM_RES_CTLR_SWAP) -Swap Extension allows you to record charge for swap. A swapped-in page is -charged back to original page allocator if possible. - -When swap is accounted, following files are added. - - memory.memsw.usage_in_bytes. - - memory.memsw.limit_in_bytes. - -usage of mem+swap is limited by memsw.limit_in_bytes. - -Note: why 'mem+swap' rather than swap. -The global LRU(kswapd) can swap out arbitrary pages. Swap-out means -to move account from memory to swap...there is no change in usage of -mem+swap. - -In other words, when we want to limit the usage of swap without affecting -global LRU, mem+swap limit is better than just limiting swap from OS point -of view. - -2.5 Reclaim - -Each cgroup maintains a per cgroup LRU that consists of an active -and inactive list. When a cgroup goes over its limit, we first try -to reclaim memory from the cgroup so as to make space for the new -pages that the cgroup has touched. If the reclaim is unsuccessful, -an OOM routine is invoked to select and kill the bulkiest task in the -cgroup. - -The reclaim algorithm has not been modified for cgroups, except that -pages that are selected for reclaiming come from the per cgroup LRU -list. - -2. Locking - -The memory controller uses the following hierarchy - -1. zone->lru_lock is used for selecting pages to be isolated -2. mem->per_zone->lru_lock protects the per cgroup LRU (per zone) -3. lock_page_cgroup() is used to protect page->page_cgroup - -3. User Interface - -0. Configuration - -a. Enable CONFIG_CGROUPS -b. Enable CONFIG_RESOURCE_COUNTERS -c. Enable CONFIG_CGROUP_MEM_RES_CTLR - -1. Prepare the cgroups -# mkdir -p /cgroups -# mount -t cgroup none /cgroups -o memory - -2. Make the new group and move bash into it -# mkdir /cgroups/0 -# echo $$ > /cgroups/0/tasks - -Since now we're in the 0 cgroup, -We can alter the memory limit: -# echo 4M > /cgroups/0/memory.limit_in_bytes - -NOTE: We can use a suffix (k, K, m, M, g or G) to indicate values in kilo, -mega or gigabytes. - -# cat /cgroups/0/memory.limit_in_bytes -4194304 - -NOTE: The interface has now changed to display the usage in bytes -instead of pages - -We can check the usage: -# cat /cgroups/0/memory.usage_in_bytes -1216512 - -A successful write to this file does not guarantee a successful set of -this limit to the value written into the file. This can be due to a -number of factors, such as rounding up to page boundaries or the total -availability of memory on the system. The user is required to re-read -this file after a write to guarantee the value committed by the kernel. - -# echo 1 > memory.limit_in_bytes -# cat memory.limit_in_bytes -4096 - -The memory.failcnt field gives the number of times that the cgroup limit was -exceeded. - -The memory.stat file gives accounting information. Now, the number of -caches, RSS and Active pages/Inactive pages are shown. - -4. Testing - -Balbir posted lmbench, AIM9, LTP and vmmstress results [10] and [11]. -Apart from that v6 has been tested with several applications and regular -daily use. The controller has also been tested on the PPC64, x86_64 and -UML platforms. - -4.1 Troubleshooting - -Sometimes a user might find that the application under a cgroup is -terminated. There are several causes for this: - -1. The cgroup limit is too low (just too low to do anything useful) -2. The user is using anonymous memory and swap is turned off or too low - -A sync followed by echo 1 > /proc/sys/vm/drop_caches will help get rid of -some of the pages cached in the cgroup (page cache pages). - -4.2 Task migration - -When a task migrates from one cgroup to another, it's charge is not -carried forward. The pages allocated from the original cgroup still -remain charged to it, the charge is dropped when the page is freed or -reclaimed. - -4.3 Removing a cgroup - -A cgroup can be removed by rmdir, but as discussed in sections 4.1 and 4.2, a -cgroup might have some charge associated with it, even though all -tasks have migrated away from it. -Such charges are freed(at default) or moved to its parent. When moved, -both of RSS and CACHES are moved to parent. -If both of them are busy, rmdir() returns -EBUSY. See 5.1 Also. - -Charges recorded in swap information is not updated at removal of cgroup. -Recorded information is discarded and a cgroup which uses swap (swapcache) -will be charged as a new owner of it. - - -5. Misc. interfaces. - -5.1 force_empty - memory.force_empty interface is provided to make cgroup's memory usage empty. - You can use this interface only when the cgroup has no tasks. - When writing anything to this - - # echo 0 > memory.force_empty - - Almost all pages tracked by this memcg will be unmapped and freed. Some of - pages cannot be freed because it's locked or in-use. Such pages are moved - to parent and this cgroup will be empty. But this may return -EBUSY in - some too busy case. - - Typical use case of this interface is that calling this before rmdir(). - Because rmdir() moves all pages to parent, some out-of-use page caches can be - moved to the parent. If you want to avoid that, force_empty will be useful. - -5.2 stat file - memory.stat file includes following statistics (now) - cache - # of pages from page-cache and shmem. - rss - # of pages from anonymous memory. - pgpgin - # of event of charging - pgpgout - # of event of uncharging - active_anon - # of pages on active lru of anon, shmem. - inactive_anon - # of pages on active lru of anon, shmem - active_file - # of pages on active lru of file-cache - inactive_file - # of pages on inactive lru of file cache - unevictable - # of pages cannot be reclaimed.(mlocked etc) - - Below is depend on CONFIG_DEBUG_VM. - inactive_ratio - VM inernal parameter. (see mm/page_alloc.c) - recent_rotated_anon - VM internal parameter. (see mm/vmscan.c) - recent_rotated_file - VM internal parameter. (see mm/vmscan.c) - recent_scanned_anon - VM internal parameter. (see mm/vmscan.c) - recent_scanned_file - VM internal parameter. (see mm/vmscan.c) - - Memo: - recent_rotated means recent frequency of lru rotation. - recent_scanned means recent # of scans to lru. - showing for better debug please see the code for meanings. - - -5.3 swappiness - Similar to /proc/sys/vm/swappiness, but affecting a hierarchy of groups only. - - Following cgroup's swapiness can't be changed. - - root cgroup (uses /proc/sys/vm/swappiness). - - a cgroup which uses hierarchy and it has child cgroup. - - a cgroup which uses hierarchy and not the root of hierarchy. - - -6. Hierarchy support - -The memory controller supports a deep hierarchy and hierarchical accounting. -The hierarchy is created by creating the appropriate cgroups in the -cgroup filesystem. Consider for example, the following cgroup filesystem -hierarchy - - root - / | \ - / | \ - a b c - | \ - | \ - d e - -In the diagram above, with hierarchical accounting enabled, all memory -usage of e, is accounted to its ancestors up until the root (i.e, c and root), -that has memory.use_hierarchy enabled. If one of the ancestors goes over its -limit, the reclaim algorithm reclaims from the tasks in the ancestor and the -children of the ancestor. - -6.1 Enabling hierarchical accounting and reclaim - -The memory controller by default disables the hierarchy feature. Support -can be enabled by writing 1 to memory.use_hierarchy file of the root cgroup - -# echo 1 > memory.use_hierarchy - -The feature can be disabled by - -# echo 0 > memory.use_hierarchy - -NOTE1: Enabling/disabling will fail if the cgroup already has other -cgroups created below it. - -NOTE2: This feature can be enabled/disabled per subtree. - -7. TODO - -1. Add support for accounting huge pages (as a separate controller) -2. Make per-cgroup scanner reclaim not-shared pages first -3. Teach controller to account for shared-pages -4. Start reclamation in the background when the limit is - not yet hit but the usage is getting closer - -Summary - -Overall, the memory controller has been a stable controller and has been -commented and discussed quite extensively in the community. - -References - -1. Singh, Balbir. RFC: Memory Controller, http://lwn.net/Articles/206697/ -2. Singh, Balbir. Memory Controller (RSS Control), - http://lwn.net/Articles/222762/ -3. Emelianov, Pavel. Resource controllers based on process cgroups - http://lkml.org/lkml/2007/3/6/198 -4. Emelianov, Pavel. RSS controller based on process cgroups (v2) - http://lkml.org/lkml/2007/4/9/78 -5. Emelianov, Pavel. RSS controller based on process cgroups (v3) - http://lkml.org/lkml/2007/5/30/244 -6. Menage, Paul. Control Groups v10, http://lwn.net/Articles/236032/ -7. Vaidyanathan, Srinivasan, Control Groups: Pagecache accounting and control - subsystem (v3), http://lwn.net/Articles/235534/ -8. Singh, Balbir. RSS controller v2 test results (lmbench), - http://lkml.org/lkml/2007/5/17/232 -9. Singh, Balbir. RSS controller v2 AIM9 results - http://lkml.org/lkml/2007/5/18/1 -10. Singh, Balbir. Memory controller v6 test results, - http://lkml.org/lkml/2007/8/19/36 -11. Singh, Balbir. Memory controller introduction (v6), - http://lkml.org/lkml/2007/8/17/69 -12. Corbet, Jonathan, Controlling memory use in cgroups, - http://lwn.net/Articles/243795/ diff --git a/Documentation/controllers/resource_counter.txt b/Documentation/controllers/resource_counter.txt deleted file mode 100644 index f196ac1d7d25..000000000000 --- a/Documentation/controllers/resource_counter.txt +++ /dev/null @@ -1,181 +0,0 @@ - - The Resource Counter - -The resource counter, declared at include/linux/res_counter.h, -is supposed to facilitate the resource management by controllers -by providing common stuff for accounting. - -This "stuff" includes the res_counter structure and routines -to work with it. - - - -1. Crucial parts of the res_counter structure - - a. unsigned long long usage - - The usage value shows the amount of a resource that is consumed - by a group at a given time. The units of measurement should be - determined by the controller that uses this counter. E.g. it can - be bytes, items or any other unit the controller operates on. - - b. unsigned long long max_usage - - The maximal value of the usage over time. - - This value is useful when gathering statistical information about - the particular group, as it shows the actual resource requirements - for a particular group, not just some usage snapshot. - - c. unsigned long long limit - - The maximal allowed amount of resource to consume by the group. In - case the group requests for more resources, so that the usage value - would exceed the limit, the resource allocation is rejected (see - the next section). - - d. unsigned long long failcnt - - The failcnt stands for "failures counter". This is the number of - resource allocation attempts that failed. - - c. spinlock_t lock - - Protects changes of the above values. - - - -2. Basic accounting routines - - a. void res_counter_init(struct res_counter *rc) - - Initializes the resource counter. As usual, should be the first - routine called for a new counter. - - b. int res_counter_charge[_locked] - (struct res_counter *rc, unsigned long val) - - When a resource is about to be allocated it has to be accounted - with the appropriate resource counter (controller should determine - which one to use on its own). This operation is called "charging". - - This is not very important which operation - resource allocation - or charging - is performed first, but - * if the allocation is performed first, this may create a - temporary resource over-usage by the time resource counter is - charged; - * if the charging is performed first, then it should be uncharged - on error path (if the one is called). - - c. void res_counter_uncharge[_locked] - (struct res_counter *rc, unsigned long val) - - When a resource is released (freed) it should be de-accounted - from the resource counter it was accounted to. This is called - "uncharging". - - The _locked routines imply that the res_counter->lock is taken. - - - 2.1 Other accounting routines - - There are more routines that may help you with common needs, like - checking whether the limit is reached or resetting the max_usage - value. They are all declared in include/linux/res_counter.h. - - - -3. Analyzing the resource counter registrations - - a. If the failcnt value constantly grows, this means that the counter's - limit is too tight. Either the group is misbehaving and consumes too - many resources, or the configuration is not suitable for the group - and the limit should be increased. - - b. The max_usage value can be used to quickly tune the group. One may - set the limits to maximal values and either load the container with - a common pattern or leave one for a while. After this the max_usage - value shows the amount of memory the container would require during - its common activity. - - Setting the limit a bit above this value gives a pretty good - configuration that works in most of the cases. - - c. If the max_usage is much less than the limit, but the failcnt value - is growing, then the group tries to allocate a big chunk of resource - at once. - - d. If the max_usage is much less than the limit, but the failcnt value - is 0, then this group is given too high limit, that it does not - require. It is better to lower the limit a bit leaving more resource - for other groups. - - - -4. Communication with the control groups subsystem (cgroups) - -All the resource controllers that are using cgroups and resource counters -should provide files (in the cgroup filesystem) to work with the resource -counter fields. They are recommended to adhere to the following rules: - - a. File names - - Field name File name - --------------------------------------------------- - usage usage_in_ - max_usage max_usage_in_ - limit limit_in_ - failcnt failcnt - lock no file :) - - b. Reading from file should show the corresponding field value in the - appropriate format. - - c. Writing to file - - Field Expected behavior - ---------------------------------- - usage prohibited - max_usage reset to usage - limit set the limit - failcnt reset to zero - - - -5. Usage example - - a. Declare a task group (take a look at cgroups subsystem for this) and - fold a res_counter into it - - struct my_group { - struct res_counter res; - - - } - - b. Put hooks in resource allocation/release paths - - int alloc_something(...) - { - if (res_counter_charge(res_counter_ptr, amount) < 0) - return -ENOMEM; - - - } - - void release_something(...) - { - res_counter_uncharge(res_counter_ptr, amount); - - - } - - In order to keep the usage value self-consistent, both the - "res_counter_ptr" and the "amount" in release_something() should be - the same as they were in the alloc_something() when the releasing - resource was allocated. - - c. Provide the way to read res_counter values and set them (the cgroups - still can help with it). - - c. Compile and run :) diff --git a/Documentation/cpusets.txt b/Documentation/cpusets.txt deleted file mode 100644 index 5c86c258c791..000000000000 --- a/Documentation/cpusets.txt +++ /dev/null @@ -1,808 +0,0 @@ - CPUSETS - ------- - -Copyright (C) 2004 BULL SA. -Written by Simon.Derr@bull.net - -Portions Copyright (c) 2004-2006 Silicon Graphics, Inc. -Modified by Paul Jackson -Modified by Christoph Lameter -Modified by Paul Menage -Modified by Hidetoshi Seto - -CONTENTS: -========= - -1. Cpusets - 1.1 What are cpusets ? - 1.2 Why are cpusets needed ? - 1.3 How are cpusets implemented ? - 1.4 What are exclusive cpusets ? - 1.5 What is memory_pressure ? - 1.6 What is memory spread ? - 1.7 What is sched_load_balance ? - 1.8 What is sched_relax_domain_level ? - 1.9 How do I use cpusets ? -2. Usage Examples and Syntax - 2.1 Basic Usage - 2.2 Adding/removing cpus - 2.3 Setting flags - 2.4 Attaching processes -3. Questions -4. Contact - -1. Cpusets -========== - -1.1 What are cpusets ? ----------------------- - -Cpusets provide a mechanism for assigning a set of CPUs and Memory -Nodes to a set of tasks. In this document "Memory Node" refers to -an on-line node that contains memory. - -Cpusets constrain the CPU and Memory placement of tasks to only -the resources within a tasks current cpuset. They form a nested -hierarchy visible in a virtual file system. These are the essential -hooks, beyond what is already present, required to manage dynamic -job placement on large systems. - -Cpusets use the generic cgroup subsystem described in -Documentation/cgroups/cgroups.txt. - -Requests by a task, using the sched_setaffinity(2) system call to -include CPUs in its CPU affinity mask, and using the mbind(2) and -set_mempolicy(2) system calls to include Memory Nodes in its memory -policy, are both filtered through that tasks cpuset, filtering out any -CPUs or Memory Nodes not in that cpuset. The scheduler will not -schedule a task on a CPU that is not allowed in its cpus_allowed -vector, and the kernel page allocator will not allocate a page on a -node that is not allowed in the requesting tasks mems_allowed vector. - -User level code may create and destroy cpusets by name in the cgroup -virtual file system, manage the attributes and permissions of these -cpusets and which CPUs and Memory Nodes are assigned to each cpuset, -specify and query to which cpuset a task is assigned, and list the -task pids assigned to a cpuset. - - -1.2 Why are cpusets needed ? ----------------------------- - -The management of large computer systems, with many processors (CPUs), -complex memory cache hierarchies and multiple Memory Nodes having -non-uniform access times (NUMA) presents additional challenges for -the efficient scheduling and memory placement of processes. - -Frequently more modest sized systems can be operated with adequate -efficiency just by letting the operating system automatically share -the available CPU and Memory resources amongst the requesting tasks. - -But larger systems, which benefit more from careful processor and -memory placement to reduce memory access times and contention, -and which typically represent a larger investment for the customer, -can benefit from explicitly placing jobs on properly sized subsets of -the system. - -This can be especially valuable on: - - * Web Servers running multiple instances of the same web application, - * Servers running different applications (for instance, a web server - and a database), or - * NUMA systems running large HPC applications with demanding - performance characteristics. - -These subsets, or "soft partitions" must be able to be dynamically -adjusted, as the job mix changes, without impacting other concurrently -executing jobs. The location of the running jobs pages may also be moved -when the memory locations are changed. - -The kernel cpuset patch provides the minimum essential kernel -mechanisms required to efficiently implement such subsets. It -leverages existing CPU and Memory Placement facilities in the Linux -kernel to avoid any additional impact on the critical scheduler or -memory allocator code. - - -1.3 How are cpusets implemented ? ---------------------------------- - -Cpusets provide a Linux kernel mechanism to constrain which CPUs and -Memory Nodes are used by a process or set of processes. - -The Linux kernel already has a pair of mechanisms to specify on which -CPUs a task may be scheduled (sched_setaffinity) and on which Memory -Nodes it may obtain memory (mbind, set_mempolicy). - -Cpusets extends these two mechanisms as follows: - - - Cpusets are sets of allowed CPUs and Memory Nodes, known to the - kernel. - - Each task in the system is attached to a cpuset, via a pointer - in the task structure to a reference counted cgroup structure. - - Calls to sched_setaffinity are filtered to just those CPUs - allowed in that tasks cpuset. - - Calls to mbind and set_mempolicy are filtered to just - those Memory Nodes allowed in that tasks cpuset. - - The root cpuset contains all the systems CPUs and Memory - Nodes. - - For any cpuset, one can define child cpusets containing a subset - of the parents CPU and Memory Node resources. - - The hierarchy of cpusets can be mounted at /dev/cpuset, for - browsing and manipulation from user space. - - A cpuset may be marked exclusive, which ensures that no other - cpuset (except direct ancestors and descendents) may contain - any overlapping CPUs or Memory Nodes. - - You can list all the tasks (by pid) attached to any cpuset. - -The implementation of cpusets requires a few, simple hooks -into the rest of the kernel, none in performance critical paths: - - - in init/main.c, to initialize the root cpuset at system boot. - - in fork and exit, to attach and detach a task from its cpuset. - - in sched_setaffinity, to mask the requested CPUs by what's - allowed in that tasks cpuset. - - in sched.c migrate_all_tasks(), to keep migrating tasks within - the CPUs allowed by their cpuset, if possible. - - in the mbind and set_mempolicy system calls, to mask the requested - Memory Nodes by what's allowed in that tasks cpuset. - - in page_alloc.c, to restrict memory to allowed nodes. - - in vmscan.c, to restrict page recovery to the current cpuset. - -You should mount the "cgroup" filesystem type in order to enable -browsing and modifying the cpusets presently known to the kernel. No -new system calls are added for cpusets - all support for querying and -modifying cpusets is via this cpuset file system. - -The /proc//status file for each task has four added lines, -displaying the tasks cpus_allowed (on which CPUs it may be scheduled) -and mems_allowed (on which Memory Nodes it may obtain memory), -in the two formats seen in the following example: - - Cpus_allowed: ffffffff,ffffffff,ffffffff,ffffffff - Cpus_allowed_list: 0-127 - Mems_allowed: ffffffff,ffffffff - Mems_allowed_list: 0-63 - -Each cpuset is represented by a directory in the cgroup file system -containing (on top of the standard cgroup files) the following -files describing that cpuset: - - - cpus: list of CPUs in that cpuset - - mems: list of Memory Nodes in that cpuset - - memory_migrate flag: if set, move pages to cpusets nodes - - cpu_exclusive flag: is cpu placement exclusive? - - mem_exclusive flag: is memory placement exclusive? - - mem_hardwall flag: is memory allocation hardwalled - - memory_pressure: measure of how much paging pressure in cpuset - -In addition, the root cpuset only has the following file: - - memory_pressure_enabled flag: compute memory_pressure? - -New cpusets are created using the mkdir system call or shell -command. The properties of a cpuset, such as its flags, allowed -CPUs and Memory Nodes, and attached tasks, are modified by writing -to the appropriate file in that cpusets directory, as listed above. - -The named hierarchical structure of nested cpusets allows partitioning -a large system into nested, dynamically changeable, "soft-partitions". - -The attachment of each task, automatically inherited at fork by any -children of that task, to a cpuset allows organizing the work load -on a system into related sets of tasks such that each set is constrained -to using the CPUs and Memory Nodes of a particular cpuset. A task -may be re-attached to any other cpuset, if allowed by the permissions -on the necessary cpuset file system directories. - -Such management of a system "in the large" integrates smoothly with -the detailed placement done on individual tasks and memory regions -using the sched_setaffinity, mbind and set_mempolicy system calls. - -The following rules apply to each cpuset: - - - Its CPUs and Memory Nodes must be a subset of its parents. - - It can't be marked exclusive unless its parent is. - - If its cpu or memory is exclusive, they may not overlap any sibling. - -These rules, and the natural hierarchy of cpusets, enable efficient -enforcement of the exclusive guarantee, without having to scan all -cpusets every time any of them change to ensure nothing overlaps a -exclusive cpuset. Also, the use of a Linux virtual file system (vfs) -to represent the cpuset hierarchy provides for a familiar permission -and name space for cpusets, with a minimum of additional kernel code. - -The cpus and mems files in the root (top_cpuset) cpuset are -read-only. The cpus file automatically tracks the value of -cpu_online_map using a CPU hotplug notifier, and the mems file -automatically tracks the value of node_states[N_HIGH_MEMORY]--i.e., -nodes with memory--using the cpuset_track_online_nodes() hook. - - -1.4 What are exclusive cpusets ? --------------------------------- - -If a cpuset is cpu or mem exclusive, no other cpuset, other than -a direct ancestor or descendent, may share any of the same CPUs or -Memory Nodes. - -A cpuset that is mem_exclusive *or* mem_hardwall is "hardwalled", -i.e. it restricts kernel allocations for page, buffer and other data -commonly shared by the kernel across multiple users. All cpusets, -whether hardwalled or not, restrict allocations of memory for user -space. This enables configuring a system so that several independent -jobs can share common kernel data, such as file system pages, while -isolating each job's user allocation in its own cpuset. To do this, -construct a large mem_exclusive cpuset to hold all the jobs, and -construct child, non-mem_exclusive cpusets for each individual job. -Only a small amount of typical kernel memory, such as requests from -interrupt handlers, is allowed to be taken outside even a -mem_exclusive cpuset. - - -1.5 What is memory_pressure ? ------------------------------ -The memory_pressure of a cpuset provides a simple per-cpuset metric -of the rate that the tasks in a cpuset are attempting to free up in -use memory on the nodes of the cpuset to satisfy additional memory -requests. - -This enables batch managers monitoring jobs running in dedicated -cpusets to efficiently detect what level of memory pressure that job -is causing. - -This is useful both on tightly managed systems running a wide mix of -submitted jobs, which may choose to terminate or re-prioritize jobs that -are trying to use more memory than allowed on the nodes assigned them, -and with tightly coupled, long running, massively parallel scientific -computing jobs that will dramatically fail to meet required performance -goals if they start to use more memory than allowed to them. - -This mechanism provides a very economical way for the batch manager -to monitor a cpuset for signs of memory pressure. It's up to the -batch manager or other user code to decide what to do about it and -take action. - -==> Unless this feature is enabled by writing "1" to the special file - /dev/cpuset/memory_pressure_enabled, the hook in the rebalance - code of __alloc_pages() for this metric reduces to simply noticing - that the cpuset_memory_pressure_enabled flag is zero. So only - systems that enable this feature will compute the metric. - -Why a per-cpuset, running average: - - Because this meter is per-cpuset, rather than per-task or mm, - the system load imposed by a batch scheduler monitoring this - metric is sharply reduced on large systems, because a scan of - the tasklist can be avoided on each set of queries. - - Because this meter is a running average, instead of an accumulating - counter, a batch scheduler can detect memory pressure with a - single read, instead of having to read and accumulate results - for a period of time. - - Because this meter is per-cpuset rather than per-task or mm, - the batch scheduler can obtain the key information, memory - pressure in a cpuset, with a single read, rather than having to - query and accumulate results over all the (dynamically changing) - set of tasks in the cpuset. - -A per-cpuset simple digital filter (requires a spinlock and 3 words -of data per-cpuset) is kept, and updated by any task attached to that -cpuset, if it enters the synchronous (direct) page reclaim code. - -A per-cpuset file provides an integer number representing the recent -(half-life of 10 seconds) rate of direct page reclaims caused by -the tasks in the cpuset, in units of reclaims attempted per second, -times 1000. - - -1.6 What is memory spread ? ---------------------------- -There are two boolean flag files per cpuset that control where the -kernel allocates pages for the file system buffers and related in -kernel data structures. They are called 'memory_spread_page' and -'memory_spread_slab'. - -If the per-cpuset boolean flag file 'memory_spread_page' is set, then -the kernel will spread the file system buffers (page cache) evenly -over all the nodes that the faulting task is allowed to use, instead -of preferring to put those pages on the node where the task is running. - -If the per-cpuset boolean flag file 'memory_spread_slab' is set, -then the kernel will spread some file system related slab caches, -such as for inodes and dentries evenly over all the nodes that the -faulting task is allowed to use, instead of preferring to put those -pages on the node where the task is running. - -The setting of these flags does not affect anonymous data segment or -stack segment pages of a task. - -By default, both kinds of memory spreading are off, and memory -pages are allocated on the node local to where the task is running, -except perhaps as modified by the tasks NUMA mempolicy or cpuset -configuration, so long as sufficient free memory pages are available. - -When new cpusets are created, they inherit the memory spread settings -of their parent. - -Setting memory spreading causes allocations for the affected page -or slab caches to ignore the tasks NUMA mempolicy and be spread -instead. Tasks using mbind() or set_mempolicy() calls to set NUMA -mempolicies will not notice any change in these calls as a result of -their containing tasks memory spread settings. If memory spreading -is turned off, then the currently specified NUMA mempolicy once again -applies to memory page allocations. - -Both 'memory_spread_page' and 'memory_spread_slab' are boolean flag -files. By default they contain "0", meaning that the feature is off -for that cpuset. If a "1" is written to that file, then that turns -the named feature on. - -The implementation is simple. - -Setting the flag 'memory_spread_page' turns on a per-process flag -PF_SPREAD_PAGE for each task that is in that cpuset or subsequently -joins that cpuset. The page allocation calls for the page cache -is modified to perform an inline check for this PF_SPREAD_PAGE task -flag, and if set, a call to a new routine cpuset_mem_spread_node() -returns the node to prefer for the allocation. - -Similarly, setting 'memory_spread_slab' turns on the flag -PF_SPREAD_SLAB, and appropriately marked slab caches will allocate -pages from the node returned by cpuset_mem_spread_node(). - -The cpuset_mem_spread_node() routine is also simple. It uses the -value of a per-task rotor cpuset_mem_spread_rotor to select the next -node in the current tasks mems_allowed to prefer for the allocation. - -This memory placement policy is also known (in other contexts) as -round-robin or interleave. - -This policy can provide substantial improvements for jobs that need -to place thread local data on the corresponding node, but that need -to access large file system data sets that need to be spread across -the several nodes in the jobs cpuset in order to fit. Without this -policy, especially for jobs that might have one thread reading in the -data set, the memory allocation across the nodes in the jobs cpuset -can become very uneven. - -1.7 What is sched_load_balance ? --------------------------------- - -The kernel scheduler (kernel/sched.c) automatically load balances -tasks. If one CPU is underutilized, kernel code running on that -CPU will look for tasks on other more overloaded CPUs and move those -tasks to itself, within the constraints of such placement mechanisms -as cpusets and sched_setaffinity. - -The algorithmic cost of load balancing and its impact on key shared -kernel data structures such as the task list increases more than -linearly with the number of CPUs being balanced. So the scheduler -has support to partition the systems CPUs into a number of sched -domains such that it only load balances within each sched domain. -Each sched domain covers some subset of the CPUs in the system; -no two sched domains overlap; some CPUs might not be in any sched -domain and hence won't be load balanced. - -Put simply, it costs less to balance between two smaller sched domains -than one big one, but doing so means that overloads in one of the -two domains won't be load balanced to the other one. - -By default, there is one sched domain covering all CPUs, except those -marked isolated using the kernel boot time "isolcpus=" argument. - -This default load balancing across all CPUs is not well suited for -the following two situations: - 1) On large systems, load balancing across many CPUs is expensive. - If the system is managed using cpusets to place independent jobs - on separate sets of CPUs, full load balancing is unnecessary. - 2) Systems supporting realtime on some CPUs need to minimize - system overhead on those CPUs, including avoiding task load - balancing if that is not needed. - -When the per-cpuset flag "sched_load_balance" is enabled (the default -setting), it requests that all the CPUs in that cpusets allowed 'cpus' -be contained in a single sched domain, ensuring that load balancing -can move a task (not otherwised pinned, as by sched_setaffinity) -from any CPU in that cpuset to any other. - -When the per-cpuset flag "sched_load_balance" is disabled, then the -scheduler will avoid load balancing across the CPUs in that cpuset, ---except-- in so far as is necessary because some overlapping cpuset -has "sched_load_balance" enabled. - -So, for example, if the top cpuset has the flag "sched_load_balance" -enabled, then the scheduler will have one sched domain covering all -CPUs, and the setting of the "sched_load_balance" flag in any other -cpusets won't matter, as we're already fully load balancing. - -Therefore in the above two situations, the top cpuset flag -"sched_load_balance" should be disabled, and only some of the smaller, -child cpusets have this flag enabled. - -When doing this, you don't usually want to leave any unpinned tasks in -the top cpuset that might use non-trivial amounts of CPU, as such tasks -may be artificially constrained to some subset of CPUs, depending on -the particulars of this flag setting in descendent cpusets. Even if -such a task could use spare CPU cycles in some other CPUs, the kernel -scheduler might not consider the possibility of load balancing that -task to that underused CPU. - -Of course, tasks pinned to a particular CPU can be left in a cpuset -that disables "sched_load_balance" as those tasks aren't going anywhere -else anyway. - -There is an impedance mismatch here, between cpusets and sched domains. -Cpusets are hierarchical and nest. Sched domains are flat; they don't -overlap and each CPU is in at most one sched domain. - -It is necessary for sched domains to be flat because load balancing -across partially overlapping sets of CPUs would risk unstable dynamics -that would be beyond our understanding. So if each of two partially -overlapping cpusets enables the flag 'sched_load_balance', then we -form a single sched domain that is a superset of both. We won't move -a task to a CPU outside it cpuset, but the scheduler load balancing -code might waste some compute cycles considering that possibility. - -This mismatch is why there is not a simple one-to-one relation -between which cpusets have the flag "sched_load_balance" enabled, -and the sched domain configuration. If a cpuset enables the flag, it -will get balancing across all its CPUs, but if it disables the flag, -it will only be assured of no load balancing if no other overlapping -cpuset enables the flag. - -If two cpusets have partially overlapping 'cpus' allowed, and only -one of them has this flag enabled, then the other may find its -tasks only partially load balanced, just on the overlapping CPUs. -This is just the general case of the top_cpuset example given a few -paragraphs above. In the general case, as in the top cpuset case, -don't leave tasks that might use non-trivial amounts of CPU in -such partially load balanced cpusets, as they may be artificially -constrained to some subset of the CPUs allowed to them, for lack of -load balancing to the other CPUs. - -1.7.1 sched_load_balance implementation details. ------------------------------------------------- - -The per-cpuset flag 'sched_load_balance' defaults to enabled (contrary -to most cpuset flags.) When enabled for a cpuset, the kernel will -ensure that it can load balance across all the CPUs in that cpuset -(makes sure that all the CPUs in the cpus_allowed of that cpuset are -in the same sched domain.) - -If two overlapping cpusets both have 'sched_load_balance' enabled, -then they will be (must be) both in the same sched domain. - -If, as is the default, the top cpuset has 'sched_load_balance' enabled, -then by the above that means there is a single sched domain covering -the whole system, regardless of any other cpuset settings. - -The kernel commits to user space that it will avoid load balancing -where it can. It will pick as fine a granularity partition of sched -domains as it can while still providing load balancing for any set -of CPUs allowed to a cpuset having 'sched_load_balance' enabled. - -The internal kernel cpuset to scheduler interface passes from the -cpuset code to the scheduler code a partition of the load balanced -CPUs in the system. This partition is a set of subsets (represented -as an array of cpumask_t) of CPUs, pairwise disjoint, that cover all -the CPUs that must be load balanced. - -Whenever the 'sched_load_balance' flag changes, or CPUs come or go -from a cpuset with this flag enabled, or a cpuset with this flag -enabled is removed, the cpuset code builds a new such partition and -passes it to the scheduler sched domain setup code, to have the sched -domains rebuilt as necessary. - -This partition exactly defines what sched domains the scheduler should -setup - one sched domain for each element (cpumask_t) in the partition. - -The scheduler remembers the currently active sched domain partitions. -When the scheduler routine partition_sched_domains() is invoked from -the cpuset code to update these sched domains, it compares the new -partition requested with the current, and updates its sched domains, -removing the old and adding the new, for each change. - - -1.8 What is sched_relax_domain_level ? --------------------------------------- - -In sched domain, the scheduler migrates tasks in 2 ways; periodic load -balance on tick, and at time of some schedule events. - -When a task is woken up, scheduler try to move the task on idle CPU. -For example, if a task A running on CPU X activates another task B -on the same CPU X, and if CPU Y is X's sibling and performing idle, -then scheduler migrate task B to CPU Y so that task B can start on -CPU Y without waiting task A on CPU X. - -And if a CPU run out of tasks in its runqueue, the CPU try to pull -extra tasks from other busy CPUs to help them before it is going to -be idle. - -Of course it takes some searching cost to find movable tasks and/or -idle CPUs, the scheduler might not search all CPUs in the domain -everytime. In fact, in some architectures, the searching ranges on -events are limited in the same socket or node where the CPU locates, -while the load balance on tick searchs all. - -For example, assume CPU Z is relatively far from CPU X. Even if CPU Z -is idle while CPU X and the siblings are busy, scheduler can't migrate -woken task B from X to Z since it is out of its searching range. -As the result, task B on CPU X need to wait task A or wait load balance -on the next tick. For some applications in special situation, waiting -1 tick may be too long. - -The 'sched_relax_domain_level' file allows you to request changing -this searching range as you like. This file takes int value which -indicates size of searching range in levels ideally as follows, -otherwise initial value -1 that indicates the cpuset has no request. - - -1 : no request. use system default or follow request of others. - 0 : no search. - 1 : search siblings (hyperthreads in a core). - 2 : search cores in a package. - 3 : search cpus in a node [= system wide on non-NUMA system] - ( 4 : search nodes in a chunk of node [on NUMA system] ) - ( 5 : search system wide [on NUMA system] ) - -The system default is architecture dependent. The system default -can be changed using the relax_domain_level= boot parameter. - -This file is per-cpuset and affect the sched domain where the cpuset -belongs to. Therefore if the flag 'sched_load_balance' of a cpuset -is disabled, then 'sched_relax_domain_level' have no effect since -there is no sched domain belonging the cpuset. - -If multiple cpusets are overlapping and hence they form a single sched -domain, the largest value among those is used. Be careful, if one -requests 0 and others are -1 then 0 is used. - -Note that modifying this file will have both good and bad effects, -and whether it is acceptable or not will be depend on your situation. -Don't modify this file if you are not sure. - -If your situation is: - - The migration costs between each cpu can be assumed considerably - small(for you) due to your special application's behavior or - special hardware support for CPU cache etc. - - The searching cost doesn't have impact(for you) or you can make - the searching cost enough small by managing cpuset to compact etc. - - The latency is required even it sacrifices cache hit rate etc. -then increasing 'sched_relax_domain_level' would benefit you. - - -1.9 How do I use cpusets ? --------------------------- - -In order to minimize the impact of cpusets on critical kernel -code, such as the scheduler, and due to the fact that the kernel -does not support one task updating the memory placement of another -task directly, the impact on a task of changing its cpuset CPU -or Memory Node placement, or of changing to which cpuset a task -is attached, is subtle. - -If a cpuset has its Memory Nodes modified, then for each task attached -to that cpuset, the next time that the kernel attempts to allocate -a page of memory for that task, the kernel will notice the change -in the tasks cpuset, and update its per-task memory placement to -remain within the new cpusets memory placement. If the task was using -mempolicy MPOL_BIND, and the nodes to which it was bound overlap with -its new cpuset, then the task will continue to use whatever subset -of MPOL_BIND nodes are still allowed in the new cpuset. If the task -was using MPOL_BIND and now none of its MPOL_BIND nodes are allowed -in the new cpuset, then the task will be essentially treated as if it -was MPOL_BIND bound to the new cpuset (even though its numa placement, -as queried by get_mempolicy(), doesn't change). If a task is moved -from one cpuset to another, then the kernel will adjust the tasks -memory placement, as above, the next time that the kernel attempts -to allocate a page of memory for that task. - -If a cpuset has its 'cpus' modified, then each task in that cpuset -will have its allowed CPU placement changed immediately. Similarly, -if a tasks pid is written to a cpusets 'tasks' file, in either its -current cpuset or another cpuset, then its allowed CPU placement is -changed immediately. If such a task had been bound to some subset -of its cpuset using the sched_setaffinity() call, the task will be -allowed to run on any CPU allowed in its new cpuset, negating the -affect of the prior sched_setaffinity() call. - -In summary, the memory placement of a task whose cpuset is changed is -updated by the kernel, on the next allocation of a page for that task, -but the processor placement is not updated, until that tasks pid is -rewritten to the 'tasks' file of its cpuset. This is done to avoid -impacting the scheduler code in the kernel with a check for changes -in a tasks processor placement. - -Normally, once a page is allocated (given a physical page -of main memory) then that page stays on whatever node it -was allocated, so long as it remains allocated, even if the -cpusets memory placement policy 'mems' subsequently changes. -If the cpuset flag file 'memory_migrate' is set true, then when -tasks are attached to that cpuset, any pages that task had -allocated to it on nodes in its previous cpuset are migrated -to the tasks new cpuset. The relative placement of the page within -the cpuset is preserved during these migration operations if possible. -For example if the page was on the second valid node of the prior cpuset -then the page will be placed on the second valid node of the new cpuset. - -Also if 'memory_migrate' is set true, then if that cpusets -'mems' file is modified, pages allocated to tasks in that -cpuset, that were on nodes in the previous setting of 'mems', -will be moved to nodes in the new setting of 'mems.' -Pages that were not in the tasks prior cpuset, or in the cpusets -prior 'mems' setting, will not be moved. - -There is an exception to the above. If hotplug functionality is used -to remove all the CPUs that are currently assigned to a cpuset, -then all the tasks in that cpuset will be moved to the nearest ancestor -with non-empty cpus. But the moving of some (or all) tasks might fail if -cpuset is bound with another cgroup subsystem which has some restrictions -on task attaching. In this failing case, those tasks will stay -in the original cpuset, and the kernel will automatically update -their cpus_allowed to allow all online CPUs. When memory hotplug -functionality for removing Memory Nodes is available, a similar exception -is expected to apply there as well. In general, the kernel prefers to -violate cpuset placement, over starving a task that has had all -its allowed CPUs or Memory Nodes taken offline. - -There is a second exception to the above. GFP_ATOMIC requests are -kernel internal allocations that must be satisfied, immediately. -The kernel may drop some request, in rare cases even panic, if a -GFP_ATOMIC alloc fails. If the request cannot be satisfied within -the current tasks cpuset, then we relax the cpuset, and look for -memory anywhere we can find it. It's better to violate the cpuset -than stress the kernel. - -To start a new job that is to be contained within a cpuset, the steps are: - - 1) mkdir /dev/cpuset - 2) mount -t cgroup -ocpuset cpuset /dev/cpuset - 3) Create the new cpuset by doing mkdir's and write's (or echo's) in - the /dev/cpuset virtual file system. - 4) Start a task that will be the "founding father" of the new job. - 5) Attach that task to the new cpuset by writing its pid to the - /dev/cpuset tasks file for that cpuset. - 6) fork, exec or clone the job tasks from this founding father task. - -For example, the following sequence of commands will setup a cpuset -named "Charlie", containing just CPUs 2 and 3, and Memory Node 1, -and then start a subshell 'sh' in that cpuset: - - mount -t cgroup -ocpuset cpuset /dev/cpuset - cd /dev/cpuset - mkdir Charlie - cd Charlie - /bin/echo 2-3 > cpus - /bin/echo 1 > mems - /bin/echo $$ > tasks - sh - # The subshell 'sh' is now running in cpuset Charlie - # The next line should display '/Charlie' - cat /proc/self/cpuset - -In the future, a C library interface to cpusets will likely be -available. For now, the only way to query or modify cpusets is -via the cpuset file system, using the various cd, mkdir, echo, cat, -rmdir commands from the shell, or their equivalent from C. - -The sched_setaffinity calls can also be done at the shell prompt using -SGI's runon or Robert Love's taskset. The mbind and set_mempolicy -calls can be done at the shell prompt using the numactl command -(part of Andi Kleen's numa package). - -2. Usage Examples and Syntax -============================ - -2.1 Basic Usage ---------------- - -Creating, modifying, using the cpusets can be done through the cpuset -virtual filesystem. - -To mount it, type: -# mount -t cgroup -o cpuset cpuset /dev/cpuset - -Then under /dev/cpuset you can find a tree that corresponds to the -tree of the cpusets in the system. For instance, /dev/cpuset -is the cpuset that holds the whole system. - -If you want to create a new cpuset under /dev/cpuset: -# cd /dev/cpuset -# mkdir my_cpuset - -Now you want to do something with this cpuset. -# cd my_cpuset - -In this directory you can find several files: -# ls -cpu_exclusive memory_migrate mems tasks -cpus memory_pressure notify_on_release -mem_exclusive memory_spread_page sched_load_balance -mem_hardwall memory_spread_slab sched_relax_domain_level - -Reading them will give you information about the state of this cpuset: -the CPUs and Memory Nodes it can use, the processes that are using -it, its properties. By writing to these files you can manipulate -the cpuset. - -Set some flags: -# /bin/echo 1 > cpu_exclusive - -Add some cpus: -# /bin/echo 0-7 > cpus - -Add some mems: -# /bin/echo 0-7 > mems - -Now attach your shell to this cpuset: -# /bin/echo $$ > tasks - -You can also create cpusets inside your cpuset by using mkdir in this -directory. -# mkdir my_sub_cs - -To remove a cpuset, just use rmdir: -# rmdir my_sub_cs -This will fail if the cpuset is in use (has cpusets inside, or has -processes attached). - -Note that for legacy reasons, the "cpuset" filesystem exists as a -wrapper around the cgroup filesystem. - -The command - -mount -t cpuset X /dev/cpuset - -is equivalent to - -mount -t cgroup -ocpuset X /dev/cpuset -echo "/sbin/cpuset_release_agent" > /dev/cpuset/release_agent - -2.2 Adding/removing cpus ------------------------- - -This is the syntax to use when writing in the cpus or mems files -in cpuset directories: - -# /bin/echo 1-4 > cpus -> set cpus list to cpus 1,2,3,4 -# /bin/echo 1,2,3,4 > cpus -> set cpus list to cpus 1,2,3,4 - -2.3 Setting flags ------------------ - -The syntax is very simple: - -# /bin/echo 1 > cpu_exclusive -> set flag 'cpu_exclusive' -# /bin/echo 0 > cpu_exclusive -> unset flag 'cpu_exclusive' - -2.4 Attaching processes ------------------------ - -# /bin/echo PID > tasks - -Note that it is PID, not PIDs. You can only attach ONE task at a time. -If you have several tasks to attach, you have to do it one after another: - -# /bin/echo PID1 > tasks -# /bin/echo PID2 > tasks - ... -# /bin/echo PIDn > tasks - - -3. Questions -============ - -Q: what's up with this '/bin/echo' ? -A: bash's builtin 'echo' command does not check calls to write() against - errors. If you use it in the cpuset file system, you won't be - able to tell whether a command succeeded or failed. - -Q: When I attach processes, only the first of the line gets really attached ! -A: We can only return one error code per call to write(). So you should also - put only ONE pid. - -4. Contact -========== - -Web: http://www.bullopensource.org/cpuset diff --git a/Documentation/scheduler/sched-design-CFS.txt b/Documentation/scheduler/sched-design-CFS.txt index 8398ca4ff4ed..6f33593e59e2 100644 --- a/Documentation/scheduler/sched-design-CFS.txt +++ b/Documentation/scheduler/sched-design-CFS.txt @@ -231,7 +231,7 @@ CPU bandwidth control purposes: This options needs CONFIG_CGROUPS to be defined, and lets the administrator create arbitrary groups of tasks, using the "cgroup" pseudo filesystem. See - Documentation/cgroups.txt for more information about this filesystem. + Documentation/cgroups/cgroups.txt for more information about this filesystem. Only one of these options to group tasks can be chosen and not both. diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h index dede0a2cfc45..4c5bcf6ca7e8 100644 --- a/include/linux/res_counter.h +++ b/include/linux/res_counter.h @@ -9,7 +9,7 @@ * * Author: Pavel Emelianov * - * See Documentation/controllers/resource_counter.txt for more + * See Documentation/cgroups/resource_counter.txt for more * info about what this counter is. */ diff --git a/init/Kconfig b/init/Kconfig index 56fd93c63c77..2af83825634e 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -323,8 +323,8 @@ config CGROUP_SCHED This option allows you to create arbitrary task groups using the "cgroup" pseudo filesystem and control the cpu bandwidth allocated to each such task group. - Refer to Documentation/cgroups.txt for more information - on "cgroup" pseudo filesystem. + Refer to Documentation/cgroups/cgroups.txt for more + information on "cgroup" pseudo filesystem. endchoice @@ -335,10 +335,9 @@ menuconfig CGROUPS use with process control subsystems such as Cpusets, CFS, memory controls or device isolation. See - - Documentation/cpusets.txt (Cpusets) - Documentation/scheduler/sched-design-CFS.txt (CFS) - - Documentation/cgroups/ (features for grouping, isolation) - - Documentation/controllers/ (features for resource control) + - Documentation/cgroups/ (features for grouping, isolation + and resource control) Say N if unsure. diff --git a/kernel/cpuset.c b/kernel/cpuset.c index 647c77a88fcb..a85678865c5e 100644 --- a/kernel/cpuset.c +++ b/kernel/cpuset.c @@ -568,7 +568,7 @@ update_domain_attr_tree(struct sched_domain_attr *dattr, struct cpuset *c) * load balancing domains (sched domains) as specified by that partial * partition. * - * See "What is sched_load_balance" in Documentation/cpusets.txt + * See "What is sched_load_balance" in Documentation/cgroups/cpusets.txt * for a background explanation of this. * * Does not return errors, on the theory that the callers of this -- cgit v1.2.3 From 65a67bd2644bef225ee318dde76016a4697218fa Mon Sep 17 00:00:00 2001 From: Marcus Meissner Date: Thu, 15 Jan 2009 13:51:00 -0800 Subject: Documentation/accounting/getdelays.c: fix endless loop When no option is passed to getdelays it just hangs, waiting for a reply which will never come. This patch prints usage() when no output marker is specified. Signed-off-by: Marcus Meissner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/accounting/getdelays.c | 4 ++++ 1 file changed, 4 insertions(+) (limited to 'Documentation') diff --git a/Documentation/accounting/getdelays.c b/Documentation/accounting/getdelays.c index cc49400b4af8..7ea231172c85 100644 --- a/Documentation/accounting/getdelays.c +++ b/Documentation/accounting/getdelays.c @@ -392,6 +392,10 @@ int main(int argc, char *argv[]) goto err; } } + if (!maskset && !tid && !containerset) { + usage(); + goto err; + } do { int i; -- cgit v1.2.3 From 219beb291ba9275dd676578724103abed4cfbfe3 Mon Sep 17 00:00:00 2001 From: Pavel Machek Date: Thu, 15 Jan 2009 13:51:24 -0800 Subject: lis3: fix documentation to fit into 80 columns MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Fix lis3 documentation to fit into 80 columns. Signed-off-by: Pavel Machek Cc: Éric Piel Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/hwmon/lis3lv02d | 30 +++++++++++++++++------------- 1 file changed, 17 insertions(+), 13 deletions(-) (limited to 'Documentation') diff --git a/Documentation/hwmon/lis3lv02d b/Documentation/hwmon/lis3lv02d index 65dfb0c0fd67..0fcfc4a7ccdc 100644 --- a/Documentation/hwmon/lis3lv02d +++ b/Documentation/hwmon/lis3lv02d @@ -13,18 +13,21 @@ Author: Description ----------- -This driver provides support for the accelerometer found in various HP laptops -sporting the feature officially called "HP Mobile Data Protection System 3D" or -"HP 3D DriveGuard". It detect automatically laptops with this sensor. Known models -(for now the HP 2133, nc6420, nc2510, nc8510, nc84x0, nw9440 and nx9420) will -have their axis automatically oriented on standard way (eg: you can directly -play neverball). The accelerometer data is readable via +This driver provides support for the accelerometer found in various HP +laptops sporting the feature officially called "HP Mobile Data +Protection System 3D" or "HP 3D DriveGuard". It detect automatically +laptops with this sensor. Known models (for now the HP 2133, nc6420, +nc2510, nc8510, nc84x0, nw9440 and nx9420) will have their axis +automatically oriented on standard way (eg: you can directly play +neverball). The accelerometer data is readable via /sys/devices/platform/lis3lv02d. Sysfs attributes under /sys/devices/platform/lis3lv02d/: position - 3D position that the accelerometer reports. Format: "(x,y,z)" -calibrate - read: values (x, y, z) that are used as the base for input class device operation. - write: forces the base to be recalibrated with the current position. +calibrate - read: values (x, y, z) that are used as the base for input + class device operation. + write: forces the base to be recalibrated with the current + position. rate - reports the sampling rate of the accelerometer device in HZ This driver also provides an absolute input class device, allowing @@ -39,11 +42,12 @@ the accelerometer are converted into a "standard" organisation of the axes * When the laptop is horizontal the position reported is about 0 for X and Y and a positive value for Z * If the left side is elevated, X increases (becomes positive) - * If the front side (where the touchpad is) is elevated, Y decreases (becomes negative) + * If the front side (where the touchpad is) is elevated, Y decreases + (becomes negative) * If the laptop is put upside-down, Z becomes negative -If your laptop model is not recognized (cf "dmesg"), you can send an email to the -authors to add it to the database. When reporting a new laptop, please include -the output of "dmidecode" plus the value of /sys/devices/platform/lis3lv02d/position -in these four cases. +If your laptop model is not recognized (cf "dmesg"), you can send an +email to the authors to add it to the database. When reporting a new +laptop, please include the output of "dmidecode" plus the value of +/sys/devices/platform/lis3lv02d/position in these four cases. -- cgit v1.2.3 From cd9f8e64c232eeb1247950007ee95de00b89ba7a Mon Sep 17 00:00:00 2001 From: Henrik Kretzschmar Date: Sun, 18 Jan 2009 10:40:24 +0100 Subject: sound: Remove removed OSS kernel parameters from doc Remove removed OSS kernel parameters from kernel-parameters.txt Remove the kernel parameters from the OSS drivers of the chips es1371 (removed 10-2007/2.6.24) and cs4232 (removed 02-2008/2.6.25) from the kernel parameters documentation. Signed-off-by: Henrik Kretzschmar Signed-off-by: Takashi Iwai --- Documentation/kernel-parameters.txt | 7 ------- 1 file changed, 7 deletions(-) (limited to 'Documentation') diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 8511d3532c27..d8362cf9909e 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -577,9 +577,6 @@ and is between 256 and 4096 characters. It is defined in the file a memory unit (amount[KMG]). See also Documentation/kdump/kdump.txt for a example. - cs4232= [HW,OSS] - Format: ,,,,, - cs89x0_dma= [HW,NET] Format: @@ -732,10 +729,6 @@ and is between 256 and 4096 characters. It is defined in the file Default value is 0. Value can be changed at runtime via /selinux/enforce. - es1371= [HW,OSS] - Format: ,[,[]] - See also header of sound/oss/es1371.c. - ether= [HW,NET] Ethernet cards parameters This option is obsoleted by the "netdev=" option, which has equivalent usage. See its documentation for details. -- cgit v1.2.3 From 32ed3f4640631ab7a4c0bc0f1463cf019d510341 Mon Sep 17 00:00:00 2001 From: Matthew Ranostay Date: Thu, 22 Jan 2009 20:53:29 -0500 Subject: ALSA: hda: Add STAC92HD83XXX_PWR_REF quirk Some revisions of the 92hd8xxx codec's not supporting port power downs in which the using of it causes capture and also randomly playback streams to not function at all. Thus by disabling it by default and adding a option to enable it manually will fix all issue on current and future revisions. Signed-off-by: Matthew Ranostay Signed-off-by: Takashi Iwai --- Documentation/sound/alsa/HD-Audio-Models.txt | 1 + sound/pci/hda/patch_sigmatel.c | 19 ++++++++++++------- 2 files changed, 13 insertions(+), 7 deletions(-) (limited to 'Documentation') diff --git a/Documentation/sound/alsa/HD-Audio-Models.txt b/Documentation/sound/alsa/HD-Audio-Models.txt index 64eb1100eec1..0f5d26bea80f 100644 --- a/Documentation/sound/alsa/HD-Audio-Models.txt +++ b/Documentation/sound/alsa/HD-Audio-Models.txt @@ -349,6 +349,7 @@ STAC92HD73* STAC92HD83* =========== ref Reference board + mic-ref Reference board with power managment for ports STAC9872 ======== diff --git a/sound/pci/hda/patch_sigmatel.c b/sound/pci/hda/patch_sigmatel.c index c553fdb2b149..3dd4eee70b7c 100644 --- a/sound/pci/hda/patch_sigmatel.c +++ b/sound/pci/hda/patch_sigmatel.c @@ -81,6 +81,7 @@ enum { enum { STAC_92HD83XXX_REF, + STAC_92HD83XXX_PWR_REF, STAC_92HD83XXX_MODELS }; @@ -1734,10 +1735,12 @@ static unsigned int ref92hd83xxx_pin_configs[14] = { static unsigned int *stac92hd83xxx_brd_tbl[STAC_92HD83XXX_MODELS] = { [STAC_92HD83XXX_REF] = ref92hd83xxx_pin_configs, + [STAC_92HD83XXX_PWR_REF] = ref92hd83xxx_pin_configs, }; static const char *stac92hd83xxx_models[STAC_92HD83XXX_MODELS] = { [STAC_92HD83XXX_REF] = "ref", + [STAC_92HD83XXX_PWR_REF] = "mic-ref", }; static struct snd_pci_quirk stac92hd83xxx_cfg_tbl[] = { @@ -4783,13 +4786,6 @@ static int patch_stac92hd83xxx(struct hda_codec *codec) AC_VERB_SET_CONNECT_SEL, num_dacs); spec->init = stac92hd83xxx_core_init; - switch (codec->vendor_id) { - case 0x111d7605: - break; - default: - spec->num_pwrs--; - } - spec->mixer = stac92hd83xxx_mixer; spec->num_pins = ARRAY_SIZE(stac92hd83xxx_pin_nids); spec->num_dmuxes = ARRAY_SIZE(stac92hd83xxx_dmux_nids); @@ -4815,6 +4811,15 @@ again: return err; } + switch (codec->vendor_id) { + case 0x111d7604: + case 0x111d7605: + if (spec->board_config == STAC_92HD83XXX_PWR_REF) + break; + spec->num_pwrs = 0; + break; + } + err = stac92xx_parse_auto_config(codec, 0x1d, 0); if (!err) { if (spec->board_config < 0) { -- cgit v1.2.3 From e955281cd6afef7ad7ea11cae0ca71d78a7b2b2b Mon Sep 17 00:00:00 2001 From: Jesse Barnes Date: Mon, 26 Jan 2009 12:19:23 -0800 Subject: networking: document "nc" in addition to "netcat" in netconsole.txt It always annoyed me that the netconsole documentation didn't give me the correct command for my distro. Update it with a command line that actually works on my Fedora install. Signed-off-by: Jesse Barnes Signed-off-by: Randy Dunlap Signed-off-by: Matt Mackall Signed-off-by: David S. Miller --- Documentation/networking/netconsole.txt | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/networking/netconsole.txt b/Documentation/networking/netconsole.txt index 3c2f2b328638..8d022073e3ef 100644 --- a/Documentation/networking/netconsole.txt +++ b/Documentation/networking/netconsole.txt @@ -51,7 +51,8 @@ Built-in netconsole starts immediately after the TCP stack is initialized and attempts to bring up the supplied dev at the supplied address. -The remote host can run either 'netcat -u -l -p ' or syslogd. +The remote host can run either 'netcat -u -l -p ', +'nc -l -u ' or syslogd. Dynamic reconfiguration: ======================== -- cgit v1.2.3 From 096abd77038a2ff74efd194d074eadcde80fb97d Mon Sep 17 00:00:00 2001 From: James Lentini Date: Thu, 8 Jan 2009 13:13:26 -0500 Subject: update port number in NFS/RDMA documentation Update the NFS/RDMA documentation to use the new port number assigned by IANA. Signed-off-by: James Lentini Signed-off-by: J. Bruce Fields --- Documentation/filesystems/nfs-rdma.txt | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'Documentation') diff --git a/Documentation/filesystems/nfs-rdma.txt b/Documentation/filesystems/nfs-rdma.txt index 44bd766f2e5d..85eaeaddd27c 100644 --- a/Documentation/filesystems/nfs-rdma.txt +++ b/Documentation/filesystems/nfs-rdma.txt @@ -251,7 +251,7 @@ NFS/RDMA Setup Instruct the server to listen on the RDMA transport: - $ echo rdma 2050 > /proc/fs/nfsd/portlist + $ echo rdma 20049 > /proc/fs/nfsd/portlist - On the client system @@ -263,7 +263,7 @@ NFS/RDMA Setup Regardless of how the client was built (module or built-in), use this command to mount the NFS/RDMA server: - $ mount -o rdma,port=2050 :/ /mnt + $ mount -o rdma,port=20049 :/ /mnt To verify that the mount is using RDMA, run "cat /proc/mounts" and check the "proto" field for the given mount. -- cgit v1.2.3 From 720893fd5fb6de1f752f816a89e630f08ae8b20a Mon Sep 17 00:00:00 2001 From: Tsugikazu Shibata Date: Fri, 23 Jan 2009 09:59:50 +0900 Subject: Sync patch for jp_JP/stable_kernel_rules.txt Updated jp_JP/stable_kernel_rules.txt due to changes in the main version of the file. Also, this patch is already reviewed by Japanese translation community called JF. Signed-off-by: Tsugikazu Shibata Signed-off-by: Greg Kroah-Hartman --- Documentation/ja_JP/stable_kernel_rules.txt | 15 ++++++++++----- 1 file changed, 10 insertions(+), 5 deletions(-) (limited to 'Documentation') diff --git a/Documentation/ja_JP/stable_kernel_rules.txt b/Documentation/ja_JP/stable_kernel_rules.txt index b3ffe870de33..14265837c4ce 100644 --- a/Documentation/ja_JP/stable_kernel_rules.txt +++ b/Documentation/ja_JP/stable_kernel_rules.txt @@ -12,11 +12,11 @@ file at first. ================================== これは、 -linux-2.6.24/Documentation/stable_kernel_rules.txt +linux-2.6.29/Documentation/stable_kernel_rules.txt の和訳です。 翻訳団体: JF プロジェクト < http://www.linux.or.jp/JF/ > -翻訳日: 2007/12/30 +翻訳日: 2009/1/14 翻訳者: Tsugikazu Shibata 校正者: 武井伸光さん、 かねこさん (Seiji Kaneko) @@ -38,12 +38,15 @@ linux-2.6.24/Documentation/stable_kernel_rules.txt - ビルドエラー(CONFIG_BROKENになっているものを除く), oops, ハング、デー タ破壊、現実のセキュリティ問題、その他 "ああ、これはダメだね"という ようなものを修正しなければならない。短く言えば、重大な問題。 + - 新しい device ID とクオークも受け入れられる。 - どのように競合状態が発生するかの説明も一緒に書かれていない限り、 "理論的には競合状態になる"ようなものは不可。 - いかなる些細な修正も含めることはできない。(スペルの修正、空白のクリー ンアップなど) - - 対応するサブシステムメンテナが受け入れたものでなければならない。 - Documentation/SubmittingPatches の規則に従ったものでなければならない。 + - パッチ自体か同等の修正が Linus のツリーに既に存在しなければならない。 +  Linus のツリーでのコミットID を -stable へのパッチ投稿の際に引用す + ること。 -stable ツリーにパッチを送付する手続き- @@ -52,8 +55,10 @@ linux-2.6.24/Documentation/stable_kernel_rules.txt - 送信者はパッチがキューに受け付けられた際には ACK を、却下された場合 には NAK を受け取る。この反応は開発者たちのスケジュールによって、数 日かかる場合がある。 - - もし受け取られたら、パッチは他の開発者たちのレビューのために - -stable キューに追加される。 + - もし受け取られたら、パッチは他の開発者たちと関連するサブシステムの + メンテナーによるレビューのために -stable キューに追加される。 + - パッチに stable@kernel.org のアドレスが付加されているときには、それ + が Linus のツリーに入る時に自動的に stable チームに email される。 - セキュリティパッチはこのエイリアス (stable@kernel.org) に送られるべ きではなく、代わりに security@kernel.org のアドレスに送られる。 -- cgit v1.2.3 From 6a1b699678c8c0d45f88a37b32358a9e82bef6bb Mon Sep 17 00:00:00 2001 From: "Hans J. Koch" Date: Wed, 7 Jan 2009 00:12:37 +0100 Subject: UIO: Add missing documentation of features added recently MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The following features were added to the UIO framework in the near past: * Generic drivers for platform devices (uio_pdrv, uio_pdrv_genirq) * an "offset" sysfs attribute for memory mappings Unfortunately, all this went in without documentation (won't happen again...) This patch updates UIO documentation. Signed-off-by: Hans J. Koch Acked-by: Uwe Kleine-König Signed-off-by: Greg Kroah-Hartman --- Documentation/DocBook/uio-howto.tmpl | 88 ++++++++++++++++++++++++++++++++++++ 1 file changed, 88 insertions(+) (limited to 'Documentation') diff --git a/Documentation/DocBook/uio-howto.tmpl b/Documentation/DocBook/uio-howto.tmpl index b787e4721c90..52e1b79ce0e6 100644 --- a/Documentation/DocBook/uio-howto.tmpl +++ b/Documentation/DocBook/uio-howto.tmpl @@ -41,6 +41,12 @@ GPL version 2. + + 0.7 + 2008-12-23 + hjk + Added generic platform drivers and offset attribute. + 0.6 2008-12-05 @@ -312,6 +318,16 @@ interested in translating it, please email me pointed to by addr. + + + offset: The offset, in bytes, that has to be + added to the pointer returned by mmap() to get + to the actual device memory. This is important if the device's memory + is not page aligned. Remember that pointers returned by + mmap() are always page aligned, so it is good + style to always add this offset. + + @@ -594,6 +610,78 @@ framework to set up sysfs files for this region. Simply leave it alone. + +Using uio_pdrv for platform devices + + In many cases, UIO drivers for platform devices can be handled in a + generic way. In the same place where you define your + struct platform_device, you simply also implement + your interrupt handler and fill your + struct uio_info. A pointer to this + struct uio_info is then used as + platform_data for your platform device. + + + You also need to set up an array of struct resource + containing addresses and sizes of your memory mappings. This + information is passed to the driver using the + .resource and .num_resources + elements of struct platform_device. + + + You now have to set the .name element of + struct platform_device to + "uio_pdrv" to use the generic UIO platform device + driver. This driver will fill the mem[] array + according to the resources given, and register the device. + + + The advantage of this approach is that you only have to edit a file + you need to edit anyway. You do not have to create an extra driver. + + + + +Using uio_pdrv_genirq for platform devices + + Especially in embedded devices, you frequently find chips where the + irq pin is tied to its own dedicated interrupt line. In such cases, + where you can be really sure the interrupt is not shared, we can take + the concept of uio_pdrv one step further and use a + generic interrupt handler. That's what + uio_pdrv_genirq does. + + + The setup for this driver is the same as described above for + uio_pdrv, except that you do not implement an + interrupt handler. The .handler element of + struct uio_info must remain + NULL. The .irq_flags element + must not contain IRQF_SHARED. + + + You will set the .name element of + struct platform_device to + "uio_pdrv_genirq" to use this driver. + + + The generic interrupt handler of uio_pdrv_genirq + will simply disable the interrupt line using + disable_irq_nosync(). After doing its work, + userspace can reenable the interrupt by writing 0x00000001 to the UIO + device file. The driver already implements an + irq_control() to make this possible, you must not + implement your own. + + + Using uio_pdrv_genirq not only saves a few lines of + interrupt handler code. You also do not need to know anything about + the chip's internal registers to create the kernel part of the driver. + All you need to know is the irq number of the pin the chip is + connected to. + + + -- cgit v1.2.3 From 58092d1e0a896eb1d9163d58f93df7ed704fa8e1 Mon Sep 17 00:00:00 2001 From: Stephen Hemminger Date: Thu, 29 Jan 2009 16:16:31 -0800 Subject: net: update documentation ip aliases This documentation is old. Add a short note to describe why aliases are no long necessary, and remove the old contact/edit info. Signed-off-by: Stephen Hemminger Signed-off-by: David S. Miller --- Documentation/networking/alias.txt | 25 ++++++------------------- 1 file changed, 6 insertions(+), 19 deletions(-) (limited to 'Documentation') diff --git a/Documentation/networking/alias.txt b/Documentation/networking/alias.txt index cd12c2ff518a..85046f53fcfc 100644 --- a/Documentation/networking/alias.txt +++ b/Documentation/networking/alias.txt @@ -2,13 +2,13 @@ IP-Aliasing: ============ -IP-aliases are additional IP-addresses/masks hooked up to a base -interface by adding a colon and a string when running ifconfig. -This string is usually numeric, but this is not a must. - -IP-Aliases are avail if CONFIG_INET (`standard' IPv4 networking) -is configured in the kernel. +IP-aliases are an obsolete way to manage multiple IP-addresses/masks +per interface. Newer tools such as iproute2 support multiple +address/prefixes per interface, but aliases are still supported +for backwards compatibility. +An alias is formed by adding a colon and a string when running ifconfig. +This string is usually numeric, but this is not a must. o Alias creation. Alias creation is done by 'magic' interface naming: eg. to create a @@ -38,16 +38,3 @@ o Relationship with main device If the base device is shut down the added aliases will be deleted too. - - -Contact -------- -Please finger or e-mail me: - Juan Jose Ciarlante - -Updated by Erik Schoenfelder - -; local variables: -; mode: indented-text -; mode: auto-fill -; end: -- cgit v1.2.3 From b44d49ab0954accefba4c71274ab58abe1c25c52 Mon Sep 17 00:00:00 2001 From: Tim 'mithro' Ansell Date: Thu, 22 Jan 2009 15:06:41 +1100 Subject: lguest: disable the FORTIFY for lguest. Makes all the warnings go away when compiling lguest on Ubuntu on Intrepid or greater. Signed-off-by: Timothy R Ansell Signed-off-by: Rusty Russell --- Documentation/lguest/Makefile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/lguest/Makefile b/Documentation/lguest/Makefile index 725eef81cd48..1f4f9e888bd1 100644 --- a/Documentation/lguest/Makefile +++ b/Documentation/lguest/Makefile @@ -1,5 +1,5 @@ # This creates the demonstration utility "lguest" which runs a Linux guest. -CFLAGS:=-Wall -Wmissing-declarations -Wmissing-prototypes -O3 -I../../include -I../../arch/x86/include +CFLAGS:=-Wall -Wmissing-declarations -Wmissing-prototypes -O3 -I../../include -I../../arch/x86/include -U_FORTIFY_SOURCE LDLIBS:=-lz all: lguest -- cgit v1.2.3 From 9e9e3cbc62da43c66e894d5a61fa08b427e25202 Mon Sep 17 00:00:00 2001 From: Evgeniy Polyakov Date: Thu, 29 Jan 2009 14:25:09 -0800 Subject: mm: OOM documentation update Signed-off-by: Evgeniy Polyakov Acked-by: David Rientjes Cc: Randy Dunlap Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/filesystems/proc.txt | 28 ++++++++++++++++++++++++++++ 1 file changed, 28 insertions(+) (limited to 'Documentation') diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index bbebc3a43ac0..a87be42f8211 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -2027,6 +2027,34 @@ increase the likelihood of this process being killed by the oom-killer. Valid values are in the range -16 to +15, plus the special value -17, which disables oom-killing altogether for this process. +The process to be killed in an out-of-memory situation is selected among all others +based on its badness score. This value equals the original memory size of the process +and is then updated according to its CPU time (utime + stime) and the +run time (uptime - start time). The longer it runs the smaller is the score. +Badness score is divided by the square root of the CPU time and then by +the double square root of the run time. + +Swapped out tasks are killed first. Half of each child's memory size is added to +the parent's score if they do not share the same memory. Thus forking servers +are the prime candidates to be killed. Having only one 'hungry' child will make +parent less preferable than the child. + +/proc//oom_score shows process' current badness score. + +The following heuristics are then applied: + * if the task was reniced, its score doubles + * superuser or direct hardware access tasks (CAP_SYS_ADMIN, CAP_SYS_RESOURCE + or CAP_SYS_RAWIO) have their score divided by 4 + * if oom condition happened in one cpuset and checked task does not belong + to it, its score is divided by 8 + * the resulting score is multiplied by two to the power of oom_adj, i.e. + points <<= oom_adj when it is positive and + points >>= -(oom_adj) otherwise + +The task with the highest badness score is then selected and its children +are killed, process itself will be killed in an OOM situation when it does +not have children or some of them disabled oom like described above. + 2.13 /proc//oom_score - Display current oom-killer score ------------------------------------------------------------- -- cgit v1.2.3 From 8d50d369d1b3ccc4d96ce01f62146173ee7064ca Mon Sep 17 00:00:00 2001 From: KAMEZAWA Hiroyuki Date: Thu, 29 Jan 2009 14:25:14 -0800 Subject: memcg: update document to mention that swapoff should be tested Considering the recently found problem "memcg: fix refcnt handling at swapoff", it's better to mention swapoff behavior in the memcg_test document. Signed-off-by: KAMEZAWA Hiroyuki Acked-by: Balbir Singh Cc: Li Zefan Cc: Daisuke Nishimura Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/cgroups/memcg_test.txt | 24 ++++++++++++++++++++++-- 1 file changed, 22 insertions(+), 2 deletions(-) (limited to 'Documentation') diff --git a/Documentation/cgroups/memcg_test.txt b/Documentation/cgroups/memcg_test.txt index 19533f93b7a2..523a9c16c400 100644 --- a/Documentation/cgroups/memcg_test.txt +++ b/Documentation/cgroups/memcg_test.txt @@ -1,6 +1,6 @@ Memory Resource Controller(Memcg) Implementation Memo. -Last Updated: 2008/12/15 -Base Kernel Version: based on 2.6.28-rc8-mm. +Last Updated: 2009/1/19 +Base Kernel Version: based on 2.6.29-rc2. Because VM is getting complex (one of reasons is memcg...), memcg's behavior is complex. This is a document for memcg's internal behavior. @@ -340,3 +340,23 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y. # mount -t cgroup none /cgroup -t cpuset,memory,cpu,devices and do task move, mkdir, rmdir etc...under this. + + 9.7 swapoff. + Besides management of swap is one of complicated parts of memcg, + call path of swap-in at swapoff is not same as usual swap-in path.. + It's worth to be tested explicitly. + + For example, test like following is good. + (Shell-A) + # mount -t cgroup none /cgroup -t memory + # mkdir /cgroup/test + # echo 40M > /cgroup/test/memory.limit_in_bytes + # echo 0 > /cgroup/test/tasks + Run malloc(100M) program under this. You'll see 60M of swaps. + (Shell-B) + # move all tasks in /cgroup/test to /cgroup + # /sbin/swapoff -a + # rmdir /test/cgroup + # kill malloc task. + + Of course, tmpfs v.s. swapoff test should be tested, too. -- cgit v1.2.3 From 5872fb94f85d2e4fdef94657bd14e1a492df9825 Mon Sep 17 00:00:00 2001 From: Randy Dunlap Date: Thu, 29 Jan 2009 16:28:02 -0800 Subject: Documentation: move DMA-mapping.txt to Doc/PCI/ Move DMA-mapping.txt to Documentation/PCI/. DMA-mapping.txt was supposed to be moved from Documentation/ to Documentation/PCI/. The 00-INDEX files in those two directories were updated, along with a few other text files, but the file itself somehow escaped being moved, so move it and update more text files and source files with its new location. Signed-off-by: Randy Dunlap Acked-by: Greg Kroah-Hartman cc: Jesse Barnes Signed-off-by: Linus Torvalds --- Documentation/DMA-API.txt | 2 +- Documentation/IO-mapping.txt | 4 ++-- Documentation/block/biodoc.txt | 5 +++-- Documentation/usb/dma.txt | 11 ++++++----- arch/ia64/hp/common/sba_iommu.c | 12 ++++++------ arch/parisc/include/asm/dma-mapping.h | 2 +- arch/parisc/kernel/pci-dma.c | 2 +- arch/x86/include/asm/dma-mapping.h | 4 ++-- arch/x86/kernel/pci-gart_64.c | 2 +- drivers/parisc/sba_iommu.c | 18 +++++++++--------- drivers/staging/altpciechdma/altpciechdma.c | 4 ++-- include/media/videobuf-dma-sg.h | 2 +- 12 files changed, 35 insertions(+), 33 deletions(-) (limited to 'Documentation') diff --git a/Documentation/DMA-API.txt b/Documentation/DMA-API.txt index 52441694fe03..2a3fcc55e981 100644 --- a/Documentation/DMA-API.txt +++ b/Documentation/DMA-API.txt @@ -5,7 +5,7 @@ This document describes the DMA API. For a more gentle introduction phrased in terms of the pci_ equivalents (and actual examples) see -DMA-mapping.txt +Documentation/PCI/PCI-DMA-mapping.txt. This API is split into two pieces. Part I describes the API and the corresponding pci_ API. Part II describes the extensions to the API diff --git a/Documentation/IO-mapping.txt b/Documentation/IO-mapping.txt index 86edb61bdee6..78a440695e11 100644 --- a/Documentation/IO-mapping.txt +++ b/Documentation/IO-mapping.txt @@ -1,6 +1,6 @@ [ NOTE: The virt_to_bus() and bus_to_virt() functions have been - superseded by the functionality provided by the PCI DMA - interface (see Documentation/DMA-mapping.txt). They continue + superseded by the functionality provided by the PCI DMA interface + (see Documentation/PCI/PCI-DMA-mapping.txt). They continue to be documented below for historical purposes, but new code must not use them. --davidm 00/12/12 ] diff --git a/Documentation/block/biodoc.txt b/Documentation/block/biodoc.txt index 3c5434c83daf..5d2480d33b43 100644 --- a/Documentation/block/biodoc.txt +++ b/Documentation/block/biodoc.txt @@ -186,8 +186,9 @@ a virtual address mapping (unlike the earlier scheme of virtual address do not have a corresponding kernel virtual address space mapping) and low-memory pages. -Note: Please refer to DMA-mapping.txt for a discussion on PCI high mem DMA -aspects and mapping of scatter gather lists, and support for 64 bit PCI. +Note: Please refer to Documentation/PCI/PCI-DMA-mapping.txt for a discussion +on PCI high mem DMA aspects and mapping of scatter gather lists, and support +for 64 bit PCI. Special handling is required only for cases where i/o needs to happen on pages at physical memory addresses beyond what the device can support. In these diff --git a/Documentation/usb/dma.txt b/Documentation/usb/dma.txt index e8b50b7de9d9..cfdcd16e3abf 100644 --- a/Documentation/usb/dma.txt +++ b/Documentation/usb/dma.txt @@ -6,8 +6,9 @@ in the kernel usb programming guide (kerneldoc, from the source code). API OVERVIEW The big picture is that USB drivers can continue to ignore most DMA issues, -though they still must provide DMA-ready buffers (see DMA-mapping.txt). -That's how they've worked through the 2.4 (and earlier) kernels. +though they still must provide DMA-ready buffers (see +Documentation/PCI/PCI-DMA-mapping.txt). That's how they've worked through +the 2.4 (and earlier) kernels. OR: they can now be DMA-aware. @@ -62,8 +63,8 @@ and effects like cache-trashing can impose subtle penalties. force a consistent memory access ordering by using memory barriers. It's not using a streaming DMA mapping, so it's good for small transfers on systems where the I/O would otherwise thrash an IOMMU mapping. (See - Documentation/DMA-mapping.txt for definitions of "coherent" and "streaming" - DMA mappings.) + Documentation/PCI/PCI-DMA-mapping.txt for definitions of "coherent" and + "streaming" DMA mappings.) Asking for 1/Nth of a page (as well as asking for N pages) is reasonably space-efficient. @@ -93,7 +94,7 @@ WORKING WITH EXISTING BUFFERS Existing buffers aren't usable for DMA without first being mapped into the DMA address space of the device. However, most buffers passed to your driver can safely be used with such DMA mapping. (See the first section -of DMA-mapping.txt, titled "What memory is DMA-able?") +of Documentation/PCI/PCI-DMA-mapping.txt, titled "What memory is DMA-able?") - When you're using scatterlists, you can map everything at once. On some systems, this kicks in an IOMMU and turns the scatterlists into single diff --git a/arch/ia64/hp/common/sba_iommu.c b/arch/ia64/hp/common/sba_iommu.c index d98f0f4ff83f..6d5e6c5630e3 100644 --- a/arch/ia64/hp/common/sba_iommu.c +++ b/arch/ia64/hp/common/sba_iommu.c @@ -906,7 +906,7 @@ sba_mark_invalid(struct ioc *ioc, dma_addr_t iova, size_t byte_cnt) * @dir: R/W or both. * @attrs: optional dma attributes * - * See Documentation/DMA-mapping.txt + * See Documentation/PCI/PCI-DMA-mapping.txt */ dma_addr_t sba_map_single_attrs(struct device *dev, void *addr, size_t size, int dir, @@ -1024,7 +1024,7 @@ sba_mark_clean(struct ioc *ioc, dma_addr_t iova, size_t size) * @dir: R/W or both. * @attrs: optional dma attributes * - * See Documentation/DMA-mapping.txt + * See Documentation/PCI/PCI-DMA-mapping.txt */ void sba_unmap_single_attrs(struct device *dev, dma_addr_t iova, size_t size, int dir, struct dma_attrs *attrs) @@ -1102,7 +1102,7 @@ EXPORT_SYMBOL(sba_unmap_single_attrs); * @size: number of bytes mapped in driver buffer. * @dma_handle: IOVA of new buffer. * - * See Documentation/DMA-mapping.txt + * See Documentation/PCI/PCI-DMA-mapping.txt */ void * sba_alloc_coherent (struct device *dev, size_t size, dma_addr_t *dma_handle, gfp_t flags) @@ -1165,7 +1165,7 @@ sba_alloc_coherent (struct device *dev, size_t size, dma_addr_t *dma_handle, gfp * @vaddr: virtual address IOVA of "consistent" buffer. * @dma_handler: IO virtual address of "consistent" buffer. * - * See Documentation/DMA-mapping.txt + * See Documentation/PCI/PCI-DMA-mapping.txt */ void sba_free_coherent (struct device *dev, size_t size, void *vaddr, dma_addr_t dma_handle) { @@ -1420,7 +1420,7 @@ sba_coalesce_chunks(struct ioc *ioc, struct device *dev, * @dir: R/W or both. * @attrs: optional dma attributes * - * See Documentation/DMA-mapping.txt + * See Documentation/PCI/PCI-DMA-mapping.txt */ int sba_map_sg_attrs(struct device *dev, struct scatterlist *sglist, int nents, int dir, struct dma_attrs *attrs) @@ -1512,7 +1512,7 @@ EXPORT_SYMBOL(sba_map_sg_attrs); * @dir: R/W or both. * @attrs: optional dma attributes * - * See Documentation/DMA-mapping.txt + * See Documentation/PCI/PCI-DMA-mapping.txt */ void sba_unmap_sg_attrs(struct device *dev, struct scatterlist *sglist, int nents, int dir, struct dma_attrs *attrs) diff --git a/arch/parisc/include/asm/dma-mapping.h b/arch/parisc/include/asm/dma-mapping.h index 53af696f23d2..da6943380908 100644 --- a/arch/parisc/include/asm/dma-mapping.h +++ b/arch/parisc/include/asm/dma-mapping.h @@ -5,7 +5,7 @@ #include #include -/* See Documentation/DMA-mapping.txt */ +/* See Documentation/PCI/PCI-DMA-mapping.txt */ struct hppa_dma_ops { int (*dma_supported)(struct device *dev, u64 mask); void *(*alloc_consistent)(struct device *dev, size_t size, dma_addr_t *iova, gfp_t flag); diff --git a/arch/parisc/kernel/pci-dma.c b/arch/parisc/kernel/pci-dma.c index ccd61b9567a6..df47895db828 100644 --- a/arch/parisc/kernel/pci-dma.c +++ b/arch/parisc/kernel/pci-dma.c @@ -2,7 +2,7 @@ ** PARISC 1.1 Dynamic DMA mapping support. ** This implementation is for PA-RISC platforms that do not support ** I/O TLBs (aka DMA address translation hardware). -** See Documentation/DMA-mapping.txt for interface definitions. +** See Documentation/PCI/PCI-DMA-mapping.txt for interface definitions. ** ** (c) Copyright 1999,2000 Hewlett-Packard Company ** (c) Copyright 2000 Grant Grundler diff --git a/arch/x86/include/asm/dma-mapping.h b/arch/x86/include/asm/dma-mapping.h index 4035357f5b9d..132a134d12f2 100644 --- a/arch/x86/include/asm/dma-mapping.h +++ b/arch/x86/include/asm/dma-mapping.h @@ -2,8 +2,8 @@ #define _ASM_X86_DMA_MAPPING_H /* - * IOMMU interface. See Documentation/DMA-mapping.txt and DMA-API.txt for - * documentation. + * IOMMU interface. See Documentation/PCI/PCI-DMA-mapping.txt and + * Documentation/DMA-API.txt for documentation. */ #include diff --git a/arch/x86/kernel/pci-gart_64.c b/arch/x86/kernel/pci-gart_64.c index 00c2bcd41463..d5768b1af080 100644 --- a/arch/x86/kernel/pci-gart_64.c +++ b/arch/x86/kernel/pci-gart_64.c @@ -5,7 +5,7 @@ * This allows to use PCI devices that only support 32bit addresses on systems * with more than 4GB. * - * See Documentation/DMA-mapping.txt for the interface specification. + * See Documentation/PCI/PCI-DMA-mapping.txt for the interface specification. * * Copyright 2002 Andi Kleen, SuSE Labs. * Subject to the GNU General Public License v2 only. diff --git a/drivers/parisc/sba_iommu.c b/drivers/parisc/sba_iommu.c index 3fac8f81d59d..a70cf16ee1ad 100644 --- a/drivers/parisc/sba_iommu.c +++ b/drivers/parisc/sba_iommu.c @@ -668,7 +668,7 @@ sba_mark_invalid(struct ioc *ioc, dma_addr_t iova, size_t byte_cnt) * @dev: instance of PCI owned by the driver that's asking * @mask: number of address bits this PCI device can handle * - * See Documentation/DMA-mapping.txt + * See Documentation/PCI/PCI-DMA-mapping.txt */ static int sba_dma_supported( struct device *dev, u64 mask) { @@ -680,8 +680,8 @@ static int sba_dma_supported( struct device *dev, u64 mask) return(0); } - /* Documentation/DMA-mapping.txt tells drivers to try 64-bit first, - * then fall back to 32-bit if that fails. + /* Documentation/PCI/PCI-DMA-mapping.txt tells drivers to try 64-bit + * first, then fall back to 32-bit if that fails. * We are just "encouraging" 32-bit DMA masks here since we can * never allow IOMMU bypass unless we add special support for ZX1. */ @@ -706,7 +706,7 @@ static int sba_dma_supported( struct device *dev, u64 mask) * @size: number of bytes to map in driver buffer. * @direction: R/W or both. * - * See Documentation/DMA-mapping.txt + * See Documentation/PCI/PCI-DMA-mapping.txt */ static dma_addr_t sba_map_single(struct device *dev, void *addr, size_t size, @@ -785,7 +785,7 @@ sba_map_single(struct device *dev, void *addr, size_t size, * @size: number of bytes mapped in driver buffer. * @direction: R/W or both. * - * See Documentation/DMA-mapping.txt + * See Documentation/PCI/PCI-DMA-mapping.txt */ static void sba_unmap_single(struct device *dev, dma_addr_t iova, size_t size, @@ -861,7 +861,7 @@ sba_unmap_single(struct device *dev, dma_addr_t iova, size_t size, * @size: number of bytes mapped in driver buffer. * @dma_handle: IOVA of new buffer. * - * See Documentation/DMA-mapping.txt + * See Documentation/PCI/PCI-DMA-mapping.txt */ static void *sba_alloc_consistent(struct device *hwdev, size_t size, dma_addr_t *dma_handle, gfp_t gfp) @@ -892,7 +892,7 @@ static void *sba_alloc_consistent(struct device *hwdev, size_t size, * @vaddr: virtual address IOVA of "consistent" buffer. * @dma_handler: IO virtual address of "consistent" buffer. * - * See Documentation/DMA-mapping.txt + * See Documentation/PCI/PCI-DMA-mapping.txt */ static void sba_free_consistent(struct device *hwdev, size_t size, void *vaddr, @@ -927,7 +927,7 @@ int dump_run_sg = 0; * @nents: number of entries in list * @direction: R/W or both. * - * See Documentation/DMA-mapping.txt + * See Documentation/PCI/PCI-DMA-mapping.txt */ static int sba_map_sg(struct device *dev, struct scatterlist *sglist, int nents, @@ -1011,7 +1011,7 @@ sba_map_sg(struct device *dev, struct scatterlist *sglist, int nents, * @nents: number of entries in list * @direction: R/W or both. * - * See Documentation/DMA-mapping.txt + * See Documentation/PCI/PCI-DMA-mapping.txt */ static void sba_unmap_sg(struct device *dev, struct scatterlist *sglist, int nents, diff --git a/drivers/staging/altpciechdma/altpciechdma.c b/drivers/staging/altpciechdma/altpciechdma.c index 8e2b4ca0651d..f516140ca976 100644 --- a/drivers/staging/altpciechdma/altpciechdma.c +++ b/drivers/staging/altpciechdma/altpciechdma.c @@ -531,7 +531,7 @@ static int __devinit dma_test(struct ape_dev *ape, struct pci_dev *dev) goto fail; /* allocate and map coherently-cached memory for a DMA-able buffer */ - /* @see 2.6.26.2/Documentation/DMA-mapping.txt line 318 */ + /* @see Documentation/PCI/PCI-DMA-mapping.txt, near line 318 */ buffer_virt = (u8 *)pci_alloc_consistent(dev, PAGE_SIZE * 4, &buffer_bus); if (!buffer_virt) { printk(KERN_DEBUG "Could not allocate coherent DMA buffer.\n"); @@ -846,7 +846,7 @@ static int __devinit probe(struct pci_dev *dev, const struct pci_device_id *id) #if 1 // @todo For now, disable 64-bit, because I do not understand the implications (DAC!) /* query for DMA transfer */ - /* @see Documentation/DMA-mapping.txt */ + /* @see Documentation/PCI/PCI-DMA-mapping.txt */ if (!pci_set_dma_mask(dev, DMA_64BIT_MASK)) { pci_set_consistent_dma_mask(dev, DMA_64BIT_MASK); /* use 64-bit DMA */ diff --git a/include/media/videobuf-dma-sg.h b/include/media/videobuf-dma-sg.h index 90edd22d343c..dda47f0082e9 100644 --- a/include/media/videobuf-dma-sg.h +++ b/include/media/videobuf-dma-sg.h @@ -49,7 +49,7 @@ struct scatterlist* videobuf_pages_to_sg(struct page **pages, int nr_pages, * does memory allocation too using vmalloc_32(). * * videobuf_dma_*() - * see Documentation/DMA-mapping.txt, these functions to + * see Documentation/PCI/PCI-DMA-mapping.txt, these functions to * basically the same. The map function does also build a * scatterlist for the buffer (and unmap frees it ...) * -- cgit v1.2.3 From 0acbc6c651911dc9ffb4f59b34306bc1ccb751e5 Mon Sep 17 00:00:00 2001 From: Teemu Likonen Date: Thu, 29 Jan 2009 16:28:16 -0800 Subject: Documentation: update CodingStyle tips for Emacs users With the previous Emacs tips example the kernel style was made available for files in the kernel-tree only. This patch updates the tip to add a separate cc-mode indent style ("linux-tabs-only"). This makes it easy to switch between different indent styles and also makes the kernel style easily available for any filetype mode (c++, awk, ...) that is managed by the Emacs cc-mode. Signed-off-by: Teemu Likonen Signed-off-by: Randy Dunlap Signed-off-by: Linus Torvalds --- Documentation/CodingStyle | 15 +++++++++++---- 1 file changed, 11 insertions(+), 4 deletions(-) (limited to 'Documentation') diff --git a/Documentation/CodingStyle b/Documentation/CodingStyle index 1875e502f872..7b5762eded2c 100644 --- a/Documentation/CodingStyle +++ b/Documentation/CodingStyle @@ -483,6 +483,16 @@ values. To do the latter, you can stick the following in your .emacs file: (* (max steps 1) c-basic-offset))) +(add-hook 'c-mode-common-hook + (lambda () + ;; Add kernel style + (c-add-style + "linux-tabs-only" + '("linux" (c-offsets-alist + (arglist-cont-nonempty + c-lineup-gcc-asm-reg + c-lineup-arglist-tabs-only)))))) + (add-hook 'c-mode-hook (lambda () (let ((filename (buffer-file-name))) @@ -490,10 +500,7 @@ values. To do the latter, you can stick the following in your .emacs file: (when (and filename (string-match "~/src/linux-trees" filename)) (setq indent-tabs-mode t) - (c-set-style "linux") - (c-set-offset 'arglist-cont-nonempty - '(c-lineup-gcc-asm-reg - c-lineup-arglist-tabs-only)))))) + (c-set-style "linux-tabs-only"))))) This will make emacs go better with the kernel coding style for C files below ~/src/linux-trees. -- cgit v1.2.3 From 70221395ba980392ba98c1d78f6c9f77be03df9e Mon Sep 17 00:00:00 2001 From: Dan Carpenter Date: Thu, 29 Jan 2009 16:28:28 -0800 Subject: fix emacs indenting howto filename expansion I don't think emacs understands tilde expansion, so use "expand-file-name" to do that. Signed-off-by: Dan Carpenter Signed-off-by: Randy Dunlap Signed-off-by: Linus Torvalds --- Documentation/CodingStyle | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/CodingStyle b/Documentation/CodingStyle index 7b5762eded2c..72968cd5eaf3 100644 --- a/Documentation/CodingStyle +++ b/Documentation/CodingStyle @@ -498,7 +498,8 @@ values. To do the latter, you can stick the following in your .emacs file: (let ((filename (buffer-file-name))) ;; Enable kernel mode for the appropriate files (when (and filename - (string-match "~/src/linux-trees" filename)) + (string-match (expand-file-name "~/src/linux-trees") + filename)) (setq indent-tabs-mode t) (c-set-style "linux-tabs-only"))))) -- cgit v1.2.3 From 242f45da5b7bf63c50f1f18301750712e7885dd6 Mon Sep 17 00:00:00 2001 From: Bill Nottingham Date: Thu, 29 Jan 2009 16:28:40 -0800 Subject: Documentation/Changes: add required versions for new filesystems btrfs requires version 0.18 of its tools, and squashfs requires 4.0. ext3 should use and ext4 requires v1.41.4 of e2fsprogs. Signed-off-by: Bill Nottingham Signed-off-by: Randy Dunlap cc: Ted Tso Signed-off-by: Linus Torvalds --- Documentation/Changes | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/Changes b/Documentation/Changes index cb2b141b1c3e..b95082be4d5e 100644 --- a/Documentation/Changes +++ b/Documentation/Changes @@ -33,10 +33,12 @@ o Gnu make 3.79.1 # make --version o binutils 2.12 # ld -v o util-linux 2.10o # fdformat --version o module-init-tools 0.9.10 # depmod -V -o e2fsprogs 1.29 # tune2fs +o e2fsprogs 1.41.4 # e2fsck -V o jfsutils 1.1.3 # fsck.jfs -V o reiserfsprogs 3.6.3 # reiserfsck -V 2>&1|grep reiserfsprogs o xfsprogs 2.6.0 # xfs_db -V +o squashfs-tools 4.0 # mksquashfs -version +o btrfs-progs 0.18 # btrfsck o pcmciautils 004 # pccardctl -V o quota-tools 3.09 # quota -V o PPP 2.4.0 # pppd --version -- cgit v1.2.3 From 7598909e3ee2a08726276d6415b69dadb52d0d76 Mon Sep 17 00:00:00 2001 From: Nikanth Karthikesan Date: Tue, 27 Jan 2009 09:29:24 +0100 Subject: Mark mandatory elevator functions in the biodoc.txt biodoc.txt mentions that elevator functions marked with * are mandatory, but no function is marked with *. Mark the 3 functions which should be implemented by any io scheduler. Signed-off-by: Nikanth Karthikesan Signed-off-by: Jens Axboe --- Documentation/block/biodoc.txt | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) (limited to 'Documentation') diff --git a/Documentation/block/biodoc.txt b/Documentation/block/biodoc.txt index 5d2480d33b43..ecad6ee75705 100644 --- a/Documentation/block/biodoc.txt +++ b/Documentation/block/biodoc.txt @@ -954,14 +954,14 @@ elevator_allow_merge_fn called whenever the block layer determines results in some sort of conflict internally, this hook allows it to do that. -elevator_dispatch_fn fills the dispatch queue with ready requests. +elevator_dispatch_fn* fills the dispatch queue with ready requests. I/O schedulers are free to postpone requests by not filling the dispatch queue unless @force is non-zero. Once dispatched, I/O schedulers are not allowed to manipulate the requests - they belong to generic dispatch queue. -elevator_add_req_fn called to add a new request into the scheduler +elevator_add_req_fn* called to add a new request into the scheduler elevator_queue_empty_fn returns true if the merge queue is empty. Drivers shouldn't use this, but rather check @@ -991,7 +991,7 @@ elevator_activate_req_fn Called when device driver first sees a request. elevator_deactivate_req_fn Called when device driver decides to delay a request by requeueing it. -elevator_init_fn +elevator_init_fn* elevator_exit_fn Allocate and free any elevator specific storage for a queue. -- cgit v1.2.3