diff options
Diffstat (limited to 'Documentation')
71 files changed, 3936 insertions, 1136 deletions
diff --git a/Documentation/Changes b/Documentation/Changes index 5eaab0441d76..86b86399d61d 100644 --- a/Documentation/Changes +++ b/Documentation/Changes @@ -65,7 +65,7 @@ o isdn4k-utils 3.1pre1 # isdnctrl 2>&1|grep version o nfs-utils 1.0.5 # showmount --version o procps 3.2.0 # ps --version o oprofile 0.9 # oprofiled --version -o udev 058 # udevinfo -V +o udev 071 # udevinfo -V Kernel compilation ================== @@ -139,9 +139,14 @@ You'll probably want to upgrade. Ksymoops -------- -If the unthinkable happens and your kernel oopses, you'll need a 2.4 -version of ksymoops to decode the report; see REPORTING-BUGS in the -root of the Linux source for more information. +If the unthinkable happens and your kernel oopses, you may need the +ksymoops tool to decode it, but in most cases you don't. +In the 2.6 kernel it is generally preferred to build the kernel with +CONFIG_KALLSYMS so that it produces readable dumps that can be used as-is +(this also produces better output than ksymoops). +If for some reason your kernel is not build with CONFIG_KALLSYMS and +you have no way to rebuild and reproduce the Oops with that option, then +you can still decode that Oops with ksymoops. Module-Init-Tools ----------------- @@ -237,6 +242,12 @@ udev udev is a userspace application for populating /dev dynamically with only entries for devices actually present. udev replaces devfs. +FUSE +---- + +Needs libfuse 2.4.0 or later. Absolute minimum is 2.3.0 but mount +options 'direct_io' and 'kernel_cache' won't work. + Networking ========== @@ -390,6 +401,10 @@ udev ---- o <http://www.kernel.org/pub/linux/utils/kernel/hotplug/udev.html> +FUSE +---- +o <http://sourceforge.net/projects/fuse> + Networking ********** diff --git a/Documentation/CodingStyle b/Documentation/CodingStyle index 22e5f9036f3c..eb7db3c19227 100644 --- a/Documentation/CodingStyle +++ b/Documentation/CodingStyle @@ -410,7 +410,26 @@ Kernel messages do not have to be terminated with a period. Printing numbers in parentheses (%d) adds no value and should be avoided. - Chapter 13: References + Chapter 13: Allocating memory + +The kernel provides the following general purpose memory allocators: +kmalloc(), kzalloc(), kcalloc(), and vmalloc(). Please refer to the API +documentation for further information about them. + +The preferred form for passing a size of a struct is the following: + + p = kmalloc(sizeof(*p), ...); + +The alternative form where struct name is spelled out hurts readability and +introduces an opportunity for a bug when the pointer variable type is changed +but the corresponding sizeof that is passed to a memory allocator is not. + +Casting the return value which is a void pointer is redundant. The conversion +from void pointer to any other pointer type is guaranteed by the C programming +language. + + + Chapter 14: References The C Programming Language, Second Edition by Brian W. Kernighan and Dennis M. Ritchie. diff --git a/Documentation/DocBook/Makefile b/Documentation/DocBook/Makefile index fa3e29ad8a46..7018f5c6a447 100644 --- a/Documentation/DocBook/Makefile +++ b/Documentation/DocBook/Makefile @@ -10,7 +10,7 @@ DOCBOOKS := wanbook.xml z8530book.xml mcabook.xml videobook.xml \ kernel-hacking.xml kernel-locking.xml deviceiobook.xml \ procfs-guide.xml writing_usb_driver.xml \ sis900.xml kernel-api.xml journal-api.xml lsm.xml usb.xml \ - gadget.xml libata.xml mtdnand.xml librs.xml + gadget.xml libata.xml mtdnand.xml librs.xml rapidio.xml ### # The build process is as follows (targets): diff --git a/Documentation/DocBook/journal-api.tmpl b/Documentation/DocBook/journal-api.tmpl index 341aaa4ce481..2077f9a28c19 100644 --- a/Documentation/DocBook/journal-api.tmpl +++ b/Documentation/DocBook/journal-api.tmpl @@ -306,7 +306,7 @@ an example. </para> <sect1><title>Journal Level</title> !Efs/jbd/journal.c -!Efs/jbd/recovery.c +!Ifs/jbd/recovery.c </sect1> <sect1><title>Transasction Level</title> !Efs/jbd/transaction.c diff --git a/Documentation/DocBook/kernel-api.tmpl b/Documentation/DocBook/kernel-api.tmpl index d650ce36485f..a8316b1a3e3d 100644 --- a/Documentation/DocBook/kernel-api.tmpl +++ b/Documentation/DocBook/kernel-api.tmpl @@ -118,7 +118,7 @@ X!Ilib/string.c </sect1> <sect1><title>User Space Memory Access</title> !Iinclude/asm-i386/uaccess.h -!Iarch/i386/lib/usercopy.c +!Earch/i386/lib/usercopy.c </sect1> <sect1><title>More Memory Management Functions</title> !Iinclude/linux/rmap.h @@ -174,7 +174,6 @@ X!Ilib/string.c <title>The Linux VFS</title> <sect1><title>The Filesystem types</title> !Iinclude/linux/fs.h -!Einclude/linux/fs.h </sect1> <sect1><title>The Directory Cache</title> !Efs/dcache.c @@ -239,9 +238,9 @@ X!Ilib/string.c <title>Network device support</title> <sect1><title>Driver Support</title> !Enet/core/dev.c - </sect1> - <sect1><title>8390 Based Network Cards</title> -!Edrivers/net/8390.c +!Enet/ethernet/eth.c +!Einclude/linux/etherdevice.h +!Enet/core/wireless.c </sect1> <sect1><title>Synchronous PPP</title> !Edrivers/net/wan/syncppp.c @@ -266,7 +265,7 @@ X!Ekernel/module.c <chapter id="hardware"> <title>Hardware Interfaces</title> <sect1><title>Interrupt Handling</title> -!Ikernel/irq/manage.c +!Ekernel/irq/manage.c </sect1> <sect1><title>Resources Management</title> @@ -286,7 +285,9 @@ X!Edrivers/pci/search.c --> !Edrivers/pci/msi.c !Edrivers/pci/bus.c -!Edrivers/pci/hotplug.c +<!-- FIXME: Removed for now since no structured comments in source +X!Edrivers/pci/hotplug.c +--> !Edrivers/pci/probe.c !Edrivers/pci/rom.c </sect1> @@ -499,7 +500,7 @@ KAO --> !Edrivers/video/modedb.c </sect1> <sect1><title>Frame Buffer Macintosh Video Mode Database</title> -!Idrivers/video/macmodes.c +!Edrivers/video/macmodes.c </sect1> <sect1><title>Frame Buffer Fonts</title> <para> diff --git a/Documentation/DocBook/kernel-hacking.tmpl b/Documentation/DocBook/kernel-hacking.tmpl index 6367bba32d22..582032eea872 100644 --- a/Documentation/DocBook/kernel-hacking.tmpl +++ b/Documentation/DocBook/kernel-hacking.tmpl @@ -1105,7 +1105,7 @@ static struct block_device_operations opt_fops = { </listitem> <listitem> <para> - Function names as strings (__func__). + Function names as strings (__FUNCTION__). </para> </listitem> <listitem> diff --git a/Documentation/DocBook/libata.tmpl b/Documentation/DocBook/libata.tmpl index 375ae760dc1e..d260d92089ad 100644 --- a/Documentation/DocBook/libata.tmpl +++ b/Documentation/DocBook/libata.tmpl @@ -415,6 +415,362 @@ and other resources, etc. </sect1> </chapter> + <chapter id="libataEH"> + <title>Error handling</title> + + <para> + This chapter describes how errors are handled under libata. + Readers are advised to read SCSI EH + (Documentation/scsi/scsi_eh.txt) and ATA exceptions doc first. + </para> + + <sect1><title>Origins of commands</title> + <para> + In libata, a command is represented with struct ata_queued_cmd + or qc. qc's are preallocated during port initialization and + repetitively used for command executions. Currently only one + qc is allocated per port but yet-to-be-merged NCQ branch + allocates one for each tag and maps each qc to NCQ tag 1-to-1. + </para> + <para> + libata commands can originate from two sources - libata itself + and SCSI midlayer. libata internal commands are used for + initialization and error handling. All normal blk requests + and commands for SCSI emulation are passed as SCSI commands + through queuecommand callback of SCSI host template. + </para> + </sect1> + + <sect1><title>How commands are issued</title> + + <variablelist> + + <varlistentry><term>Internal commands</term> + <listitem> + <para> + First, qc is allocated and initialized using + ata_qc_new_init(). Although ata_qc_new_init() doesn't + implement any wait or retry mechanism when qc is not + available, internal commands are currently issued only during + initialization and error recovery, so no other command is + active and allocation is guaranteed to succeed. + </para> + <para> + Once allocated qc's taskfile is initialized for the command to + be executed. qc currently has two mechanisms to notify + completion. One is via qc->complete_fn() callback and the + other is completion qc->waiting. qc->complete_fn() callback + is the asynchronous path used by normal SCSI translated + commands and qc->waiting is the synchronous (issuer sleeps in + process context) path used by internal commands. + </para> + <para> + Once initialization is complete, host_set lock is acquired + and the qc is issued. + </para> + </listitem> + </varlistentry> + + <varlistentry><term>SCSI commands</term> + <listitem> + <para> + All libata drivers use ata_scsi_queuecmd() as + hostt->queuecommand callback. scmds can either be simulated + or translated. No qc is involved in processing a simulated + scmd. The result is computed right away and the scmd is + completed. + </para> + <para> + For a translated scmd, ata_qc_new_init() is invoked to + allocate a qc and the scmd is translated into the qc. SCSI + midlayer's completion notification function pointer is stored + into qc->scsidone. + </para> + <para> + qc->complete_fn() callback is used for completion + notification. ATA commands use ata_scsi_qc_complete() while + ATAPI commands use atapi_qc_complete(). Both functions end up + calling qc->scsidone to notify upper layer when the qc is + finished. After translation is completed, the qc is issued + with ata_qc_issue(). + </para> + <para> + Note that SCSI midlayer invokes hostt->queuecommand while + holding host_set lock, so all above occur while holding + host_set lock. + </para> + </listitem> + </varlistentry> + + </variablelist> + </sect1> + + <sect1><title>How commands are processed</title> + <para> + Depending on which protocol and which controller are used, + commands are processed differently. For the purpose of + discussion, a controller which uses taskfile interface and all + standard callbacks is assumed. + </para> + <para> + Currently 6 ATA command protocols are used. They can be + sorted into the following four categories according to how + they are processed. + </para> + + <variablelist> + <varlistentry><term>ATA NO DATA or DMA</term> + <listitem> + <para> + ATA_PROT_NODATA and ATA_PROT_DMA fall into this category. + These types of commands don't require any software + intervention once issued. Device will raise interrupt on + completion. + </para> + </listitem> + </varlistentry> + + <varlistentry><term>ATA PIO</term> + <listitem> + <para> + ATA_PROT_PIO is in this category. libata currently + implements PIO with polling. ATA_NIEN bit is set to turn + off interrupt and pio_task on ata_wq performs polling and + IO. + </para> + </listitem> + </varlistentry> + + <varlistentry><term>ATAPI NODATA or DMA</term> + <listitem> + <para> + ATA_PROT_ATAPI_NODATA and ATA_PROT_ATAPI_DMA are in this + category. packet_task is used to poll BSY bit after + issuing PACKET command. Once BSY is turned off by the + device, packet_task transfers CDB and hands off processing + to interrupt handler. + </para> + </listitem> + </varlistentry> + + <varlistentry><term>ATAPI PIO</term> + <listitem> + <para> + ATA_PROT_ATAPI is in this category. ATA_NIEN bit is set + and, as in ATAPI NODATA or DMA, packet_task submits cdb. + However, after submitting cdb, further processing (data + transfer) is handed off to pio_task. + </para> + </listitem> + </varlistentry> + </variablelist> + </sect1> + + <sect1><title>How commands are completed</title> + <para> + Once issued, all qc's are either completed with + ata_qc_complete() or time out. For commands which are handled + by interrupts, ata_host_intr() invokes ata_qc_complete(), and, + for PIO tasks, pio_task invokes ata_qc_complete(). In error + cases, packet_task may also complete commands. + </para> + <para> + ata_qc_complete() does the following. + </para> + + <orderedlist> + + <listitem> + <para> + DMA memory is unmapped. + </para> + </listitem> + + <listitem> + <para> + ATA_QCFLAG_ACTIVE is clared from qc->flags. + </para> + </listitem> + + <listitem> + <para> + qc->complete_fn() callback is invoked. If the return value of + the callback is not zero. Completion is short circuited and + ata_qc_complete() returns. + </para> + </listitem> + + <listitem> + <para> + __ata_qc_complete() is called, which does + <orderedlist> + + <listitem> + <para> + qc->flags is cleared to zero. + </para> + </listitem> + + <listitem> + <para> + ap->active_tag and qc->tag are poisoned. + </para> + </listitem> + + <listitem> + <para> + qc->waiting is claread & completed (in that order). + </para> + </listitem> + + <listitem> + <para> + qc is deallocated by clearing appropriate bit in ap->qactive. + </para> + </listitem> + + </orderedlist> + </para> + </listitem> + + </orderedlist> + + <para> + So, it basically notifies upper layer and deallocates qc. One + exception is short-circuit path in #3 which is used by + atapi_qc_complete(). + </para> + <para> + For all non-ATAPI commands, whether it fails or not, almost + the same code path is taken and very little error handling + takes place. A qc is completed with success status if it + succeeded, with failed status otherwise. + </para> + <para> + However, failed ATAPI commands require more handling as + REQUEST SENSE is needed to acquire sense data. If an ATAPI + command fails, ata_qc_complete() is invoked with error status, + which in turn invokes atapi_qc_complete() via + qc->complete_fn() callback. + </para> + <para> + This makes atapi_qc_complete() set scmd->result to + SAM_STAT_CHECK_CONDITION, complete the scmd and return 1. As + the sense data is empty but scmd->result is CHECK CONDITION, + SCSI midlayer will invoke EH for the scmd, and returning 1 + makes ata_qc_complete() to return without deallocating the qc. + This leads us to ata_scsi_error() with partially completed qc. + </para> + + </sect1> + + <sect1><title>ata_scsi_error()</title> + <para> + ata_scsi_error() is the current hostt->eh_strategy_handler() + for libata. As discussed above, this will be entered in two + cases - timeout and ATAPI error completion. This function + calls low level libata driver's eng_timeout() callback, the + standard callback for which is ata_eng_timeout(). It checks + if a qc is active and calls ata_qc_timeout() on the qc if so. + Actual error handling occurs in ata_qc_timeout(). + </para> + <para> + If EH is invoked for timeout, ata_qc_timeout() stops BMDMA and + completes the qc. Note that as we're currently in EH, we + cannot call scsi_done. As described in SCSI EH doc, a + recovered scmd should be either retried with + scsi_queue_insert() or finished with scsi_finish_command(). + Here, we override qc->scsidone with scsi_finish_command() and + calls ata_qc_complete(). + </para> + <para> + If EH is invoked due to a failed ATAPI qc, the qc here is + completed but not deallocated. The purpose of this + half-completion is to use the qc as place holder to make EH + code reach this place. This is a bit hackish, but it works. + </para> + <para> + Once control reaches here, the qc is deallocated by invoking + __ata_qc_complete() explicitly. Then, internal qc for REQUEST + SENSE is issued. Once sense data is acquired, scmd is + finished by directly invoking scsi_finish_command() on the + scmd. Note that as we already have completed and deallocated + the qc which was associated with the scmd, we don't need + to/cannot call ata_qc_complete() again. + </para> + + </sect1> + + <sect1><title>Problems with the current EH</title> + + <itemizedlist> + + <listitem> + <para> + Error representation is too crude. Currently any and all + error conditions are represented with ATA STATUS and ERROR + registers. Errors which aren't ATA device errors are treated + as ATA device errors by setting ATA_ERR bit. Better error + descriptor which can properly represent ATA and other + errors/exceptions is needed. + </para> + </listitem> + + <listitem> + <para> + When handling timeouts, no action is taken to make device + forget about the timed out command and ready for new commands. + </para> + </listitem> + + <listitem> + <para> + EH handling via ata_scsi_error() is not properly protected + from usual command processing. On EH entrance, the device is + not in quiescent state. Timed out commands may succeed or + fail any time. pio_task and atapi_task may still be running. + </para> + </listitem> + + <listitem> + <para> + Too weak error recovery. Devices / controllers causing HSM + mismatch errors and other errors quite often require reset to + return to known state. Also, advanced error handling is + necessary to support features like NCQ and hotplug. + </para> + </listitem> + + <listitem> + <para> + ATA errors are directly handled in the interrupt handler and + PIO errors in pio_task. This is problematic for advanced + error handling for the following reasons. + </para> + <para> + First, advanced error handling often requires context and + internal qc execution. + </para> + <para> + Second, even a simple failure (say, CRC error) needs + information gathering and could trigger complex error handling + (say, resetting & reconfiguring). Having multiple code + paths to gather information, enter EH and trigger actions + makes life painful. + </para> + <para> + Third, scattered EH code makes implementing low level drivers + difficult. Low level drivers override libata callbacks. If + EH is scattered over several places, each affected callbacks + should perform its part of error handling. This can be error + prone and painful. + </para> + </listitem> + + </itemizedlist> + </sect1> + </chapter> + <chapter id="libataExt"> <title>libata Library</title> !Edrivers/scsi/libata-core.c @@ -431,6 +787,722 @@ and other resources, etc. !Idrivers/scsi/libata-scsi.c </chapter> + <chapter id="ataExceptions"> + <title>ATA errors & exceptions</title> + + <para> + This chapter tries to identify what error/exception conditions exist + for ATA/ATAPI devices and describe how they should be handled in + implementation-neutral way. + </para> + + <para> + The term 'error' is used to describe conditions where either an + explicit error condition is reported from device or a command has + timed out. + </para> + + <para> + The term 'exception' is either used to describe exceptional + conditions which are not errors (say, power or hotplug events), or + to describe both errors and non-error exceptional conditions. Where + explicit distinction between error and exception is necessary, the + term 'non-error exception' is used. + </para> + + <sect1 id="excat"> + <title>Exception categories</title> + <para> + Exceptions are described primarily with respect to legacy + taskfile + bus master IDE interface. If a controller provides + other better mechanism for error reporting, mapping those into + categories described below shouldn't be difficult. + </para> + + <para> + In the following sections, two recovery actions - reset and + reconfiguring transport - are mentioned. These are described + further in <xref linkend="exrec"/>. + </para> + + <sect2 id="excatHSMviolation"> + <title>HSM violation</title> + <para> + This error is indicated when STATUS value doesn't match HSM + requirement during issuing or excution any ATA/ATAPI command. + </para> + + <itemizedlist> + <title>Examples</title> + + <listitem> + <para> + ATA_STATUS doesn't contain !BSY && DRDY && !DRQ while trying + to issue a command. + </para> + </listitem> + + <listitem> + <para> + !BSY && !DRQ during PIO data transfer. + </para> + </listitem> + + <listitem> + <para> + DRQ on command completion. + </para> + </listitem> + + <listitem> + <para> + !BSY && ERR after CDB tranfer starts but before the + last byte of CDB is transferred. ATA/ATAPI standard states + that "The device shall not terminate the PACKET command + with an error before the last byte of the command packet has + been written" in the error outputs description of PACKET + command and the state diagram doesn't include such + transitions. + </para> + </listitem> + + </itemizedlist> + + <para> + In these cases, HSM is violated and not much information + regarding the error can be acquired from STATUS or ERROR + register. IOW, this error can be anything - driver bug, + faulty device, controller and/or cable. + </para> + + <para> + As HSM is violated, reset is necessary to restore known state. + Reconfiguring transport for lower speed might be helpful too + as transmission errors sometimes cause this kind of errors. + </para> + </sect2> + + <sect2 id="excatDevErr"> + <title>ATA/ATAPI device error (non-NCQ / non-CHECK CONDITION)</title> + + <para> + These are errors detected and reported by ATA/ATAPI devices + indicating device problems. For this type of errors, STATUS + and ERROR register values are valid and describe error + condition. Note that some of ATA bus errors are detected by + ATA/ATAPI devices and reported using the same mechanism as + device errors. Those cases are described later in this + section. + </para> + + <para> + For ATA commands, this type of errors are indicated by !BSY + && ERR during command execution and on completion. + </para> + + <para>For ATAPI commands,</para> + + <itemizedlist> + + <listitem> + <para> + !BSY && ERR && ABRT right after issuing PACKET + indicates that PACKET command is not supported and falls in + this category. + </para> + </listitem> + + <listitem> + <para> + !BSY && ERR(==CHK) && !ABRT after the last + byte of CDB is transferred indicates CHECK CONDITION and + doesn't fall in this category. + </para> + </listitem> + + <listitem> + <para> + !BSY && ERR(==CHK) && ABRT after the last byte + of CDB is transferred *probably* indicates CHECK CONDITION and + doesn't fall in this category. + </para> + </listitem> + + </itemizedlist> + + <para> + Of errors detected as above, the followings are not ATA/ATAPI + device errors but ATA bus errors and should be handled + according to <xref linkend="excatATAbusErr"/>. + </para> + + <variablelist> + + <varlistentry> + <term>CRC error during data transfer</term> + <listitem> + <para> + This is indicated by ICRC bit in the ERROR register and + means that corruption occurred during data transfer. Upto + ATA/ATAPI-7, the standard specifies that this bit is only + applicable to UDMA transfers but ATA/ATAPI-8 draft revision + 1f says that the bit may be applicable to multiword DMA and + PIO. + </para> + </listitem> + </varlistentry> + + <varlistentry> + <term>ABRT error during data transfer or on completion</term> + <listitem> + <para> + Upto ATA/ATAPI-7, the standard specifies that ABRT could be + set on ICRC errors and on cases where a device is not able + to complete a command. Combined with the fact that MWDMA + and PIO transfer errors aren't allowed to use ICRC bit upto + ATA/ATAPI-7, it seems to imply that ABRT bit alone could + indicate tranfer errors. + </para> + <para> + However, ATA/ATAPI-8 draft revision 1f removes the part + that ICRC errors can turn on ABRT. So, this is kind of + gray area. Some heuristics are needed here. + </para> + </listitem> + </varlistentry> + + </variablelist> + + <para> + ATA/ATAPI device errors can be further categorized as follows. + </para> + + <variablelist> + + <varlistentry> + <term>Media errors</term> + <listitem> + <para> + This is indicated by UNC bit in the ERROR register. ATA + devices reports UNC error only after certain number of + retries cannot recover the data, so there's nothing much + else to do other than notifying upper layer. + </para> + <para> + READ and WRITE commands report CHS or LBA of the first + failed sector but ATA/ATAPI standard specifies that the + amount of transferred data on error completion is + indeterminate, so we cannot assume that sectors preceding + the failed sector have been transferred and thus cannot + complete those sectors successfully as SCSI does. + </para> + </listitem> + </varlistentry> + + <varlistentry> + <term>Media changed / media change requested error</term> + <listitem> + <para> + <<TODO: fill here>> + </para> + </listitem> + </varlistentry> + + <varlistentry><term>Address error</term> + <listitem> + <para> + This is indicated by IDNF bit in the ERROR register. + Report to upper layer. + </para> + </listitem> + </varlistentry> + + <varlistentry><term>Other errors</term> + <listitem> + <para> + This can be invalid command or parameter indicated by ABRT + ERROR bit or some other error condition. Note that ABRT + bit can indicate a lot of things including ICRC and Address + errors. Heuristics needed. + </para> + </listitem> + </varlistentry> + + </variablelist> + + <para> + Depending on commands, not all STATUS/ERROR bits are + applicable. These non-applicable bits are marked with + "na" in the output descriptions but upto ATA/ATAPI-7 + no definition of "na" can be found. However, + ATA/ATAPI-8 draft revision 1f describes "N/A" as + follows. + </para> + + <blockquote> + <variablelist> + <varlistentry><term>3.2.3.3a N/A</term> + <listitem> + <para> + A keyword the indicates a field has no defined value in + this standard and should not be checked by the host or + device. N/A fields should be cleared to zero. + </para> + </listitem> + </varlistentry> + </variablelist> + </blockquote> + + <para> + So, it seems reasonable to assume that "na" bits are + cleared to zero by devices and thus need no explicit masking. + </para> + + </sect2> + + <sect2 id="excatATAPIcc"> + <title>ATAPI device CHECK CONDITION</title> + + <para> + ATAPI device CHECK CONDITION error is indicated by set CHK bit + (ERR bit) in the STATUS register after the last byte of CDB is + transferred for a PACKET command. For this kind of errors, + sense data should be acquired to gather information regarding + the errors. REQUEST SENSE packet command should be used to + acquire sense data. + </para> + + <para> + Once sense data is acquired, this type of errors can be + handled similary to other SCSI errors. Note that sense data + may indicate ATA bus error (e.g. Sense Key 04h HARDWARE ERROR + && ASC/ASCQ 47h/00h SCSI PARITY ERROR). In such + cases, the error should be considered as an ATA bus error and + handled according to <xref linkend="excatATAbusErr"/>. + </para> + + </sect2> + + <sect2 id="excatNCQerr"> + <title>ATA device error (NCQ)</title> + + <para> + NCQ command error is indicated by cleared BSY and set ERR bit + during NCQ command phase (one or more NCQ commands + outstanding). Although STATUS and ERROR registers will + contain valid values describing the error, READ LOG EXT is + required to clear the error condition, determine which command + has failed and acquire more information. + </para> + + <para> + READ LOG EXT Log Page 10h reports which tag has failed and + taskfile register values describing the error. With this + information the failed command can be handled as a normal ATA + command error as in <xref linkend="excatDevErr"/> and all + other in-flight commands must be retried. Note that this + retry should not be counted - it's likely that commands + retried this way would have completed normally if it were not + for the failed command. + </para> + + <para> + Note that ATA bus errors can be reported as ATA device NCQ + errors. This should be handled as described in <xref + linkend="excatATAbusErr"/>. + </para> + + <para> + If READ LOG EXT Log Page 10h fails or reports NQ, we're + thoroughly screwed. This condition should be treated + according to <xref linkend="excatHSMviolation"/>. + </para> + + </sect2> + + <sect2 id="excatATAbusErr"> + <title>ATA bus error</title> + + <para> + ATA bus error means that data corruption occurred during + transmission over ATA bus (SATA or PATA). This type of errors + can be indicated by + </para> + + <itemizedlist> + + <listitem> + <para> + ICRC or ABRT error as described in <xref linkend="excatDevErr"/>. + </para> + </listitem> + + <listitem> + <para> + Controller-specific error completion with error information + indicating transmission error. + </para> + </listitem> + + <listitem> + <para> + On some controllers, command timeout. In this case, there may + be a mechanism to determine that the timeout is due to + transmission error. + </para> + </listitem> + + <listitem> + <para> + Unknown/random errors, timeouts and all sorts of weirdities. + </para> + </listitem> + + </itemizedlist> + + <para> + As described above, transmission errors can cause wide variety + of symptoms ranging from device ICRC error to random device + lockup, and, for many cases, there is no way to tell if an + error condition is due to transmission error or not; + therefore, it's necessary to employ some kind of heuristic + when dealing with errors and timeouts. For example, + encountering repetitive ABRT errors for known supported + command is likely to indicate ATA bus error. + </para> + + <para> + Once it's determined that ATA bus errors have possibly + occurred, lowering ATA bus transmission speed is one of + actions which may alleviate the problem. See <xref + linkend="exrecReconf"/> for more information. + </para> + + </sect2> + + <sect2 id="excatPCIbusErr"> + <title>PCI bus error</title> + + <para> + Data corruption or other failures during transmission over PCI + (or other system bus). For standard BMDMA, this is indicated + by Error bit in the BMDMA Status register. This type of + errors must be logged as it indicates something is very wrong + with the system. Resetting host controller is recommended. + </para> + + </sect2> + + <sect2 id="excatLateCompletion"> + <title>Late completion</title> + + <para> + This occurs when timeout occurs and the timeout handler finds + out that the timed out command has completed successfully or + with error. This is usually caused by lost interrupts. This + type of errors must be logged. Resetting host controller is + recommended. + </para> + + </sect2> + + <sect2 id="excatUnknown"> + <title>Unknown error (timeout)</title> + + <para> + This is when timeout occurs and the command is still + processing or the host and device are in unknown state. When + this occurs, HSM could be in any valid or invalid state. To + bring the device to known state and make it forget about the + timed out command, resetting is necessary. The timed out + command may be retried. + </para> + + <para> + Timeouts can also be caused by transmission errors. Refer to + <xref linkend="excatATAbusErr"/> for more details. + </para> + + </sect2> + + <sect2 id="excatHoplugPM"> + <title>Hotplug and power management exceptions</title> + + <para> + <<TODO: fill here>> + </para> + + </sect2> + + </sect1> + + <sect1 id="exrec"> + <title>EH recovery actions</title> + + <para> + This section discusses several important recovery actions. + </para> + + <sect2 id="exrecClr"> + <title>Clearing error condition</title> + + <para> + Many controllers require its error registers to be cleared by + error handler. Different controllers may have different + requirements. + </para> + + <para> + For SATA, it's strongly recommended to clear at least SError + register during error handling. + </para> + </sect2> + + <sect2 id="exrecRst"> + <title>Reset</title> + + <para> + During EH, resetting is necessary in the following cases. + </para> + + <itemizedlist> + + <listitem> + <para> + HSM is in unknown or invalid state + </para> + </listitem> + + <listitem> + <para> + HBA is in unknown or invalid state + </para> + </listitem> + + <listitem> + <para> + EH needs to make HBA/device forget about in-flight commands + </para> + </listitem> + + <listitem> + <para> + HBA/device behaves weirdly + </para> + </listitem> + + </itemizedlist> + + <para> + Resetting during EH might be a good idea regardless of error + condition to improve EH robustness. Whether to reset both or + either one of HBA and device depends on situation but the + following scheme is recommended. + </para> + + <itemizedlist> + + <listitem> + <para> + When it's known that HBA is in ready state but ATA/ATAPI + device in in unknown state, reset only device. + </para> + </listitem> + + <listitem> + <para> + If HBA is in unknown state, reset both HBA and device. + </para> + </listitem> + + </itemizedlist> + + <para> + HBA resetting is implementation specific. For a controller + complying to taskfile/BMDMA PCI IDE, stopping active DMA + transaction may be sufficient iff BMDMA state is the only HBA + context. But even mostly taskfile/BMDMA PCI IDE complying + controllers may have implementation specific requirements and + mechanism to reset themselves. This must be addressed by + specific drivers. + </para> + + <para> + OTOH, ATA/ATAPI standard describes in detail ways to reset + ATA/ATAPI devices. + </para> + + <variablelist> + + <varlistentry><term>PATA hardware reset</term> + <listitem> + <para> + This is hardware initiated device reset signalled with + asserted PATA RESET- signal. There is no standard way to + initiate hardware reset from software although some + hardware provides registers that allow driver to directly + tweak the RESET- signal. + </para> + </listitem> + </varlistentry> + + <varlistentry><term>Software reset</term> + <listitem> + <para> + This is achieved by turning CONTROL SRST bit on for at + least 5us. Both PATA and SATA support it but, in case of + SATA, this may require controller-specific support as the + second Register FIS to clear SRST should be transmitted + while BSY bit is still set. Note that on PATA, this resets + both master and slave devices on a channel. + </para> + </listitem> + </varlistentry> + + <varlistentry><term>EXECUTE DEVICE DIAGNOSTIC command</term> + <listitem> + <para> + Although ATA/ATAPI standard doesn't describe exactly, EDD + implies some level of resetting, possibly similar level + with software reset. Host-side EDD protocol can be handled + with normal command processing and most SATA controllers + should be able to handle EDD's just like other commands. + As in software reset, EDD affects both devices on a PATA + bus. + </para> + <para> + Although EDD does reset devices, this doesn't suit error + handling as EDD cannot be issued while BSY is set and it's + unclear how it will act when device is in unknown/weird + state. + </para> + </listitem> + </varlistentry> + + <varlistentry><term>ATAPI DEVICE RESET command</term> + <listitem> + <para> + This is very similar to software reset except that reset + can be restricted to the selected device without affecting + the other device sharing the cable. + </para> + </listitem> + </varlistentry> + + <varlistentry><term>SATA phy reset</term> + <listitem> + <para> + This is the preferred way of resetting a SATA device. In + effect, it's identical to PATA hardware reset. Note that + this can be done with the standard SCR Control register. + As such, it's usually easier to implement than software + reset. + </para> + </listitem> + </varlistentry> + + </variablelist> + + <para> + One more thing to consider when resetting devices is that + resetting clears certain configuration parameters and they + need to be set to their previous or newly adjusted values + after reset. + </para> + + <para> + Parameters affected are. + </para> + + <itemizedlist> + + <listitem> + <para> + CHS set up with INITIALIZE DEVICE PARAMETERS (seldomly used) + </para> + </listitem> + + <listitem> + <para> + Parameters set with SET FEATURES including transfer mode setting + </para> + </listitem> + + <listitem> + <para> + Block count set with SET MULTIPLE MODE + </para> + </listitem> + + <listitem> + <para> + Other parameters (SET MAX, MEDIA LOCK...) + </para> + </listitem> + + </itemizedlist> + + <para> + ATA/ATAPI standard specifies that some parameters must be + maintained across hardware or software reset, but doesn't + strictly specify all of them. Always reconfiguring needed + parameters after reset is required for robustness. Note that + this also applies when resuming from deep sleep (power-off). + </para> + + <para> + Also, ATA/ATAPI standard requires that IDENTIFY DEVICE / + IDENTIFY PACKET DEVICE is issued after any configuration + parameter is updated or a hardware reset and the result used + for further operation. OS driver is required to implement + revalidation mechanism to support this. + </para> + + </sect2> + + <sect2 id="exrecReconf"> + <title>Reconfigure transport</title> + + <para> + For both PATA and SATA, a lot of corners are cut for cheap + connectors, cables or controllers and it's quite common to see + high transmission error rate. This can be mitigated by + lowering transmission speed. + </para> + + <para> + The following is a possible scheme Jeff Garzik suggested. + </para> + + <blockquote> + <para> + If more than $N (3?) transmission errors happen in 15 minutes, + </para> + <itemizedlist> + <listitem> + <para> + if SATA, decrease SATA PHY speed. if speed cannot be decreased, + </para> + </listitem> + <listitem> + <para> + decrease UDMA xfer speed. if at UDMA0, switch to PIO4, + </para> + </listitem> + <listitem> + <para> + decrease PIO xfer speed. if at PIO3, complain, but continue + </para> + </listitem> + </itemizedlist> + </blockquote> + + </sect2> + + </sect1> + + </chapter> + <chapter id="PiixInt"> <title>ata_piix Internals</title> !Idrivers/scsi/ata_piix.c diff --git a/Documentation/DocBook/rapidio.tmpl b/Documentation/DocBook/rapidio.tmpl new file mode 100644 index 000000000000..1becf27ba27e --- /dev/null +++ b/Documentation/DocBook/rapidio.tmpl @@ -0,0 +1,160 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN" + "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd" [ + <!ENTITY rapidio SYSTEM "rapidio.xml"> + ]> + +<book id="RapidIO-Guide"> + <bookinfo> + <title>RapidIO Subsystem Guide</title> + + <authorgroup> + <author> + <firstname>Matt</firstname> + <surname>Porter</surname> + <affiliation> + <address> + <email>mporter@kernel.crashing.org</email> + <email>mporter@mvista.com</email> + </address> + </affiliation> + </author> + </authorgroup> + + <copyright> + <year>2005</year> + <holder>MontaVista Software, Inc.</holder> + </copyright> + + <legalnotice> + <para> + This documentation is free software; you can redistribute + it and/or modify it under the terms of the GNU General Public + License version 2 as published by the Free Software Foundation. + </para> + + <para> + This program is distributed in the hope that it will be + useful, but WITHOUT ANY WARRANTY; without even the implied + warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. + See the GNU General Public License for more details. + </para> + + <para> + You should have received a copy of the GNU General Public + License along with this program; if not, write to the Free + Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, + MA 02111-1307 USA + </para> + + <para> + For more details see the file COPYING in the source + distribution of Linux. + </para> + </legalnotice> + </bookinfo> + +<toc></toc> + + <chapter id="intro"> + <title>Introduction</title> + <para> + RapidIO is a high speed switched fabric interconnect with + features aimed at the embedded market. RapidIO provides + support for memory-mapped I/O as well as message-based + transactions over the switched fabric network. RapidIO has + a standardized discovery mechanism not unlike the PCI bus + standard that allows simple detection of devices in a + network. + </para> + <para> + This documentation is provided for developers intending + to support RapidIO on new architectures, write new drivers, + or to understand the subsystem internals. + </para> + </chapter> + + <chapter id="bugs"> + <title>Known Bugs and Limitations</title> + + <sect1> + <title>Bugs</title> + <para>None. ;)</para> + </sect1> + <sect1> + <title>Limitations</title> + <para> + <orderedlist> + <listitem><para>Access/management of RapidIO memory regions is not supported</para></listitem> + <listitem><para>Multiple host enumeration is not supported</para></listitem> + </orderedlist> + </para> + </sect1> + </chapter> + + <chapter id="drivers"> + <title>RapidIO driver interface</title> + <para> + Drivers are provided a set of calls in order + to interface with the subsystem to gather info + on devices, request/map memory region resources, + and manage mailboxes/doorbells. + </para> + <sect1> + <title>Functions</title> +!Iinclude/linux/rio_drv.h +!Edrivers/rapidio/rio-driver.c +!Edrivers/rapidio/rio.c + </sect1> + </chapter> + + <chapter id="internals"> + <title>Internals</title> + + <para> + This chapter contains the autogenerated documentation of the RapidIO + subsystem. + </para> + + <sect1><title>Structures</title> +!Iinclude/linux/rio.h + </sect1> + <sect1><title>Enumeration and Discovery</title> +!Idrivers/rapidio/rio-scan.c + </sect1> + <sect1><title>Driver functionality</title> +!Idrivers/rapidio/rio.c +!Idrivers/rapidio/rio-access.c + </sect1> + <sect1><title>Device model support</title> +!Idrivers/rapidio/rio-driver.c + </sect1> + <sect1><title>Sysfs support</title> +!Idrivers/rapidio/rio-sysfs.c + </sect1> + <sect1><title>PPC32 support</title> +!Iarch/ppc/kernel/rio.c +!Earch/ppc/syslib/ppc85xx_rio.c +!Iarch/ppc/syslib/ppc85xx_rio.c + </sect1> + </chapter> + + <chapter id="credits"> + <title>Credits</title> + <para> + The following people have contributed to the RapidIO + subsystem directly or indirectly: + <orderedlist> + <listitem><para>Matt Porter<email>mporter@kernel.crashing.org</email></para></listitem> + <listitem><para>Randy Vinson<email>rvinson@mvista.com</email></para></listitem> + <listitem><para>Dan Malek<email>dan@embeddedalley.com</email></para></listitem> + </orderedlist> + </para> + <para> + The following people have contributed to this document: + <orderedlist> + <listitem><para>Matt Porter<email>mporter@kernel.crashing.org</email></para></listitem> + </orderedlist> + </para> + </chapter> +</book> diff --git a/Documentation/DocBook/usb.tmpl b/Documentation/DocBook/usb.tmpl index 705c442c7bf4..15ce0f21e5e0 100644 --- a/Documentation/DocBook/usb.tmpl +++ b/Documentation/DocBook/usb.tmpl @@ -291,7 +291,7 @@ !Edrivers/usb/core/hcd.c !Edrivers/usb/core/hcd-pci.c -!Edrivers/usb/core/buffer.c +!Idrivers/usb/core/buffer.c </chapter> <chapter> diff --git a/Documentation/DocBook/writing_usb_driver.tmpl b/Documentation/DocBook/writing_usb_driver.tmpl index 51f3bfb6fb6e..008a341234d0 100644 --- a/Documentation/DocBook/writing_usb_driver.tmpl +++ b/Documentation/DocBook/writing_usb_driver.tmpl @@ -345,8 +345,7 @@ if (!retval) { <programlisting> static inline void skel_delete (struct usb_skel *dev) { - if (dev->bulk_in_buffer != NULL) - kfree (dev->bulk_in_buffer); + kfree (dev->bulk_in_buffer); if (dev->bulk_out_buffer != NULL) usb_buffer_free (dev->udev, dev->bulk_out_size, dev->bulk_out_buffer, diff --git a/Documentation/MSI-HOWTO.txt b/Documentation/MSI-HOWTO.txt index 63edc5f847c4..3ec6c720b016 100644 --- a/Documentation/MSI-HOWTO.txt +++ b/Documentation/MSI-HOWTO.txt @@ -10,14 +10,22 @@ This guide describes the basics of Message Signaled Interrupts (MSI), the advantages of using MSI over traditional interrupt mechanisms, and how to enable your driver to use MSI or MSI-X. Also included is -a Frequently Asked Questions. +a Frequently Asked Questions (FAQ) section. + +1.1 Terminology + +PCI devices can be single-function or multi-function. In either case, +when this text talks about enabling or disabling MSI on a "device +function," it is referring to one specific PCI device and function and +not to all functions on a PCI device (unless the PCI device has only +one function). 2. Copyright 2003 Intel Corporation 3. What is MSI/MSI-X? Message Signaled Interrupt (MSI), as described in the PCI Local Bus -Specification Revision 2.3 or latest, is an optional feature, and a +Specification Revision 2.3 or later, is an optional feature, and a required feature for PCI Express devices. MSI enables a device function to request service by sending an Inbound Memory Write on its PCI bus to the FSB as a Message Signal Interrupt transaction. Because MSI is @@ -27,7 +35,7 @@ supported. A PCI device that supports MSI must also support pin IRQ assertion interrupt mechanism to provide backward compatibility for systems that -do not support MSI. In Systems, which support MSI, the bus driver is +do not support MSI. In systems which support MSI, the bus driver is responsible for initializing the message address and message data of the device function's MSI/MSI-X capability structure during device initial configuration. @@ -61,17 +69,17 @@ over the MSI capability structure as described below. - MSI and MSI-X both support per-vector masking. Per-vector masking is an optional extension of MSI but a required - feature for MSI-X. Per-vector masking provides the kernel - the ability to mask/unmask MSI when servicing its software - interrupt service routing handler. If per-vector masking is + feature for MSI-X. Per-vector masking provides the kernel the + ability to mask/unmask a single MSI while running its + interrupt service routine. If per-vector masking is not supported, then the device driver should provide the hardware/software synchronization to ensure that the device generates MSI when the driver wants it to do so. 4. Why use MSI? -As a benefit the simplification of board design, MSI allows board -designers to remove out of band interrupt routing. MSI is another +As a benefit to the simplification of board design, MSI allows board +designers to remove out-of-band interrupt routing. MSI is another step towards a legacy-free environment. Due to increasing pressure on chipset and processor packages to @@ -87,7 +95,7 @@ support. As a result, the PCI Express technology requires MSI support for better interrupt performance. Using MSI enables the device functions to support two or more -vectors, which can be configured to target different CPU's to +vectors, which can be configured to target different CPUs to increase scalability. 5. Configuring a driver to use MSI/MSI-X @@ -119,13 +127,13 @@ pci_enable_msi() explicitly. int pci_enable_msi(struct pci_dev *dev) -With this new API, any existing device driver, which like to have -MSI enabled on its device function, must call this API to enable MSI +With this new API, a device driver that wants to have MSI +enabled on its device function must call this API to enable MSI. A successful call will initialize the MSI capability structure with ONE vector, regardless of whether a device function is capable of supporting multiple messages. This vector replaces the -pre-assigned dev->irq with a new MSI vector. To avoid the conflict -of new assigned vector with existing pre-assigned vector requires +pre-assigned dev->irq with a new MSI vector. To avoid a conflict +of the new assigned vector with existing pre-assigned vector requires a device driver to call this API before calling request_irq(). 5.2.2 API pci_disable_msi @@ -137,14 +145,14 @@ when a device driver is unloading. This API restores dev->irq with the pre-assigned IOAPIC vector and switches a device's interrupt mode to PCI pin-irq assertion/INTx emulation mode. -Note that a device driver should always call free_irq() on MSI vector -it has done request_irq() on before calling this API. Failure to do -so results a BUG_ON() and a device will be left with MSI enabled and +Note that a device driver should always call free_irq() on the MSI vector +that it has done request_irq() on before calling this API. Failure to do +so results in a BUG_ON() and a device will be left with MSI enabled and leaks its vector. 5.2.3 MSI mode vs. legacy mode diagram -The below diagram shows the events, which switches the interrupt +The below diagram shows the events which switch the interrupt mode on the MSI-capable device function between MSI mode and PIN-IRQ assertion mode. @@ -155,9 +163,9 @@ PIN-IRQ assertion mode. ------------ pci_disable_msi ------------------------ -Figure 1.0 MSI Mode vs. Legacy Mode +Figure 1. MSI Mode vs. Legacy Mode -In Figure 1.0, a device operates by default in legacy mode. Legacy +In Figure 1, a device operates by default in legacy mode. Legacy in this context means PCI pin-irq assertion or PCI-Express INTx emulation. A successful MSI request (using pci_enable_msi()) switches a device's interrupt mode to MSI mode. A pre-assigned IOAPIC vector @@ -166,11 +174,11 @@ assigned MSI vector will replace dev->irq. To return back to its default mode, a device driver should always call pci_disable_msi() to undo the effect of pci_enable_msi(). Note that a -device driver should always call free_irq() on MSI vector it has done -request_irq() on before calling pci_disable_msi(). Failure to do so -results a BUG_ON() and a device will be left with MSI enabled and +device driver should always call free_irq() on the MSI vector it has +done request_irq() on before calling pci_disable_msi(). Failure to do +so results in a BUG_ON() and a device will be left with MSI enabled and leaks its vector. Otherwise, the PCI subsystem restores a device's -dev->irq with a pre-assigned IOAPIC vector and marks released +dev->irq with a pre-assigned IOAPIC vector and marks the released MSI vector as unused. Once being marked as unused, there is no guarantee that the PCI @@ -178,8 +186,8 @@ subsystem will reserve this MSI vector for a device. Depending on the availability of current PCI vector resources and the number of MSI/MSI-X requests from other drivers, this MSI may be re-assigned. -For the case where the PCI subsystem re-assigned this MSI vector -another driver, a request to switching back to MSI mode may result +For the case where the PCI subsystem re-assigns this MSI vector to +another driver, a request to switch back to MSI mode may result in being assigned a different MSI vector or a failure if no more vectors are available. @@ -208,12 +216,12 @@ Unlike the function pci_enable_msi(), the function pci_enable_msix() does not replace the pre-assigned IOAPIC dev->irq with a new MSI vector because the PCI subsystem writes the 1:1 vector-to-entry mapping into the field vector of each element contained in a second argument. -Note that the pre-assigned IO-APIC dev->irq is valid only if the device -operates in PIN-IRQ assertion mode. In MSI-X mode, any attempt of +Note that the pre-assigned IOAPIC dev->irq is valid only if the device +operates in PIN-IRQ assertion mode. In MSI-X mode, any attempt at using dev->irq by the device driver to request for interrupt service may result unpredictabe behavior. -For each MSI-X vector granted, a device driver is responsible to call +For each MSI-X vector granted, a device driver is responsible for calling other functions like request_irq(), enable_irq(), etc. to enable this vector with its corresponding interrupt service handler. It is a device driver's choice to assign all vectors with the same @@ -224,13 +232,13 @@ service handler. The PCI 3.0 specification has implementation notes that MMIO address space for a device's MSI-X structure should be isolated so that the -software system can set different page for controlling accesses to -the MSI-X structure. The implementation of MSI patch requires the PCI +software system can set different pages for controlling accesses to the +MSI-X structure. The implementation of MSI support requires the PCI subsystem, not a device driver, to maintain full control of the MSI-X -table/MSI-X PBA and MMIO address space of the MSI-X table/MSI-X PBA. -A device driver is prohibited from requesting the MMIO address space -of the MSI-X table/MSI-X PBA. Otherwise, the PCI subsystem will fail -enabling MSI-X on its hardware device when it calls the function +table/MSI-X PBA (Pending Bit Array) and MMIO address space of the MSI-X +table/MSI-X PBA. A device driver is prohibited from requesting the MMIO +address space of the MSI-X table/MSI-X PBA. Otherwise, the PCI subsystem +will fail enabling MSI-X on its hardware device when it calls the function pci_enable_msix(). 5.3.2 Handling MSI-X allocation @@ -274,9 +282,9 @@ For the case where fewer MSI-X vectors are allocated to a function than requested, the function pci_enable_msix() will return the maximum number of MSI-X vectors available to the caller. A device driver may re-send its request with fewer or equal vectors indicated -in a return. For example, if a device driver requests 5 vectors, but -the number of available vectors is 3 vectors, a value of 3 will be a -return as a result of pci_enable_msix() call. A function could be +in the return. For example, if a device driver requests 5 vectors, but +the number of available vectors is 3 vectors, a value of 3 will be +returned as a result of pci_enable_msix() call. A function could be designed for its driver to use only 3 MSI-X table entries as different combinations as ABC--, A-B-C, A--CB, etc. Note that this patch does not support multiple entries with the same vector. Such @@ -285,49 +293,46 @@ as ABBCC, AABCC, BCCBA, etc will result as a failure by the function pci_enable_msix(). Below are the reasons why supporting multiple entries with the same vector is an undesirable solution. - - The PCI subsystem can not determine which entry, which - generated the message, to mask/unmask MSI while handling + - The PCI subsystem cannot determine the entry that + generated the message to mask/unmask MSI while handling software driver ISR. Attempting to walk through all MSI-X table entries (2048 max) to mask/unmask any match vector is an undesirable solution. - - Walk through all MSI-X table entries (2048 max) to handle + - Walking through all MSI-X table entries (2048 max) to handle SMP affinity of any match vector is an undesirable solution. 5.3.4 API pci_enable_msix -int pci_enable_msix(struct pci_dev *dev, u32 *entries, int nvec) +int pci_enable_msix(struct pci_dev *dev, struct msix_entry *entries, int nvec) This API enables a device driver to request the PCI subsystem -for enabling MSI-X messages on its hardware device. Depending on +to enable MSI-X messages on its hardware device. Depending on the availability of PCI vectors resources, the PCI subsystem enables -either all or nothing. +either all or none of the requested vectors. -Argument dev points to the device (pci_dev) structure. +Argument 'dev' points to the device (pci_dev) structure. -Argument entries is a pointer of unsigned integer type. The number of -elements is indicated in argument nvec. The content of each element -will be mapped to the following struct defined in /driver/pci/msi.h. +Argument 'entries' is a pointer to an array of msix_entry structs. +The number of entries is indicated in argument 'nvec'. +struct msix_entry is defined in /driver/pci/msi.h: struct msix_entry { u16 vector; /* kernel uses to write alloc vector */ u16 entry; /* driver uses to specify entry */ }; -A device driver is responsible for initializing the field entry of -each element with unique entry supported by MSI-X table. Otherwise, +A device driver is responsible for initializing the field 'entry' of +each element with a unique entry supported by MSI-X table. Otherwise, -EINVAL will be returned as a result. A successful return of zero -indicates the PCI subsystem completes initializing each of requested +indicates the PCI subsystem completed initializing each of the requested entries of the MSI-X table with message address and message data. Last but not least, the PCI subsystem will write the 1:1 -vector-to-entry mapping into the field vector of each element. A -device driver is responsible of keeping track of allocated MSI-X +vector-to-entry mapping into the field 'vector' of each element. A +device driver is responsible for keeping track of allocated MSI-X vectors in its internal data structure. -Argument nvec is an integer indicating the number of messages -requested. - -A return of zero indicates that the number of MSI-X vectors is +A return of zero indicates that the number of MSI-X vectors was successfully allocated. A return of greater than zero indicates MSI-X vector shortage. Or a return of less than zero indicates a failure. This failure may be a result of duplicate entries @@ -341,12 +346,12 @@ void pci_disable_msix(struct pci_dev *dev) This API should always be used to undo the effect of pci_enable_msix() when a device driver is unloading. Note that a device driver should always call free_irq() on all MSI-X vectors it has done request_irq() -on before calling this API. Failure to do so results a BUG_ON() and +on before calling this API. Failure to do so results in a BUG_ON() and a device will be left with MSI-X enabled and leaks its vectors. 5.3.6 MSI-X mode vs. legacy mode diagram -The below diagram shows the events, which switches the interrupt +The below diagram shows the events which switch the interrupt mode on the MSI-X capable device function between MSI-X mode and PIN-IRQ assertion mode (legacy). @@ -356,22 +361,22 @@ PIN-IRQ assertion mode (legacy). | | ===============> | | ------------ pci_disable_msix ------------------------ -Figure 2.0 MSI-X Mode vs. Legacy Mode +Figure 2. MSI-X Mode vs. Legacy Mode -In Figure 2.0, a device operates by default in legacy mode. A +In Figure 2, a device operates by default in legacy mode. A successful MSI-X request (using pci_enable_msix()) switches a device's interrupt mode to MSI-X mode. A pre-assigned IOAPIC vector stored in dev->irq will be saved by the PCI subsystem; however, unlike MSI mode, the PCI subsystem will not replace dev->irq with assigned MSI-X vector because the PCI subsystem already writes the 1:1 -vector-to-entry mapping into the field vector of each element +vector-to-entry mapping into the field 'vector' of each element specified in second argument. To return back to its default mode, a device driver should always call pci_disable_msix() to undo the effect of pci_enable_msix(). Note that a device driver should always call free_irq() on all MSI-X vectors it has done request_irq() on before calling pci_disable_msix(). Failure -to do so results a BUG_ON() and a device will be left with MSI-X +to do so results in a BUG_ON() and a device will be left with MSI-X enabled and leaks its vectors. Otherwise, the PCI subsystem switches a device function's interrupt mode from MSI-X mode to legacy mode and marks all allocated MSI-X vectors as unused. @@ -383,53 +388,56 @@ MSI/MSI-X requests from other drivers, these MSI-X vectors may be re-assigned. For the case where the PCI subsystem re-assigned these MSI-X vectors -to other driver, a request to switching back to MSI-X mode may result +to other drivers, a request to switch back to MSI-X mode may result being assigned with another set of MSI-X vectors or a failure if no more vectors are available. -5.4 Handling function implementng both MSI and MSI-X capabilities +5.4 Handling function implementing both MSI and MSI-X capabilities For the case where a function implements both MSI and MSI-X capabilities, the PCI subsystem enables a device to run either in MSI mode or MSI-X mode but not both. A device driver determines whether it wants MSI or MSI-X enabled on its hardware device. Once a device -driver requests for MSI, for example, it is prohibited to request for +driver requests for MSI, for example, it is prohibited from requesting MSI-X; in other words, a device driver is not permitted to ping-pong between MSI mod MSI-X mode during a run-time. 5.5 Hardware requirements for MSI/MSI-X support + MSI/MSI-X support requires support from both system hardware and individual hardware device functions. 5.5.1 System hardware support + Since the target of MSI address is the local APIC CPU, enabling -MSI/MSI-X support in Linux kernel is dependent on whether existing -system hardware supports local APIC. Users should verify their -system whether it runs when CONFIG_X86_LOCAL_APIC=y. +MSI/MSI-X support in the Linux kernel is dependent on whether existing +system hardware supports local APIC. Users should verify that their +system supports local APIC operation by testing that it runs when +CONFIG_X86_LOCAL_APIC=y. In SMP environment, CONFIG_X86_LOCAL_APIC is automatically set; however, in UP environment, users must manually set CONFIG_X86_LOCAL_APIC. Once CONFIG_X86_LOCAL_APIC=y, setting -CONFIG_PCI_MSI enables the VECTOR based scheme and -the option for MSI-capable device drivers to selectively enable -MSI/MSI-X. +CONFIG_PCI_MSI enables the VECTOR based scheme and the option for +MSI-capable device drivers to selectively enable MSI/MSI-X. Note that CONFIG_X86_IO_APIC setting is irrelevant because MSI/MSI-X vector is allocated new during runtime and MSI/MSI-X support does not depend on BIOS support. This key independency enables MSI/MSI-X -support on future IOxAPIC free platform. +support on future IOxAPIC free platforms. 5.5.2 Device hardware support + The hardware device function supports MSI by indicating the MSI/MSI-X capability structure on its PCI capability list. By default, this capability structure will not be initialized by the kernel to enable MSI during the system boot. In other words, the device function is running on its default pin assertion mode. Note that in many cases the hardware supporting MSI have bugs, -which may result in system hang. The software driver of specific -MSI-capable hardware is responsible for whether calling +which may result in system hangs. The software driver of specific +MSI-capable hardware is responsible for deciding whether to call pci_enable_msi or not. A return of zero indicates the kernel -successfully initializes the MSI/MSI-X capability structure of the +successfully initialized the MSI/MSI-X capability structure of the device function. The device function is now running on MSI/MSI-X mode. 5.6 How to tell whether MSI/MSI-X is enabled on device function @@ -439,10 +447,10 @@ pci_enable_msi()/pci_enable_msix() indicates to a device driver that its device function is initialized successfully and ready to run in MSI/MSI-X mode. -At the user level, users can use command 'cat /proc/interrupts' -to display the vector allocated for a device and its interrupt -MSI/MSI-X mode ("PCI MSI"/"PCI MSIX"). Below shows below MSI mode is -enabled on a SCSI Adaptec 39320D Ultra320. +At the user level, users can use the command 'cat /proc/interrupts' +to display the vectors allocated for devices and their interrupt +MSI/MSI-X modes ("PCI-MSI"/"PCI-MSI-X"). Below shows MSI mode is +enabled on a SCSI Adaptec 39320D Ultra320 controller. CPU0 CPU1 0: 324639 0 IO-APIC-edge timer @@ -453,8 +461,8 @@ enabled on a SCSI Adaptec 39320D Ultra320. 15: 1 0 IO-APIC-edge ide1 169: 0 0 IO-APIC-level uhci-hcd 185: 0 0 IO-APIC-level uhci-hcd -193: 138 10 PCI MSI aic79xx -201: 30 0 PCI MSI aic79xx +193: 138 10 PCI-MSI aic79xx +201: 30 0 PCI-MSI aic79xx 225: 30 0 IO-APIC-level aic7xxx 233: 30 0 IO-APIC-level aic7xxx NMI: 0 0 @@ -490,8 +498,8 @@ target address set as 0xfeexxxxx, as conformed to PCI specification 2.3 or latest, then it should work. Q4. From the driver point of view, if the MSI is lost because -of the errors occur during inbound memory write, then it may -wait for ever. Is there a mechanism for it to recover? +of errors occurring during inbound memory write, then it may +wait forever. Is there a mechanism for it to recover? A4. Since the target of the transaction is an inbound memory write, all transaction termination conditions (Retry, diff --git a/Documentation/RCU/torture.txt b/Documentation/RCU/torture.txt new file mode 100644 index 000000000000..e4c38152f7f7 --- /dev/null +++ b/Documentation/RCU/torture.txt @@ -0,0 +1,122 @@ +RCU Torture Test Operation + + +CONFIG_RCU_TORTURE_TEST + +The CONFIG_RCU_TORTURE_TEST config option is available for all RCU +implementations. It creates an rcutorture kernel module that can +be loaded to run a torture test. The test periodically outputs +status messages via printk(), which can be examined via the dmesg +command (perhaps grepping for "rcutorture"). The test is started +when the module is loaded, and stops when the module is unloaded. + +However, actually setting this config option to "y" results in the system +running the test immediately upon boot, and ending only when the system +is taken down. Normally, one will instead want to build the system +with CONFIG_RCU_TORTURE_TEST=m and to use modprobe and rmmod to control +the test, perhaps using a script similar to the one shown at the end of +this document. Note that you will need CONFIG_MODULE_UNLOAD in order +to be able to end the test. + + +MODULE PARAMETERS + +This module has the following parameters: + +nreaders This is the number of RCU reading threads supported. + The default is twice the number of CPUs. Why twice? + To properly exercise RCU implementations with preemptible + read-side critical sections. + +stat_interval The number of seconds between output of torture + statistics (via printk()). Regardless of the interval, + statistics are printed when the module is unloaded. + Setting the interval to zero causes the statistics to + be printed -only- when the module is unloaded, and this + is the default. + +verbose Enable debug printk()s. Default is disabled. + + +OUTPUT + +The statistics output is as follows: + + rcutorture: --- Start of test: nreaders=16 stat_interval=0 verbose=0 + rcutorture: rtc: 0000000000000000 ver: 1916 tfle: 0 rta: 1916 rtaf: 0 rtf: 1915 + rcutorture: Reader Pipe: 1466408 9747 0 0 0 0 0 0 0 0 0 + rcutorture: Reader Batch: 1464477 11678 0 0 0 0 0 0 0 0 + rcutorture: Free-Block Circulation: 1915 1915 1915 1915 1915 1915 1915 1915 1915 1915 0 + rcutorture: --- End of test + +The command "dmesg | grep rcutorture:" will extract this information on +most systems. On more esoteric configurations, it may be necessary to +use other commands to access the output of the printk()s used by +the RCU torture test. The printk()s use KERN_ALERT, so they should +be evident. ;-) + +The entries are as follows: + +o "ggp": The number of counter flips (or batches) since boot. + +o "rtc": The hexadecimal address of the structure currently visible + to readers. + +o "ver": The number of times since boot that the rcutw writer task + has changed the structure visible to readers. + +o "tfle": If non-zero, indicates that the "torture freelist" + containing structure to be placed into the "rtc" area is empty. + This condition is important, since it can fool you into thinking + that RCU is working when it is not. :-/ + +o "rta": Number of structures allocated from the torture freelist. + +o "rtaf": Number of allocations from the torture freelist that have + failed due to the list being empty. + +o "rtf": Number of frees into the torture freelist. + +o "Reader Pipe": Histogram of "ages" of structures seen by readers. + If any entries past the first two are non-zero, RCU is broken. + And rcutorture prints the error flag string "!!!" to make sure + you notice. The age of a newly allocated structure is zero, + it becomes one when removed from reader visibility, and is + incremented once per grace period subsequently -- and is freed + after passing through (RCU_TORTURE_PIPE_LEN-2) grace periods. + + The output displayed above was taken from a correctly working + RCU. If you want to see what it looks like when broken, break + it yourself. ;-) + +o "Reader Batch": Another histogram of "ages" of structures seen + by readers, but in terms of counter flips (or batches) rather + than in terms of grace periods. The legal number of non-zero + entries is again two. The reason for this separate view is + that it is easier to get the third entry to show up in the + "Reader Batch" list than in the "Reader Pipe" list. + +o "Free-Block Circulation": Shows the number of torture structures + that have reached a given point in the pipeline. The first element + should closely correspond to the number of structures allocated, + the second to the number that have been removed from reader view, + and all but the last remaining to the corresponding number of + passes through a grace period. The last entry should be zero, + as it is only incremented if a torture structure's counter + somehow gets incremented farther than it should. + + +USAGE + +The following script may be used to torture RCU: + + #!/bin/sh + + modprobe rcutorture + sleep 100 + rmmod rcutorture + dmesg | grep rcutorture: + +The output can be manually inspected for the error flag of "!!!". +One could of course create a more elaborate script that automatically +checked for such errors. diff --git a/Documentation/RCU/whatisRCU.txt b/Documentation/RCU/whatisRCU.txt index 354d89c78377..15da16861fa3 100644 --- a/Documentation/RCU/whatisRCU.txt +++ b/Documentation/RCU/whatisRCU.txt @@ -772,8 +772,6 @@ RCU pointer/list traversal: list_for_each_entry_rcu list_for_each_continue_rcu (to be deprecated in favor of new list_for_each_entry_continue_rcu) - hlist_for_each_rcu (to be deprecated in favor of - hlist_for_each_entry_rcu) hlist_for_each_entry_rcu RCU pointer update: diff --git a/Documentation/SubmittingPatches b/Documentation/SubmittingPatches index 7f43b040311e..237d54c44bc5 100644 --- a/Documentation/SubmittingPatches +++ b/Documentation/SubmittingPatches @@ -301,8 +301,84 @@ now, but you can do this to mark internal company procedures or just point out some special detail about the sign-off. +12) The canonical patch format -12) More references for submitting patches +The canonical patch subject line is: + + Subject: [PATCH 001/123] subsystem: summary phrase + +The canonical patch message body contains the following: + + - A "from" line specifying the patch author. + + - An empty line. + + - The body of the explanation, which will be copied to the + permanent changelog to describe this patch. + + - The "Signed-off-by:" lines, described above, which will + also go in the changelog. + + - A marker line containing simply "---". + + - Any additional comments not suitable for the changelog. + + - The actual patch (diff output). + +The Subject line format makes it very easy to sort the emails +alphabetically by subject line - pretty much any email reader will +support that - since because the sequence number is zero-padded, +the numerical and alphabetic sort is the same. + +The "subsystem" in the email's Subject should identify which +area or subsystem of the kernel is being patched. + +The "summary phrase" in the email's Subject should concisely +describe the patch which that email contains. The "summary +phrase" should not be a filename. Do not use the same "summary +phrase" for every patch in a whole patch series. + +Bear in mind that the "summary phrase" of your email becomes +a globally-unique identifier for that patch. It propagates +all the way into the git changelog. The "summary phrase" may +later be used in developer discussions which refer to the patch. +People will want to google for the "summary phrase" to read +discussion regarding that patch. + +A couple of example Subjects: + + Subject: [patch 2/5] ext2: improve scalability of bitmap searching + Subject: [PATCHv2 001/207] x86: fix eflags tracking + +The "from" line must be the very first line in the message body, +and has the form: + + From: Original Author <author@example.com> + +The "from" line specifies who will be credited as the author of the +patch in the permanent changelog. If the "from" line is missing, +then the "From:" line from the email header will be used to determine +the patch author in the changelog. + +The explanation body will be committed to the permanent source +changelog, so should make sense to a competent reader who has long +since forgotten the immediate details of the discussion that might +have led to this patch. + +The "---" marker line serves the essential purpose of marking for patch +handling tools where the changelog message ends. + +One good use for the additional comments after the "---" marker is for +a diffstat, to show what files have changed, and the number of inserted +and deleted lines per file. A diffstat is especially useful on bigger +patches. Other comments relevant only to the moment or the maintainer, +not suitable for the permanent changelog, should also go here. + +See more details on the proper patch format in the following +references. + + +13) More references for submitting patches Andrew Morton, "The perfect patch" (tpp). <http://www.zip.com.au/~akpm/linux/patches/stuff/tpp.txt> @@ -310,6 +386,14 @@ Andrew Morton, "The perfect patch" (tpp). Jeff Garzik, "Linux kernel patch submission format." <http://linux.yyz.us/patch-format.html> +Greg KH, "How to piss off a kernel subsystem maintainer" + <http://www.kroah.com/log/2005/03/31/> + +Kernel Documentation/CodingStyle + <http://sosdg.org/~coywolf/lxr/source/Documentation/CodingStyle> + +Linus Torvald's mail on the canonical patch format: + <http://lkml.org/lkml/2005/4/7/183> ----------------------------------- diff --git a/Documentation/arm/README b/Documentation/arm/README index a6f718e90a86..5ed6f3530b86 100644 --- a/Documentation/arm/README +++ b/Documentation/arm/README @@ -8,10 +8,9 @@ Compilation of kernel --------------------- In order to compile ARM Linux, you will need a compiler capable of - generating ARM ELF code with GNU extensions. GCC 2.95.1, EGCS - 1.1.2, and GCC 3.3 are known to be good compilers. Fortunately, you - needn't guess. The kernel will report an error if your compiler is - a recognized offender. + generating ARM ELF code with GNU extensions. GCC 3.3 is known to be + a good compiler. Fortunately, you needn't guess. The kernel will report + an error if your compiler is a recognized offender. To build ARM Linux natively, you shouldn't have to alter the ARCH = line in the top level Makefile. However, if you don't have the ARM Linux ELF diff --git a/Documentation/arm/Samsung-S3C24XX/Overview.txt b/Documentation/arm/Samsung-S3C24XX/Overview.txt index 3af4d29a8938..89aa89d526ac 100644 --- a/Documentation/arm/Samsung-S3C24XX/Overview.txt +++ b/Documentation/arm/Samsung-S3C24XX/Overview.txt @@ -81,7 +81,8 @@ Adding New Machines Any large scale modifications, or new drivers should be discussed on the ARM kernel mailing list (linux-arm-kernel) before being - attempted. + attempted. See http://www.arm.linux.org.uk/mailinglists/ for the + mailing list information. NAND @@ -120,6 +121,43 @@ Clock Management various clock units +Platform Data +------------- + + Whenever a device has platform specific data that is specified + on a per-machine basis, care should be taken to ensure the + following: + + 1) that default data is not left in the device to confuse the + driver if a machine does not set it at startup + + 2) the data should (if possible) be marked as __initdata, + to ensure that the data is thrown away if the machine is + not the one currently in use. + + The best way of doing this is to make a function that + kmalloc()s an area of memory, and copies the __initdata + and then sets the relevant device's platform data. Making + the function `__init` takes care of ensuring it is discarded + with the rest of the initialisation code + + static __init void s3c24xx_xxx_set_platdata(struct xxx_data *pd) + { + struct s3c2410_xxx_mach_info *npd; + + npd = kmalloc(sizeof(struct s3c2410_xxx_mach_info), GFP_KERNEL); + if (npd) { + memcpy(npd, pd, sizeof(struct s3c2410_xxx_mach_info)); + s3c_device_xxx.dev.platform_data = npd; + } else { + printk(KERN_ERR "no memory for xxx platform data\n"); + } + } + + Note, since the code is marked as __init, it should not be + exported outside arch/arm/mach-s3c2410/, or exported to + modules via EXPORT_SYMBOL() and related functions. + Port Contributors ----------------- @@ -149,6 +187,7 @@ Document Changes 06 Mar 2005 - BJD - Added Christer Weinigel 08 Mar 2005 - BJD - Added LCVR to list of people, updated introduction 08 Mar 2005 - BJD - Added section on adding machines + 09 Sep 2005 - BJD - Added section on platform data Document Author --------------- diff --git a/Documentation/block/biodoc.txt b/Documentation/block/biodoc.txt index 6dd274d7e1cf..2d65c2182161 100644 --- a/Documentation/block/biodoc.txt +++ b/Documentation/block/biodoc.txt @@ -906,9 +906,20 @@ Aside: 4. The I/O scheduler -I/O schedulers are now per queue. They should be runtime switchable and modular -but aren't yet. Jens has most bits to do this, but the sysfs implementation is -missing. +I/O scheduler, a.k.a. elevator, is implemented in two layers. Generic dispatch +queue and specific I/O schedulers. Unless stated otherwise, elevator is used +to refer to both parts and I/O scheduler to specific I/O schedulers. + +Block layer implements generic dispatch queue in ll_rw_blk.c and elevator.c. +The generic dispatch queue is responsible for properly ordering barrier +requests, requeueing, handling non-fs requests and all other subtleties. + +Specific I/O schedulers are responsible for ordering normal filesystem +requests. They can also choose to delay certain requests to improve +throughput or whatever purpose. As the plural form indicates, there are +multiple I/O schedulers. They can be built as modules but at least one should +be built inside the kernel. Each queue can choose different one and can also +change to another one dynamically. A block layer call to the i/o scheduler follows the convention elv_xxx(). This calls elevator_xxx_fn in the elevator switch (drivers/block/elevator.c). Oh, @@ -921,44 +932,36 @@ keeping work. The functions an elevator may implement are: (* are mandatory) elevator_merge_fn called to query requests for merge with a bio -elevator_merge_req_fn " " " with another request +elevator_merge_req_fn called when two requests get merged. the one + which gets merged into the other one will be + never seen by I/O scheduler again. IOW, after + being merged, the request is gone. elevator_merged_fn called when a request in the scheduler has been involved in a merge. It is used in the deadline scheduler for example, to reposition the request if its sorting order has changed. -*elevator_next_req_fn returns the next scheduled request, or NULL - if there are none (or none are ready). +elevator_dispatch_fn fills the dispatch queue with ready requests. + I/O schedulers are free to postpone requests by + not filling the dispatch queue unless @force + is non-zero. Once dispatched, I/O schedulers + are not allowed to manipulate the requests - + they belong to generic dispatch queue. -*elevator_add_req_fn called to add a new request into the scheduler +elevator_add_req_fn called to add a new request into the scheduler elevator_queue_empty_fn returns true if the merge queue is empty. Drivers shouldn't use this, but rather check if elv_next_request is NULL (without losing the request if one exists!) -elevator_remove_req_fn This is called when a driver claims ownership of - the target request - it now belongs to the - driver. It must not be modified or merged. - Drivers must not lose the request! A subsequent - call of elevator_next_req_fn must return the - _next_ request. - -elevator_requeue_req_fn called to add a request to the scheduler. This - is used when the request has alrnadebeen - returned by elv_next_request, but hasn't - completed. If this is not implemented then - elevator_add_req_fn is called instead. - elevator_former_req_fn elevator_latter_req_fn These return the request before or after the one specified in disk sort order. Used by the block layer to find merge possibilities. -elevator_completed_req_fn called when a request is completed. This might - come about due to being merged with another or - when the device completes the request. +elevator_completed_req_fn called when a request is completed. elevator_may_queue_fn returns true if the scheduler wants to allow the current context to queue a new request even if @@ -967,13 +970,33 @@ elevator_may_queue_fn returns true if the scheduler wants to allow the elevator_set_req_fn elevator_put_req_fn Must be used to allocate and free any elevator - specific storate for a request. + specific storage for a request. + +elevator_activate_req_fn Called when device driver first sees a request. + I/O schedulers can use this callback to + determine when actual execution of a request + starts. +elevator_deactivate_req_fn Called when device driver decides to delay + a request by requeueing it. elevator_init_fn elevator_exit_fn Allocate and free any elevator specific storage for a queue. -4.2 I/O scheduler implementation +4.2 Request flows seen by I/O schedulers +All requests seens by I/O schedulers strictly follow one of the following three +flows. + + set_req_fn -> + + i. add_req_fn -> (merged_fn ->)* -> dispatch_fn -> activate_req_fn -> + (deactivate_req_fn -> activate_req_fn ->)* -> completed_req_fn + ii. add_req_fn -> (merged_fn ->)* -> merge_req_fn + iii. [none] + + -> put_req_fn + +4.3 I/O scheduler implementation The generic i/o scheduler algorithm attempts to sort/merge/batch requests for optimal disk scan and request servicing performance (based on generic principles and device capabilities), optimized for: @@ -993,18 +1016,7 @@ request in sort order to prevent binary tree lookups. This arrangement is not a generic block layer characteristic however, so elevators may implement queues as they please. -ii. Last merge hint -The last merge hint is part of the generic queue layer. I/O schedulers must do -some management on it. For the most part, the most important thing is to make -sure q->last_merge is cleared (set to NULL) when the request on it is no longer -a candidate for merging (for example if it has been sent to the driver). - -The last merge performed is cached as a hint for the subsequent request. If -sequential data is being submitted, the hint is used to perform merges without -any scanning. This is not sufficient when there are multiple processes doing -I/O though, so a "merge hash" is used by some schedulers. - -iii. Merge hash +ii. Merge hash AS and deadline use a hash table indexed by the last sector of a request. This enables merging code to quickly look up "back merge" candidates, even when multiple I/O streams are being performed at once on one disk. @@ -1013,29 +1025,8 @@ multiple I/O streams are being performed at once on one disk. are far less common than "back merges" due to the nature of most I/O patterns. Front merges are handled by the binary trees in AS and deadline schedulers. -iv. Handling barrier cases -A request with flags REQ_HARDBARRIER or REQ_SOFTBARRIER must not be ordered -around. That is, they must be processed after all older requests, and before -any newer ones. This includes merges! - -In AS and deadline schedulers, barriers have the effect of flushing the reorder -queue. The performance cost of this will vary from nothing to a lot depending -on i/o patterns and device characteristics. Obviously they won't improve -performance, so their use should be kept to a minimum. - -v. Handling insertion position directives -A request may be inserted with a position directive. The directives are one of -ELEVATOR_INSERT_BACK, ELEVATOR_INSERT_FRONT, ELEVATOR_INSERT_SORT. - -ELEVATOR_INSERT_SORT is a general directive for non-barrier requests. -ELEVATOR_INSERT_BACK is used to insert a barrier to the back of the queue. -ELEVATOR_INSERT_FRONT is used to insert a barrier to the front of the queue, and -overrides the ordering requested by any previous barriers. In practice this is -harmless and required, because it is used for SCSI requeueing. This does not -require flushing the reorder queue, so does not impose a performance penalty. - -vi. Plugging the queue to batch requests in anticipation of opportunities for - merge/sort optimizations +iii. Plugging the queue to batch requests in anticipation of opportunities for + merge/sort optimizations This is just the same as in 2.4 so far, though per-device unplugging support is anticipated for 2.5. Also with a priority-based i/o scheduler, @@ -1069,7 +1060,7 @@ Aside: blk_kick_queue() to unplug a specific queue (right away ?) or optionally, all queues, is in the plan. -4.3 I/O contexts +4.4 I/O contexts I/O contexts provide a dynamically allocated per process data area. They may be used in I/O schedulers, and in the block layer (could be used for IO statis, priorities for example). See *io_context in drivers/block/ll_rw_blk.c, and diff --git a/Documentation/cachetlb.txt b/Documentation/cachetlb.txt index e132fb1163b0..7eb715e07eda 100644 --- a/Documentation/cachetlb.txt +++ b/Documentation/cachetlb.txt @@ -49,9 +49,6 @@ changes occur: page table operations such as what happens during fork, and exec. - Platform developers note that generic code will always - invoke this interface without mm->page_table_lock held. - 3) void flush_tlb_range(struct vm_area_struct *vma, unsigned long start, unsigned long end) @@ -72,9 +69,6 @@ changes occur: call flush_tlb_page (see below) for each entry which may be modified. - Platform developers note that generic code will always - invoke this interface with mm->page_table_lock held. - 4) void flush_tlb_page(struct vm_area_struct *vma, unsigned long addr) This time we need to remove the PAGE_SIZE sized translation @@ -93,9 +87,6 @@ changes occur: This is used primarily during fault processing. - Platform developers note that generic code will always - invoke this interface with mm->page_table_lock held. - 5) void flush_tlb_pgtables(struct mm_struct *mm, unsigned long start, unsigned long end) diff --git a/Documentation/connector/connector.txt b/Documentation/connector/connector.txt index 54a0a14bfbe3..57a314b14cf8 100644 --- a/Documentation/connector/connector.txt +++ b/Documentation/connector/connector.txt @@ -131,3 +131,47 @@ Netlink itself is not reliable protocol, that means that messages can be lost due to memory pressure or process' receiving queue overflowed, so caller is warned must be prepared. That is why struct cn_msg [main connector's message header] contains u32 seq and u32 ack fields. + +/*****************************************/ +Userspace usage. +/*****************************************/ +2.6.14 has a new netlink socket implementation, which by default does not +allow to send data to netlink groups other than 1. +So, if to use netlink socket (for example using connector) +with different group number userspace application must subscribe to +that group. It can be achieved by following pseudocode: + +s = socket(PF_NETLINK, SOCK_DGRAM, NETLINK_CONNECTOR); + +l_local.nl_family = AF_NETLINK; +l_local.nl_groups = 12345; +l_local.nl_pid = 0; + +if (bind(s, (struct sockaddr *)&l_local, sizeof(struct sockaddr_nl)) == -1) { + perror("bind"); + close(s); + return -1; +} + +{ + int on = l_local.nl_groups; + setsockopt(s, 270, 1, &on, sizeof(on)); +} + +Where 270 above is SOL_NETLINK, and 1 is a NETLINK_ADD_MEMBERSHIP socket +option. To drop multicast subscription one should call above socket option +with NETLINK_DROP_MEMBERSHIP parameter which is defined as 0. + +2.6.14 netlink code only allows to select a group which is less or equal to +the maximum group number, which is used at netlink_kernel_create() time. +In case of connector it is CN_NETLINK_USERS + 0xf, so if you want to use +group number 12345, you must increment CN_NETLINK_USERS to that number. +Additional 0xf numbers are allocated to be used by non-in-kernel users. + +Due to this limitation, group 0xffffffff does not work now, so one can +not use add/remove connector's group notifications, but as far as I know, +only cn_test.c test module used it. + +Some work in netlink area is still being done, so things can be changed in +2.6.15 timeframe, if it will happen, documentation will be updated for that +kernel. diff --git a/Documentation/cpusets.txt b/Documentation/cpusets.txt index d17b7d2dd771..a09a8eb80665 100644 --- a/Documentation/cpusets.txt +++ b/Documentation/cpusets.txt @@ -94,7 +94,7 @@ the available CPU and Memory resources amongst the requesting tasks. But larger systems, which benefit more from careful processor and memory placement to reduce memory access times and contention, and which typically represent a larger investment for the customer, -can benefit from explictly placing jobs on properly sized subsets of +can benefit from explicitly placing jobs on properly sized subsets of the system. This can be especially valuable on: diff --git a/Documentation/dell_rbu.txt b/Documentation/dell_rbu.txt index bcfa5c35036b..941343a7a265 100644 --- a/Documentation/dell_rbu.txt +++ b/Documentation/dell_rbu.txt @@ -13,6 +13,8 @@ the BIOS on Dell servers (starting from servers sold since 1999), desktops and notebooks (starting from those sold in 2005). Please go to http://support.dell.com register and you can find info on OpenManage and Dell Update packages (DUP). +Libsmbios can also be used to update BIOS on Dell systems go to +http://linux.dell.com/libsmbios/ for details. Dell_RBU driver supports BIOS update using the monilothic image and packetized image methods. In case of moniolithic the driver allocates a contiguous chunk @@ -22,8 +24,8 @@ would place each packet in contiguous physical memory. The driver also maintains a link list of packets for reading them back. If the dell_rbu driver is unloaded all the allocated memory is freed. -The rbu driver needs to have an application which will inform the BIOS to -enable the update in the next system reboot. +The rbu driver needs to have an application (as mentioned above)which will +inform the BIOS to enable the update in the next system reboot. The user should not unload the rbu driver after downloading the BIOS image or updating. @@ -33,6 +35,7 @@ The driver load creates the following directories under the /sys file system. /sys/class/firmware/dell_rbu/data /sys/devices/platform/dell_rbu/image_type /sys/devices/platform/dell_rbu/data +/sys/devices/platform/dell_rbu/packet_size The driver supports two types of update mechanism; monolithic and packetized. These update mechanism depends upon the BIOS currently running on the system. @@ -42,10 +45,30 @@ In case of packet mechanism the single memory can be broken in smaller chuks of contiguous memory and the BIOS image is scattered in these packets. By default the driver uses monolithic memory for the update type. This can be -changed to contiguous during the driver load time by specifying the load +changed to packets during the driver load time by specifying the load parameter image_type=packet. This can also be changed later as below echo packet > /sys/devices/platform/dell_rbu/image_type +In packet update mode the packet size has to be given before any packets can +be downloaded. It is done as below +echo XXXX > /sys/devices/platform/dell_rbu/packet_size +In the packet update mechanism, the user neesd to create a new file having +packets of data arranged back to back. It can be done as follows +The user creates packets header, gets the chunk of the BIOS image and +placs it next to the packetheader; now, the packetheader + BIOS image chunk +added to geather should match the specified packet_size. This makes one +packet, the user needs to create more such packets out of the entire BIOS +image file and then arrange all these packets back to back in to one single +file. +This file is then copied to /sys/class/firmware/dell_rbu/data. +Once this file gets to the driver, the driver extracts packet_size data from +the file and spreads it accross the physical memory in contiguous packet_sized +space. +This method makes sure that all the packets get to the driver in a single operation. + +In monolithic update the user simply get the BIOS image (.hdr file) and copies +to the data file as is without any change to the BIOS image itself. + Do the steps below to download the BIOS image. 1) echo 1 > /sys/class/firmware/dell_rbu/loading 2) cp bios_image.hdr /sys/class/firmware/dell_rbu/data @@ -53,20 +76,23 @@ Do the steps below to download the BIOS image. The /sys/class/firmware/dell_rbu/ entries will remain till the following is done. -echo -1 > /sys/class/firmware/dell_rbu/loading +echo -1 > /sys/class/firmware/dell_rbu/loading. +Until this step is completed the driver cannot be unloaded. +Also echoing either mono ,packet or init in to image_type will free up the +memory allocated by the driver. -Until this step is completed the drivr cannot be unloaded. +If an user by accident executes steps 1 and 3 above without executing step 2; +it will make the /sys/class/firmware/dell_rbu/ entries to disappear. +The entries can be recreated by doing the following +echo init > /sys/devices/platform/dell_rbu/image_type +NOTE: echoing init in image_type does not change it original value. Also the driver provides /sys/devices/platform/dell_rbu/data readonly file to -read back the image downloaded. This is useful in case of packet update -mechanism where the above steps 1,2,3 will repeated for every packet. -By reading the /sys/devices/platform/dell_rbu/data file all packet data -downloaded can be verified in a single file. -The packets are arranged in this file one after the other in a FIFO order. +read back the image downloaded. NOTE: -This driver requires a patch for firmware_class.c which has the addition -of request_firmware_nowait_nohotplug function to wortk +This driver requires a patch for firmware_class.c which has the modified +request_firmware_nowait function. Also after updating the BIOS image an user mdoe application neeeds to execute code which message the BIOS update request to the BIOS. So on the next reboot the BIOS knows about the new image downloaded and it updates it self. diff --git a/Documentation/device-mapper/snapshot.txt b/Documentation/device-mapper/snapshot.txt new file mode 100644 index 000000000000..a5009c8300f3 --- /dev/null +++ b/Documentation/device-mapper/snapshot.txt @@ -0,0 +1,74 @@ +Device-mapper snapshot support +============================== + +Device-mapper allows you, without massive data copying: + +*) To create snapshots of any block device i.e. mountable, saved states of +the block device which are also writable without interfering with the +original content; +*) To create device "forks", i.e. multiple different versions of the +same data stream. + + +In both cases, dm copies only the chunks of data that get changed and +uses a separate copy-on-write (COW) block device for storage. + + +There are two dm targets available: snapshot and snapshot-origin. + +*) snapshot-origin <origin> + +which will normally have one or more snapshots based on it. +Reads will be mapped directly to the backing device. For each write, the +original data will be saved in the <COW device> of each snapshot to keep +its visible content unchanged, at least until the <COW device> fills up. + + +*) snapshot <origin> <COW device> <persistent?> <chunksize> + +A snapshot of the <origin> block device is created. Changed chunks of +<chunksize> sectors will be stored on the <COW device>. Writes will +only go to the <COW device>. Reads will come from the <COW device> or +from <origin> for unchanged data. <COW device> will often be +smaller than the origin and if it fills up the snapshot will become +useless and be disabled, returning errors. So it is important to monitor +the amount of free space and expand the <COW device> before it fills up. + +<persistent?> is P (Persistent) or N (Not persistent - will not survive +after reboot). +The difference is that for transient snapshots less metadata must be +saved on disk - they can be kept in memory by the kernel. + + +How this is used by LVM2 +======================== +When you create the first LVM2 snapshot of a volume, four dm devices are used: + +1) a device containing the original mapping table of the source volume; +2) a device used as the <COW device>; +3) a "snapshot" device, combining #1 and #2, which is the visible snapshot + volume; +4) the "original" volume (which uses the device number used by the original + source volume), whose table is replaced by a "snapshot-origin" mapping + from device #1. + +A fixed naming scheme is used, so with the following commands: + +lvcreate -L 1G -n base volumeGroup +lvcreate -L 100M --snapshot -n snap volumeGroup/base + +we'll have this situation (with volumes in above order): + +# dmsetup table|grep volumeGroup + +volumeGroup-base-real: 0 2097152 linear 8:19 384 +volumeGroup-snap-cow: 0 204800 linear 8:19 2097536 +volumeGroup-snap: 0 2097152 snapshot 254:11 254:12 P 16 +volumeGroup-base: 0 2097152 snapshot-origin 254:11 + +# ls -lL /dev/mapper/volumeGroup-* +brw------- 1 root root 254, 11 29 ago 18:15 /dev/mapper/volumeGroup-base-real +brw------- 1 root root 254, 12 29 ago 18:15 /dev/mapper/volumeGroup-snap-cow +brw------- 1 root root 254, 13 29 ago 18:15 /dev/mapper/volumeGroup-snap +brw------- 1 root root 254, 10 29 ago 18:14 /dev/mapper/volumeGroup-base + diff --git a/Documentation/driver-model/driver.txt b/Documentation/driver-model/driver.txt index fabaca1ab1b0..59806c9761f7 100644 --- a/Documentation/driver-model/driver.txt +++ b/Documentation/driver-model/driver.txt @@ -14,8 +14,8 @@ struct device_driver { int (*probe) (struct device * dev); int (*remove) (struct device * dev); - int (*suspend) (struct device * dev, pm_message_t state, u32 level); - int (*resume) (struct device * dev, u32 level); + int (*suspend) (struct device * dev, pm_message_t state); + int (*resume) (struct device * dev); }; @@ -194,69 +194,13 @@ device; i.e. anything in the device's driver_data field. If the device is still present, it should quiesce the device and place it into a supported low-power state. - int (*suspend) (struct device * dev, pm_message_t state, u32 level); + int (*suspend) (struct device * dev, pm_message_t state); -suspend is called to put the device in a low power state. There are -several stages to successfully suspending a device, which is denoted in -the @level parameter. Breaking the suspend transition into several -stages affords the platform flexibility in performing device power -management based on the requirements of the system and the -user-defined policy. +suspend is called to put the device in a low power state. -SUSPEND_NOTIFY notifies the device that a suspend transition is about -to happen. This happens on system power state transitions to verify -that all devices can successfully suspend. + int (*resume) (struct device * dev); -A driver may choose to fail on this call, which should cause the -entire suspend transition to fail. A driver should fail only if it -knows that the device will not be able to be resumed properly when the -system wakes up again. It could also fail if it somehow determines it -is in the middle of an operation too important to stop. - -SUSPEND_DISABLE tells the device to stop I/O transactions. When it -stops transactions, or what it should do with unfinished transactions -is a policy of the driver. After this call, the driver should not -accept any other I/O requests. - -SUSPEND_SAVE_STATE tells the device to save the context of the -hardware. This includes any bus-specific hardware state and -device-specific hardware state. A pointer to this saved state can be -stored in the device's saved_state field. - -SUSPEND_POWER_DOWN tells the driver to place the device in the low -power state requested. - -Whether suspend is called with a given level is a policy of the -platform. Some levels may be omitted; drivers must not assume the -reception of any level. However, all levels must be called in the -order above; i.e. notification will always come before disabling; -disabling the device will come before suspending the device. - -All calls are made with interrupts enabled, except for the -SUSPEND_POWER_DOWN level. - - int (*resume) (struct device * dev, u32 level); - -Resume is used to bring a device back from a low power state. Like the -suspend transition, it happens in several stages. - -RESUME_POWER_ON tells the driver to set the power state to the state -before the suspend call (The device could have already been in a low -power state before the suspend call to put in a lower power state). - -RESUME_RESTORE_STATE tells the driver to restore the state saved by -the SUSPEND_SAVE_STATE suspend call. - -RESUME_ENABLE tells the driver to start accepting I/O transactions -again. Depending on driver policy, the device may already have pending -I/O requests. - -RESUME_POWER_ON is called with interrupts disabled. The other resume -levels are called with interrupts enabled. - -As with the various suspend stages, the driver must not assume that -any other resume calls have been or will be made. Each call should be -self-contained and not dependent on any external state. +Resume is used to bring a device back from a low power state. Attributes diff --git a/Documentation/driver-model/porting.txt b/Documentation/driver-model/porting.txt index ff2fef2107f0..98b233cb8b36 100644 --- a/Documentation/driver-model/porting.txt +++ b/Documentation/driver-model/porting.txt @@ -350,7 +350,7 @@ When a driver is registered, the bus's list of devices is iterated over. bus->match() is called for each device that is not already claimed by a driver. -When a device is successfully bound to a device, device->driver is +When a device is successfully bound to a driver, device->driver is set, the device is added to a per-driver list of devices, and a symlink is created in the driver's sysfs directory that points to the device's physical directory: diff --git a/Documentation/fb/vesafb.txt b/Documentation/fb/vesafb.txt index 62db6758d1c1..ee277dd204b0 100644 --- a/Documentation/fb/vesafb.txt +++ b/Documentation/fb/vesafb.txt @@ -146,10 +146,10 @@ pmipal Use the protected mode interface for palette changes. mtrr:n setup memory type range registers for the vesafb framebuffer where n: - 0 - disabled (equivalent to nomtrr) + 0 - disabled (equivalent to nomtrr) (default) 1 - uncachable 2 - write-back - 3 - write-combining (default) + 3 - write-combining 4 - write-through If you see the following in dmesg, choose the type that matches the diff --git a/Documentation/feature-removal-schedule.txt b/Documentation/feature-removal-schedule.txt index b67189a8d8d4..decdf9917e0d 100644 --- a/Documentation/feature-removal-schedule.txt +++ b/Documentation/feature-removal-schedule.txt @@ -69,6 +69,22 @@ Who: Grant Coady <gcoady@gmail.com> --------------------------- +What: remove EXPORT_SYMBOL(panic_timeout) +When: April 2006 +Files: kernel/panic.c +Why: No modular usage in the kernel. +Who: Adrian Bunk <bunk@stusta.de> + +--------------------------- + +What: remove EXPORT_SYMBOL(insert_resource) +When: April 2006 +Files: kernel/resource.c +Why: No modular usage in the kernel. +Who: Adrian Bunk <bunk@stusta.de> + +--------------------------- + What: PCMCIA control ioctl (needed for pcmcia-cs [cardmgr, cardctl]) When: November 2005 Files: drivers/pcmcia/: pcmcia_ioctl.c diff --git a/Documentation/filesystems/dentry-locking.txt b/Documentation/filesystems/dentry-locking.txt new file mode 100644 index 000000000000..4c0c575a4012 --- /dev/null +++ b/Documentation/filesystems/dentry-locking.txt @@ -0,0 +1,173 @@ +RCU-based dcache locking model +============================== + +On many workloads, the most common operation on dcache is to look up a +dentry, given a parent dentry and the name of the child. Typically, +for every open(), stat() etc., the dentry corresponding to the +pathname will be looked up by walking the tree starting with the first +component of the pathname and using that dentry along with the next +component to look up the next level and so on. Since it is a frequent +operation for workloads like multiuser environments and web servers, +it is important to optimize this path. + +Prior to 2.5.10, dcache_lock was acquired in d_lookup and thus in +every component during path look-up. Since 2.5.10 onwards, fast-walk +algorithm changed this by holding the dcache_lock at the beginning and +walking as many cached path component dentries as possible. This +significantly decreases the number of acquisition of +dcache_lock. However it also increases the lock hold time +significantly and affects performance in large SMP machines. Since +2.5.62 kernel, dcache has been using a new locking model that uses RCU +to make dcache look-up lock-free. + +The current dcache locking model is not very different from the +existing dcache locking model. Prior to 2.5.62 kernel, dcache_lock +protected the hash chain, d_child, d_alias, d_lru lists as well as +d_inode and several other things like mount look-up. RCU-based changes +affect only the way the hash chain is protected. For everything else +the dcache_lock must be taken for both traversing as well as +updating. The hash chain updates too take the dcache_lock. The +significant change is the way d_lookup traverses the hash chain, it +doesn't acquire the dcache_lock for this and rely on RCU to ensure +that the dentry has not been *freed*. + + +Dcache locking details +====================== + +For many multi-user workloads, open() and stat() on files are very +frequently occurring operations. Both involve walking of path names to +find the dentry corresponding to the concerned file. In 2.4 kernel, +dcache_lock was held during look-up of each path component. Contention +and cache-line bouncing of this global lock caused significant +scalability problems. With the introduction of RCU in Linux kernel, +this was worked around by making the look-up of path components during +path walking lock-free. + + +Safe lock-free look-up of dcache hash table +=========================================== + +Dcache is a complex data structure with the hash table entries also +linked together in other lists. In 2.4 kernel, dcache_lock protected +all the lists. We applied RCU only on hash chain walking. The rest of +the lists are still protected by dcache_lock. Some of the important +changes are : + +1. The deletion from hash chain is done using hlist_del_rcu() macro + which doesn't initialize next pointer of the deleted dentry and + this allows us to walk safely lock-free while a deletion is + happening. + +2. Insertion of a dentry into the hash table is done using + hlist_add_head_rcu() which take care of ordering the writes - the + writes to the dentry must be visible before the dentry is + inserted. This works in conjunction with hlist_for_each_rcu() while + walking the hash chain. The only requirement is that all + initialization to the dentry must be done before + hlist_add_head_rcu() since we don't have dcache_lock protection + while traversing the hash chain. This isn't different from the + existing code. + +3. The dentry looked up without holding dcache_lock by cannot be + returned for walking if it is unhashed. It then may have a NULL + d_inode or other bogosity since RCU doesn't protect the other + fields in the dentry. We therefore use a flag DCACHE_UNHASHED to + indicate unhashed dentries and use this in conjunction with a + per-dentry lock (d_lock). Once looked up without the dcache_lock, + we acquire the per-dentry lock (d_lock) and check if the dentry is + unhashed. If so, the look-up is failed. If not, the reference count + of the dentry is increased and the dentry is returned. + +4. Once a dentry is looked up, it must be ensured during the path walk + for that component it doesn't go away. In pre-2.5.10 code, this was + done holding a reference to the dentry. dcache_rcu does the same. + In some sense, dcache_rcu path walking looks like the pre-2.5.10 + version. + +5. All dentry hash chain updates must take the dcache_lock as well as + the per-dentry lock in that order. dput() does this to ensure that + a dentry that has just been looked up in another CPU doesn't get + deleted before dget() can be done on it. + +6. There are several ways to do reference counting of RCU protected + objects. One such example is in ipv4 route cache where deferred + freeing (using call_rcu()) is done as soon as the reference count + goes to zero. This cannot be done in the case of dentries because + tearing down of dentries require blocking (dentry_iput()) which + isn't supported from RCU callbacks. Instead, tearing down of + dentries happen synchronously in dput(), but actual freeing happens + later when RCU grace period is over. This allows safe lock-free + walking of the hash chains, but a matched dentry may have been + partially torn down. The checking of DCACHE_UNHASHED flag with + d_lock held detects such dentries and prevents them from being + returned from look-up. + + +Maintaining POSIX rename semantics +================================== + +Since look-up of dentries is lock-free, it can race against a +concurrent rename operation. For example, during rename of file A to +B, look-up of either A or B must succeed. So, if look-up of B happens +after A has been removed from the hash chain but not added to the new +hash chain, it may fail. Also, a comparison while the name is being +written concurrently by a rename may result in false positive matches +violating rename semantics. Issues related to race with rename are +handled as described below : + +1. Look-up can be done in two ways - d_lookup() which is safe from + simultaneous renames and __d_lookup() which is not. If + __d_lookup() fails, it must be followed up by a d_lookup() to + correctly determine whether a dentry is in the hash table or + not. d_lookup() protects look-ups using a sequence lock + (rename_lock). + +2. The name associated with a dentry (d_name) may be changed if a + rename is allowed to happen simultaneously. To avoid memcmp() in + __d_lookup() go out of bounds due to a rename and false positive + comparison, the name comparison is done while holding the + per-dentry lock. This prevents concurrent renames during this + operation. + +3. Hash table walking during look-up may move to a different bucket as + the current dentry is moved to a different bucket due to rename. + But we use hlists in dcache hash table and they are + null-terminated. So, even if a dentry moves to a different bucket, + hash chain walk will terminate. [with a list_head list, it may not + since termination is when the list_head in the original bucket is + reached]. Since we redo the d_parent check and compare name while + holding d_lock, lock-free look-up will not race against d_move(). + +4. There can be a theoretical race when a dentry keeps coming back to + original bucket due to double moves. Due to this look-up may + consider that it has never moved and can end up in a infinite loop. + But this is not any worse that theoretical livelocks we already + have in the kernel. + + +Important guidelines for filesystem developers related to dcache_rcu +==================================================================== + +1. Existing dcache interfaces (pre-2.5.62) exported to filesystem + don't change. Only dcache internal implementation changes. However + filesystems *must not* delete from the dentry hash chains directly + using the list macros like allowed earlier. They must use dcache + APIs like d_drop() or __d_drop() depending on the situation. + +2. d_flags is now protected by a per-dentry lock (d_lock). All access + to d_flags must be protected by it. + +3. For a hashed dentry, checking of d_count needs to be protected by + d_lock. + + +Papers and other documentation on dcache locking +================================================ + +1. Scaling dcache with RCU (http://linuxjournal.com/article.php?sid=7124). + +2. http://lse.sourceforge.net/locking/dcache/dcache.html + + + diff --git a/Documentation/filesystems/devfs/README b/Documentation/filesystems/devfs/README index 54366ecc241f..aabfba24bc2e 100644 --- a/Documentation/filesystems/devfs/README +++ b/Documentation/filesystems/devfs/README @@ -1812,11 +1812,6 @@ it may overflow the messages buffer, but try to get as much of it as you can -if you get an Oops, run ksymoops to decode it so that the -names of the offending functions are provided. A non-decoded Oops is -pretty useless - - send a copy of your devfsd configuration file(s) send the bug report to me first. diff --git a/Documentation/filesystems/ntfs.txt b/Documentation/filesystems/ntfs.txt index a5fbc8e897fa..614de3124901 100644 --- a/Documentation/filesystems/ntfs.txt +++ b/Documentation/filesystems/ntfs.txt @@ -50,9 +50,14 @@ userspace utilities, etc. Features ======== -- This is a complete rewrite of the NTFS driver that used to be in the kernel. - This new driver implements NTFS read support and is functionally equivalent - to the old ntfs driver. +- This is a complete rewrite of the NTFS driver that used to be in the 2.4 and + earlier kernels. This new driver implements NTFS read support and is + functionally equivalent to the old ntfs driver and it also implements limited + write support. The biggest limitation at present is that files/directories + cannot be created or deleted. See below for the list of write features that + are so far supported. Another limitation is that writing to compressed files + is not implemented at all. Also, neither read nor write access to encrypted + files is so far implemented. - The new driver has full support for sparse files on NTFS 3.x volumes which the old driver isn't happy with. - The new driver supports execution of binaries due to mmap() now being @@ -78,7 +83,20 @@ Features - The new driver supports fsync(2), fdatasync(2), and msync(2). - The new driver supports readv(2) and writev(2). - The new driver supports access time updates (including mtime and ctime). - +- The new driver supports truncate(2) and open(2) with O_TRUNC. But at present + only very limited support for highly fragmented files, i.e. ones which have + their data attribute split across multiple extents, is included. Another + limitation is that at present truncate(2) will never create sparse files, + since to mark a file sparse we need to modify the directory entry for the + file and we do not implement directory modifications yet. +- The new driver supports write(2) which can both overwrite existing data and + extend the file size so that you can write beyond the existing data. Also, + writing into sparse regions is supported and the holes are filled in with + clusters. But at present only limited support for highly fragmented files, + i.e. ones which have their data attribute split across multiple extents, is + included. Another limitation is that write(2) will never create sparse + files, since to mark a file sparse we need to modify the directory entry for + the file and we do not implement directory modifications yet. Supported mount options ======================= @@ -439,6 +457,22 @@ ChangeLog Note, a technical ChangeLog aimed at kernel hackers is in fs/ntfs/ChangeLog. +2.1.25: + - Write support is now extended with write(2) being able to both + overwrite existing file data and to extend files. Also, if a write + to a sparse region occurs, write(2) will fill in the hole. Note, + mmap(2) based writes still do not support writing into holes or + writing beyond the initialized size. + - Write support has a new feature and that is that truncate(2) and + open(2) with O_TRUNC are now implemented thus files can be both made + smaller and larger. + - Note: Both write(2) and truncate(2)/open(2) with O_TRUNC still have + limitations in that they + - only provide limited support for highly fragmented files. + - only work on regular, i.e. uncompressed and unencrypted files. + - never create sparse files although this will change once directory + operations are implemented. + - Lots of bug fixes and enhancements across the board. 2.1.24: - Support journals ($LogFile) which have been modified by chkdsk. This means users can boot into Windows after we marked the volume dirty. diff --git a/Documentation/filesystems/ramfs-rootfs-initramfs.txt b/Documentation/filesystems/ramfs-rootfs-initramfs.txt new file mode 100644 index 000000000000..b3404a032596 --- /dev/null +++ b/Documentation/filesystems/ramfs-rootfs-initramfs.txt @@ -0,0 +1,195 @@ +ramfs, rootfs and initramfs +October 17, 2005 +Rob Landley <rob@landley.net> +============================= + +What is ramfs? +-------------- + +Ramfs is a very simple filesystem that exports Linux's disk caching +mechanisms (the page cache and dentry cache) as a dynamically resizable +ram-based filesystem. + +Normally all files are cached in memory by Linux. Pages of data read from +backing store (usually the block device the filesystem is mounted on) are kept +around in case it's needed again, but marked as clean (freeable) in case the +Virtual Memory system needs the memory for something else. Similarly, data +written to files is marked clean as soon as it has been written to backing +store, but kept around for caching purposes until the VM reallocates the +memory. A similar mechanism (the dentry cache) greatly speeds up access to +directories. + +With ramfs, there is no backing store. Files written into ramfs allocate +dentries and page cache as usual, but there's nowhere to write them to. +This means the pages are never marked clean, so they can't be freed by the +VM when it's looking to recycle memory. + +The amount of code required to implement ramfs is tiny, because all the +work is done by the existing Linux caching infrastructure. Basically, +you're mounting the disk cache as a filesystem. Because of this, ramfs is not +an optional component removable via menuconfig, since there would be negligible +space savings. + +ramfs and ramdisk: +------------------ + +The older "ram disk" mechanism created a synthetic block device out of +an area of ram and used it as backing store for a filesystem. This block +device was of fixed size, so the filesystem mounted on it was of fixed +size. Using a ram disk also required unnecessarily copying memory from the +fake block device into the page cache (and copying changes back out), as well +as creating and destroying dentries. Plus it needed a filesystem driver +(such as ext2) to format and interpret this data. + +Compared to ramfs, this wastes memory (and memory bus bandwidth), creates +unnecessary work for the CPU, and pollutes the CPU caches. (There are tricks +to avoid this copying by playing with the page tables, but they're unpleasantly +complicated and turn out to be about as expensive as the copying anyway.) +More to the point, all the work ramfs is doing has to happen _anyway_, +since all file access goes through the page and dentry caches. The ram +disk is simply unnecessary, ramfs is internally much simpler. + +Another reason ramdisks are semi-obsolete is that the introduction of +loopback devices offered a more flexible and convenient way to create +synthetic block devices, now from files instead of from chunks of memory. +See losetup (8) for details. + +ramfs and tmpfs: +---------------- + +One downside of ramfs is you can keep writing data into it until you fill +up all memory, and the VM can't free it because the VM thinks that files +should get written to backing store (rather than swap space), but ramfs hasn't +got any backing store. Because of this, only root (or a trusted user) should +be allowed write access to a ramfs mount. + +A ramfs derivative called tmpfs was created to add size limits, and the ability +to write the data to swap space. Normal users can be allowed write access to +tmpfs mounts. See Documentation/filesystems/tmpfs.txt for more information. + +What is rootfs? +--------------- + +Rootfs is a special instance of ramfs, which is always present in 2.6 systems. +(It's used internally as the starting and stopping point for searches of the +kernel's doubly-linked list of mount points.) + +Most systems just mount another filesystem over it and ignore it. The +amount of space an empty instance of ramfs takes up is tiny. + +What is initramfs? +------------------ + +All 2.6 Linux kernels contain a gzipped "cpio" format archive, which is +extracted into rootfs when the kernel boots up. After extracting, the kernel +checks to see if rootfs contains a file "init", and if so it executes it as PID +1. If found, this init process is responsible for bringing the system the +rest of the way up, including locating and mounting the real root device (if +any). If rootfs does not contain an init program after the embedded cpio +archive is extracted into it, the kernel will fall through to the older code +to locate and mount a root partition, then exec some variant of /sbin/init +out of that. + +All this differs from the old initrd in several ways: + + - The old initrd was a separate file, while the initramfs archive is linked + into the linux kernel image. (The directory linux-*/usr is devoted to + generating this archive during the build.) + + - The old initrd file was a gzipped filesystem image (in some file format, + such as ext2, that had to be built into the kernel), while the new + initramfs archive is a gzipped cpio archive (like tar only simpler, + see cpio(1) and Documentation/early-userspace/buffer-format.txt). + + - The program run by the old initrd (which was called /initrd, not /init) did + some setup and then returned to the kernel, while the init program from + initramfs is not expected to return to the kernel. (If /init needs to hand + off control it can overmount / with a new root device and exec another init + program. See the switch_root utility, below.) + + - When switching another root device, initrd would pivot_root and then + umount the ramdisk. But initramfs is rootfs: you can neither pivot_root + rootfs, nor unmount it. Instead delete everything out of rootfs to + free up the space (find -xdev / -exec rm '{}' ';'), overmount rootfs + with the new root (cd /newmount; mount --move . /; chroot .), attach + stdin/stdout/stderr to the new /dev/console, and exec the new init. + + Since this is a remarkably persnickity process (and involves deleting + commands before you can run them), the klibc package introduced a helper + program (utils/run_init.c) to do all this for you. Most other packages + (such as busybox) have named this command "switch_root". + +Populating initramfs: +--------------------- + +The 2.6 kernel build process always creates a gzipped cpio format initramfs +archive and links it into the resulting kernel binary. By default, this +archive is empty (consuming 134 bytes on x86). The config option +CONFIG_INITRAMFS_SOURCE (for some reason buried under devices->block devices +in menuconfig, and living in usr/Kconfig) can be used to specify a source for +the initramfs archive, which will automatically be incorporated into the +resulting binary. This option can point to an existing gzipped cpio archive, a +directory containing files to be archived, or a text file specification such +as the following example: + + dir /dev 755 0 0 + nod /dev/console 644 0 0 c 5 1 + nod /dev/loop0 644 0 0 b 7 0 + dir /bin 755 1000 1000 + slink /bin/sh busybox 777 0 0 + file /bin/busybox initramfs/busybox 755 0 0 + dir /proc 755 0 0 + dir /sys 755 0 0 + dir /mnt 755 0 0 + file /init initramfs/init.sh 755 0 0 + +One advantage of the text file is that root access is not required to +set permissions or create device nodes in the new archive. (Note that those +two example "file" entries expect to find files named "init.sh" and "busybox" in +a directory called "initramfs", under the linux-2.6.* directory. See +Documentation/early-userspace/README for more details.) + +If you don't already understand what shared libraries, devices, and paths +you need to get a minimal root filesystem up and running, here are some +references: +http://www.tldp.org/HOWTO/Bootdisk-HOWTO/ +http://www.tldp.org/HOWTO/From-PowerUp-To-Bash-Prompt-HOWTO.html +http://www.linuxfromscratch.org/lfs/view/stable/ + +The "klibc" package (http://www.kernel.org/pub/linux/libs/klibc) is +designed to be a tiny C library to statically link early userspace +code against, along with some related utilities. It is BSD licensed. + +I use uClibc (http://www.uclibc.org) and busybox (http://www.busybox.net) +myself. These are LGPL and GPL, respectively. + +In theory you could use glibc, but that's not well suited for small embedded +uses like this. (A "hello world" program statically linked against glibc is +over 400k. With uClibc it's 7k. Also note that glibc dlopens libnss to do +name lookups, even when otherwise statically linked.) + +Future directions: +------------------ + +Today (2.6.14), initramfs is always compiled in, but not always used. The +kernel falls back to legacy boot code that is reached only if initramfs does +not contain an /init program. The fallback is legacy code, there to ensure a +smooth transition and allowing early boot functionality to gradually move to +"early userspace" (I.E. initramfs). + +The move to early userspace is necessary because finding and mounting the real +root device is complex. Root partitions can span multiple devices (raid or +separate journal). They can be out on the network (requiring dhcp, setting a +specific mac address, logging into a server, etc). They can live on removable +media, with dynamically allocated major/minor numbers and persistent naming +issues requiring a full udev implementation to sort out. They can be +compressed, encrypted, copy-on-write, loopback mounted, strangely partitioned, +and so on. + +This kind of complexity (which inevitably includes policy) is rightly handled +in userspace. Both klibc and busybox/uClibc are working on simple initramfs +packages to drop into a kernel build, and when standard solutions are ready +and widely deployed, the kernel's legacy early boot code will become obsolete +and a candidate for the feature removal schedule. + +But that's a while off yet. diff --git a/Documentation/filesystems/relayfs.txt b/Documentation/filesystems/relayfs.txt index d24e1b0d4f39..d803abed29f0 100644 --- a/Documentation/filesystems/relayfs.txt +++ b/Documentation/filesystems/relayfs.txt @@ -15,7 +15,7 @@ retrieve the data as it becomes available. The format of the data logged into the channel buffers is completely up to the relayfs client; relayfs does however provide hooks which -allow clients to impose some stucture on the buffer data. Nor does +allow clients to impose some structure on the buffer data. Nor does relayfs implement any form of data filtering - this also is left to the client. The purpose is to keep relayfs as simple as possible. diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt index f042c12e0ed2..ee4c0a8b8db7 100644 --- a/Documentation/filesystems/vfs.txt +++ b/Documentation/filesystems/vfs.txt @@ -3,7 +3,7 @@ Original author: Richard Gooch <rgooch@atnf.csiro.au> - Last updated on August 25, 2005 + Last updated on October 28, 2005 Copyright (C) 1999 Richard Gooch Copyright (C) 2005 Pekka Enberg @@ -11,62 +11,61 @@ This file is released under the GPLv2. -What is it? -=========== +Introduction +============ -The Virtual File System (otherwise known as the Virtual Filesystem -Switch) is the software layer in the kernel that provides the -filesystem interface to userspace programs. It also provides an -abstraction within the kernel which allows different filesystem -implementations to coexist. +The Virtual File System (also known as the Virtual Filesystem Switch) +is the software layer in the kernel that provides the filesystem +interface to userspace programs. It also provides an abstraction +within the kernel which allows different filesystem implementations to +coexist. +VFS system calls open(2), stat(2), read(2), write(2), chmod(2) and so +on are called from a process context. Filesystem locking is described +in the document Documentation/filesystems/Locking. -A Quick Look At How It Works -============================ -In this section I'll briefly describe how things work, before -launching into the details. I'll start with describing what happens -when user programs open and manipulate files, and then look from the -other view which is how a filesystem is supported and subsequently -mounted. - - -Opening a File --------------- - -The VFS implements the open(2), stat(2), chmod(2) and similar system -calls. The pathname argument is used by the VFS to search through the -directory entry cache (dentry cache or "dcache"). This provides a very -fast look-up mechanism to translate a pathname (filename) into a -specific dentry. - -An individual dentry usually has a pointer to an inode. Inodes are the -things that live on disc drives, and can be regular files (you know: -those things that you write data into), directories, FIFOs and other -beasts. Dentries live in RAM and are never saved to disc: they exist -only for performance. Inodes live on disc and are copied into memory -when required. Later any changes are written back to disc. The inode -that lives in RAM is a VFS inode, and it is this which the dentry -points to. A single inode can be pointed to by multiple dentries -(think about hardlinks). - -The dcache is meant to be a view into your entire filespace. Unlike -Linus, most of us losers can't fit enough dentries into RAM to cover -all of our filespace, so the dcache has bits missing. In order to -resolve your pathname into a dentry, the VFS may have to resort to -creating dentries along the way, and then loading the inode. This is -done by looking up the inode. - -To look up an inode (usually read from disc) requires that the VFS -calls the lookup() method of the parent directory inode. This method -is installed by the specific filesystem implementation that the inode -lives in. There will be more on this later. +Directory Entry Cache (dcache) +------------------------------ -Once the VFS has the required dentry (and hence the inode), we can do -all those boring things like open(2) the file, or stat(2) it to peek -at the inode data. The stat(2) operation is fairly simple: once the -VFS has the dentry, it peeks at the inode data and passes some of it -back to userspace. +The VFS implements the open(2), stat(2), chmod(2), and similar system +calls. The pathname argument that is passed to them is used by the VFS +to search through the directory entry cache (also known as the dentry +cache or dcache). This provides a very fast look-up mechanism to +translate a pathname (filename) into a specific dentry. Dentries live +in RAM and are never saved to disc: they exist only for performance. + +The dentry cache is meant to be a view into your entire filespace. As +most computers cannot fit all dentries in the RAM at the same time, +some bits of the cache are missing. In order to resolve your pathname +into a dentry, the VFS may have to resort to creating dentries along +the way, and then loading the inode. This is done by looking up the +inode. + + +The Inode Object +---------------- + +An individual dentry usually has a pointer to an inode. Inodes are +filesystem objects such as regular files, directories, FIFOs and other +beasts. They live either on the disc (for block device filesystems) +or in the memory (for pseudo filesystems). Inodes that live on the +disc are copied into the memory when required and changes to the inode +are written back to disc. A single inode can be pointed to by multiple +dentries (hard links, for example, do this). + +To look up an inode requires that the VFS calls the lookup() method of +the parent directory inode. This method is installed by the specific +filesystem implementation that the inode lives in. Once the VFS has +the required dentry (and hence the inode), we can do all those boring +things like open(2) the file, or stat(2) it to peek at the inode +data. The stat(2) operation is fairly simple: once the VFS has the +dentry, it peeks at the inode data and passes some of it back to +userspace. + + +The File Object +--------------- Opening a file requires another operation: allocation of a file structure (this is the kernel-side implementation of file @@ -74,51 +73,39 @@ descriptors). The freshly allocated file structure is initialized with a pointer to the dentry and a set of file operation member functions. These are taken from the inode data. The open() file method is then called so the specific filesystem implementation can do it's work. You -can see that this is another switch performed by the VFS. - -The file structure is placed into the file descriptor table for the -process. +can see that this is another switch performed by the VFS. The file +structure is placed into the file descriptor table for the process. Reading, writing and closing files (and other assorted VFS operations) is done by using the userspace file descriptor to grab the appropriate -file structure, and then calling the required file structure method -function to do whatever is required. - -For as long as the file is open, it keeps the dentry "open" (in use), -which in turn means that the VFS inode is still in use. - -All VFS system calls (i.e. open(2), stat(2), read(2), write(2), -chmod(2) and so on) are called from a process context. You should -assume that these calls are made without any kernel locks being -held. This means that the processes may be executing the same piece of -filesystem or driver code at the same time, on different -processors. You should ensure that access to shared resources is -protected by appropriate locks. +file structure, and then calling the required file structure method to +do whatever is required. For as long as the file is open, it keeps the +dentry in use, which in turn means that the VFS inode is still in use. Registering and Mounting a Filesystem -------------------------------------- +===================================== -If you want to support a new kind of filesystem in the kernel, all you -need to do is call register_filesystem(). You pass a structure -describing the filesystem implementation (struct file_system_type) -which is then added to an internal table of supported filesystems. You -can do: +To register and unregister a filesystem, use the following API +functions: -% cat /proc/filesystems + #include <linux/fs.h> -to see what filesystems are currently available on your system. + extern int register_filesystem(struct file_system_type *); + extern int unregister_filesystem(struct file_system_type *); -When a request is made to mount a block device onto a directory in -your filespace the VFS will call the appropriate method for the -specific filesystem. The dentry for the mount point will then be -updated to point to the root inode for the new filesystem. +The passed struct file_system_type describes your filesystem. When a +request is made to mount a device onto a directory in your filespace, +the VFS will call the appropriate get_sb() method for the specific +filesystem. The dentry for the mount point will then be updated to +point to the root inode for the new filesystem. -It's now time to look at things in more detail. +You can see all filesystems that are registered to the kernel in the +file /proc/filesystems. struct file_system_type -======================= +----------------------- This describes the filesystem. As of kernel 2.6.13, the following members are defined: @@ -197,8 +184,14 @@ A fill_super() method implementation has the following arguments: int silent: whether or not to be silent on error +The Superblock Object +===================== + +A superblock object represents a mounted filesystem. + + struct super_operations -======================= +----------------------- This describes how the VFS can manipulate the superblock of your filesystem. As of kernel 2.6.13, the following members are defined: @@ -286,9 +279,9 @@ or bottom half). a superblock. The second parameter indicates whether the method should wait until the write out has been completed. Optional. - write_super_lockfs: called when VFS is locking a filesystem and forcing - it into a consistent state. This function is currently used by the - Logical Volume Manager (LVM). + write_super_lockfs: called when VFS is locking a filesystem and + forcing it into a consistent state. This method is currently + used by the Logical Volume Manager (LVM). unlockfs: called when VFS is unlocking a filesystem and making it writable again. @@ -317,8 +310,14 @@ field. This is a pointer to a "struct inode_operations" which describes the methods that can be performed on individual inodes. +The Inode Object +================ + +An inode object represents an object within the filesystem. + + struct inode_operations -======================= +----------------------- This describes how the VFS can manipulate an inode in your filesystem. As of kernel 2.6.13, the following members are defined: @@ -394,51 +393,62 @@ otherwise noted. will probably need to call d_instantiate() just as you would in the create() method + rename: called by the rename(2) system call to rename the object to + have the parent and name given by the second inode and dentry. + readlink: called by the readlink(2) system call. Only required if you want to support reading symbolic links follow_link: called by the VFS to follow a symbolic link to the inode it points to. Only required if you want to support - symbolic links. This function returns a void pointer cookie + symbolic links. This method returns a void pointer cookie that is passed to put_link(). put_link: called by the VFS to release resources allocated by - follow_link(). The cookie returned by follow_link() is passed to - to this function as the last parameter. It is used by filesystems - such as NFS where page cache is not stable (i.e. page that was - installed when the symbolic link walk started might not be in the - page cache at the end of the walk). - - truncate: called by the VFS to change the size of a file. The i_size - field of the inode is set to the desired size by the VFS before - this function is called. This function is called by the truncate(2) - system call and related functionality. + follow_link(). The cookie returned by follow_link() is passed + to to this method as the last parameter. It is used by + filesystems such as NFS where page cache is not stable + (i.e. page that was installed when the symbolic link walk + started might not be in the page cache at the end of the + walk). + + truncate: called by the VFS to change the size of a file. The + i_size field of the inode is set to the desired size by the + VFS before this method is called. This method is called by + the truncate(2) system call and related functionality. permission: called by the VFS to check for access rights on a POSIX-like filesystem. - setattr: called by the VFS to set attributes for a file. This function is - called by chmod(2) and related system calls. + setattr: called by the VFS to set attributes for a file. This method + is called by chmod(2) and related system calls. - getattr: called by the VFS to get attributes of a file. This function is - called by stat(2) and related system calls. + getattr: called by the VFS to get attributes of a file. This method + is called by stat(2) and related system calls. setxattr: called by the VFS to set an extended attribute for a file. - Extended attribute is a name:value pair associated with an inode. This - function is called by setxattr(2) system call. + Extended attribute is a name:value pair associated with an + inode. This method is called by setxattr(2) system call. + + getxattr: called by the VFS to retrieve the value of an extended + attribute name. This method is called by getxattr(2) function + call. - getxattr: called by the VFS to retrieve the value of an extended attribute - name. This function is called by getxattr(2) function call. + listxattr: called by the VFS to list all extended attributes for a + given file. This method is called by listxattr(2) system call. - listxattr: called by the VFS to list all extended attributes for a given - file. This function is called by listxattr(2) system call. + removexattr: called by the VFS to remove an extended attribute from + a file. This method is called by removexattr(2) system call. - removexattr: called by the VFS to remove an extended attribute from a file. - This function is called by removexattr(2) system call. + +The Address Space Object +======================== + +The address space object is used to identify pages in the page cache. struct address_space_operations -=============================== +------------------------------- This describes how the VFS can manipulate mapping of a file to page cache in your filesystem. As of kernel 2.6.13, the following members are defined: @@ -502,8 +512,14 @@ struct address_space_operations { it. An example implementation can be found in fs/ext2/xip.c. +The File Object +=============== + +A file object represents a file opened by a process. + + struct file_operations -====================== +---------------------- This describes how the VFS can manipulate an open file. As of kernel 2.6.13, the following members are defined: @@ -661,7 +677,7 @@ of child dentries. Child dentries are basically like files in a directory. -Directory Entry Cache APIs +Directory Entry Cache API -------------------------- There are a number of functions defined which permit a filesystem to @@ -705,178 +721,24 @@ manipulate dentries: and the dentry is returned. The caller must use d_put() to free the dentry when it finishes using it. +For further information on dentry locking, please refer to the document +Documentation/filesystems/dentry-locking.txt. -RCU-based dcache locking model ------------------------------- -On many workloads, the most common operation on dcache is -to look up a dentry, given a parent dentry and the name -of the child. Typically, for every open(), stat() etc., -the dentry corresponding to the pathname will be looked -up by walking the tree starting with the first component -of the pathname and using that dentry along with the next -component to look up the next level and so on. Since it -is a frequent operation for workloads like multiuser -environments and web servers, it is important to optimize -this path. - -Prior to 2.5.10, dcache_lock was acquired in d_lookup and thus -in every component during path look-up. Since 2.5.10 onwards, -fast-walk algorithm changed this by holding the dcache_lock -at the beginning and walking as many cached path component -dentries as possible. This significantly decreases the number -of acquisition of dcache_lock. However it also increases the -lock hold time significantly and affects performance in large -SMP machines. Since 2.5.62 kernel, dcache has been using -a new locking model that uses RCU to make dcache look-up -lock-free. - -The current dcache locking model is not very different from the existing -dcache locking model. Prior to 2.5.62 kernel, dcache_lock -protected the hash chain, d_child, d_alias, d_lru lists as well -as d_inode and several other things like mount look-up. RCU-based -changes affect only the way the hash chain is protected. For everything -else the dcache_lock must be taken for both traversing as well as -updating. The hash chain updates too take the dcache_lock. -The significant change is the way d_lookup traverses the hash chain, -it doesn't acquire the dcache_lock for this and rely on RCU to -ensure that the dentry has not been *freed*. - - -Dcache locking details ----------------------- +Resources +========= + +(Note some of these resources are not up-to-date with the latest kernel + version.) + +Creating Linux virtual filesystems. 2002 + <http://lwn.net/Articles/13325/> + +The Linux Virtual File-system Layer by Neil Brown. 1999 + <http://www.cse.unsw.edu.au/~neilb/oss/linux-commentary/vfs.html> + +A tour of the Linux VFS by Michael K. Johnson. 1996 + <http://www.tldp.org/LDP/khg/HyperNews/get/fs/vfstour.html> -For many multi-user workloads, open() and stat() on files are -very frequently occurring operations. Both involve walking -of path names to find the dentry corresponding to the -concerned file. In 2.4 kernel, dcache_lock was held -during look-up of each path component. Contention and -cache-line bouncing of this global lock caused significant -scalability problems. With the introduction of RCU -in Linux kernel, this was worked around by making -the look-up of path components during path walking lock-free. - - -Safe lock-free look-up of dcache hash table -=========================================== - -Dcache is a complex data structure with the hash table entries -also linked together in other lists. In 2.4 kernel, dcache_lock -protected all the lists. We applied RCU only on hash chain -walking. The rest of the lists are still protected by dcache_lock. -Some of the important changes are : - -1. The deletion from hash chain is done using hlist_del_rcu() macro which - doesn't initialize next pointer of the deleted dentry and this - allows us to walk safely lock-free while a deletion is happening. - -2. Insertion of a dentry into the hash table is done using - hlist_add_head_rcu() which take care of ordering the writes - - the writes to the dentry must be visible before the dentry - is inserted. This works in conjunction with hlist_for_each_rcu() - while walking the hash chain. The only requirement is that - all initialization to the dentry must be done before hlist_add_head_rcu() - since we don't have dcache_lock protection while traversing - the hash chain. This isn't different from the existing code. - -3. The dentry looked up without holding dcache_lock by cannot be - returned for walking if it is unhashed. It then may have a NULL - d_inode or other bogosity since RCU doesn't protect the other - fields in the dentry. We therefore use a flag DCACHE_UNHASHED to - indicate unhashed dentries and use this in conjunction with a - per-dentry lock (d_lock). Once looked up without the dcache_lock, - we acquire the per-dentry lock (d_lock) and check if the - dentry is unhashed. If so, the look-up is failed. If not, the - reference count of the dentry is increased and the dentry is returned. - -4. Once a dentry is looked up, it must be ensured during the path - walk for that component it doesn't go away. In pre-2.5.10 code, - this was done holding a reference to the dentry. dcache_rcu does - the same. In some sense, dcache_rcu path walking looks like - the pre-2.5.10 version. - -5. All dentry hash chain updates must take the dcache_lock as well as - the per-dentry lock in that order. dput() does this to ensure - that a dentry that has just been looked up in another CPU - doesn't get deleted before dget() can be done on it. - -6. There are several ways to do reference counting of RCU protected - objects. One such example is in ipv4 route cache where - deferred freeing (using call_rcu()) is done as soon as - the reference count goes to zero. This cannot be done in - the case of dentries because tearing down of dentries - require blocking (dentry_iput()) which isn't supported from - RCU callbacks. Instead, tearing down of dentries happen - synchronously in dput(), but actual freeing happens later - when RCU grace period is over. This allows safe lock-free - walking of the hash chains, but a matched dentry may have - been partially torn down. The checking of DCACHE_UNHASHED - flag with d_lock held detects such dentries and prevents - them from being returned from look-up. - - -Maintaining POSIX rename semantics -================================== - -Since look-up of dentries is lock-free, it can race against -a concurrent rename operation. For example, during rename -of file A to B, look-up of either A or B must succeed. -So, if look-up of B happens after A has been removed from the -hash chain but not added to the new hash chain, it may fail. -Also, a comparison while the name is being written concurrently -by a rename may result in false positive matches violating -rename semantics. Issues related to race with rename are -handled as described below : - -1. Look-up can be done in two ways - d_lookup() which is safe - from simultaneous renames and __d_lookup() which is not. - If __d_lookup() fails, it must be followed up by a d_lookup() - to correctly determine whether a dentry is in the hash table - or not. d_lookup() protects look-ups using a sequence - lock (rename_lock). - -2. The name associated with a dentry (d_name) may be changed if - a rename is allowed to happen simultaneously. To avoid memcmp() - in __d_lookup() go out of bounds due to a rename and false - positive comparison, the name comparison is done while holding the - per-dentry lock. This prevents concurrent renames during this - operation. - -3. Hash table walking during look-up may move to a different bucket as - the current dentry is moved to a different bucket due to rename. - But we use hlists in dcache hash table and they are null-terminated. - So, even if a dentry moves to a different bucket, hash chain - walk will terminate. [with a list_head list, it may not since - termination is when the list_head in the original bucket is reached]. - Since we redo the d_parent check and compare name while holding - d_lock, lock-free look-up will not race against d_move(). - -4. There can be a theoretical race when a dentry keeps coming back - to original bucket due to double moves. Due to this look-up may - consider that it has never moved and can end up in a infinite loop. - But this is not any worse that theoretical livelocks we already - have in the kernel. - - -Important guidelines for filesystem developers related to dcache_rcu -==================================================================== - -1. Existing dcache interfaces (pre-2.5.62) exported to filesystem - don't change. Only dcache internal implementation changes. However - filesystems *must not* delete from the dentry hash chains directly - using the list macros like allowed earlier. They must use dcache - APIs like d_drop() or __d_drop() depending on the situation. - -2. d_flags is now protected by a per-dentry lock (d_lock). All - access to d_flags must be protected by it. - -3. For a hashed dentry, checking of d_count needs to be protected - by d_lock. - - -Papers and other documentation on dcache locking -================================================ - -1. Scaling dcache with RCU (http://linuxjournal.com/article.php?sid=7124). - -2. http://lse.sourceforge.net/locking/dcache/dcache.html +A small trail through the Linux kernel by Andries Brouwer. 2001 + <http://www.win.tue.nl/~aeb/linux/vfs/trail.html> diff --git a/Documentation/filesystems/xfs.txt b/Documentation/filesystems/xfs.txt index c7d5d0c7067d..74aeb142ae5f 100644 --- a/Documentation/filesystems/xfs.txt +++ b/Documentation/filesystems/xfs.txt @@ -19,15 +19,43 @@ Mount Options When mounting an XFS filesystem, the following options are accepted. - biosize=size - Sets the preferred buffered I/O size (default size is 64K). - "size" must be expressed as the logarithm (base2) of the - desired I/O size. - Valid values for this option are 14 through 16, inclusive - (i.e. 16K, 32K, and 64K bytes). On machines with a 4K - pagesize, 13 (8K bytes) is also a valid size. - The preferred buffered I/O size can also be altered on an - individual file basis using the ioctl(2) system call. + allocsize=size + Sets the buffered I/O end-of-file preallocation size when + doing delayed allocation writeout (default size is 64KiB). + Valid values for this option are page size (typically 4KiB) + through to 1GiB, inclusive, in power-of-2 increments. + + attr2/noattr2 + The options enable/disable (default is disabled for backward + compatibility on-disk) an "opportunistic" improvement to be + made in the way inline extended attributes are stored on-disk. + When the new form is used for the first time (by setting or + removing extended attributes) the on-disk superblock feature + bit field will be updated to reflect this format being in use. + + barrier + Enables the use of block layer write barriers for writes into + the journal and unwritten extent conversion. This allows for + drive level write caching to be enabled, for devices that + support write barriers. + + dmapi + Enable the DMAPI (Data Management API) event callouts. + Use with the "mtpt" option. + + grpid/bsdgroups and nogrpid/sysvgroups + These options define what group ID a newly created file gets. + When grpid is set, it takes the group ID of the directory in + which it is created; otherwise (the default) it takes the fsgid + of the current process, unless the directory has the setgid bit + set, in which case it takes the gid from the parent directory, + and also gets the setgid bit set if it is a directory itself. + + ihashsize=value + Sets the number of hash buckets available for hashing the + in-memory inodes of the specified mount point. If a value + of zero is used, the value selected by the default algorithm + will be displayed in /proc/mounts. ikeep/noikeep When inode clusters are emptied of inodes, keep them around @@ -35,12 +63,31 @@ When mounting an XFS filesystem, the following options are accepted. and is still the default for now. Using the noikeep option, inode clusters are returned to the free space pool. + inode64 + Indicates that XFS is allowed to create inodes at any location + in the filesystem, including those which will result in inode + numbers occupying more than 32 bits of significance. This is + provided for backwards compatibility, but causes problems for + backup applications that cannot handle large inode numbers. + + largeio/nolargeio + If "nolargeio" is specified, the optimal I/O reported in + st_blksize by stat(2) will be as small as possible to allow user + applications to avoid inefficient read/modify/write I/O. + If "largeio" specified, a filesystem that has a "swidth" specified + will return the "swidth" value (in bytes) in st_blksize. If the + filesystem does not have a "swidth" specified but does specify + an "allocsize" then "allocsize" (in bytes) will be returned + instead. + If neither of these two options are specified, then filesystem + will behave as if "nolargeio" was specified. + logbufs=value Set the number of in-memory log buffers. Valid numbers range from 2-8 inclusive. The default value is 8 buffers for filesystems with a - blocksize of 64K, 4 buffers for filesystems with a blocksize - of 32K, 3 buffers for filesystems with a blocksize of 16K + blocksize of 64KiB, 4 buffers for filesystems with a blocksize + of 32KiB, 3 buffers for filesystems with a blocksize of 16KiB and 2 buffers for all other configurations. Increasing the number of buffers may increase performance on some workloads at the cost of the memory used for the additional log buffers @@ -49,10 +96,10 @@ When mounting an XFS filesystem, the following options are accepted. logbsize=value Set the size of each in-memory log buffer. Size may be specified in bytes, or in kilobytes with a "k" suffix. - Valid sizes for version 1 and version 2 logs are 16384 (16k) and - 32768 (32k). Valid sizes for version 2 logs also include + Valid sizes for version 1 and version 2 logs are 16384 (16k) and + 32768 (32k). Valid sizes for version 2 logs also include 65536 (64k), 131072 (128k) and 262144 (256k). - The default value for machines with more than 32MB of memory + The default value for machines with more than 32MiB of memory is 32768, machines with less memory use 16384 by default. logdev=device and rtdev=device @@ -62,6 +109,11 @@ When mounting an XFS filesystem, the following options are accepted. optional, and the log section can be separate from the data section or contained within it. + mtpt=mountpoint + Use with the "dmapi" option. The value specified here will be + included in the DMAPI mount event, and should be the path of + the actual mountpoint that is used. + noalign Data allocations will not be aligned at stripe unit boundaries. @@ -91,13 +143,17 @@ When mounting an XFS filesystem, the following options are accepted. O_SYNC writes can be lost if the system crashes. If timestamp updates are critical, use the osyncisosync option. - quota/usrquota/uqnoenforce + uquota/usrquota/uqnoenforce/quota User disk quota accounting enabled, and limits (optionally) - enforced. + enforced. Refer to xfs_quota(8) for further details. - grpquota/gqnoenforce + gquota/grpquota/gqnoenforce Group disk quota accounting enabled and limits (optionally) - enforced. + enforced. Refer to xfs_quota(8) for further details. + + pquota/prjquota/pqnoenforce + Project disk quota accounting enabled and limits (optionally) + enforced. Refer to xfs_quota(8) for further details. sunit=value and swidth=value Used to specify the stripe unit and width for a RAID device or @@ -113,15 +169,21 @@ When mounting an XFS filesystem, the following options are accepted. The "swidth" option is required if the "sunit" option has been specified, and must be a multiple of the "sunit" value. + swalloc + Data allocations will be rounded up to stripe width boundaries + when the current end of file is being extended and the file + size is larger than the stripe width size. + + sysctls ======= The following sysctls are available for the XFS filesystem: fs.xfs.stats_clear (Min: 0 Default: 0 Max: 1) - Setting this to "1" clears accumulated XFS statistics + Setting this to "1" clears accumulated XFS statistics in /proc/fs/xfs/stat. It then immediately resets to "0". - + fs.xfs.xfssyncd_centisecs (Min: 100 Default: 3000 Max: 720000) The interval at which the xfssyncd thread flushes metadata out to disk. This thread will flush log activity out, and @@ -143,9 +205,9 @@ The following sysctls are available for the XFS filesystem: XFS_ERRLEVEL_HIGH: 5 fs.xfs.panic_mask (Min: 0 Default: 0 Max: 127) - Causes certain error conditions to call BUG(). Value is a bitmask; + Causes certain error conditions to call BUG(). Value is a bitmask; AND together the tags which represent errors which should cause panics: - + XFS_NO_PTAG 0 XFS_PTAG_IFLUSH 0x00000001 XFS_PTAG_LOGRES 0x00000002 @@ -155,7 +217,7 @@ The following sysctls are available for the XFS filesystem: XFS_PTAG_SHUTDOWN_IOERROR 0x00000020 XFS_PTAG_SHUTDOWN_LOGERROR 0x00000040 - This option is intended for debugging only. + This option is intended for debugging only. fs.xfs.irix_symlink_mode (Min: 0 Default: 0 Max: 1) Controls whether symlinks are created with mode 0777 (default) @@ -164,25 +226,37 @@ The following sysctls are available for the XFS filesystem: fs.xfs.irix_sgid_inherit (Min: 0 Default: 0 Max: 1) Controls files created in SGID directories. If the group ID of the new file does not match the effective group - ID or one of the supplementary group IDs of the parent dir, the - ISGID bit is cleared if the irix_sgid_inherit compatibility sysctl + ID or one of the supplementary group IDs of the parent dir, the + ISGID bit is cleared if the irix_sgid_inherit compatibility sysctl is set. fs.xfs.restrict_chown (Min: 0 Default: 1 Max: 1) Controls whether unprivileged users can use chown to "give away" a file to another user. - fs.xfs.inherit_sync (Min: 0 Default: 1 Max 1) - Setting this to "1" will cause the "sync" flag set - by the chattr(1) command on a directory to be + fs.xfs.inherit_sync (Min: 0 Default: 1 Max: 1) + Setting this to "1" will cause the "sync" flag set + by the xfs_io(8) chattr command on a directory to be inherited by files in that directory. - fs.xfs.inherit_nodump (Min: 0 Default: 1 Max 1) - Setting this to "1" will cause the "nodump" flag set - by the chattr(1) command on a directory to be + fs.xfs.inherit_nodump (Min: 0 Default: 1 Max: 1) + Setting this to "1" will cause the "nodump" flag set + by the xfs_io(8) chattr command on a directory to be inherited by files in that directory. - fs.xfs.inherit_noatime (Min: 0 Default: 1 Max 1) - Setting this to "1" will cause the "noatime" flag set - by the chattr(1) command on a directory to be + fs.xfs.inherit_noatime (Min: 0 Default: 1 Max: 1) + Setting this to "1" will cause the "noatime" flag set + by the xfs_io(8) chattr command on a directory to be inherited by files in that directory. + + fs.xfs.inherit_nosymlinks (Min: 0 Default: 1 Max: 1) + Setting this to "1" will cause the "nosymlinks" flag set + by the xfs_io(8) chattr command on a directory to be + inherited by files in that directory. + + fs.xfs.rotorstep (Min: 1 Default: 1 Max: 256) + In "inode32" allocation mode, this option determines how many + files the allocator attempts to allocate in the same allocation + group before moving to the next allocation group. The intent + is to control the rate at which the allocator moves between + allocation groups when allocating extents for new files. diff --git a/Documentation/firmware_class/firmware_sample_driver.c b/Documentation/firmware_class/firmware_sample_driver.c index 4bef8c25172c..d3ad2c24490a 100644 --- a/Documentation/firmware_class/firmware_sample_driver.c +++ b/Documentation/firmware_class/firmware_sample_driver.c @@ -13,6 +13,7 @@ #include <linux/kernel.h> #include <linux/init.h> #include <linux/device.h> +#include <linux/string.h> #include "linux/firmware.h" diff --git a/Documentation/firmware_class/firmware_sample_firmware_class.c b/Documentation/firmware_class/firmware_sample_firmware_class.c index 09eab2f1b373..57b956aecbc5 100644 --- a/Documentation/firmware_class/firmware_sample_firmware_class.c +++ b/Documentation/firmware_class/firmware_sample_firmware_class.c @@ -14,6 +14,8 @@ #include <linux/module.h> #include <linux/init.h> #include <linux/timer.h> +#include <linux/slab.h> +#include <linux/string.h> #include <linux/firmware.h> diff --git a/Documentation/hpet.txt b/Documentation/hpet.txt index 4e7cc8d3359b..e52457581f47 100644 --- a/Documentation/hpet.txt +++ b/Documentation/hpet.txt @@ -1,18 +1,21 @@ High Precision Event Timer Driver for Linux -The High Precision Event Timer (HPET) hardware is the future replacement for the 8254 and Real -Time Clock (RTC) periodic timer functionality. Each HPET can have up two 32 timers. It is possible -to configure the first two timers as legacy replacements for 8254 and RTC periodic. A specification -done by INTEL and Microsoft can be found at http://www.intel.com/labs/platcomp/hpet/hpetspec.htm. - -The driver supports detection of HPET driver allocation and initialization of the HPET before the -driver module_init routine is called. This enables platform code which uses timer 0 or 1 as the -main timer to intercept HPET initialization. An example of this initialization can be found in +The High Precision Event Timer (HPET) hardware is the future replacement +for the 8254 and Real Time Clock (RTC) periodic timer functionality. +Each HPET can have up two 32 timers. It is possible to configure the +first two timers as legacy replacements for 8254 and RTC periodic timers. +A specification done by Intel and Microsoft can be found at +<http://www.intel.com/hardwaredesign/hpetspec.htm>. + +The driver supports detection of HPET driver allocation and initialization +of the HPET before the driver module_init routine is called. This enables +platform code which uses timer 0 or 1 as the main timer to intercept HPET +initialization. An example of this initialization can be found in arch/i386/kernel/time_hpet.c. -The driver provides two APIs which are very similar to the API found in the rtc.c driver. -There is a user space API and a kernel space API. An example user space program is provided -below. +The driver provides two APIs which are very similar to the API found in +the rtc.c driver. There is a user space API and a kernel space API. +An example user space program is provided below. #include <stdio.h> #include <stdlib.h> @@ -290,9 +293,8 @@ The kernel API has three interfaces exported from the driver: hpet_unregister(struct hpet_task *tp) hpet_control(struct hpet_task *tp, unsigned int cmd, unsigned long arg) -The kernel module using this interface fills in the ht_func and ht_data members of the -hpet_task structure before calling hpet_register. hpet_control simply vectors to the hpet_ioctl -routine and has the same commands and respective arguments as the user API. hpet_unregister +The kernel module using this interface fills in the ht_func and ht_data +members of the hpet_task structure before calling hpet_register. +hpet_control simply vectors to the hpet_ioctl routine and has the same +commands and respective arguments as the user API. hpet_unregister is used to terminate usage of the HPET timer reserved by hpet_register. - - diff --git a/Documentation/hwmon/it87 b/Documentation/hwmon/it87 index 0d0195040d88..7f42e441c645 100644 --- a/Documentation/hwmon/it87 +++ b/Documentation/hwmon/it87 @@ -4,18 +4,18 @@ Kernel driver it87 Supported chips: * IT8705F Prefix: 'it87' - Addresses scanned: from Super I/O config space, or default ISA 0x290 (8 I/O ports) + Addresses scanned: from Super I/O config space (8 I/O ports) Datasheet: Publicly available at the ITE website http://www.ite.com.tw/ * IT8712F Prefix: 'it8712' Addresses scanned: I2C 0x28 - 0x2f - from Super I/O config space, or default ISA 0x290 (8 I/O ports) + from Super I/O config space (8 I/O ports) Datasheet: Publicly available at the ITE website http://www.ite.com.tw/ * SiS950 [clone of IT8705F] - Prefix: 'sis950' - Addresses scanned: from Super I/O config space, or default ISA 0x290 (8 I/O ports) + Prefix: 'it87' + Addresses scanned: from Super I/O config space (8 I/O ports) Datasheet: No longer be available Author: Christophe Gauthron <chrisg@0-in.com> diff --git a/Documentation/hwmon/lm90 b/Documentation/hwmon/lm90 index 2c4cf39471f4..438cb24cee5b 100644 --- a/Documentation/hwmon/lm90 +++ b/Documentation/hwmon/lm90 @@ -24,14 +24,14 @@ Supported chips: http://www.national.com/pf/LM/LM86.html * Analog Devices ADM1032 Prefix: 'adm1032' - Addresses scanned: I2C 0x4c + Addresses scanned: I2C 0x4c and 0x4d Datasheet: Publicly available at the Analog Devices website - http://products.analog.com/products/info.asp?product=ADM1032 + http://www.analog.com/en/prod/0,2877,ADM1032,00.html * Analog Devices ADT7461 Prefix: 'adt7461' - Addresses scanned: I2C 0x4c + Addresses scanned: I2C 0x4c and 0x4d Datasheet: Publicly available at the Analog Devices website - http://products.analog.com/products/info.asp?product=ADT7461 + http://www.analog.com/en/prod/0,2877,ADT7461,00.html Note: Only if in ADM1032 compatibility mode * Maxim MAX6657 Prefix: 'max6657' @@ -71,8 +71,8 @@ increased resolution of the remote temperature measurement. The different chipsets of the family are not strictly identical, although very similar. This driver doesn't handle any specific feature for now, -but could if there ever was a need for it. For reference, here comes a -non-exhaustive list of specific features: +with the exception of SMBus PEC. For reference, here comes a non-exhaustive +list of specific features: LM90: * Filter and alert configuration register at 0xBF. @@ -91,6 +91,7 @@ ADM1032: * Conversion averaging. * Up to 64 conversions/s. * ALERT is triggered by open remote sensor. + * SMBus PEC support for Write Byte and Receive Byte transactions. ADT7461 * Extended temperature range (breaks compatibility) @@ -119,3 +120,37 @@ The lm90 driver will not update its values more frequently than every other second; reading them more often will do no harm, but will return 'old' values. +PEC Support +----------- + +The ADM1032 is the only chip of the family which supports PEC. It does +not support PEC on all transactions though, so some care must be taken. + +When reading a register value, the PEC byte is computed and sent by the +ADM1032 chip. However, in the case of a combined transaction (SMBus Read +Byte), the ADM1032 computes the CRC value over only the second half of +the message rather than its entirety, because it thinks the first half +of the message belongs to a different transaction. As a result, the CRC +value differs from what the SMBus master expects, and all reads fail. + +For this reason, the lm90 driver will enable PEC for the ADM1032 only if +the bus supports the SMBus Send Byte and Receive Byte transaction types. +These transactions will be used to read register values, instead of +SMBus Read Byte, and PEC will work properly. + +Additionally, the ADM1032 doesn't support SMBus Send Byte with PEC. +Instead, it will try to write the PEC value to the register (because the +SMBus Send Byte transaction with PEC is similar to a Write Byte transaction +without PEC), which is not what we want. Thus, PEC is explicitely disabled +on SMBus Send Byte transactions in the lm90 driver. + +PEC on byte data transactions represents a significant increase in bandwidth +usage (+33% for writes, +25% for reads) in normal conditions. With the need +to use two SMBus transaction for reads, this overhead jumps to +50%. Worse, +two transactions will typically mean twice as much delay waiting for +transaction completion, effectively doubling the register cache refresh time. +I guess reliability comes at a price, but it's quite expensive this time. + +So, as not everyone might enjoy the slowdown, PEC can be disabled through +sysfs. Just write 0 to the "pec" file and PEC will be disabled. Write 1 +to that file to enable PEC again. diff --git a/Documentation/hwmon/smsc47b397 b/Documentation/hwmon/smsc47b397 index da9d80c96432..20682f15ae41 100644 --- a/Documentation/hwmon/smsc47b397 +++ b/Documentation/hwmon/smsc47b397 @@ -3,6 +3,7 @@ Kernel driver smsc47b397 Supported chips: * SMSC LPC47B397-NC + * SMSC SCH5307-NS Prefix: 'smsc47b397' Addresses scanned: none, address read from Super I/O config space Datasheet: In this file @@ -12,11 +13,14 @@ Authors: Mark M. Hoffman <mhoffman@lightlink.com> November 23, 2004 -The following specification describes the SMSC LPC47B397-NC sensor chip +The following specification describes the SMSC LPC47B397-NC[1] sensor chip (for which there is no public datasheet available). This document was provided by Craig Kelly (In-Store Broadcast Network) and edited/corrected by Mark M. Hoffman <mhoffman@lightlink.com>. +[1] And SMSC SCH5307-NS, which has a different device ID but is otherwise +compatible. + * * * * * Methods for detecting the HP SIO and reading the thermal data on a dc7100. @@ -127,7 +131,7 @@ OUT DX,AL The registers of interest for identifying the SIO on the dc7100 are Device ID (0x20) and Device Rev (0x21). -The Device ID will read 0X6F +The Device ID will read 0x6F (for SCH5307-NS, 0x81) The Device Rev currently reads 0x01 Obtaining the HWM Base Address. diff --git a/Documentation/hwmon/smsc47m1 b/Documentation/hwmon/smsc47m1 index 34e6478c1425..c15bbe68264e 100644 --- a/Documentation/hwmon/smsc47m1 +++ b/Documentation/hwmon/smsc47m1 @@ -12,6 +12,10 @@ Supported chips: http://www.smsc.com/main/datasheets/47m14x.pdf http://www.smsc.com/main/tools/discontinued/47m15x.pdf http://www.smsc.com/main/datasheets/47m192.pdf + * SMSC LPC47M997 + Addresses scanned: none, address read from Super I/O config space + Prefix: 'smsc47m1' + Datasheet: none Authors: Mark D. Studebaker <mdsxyz123@yahoo.com>, @@ -30,6 +34,9 @@ The 47M15x and 47M192 chips contain a full 'hardware monitoring block' in addition to the fan monitoring and control. The hardware monitoring block is not supported by the driver. +No documentation is available for the 47M997, but it has the same device +ID as the 47M15x and 47M192 chips and seems to be compatible. + Fan rotation speeds are reported in RPM (rotations per minute). An alarm is triggered if the rotation speed has dropped below a programmable limit. Fan readings can be divided by a programmable divider (1, 2, 4 or 8) to give diff --git a/Documentation/hwmon/sysfs-interface b/Documentation/hwmon/sysfs-interface index 346400519d0d..764cdc5480e7 100644 --- a/Documentation/hwmon/sysfs-interface +++ b/Documentation/hwmon/sysfs-interface @@ -272,3 +272,6 @@ beep_mask Bitmask for beep. eeprom Raw EEPROM data in binary form. Read only. + +pec Enable or disable PEC (SMBus only) + Read/Write diff --git a/Documentation/hwmon/via686a b/Documentation/hwmon/via686a index b82014cb7c53..a936fb3824b2 100644 --- a/Documentation/hwmon/via686a +++ b/Documentation/hwmon/via686a @@ -18,8 +18,9 @@ Authors: Module Parameters ----------------- -force_addr=0xaddr Set the I/O base address. Useful for Asus A7V boards - that don't set the address in the BIOS. Does not do a +force_addr=0xaddr Set the I/O base address. Useful for boards that + don't set the address in the BIOS. Look for a BIOS + upgrade before resorting to this. Does not do a PCI force; the via686a must still be present in lspci. Don't use this unless the driver complains that the base address is not set. @@ -63,3 +64,15 @@ miss once-only alarms. The driver only updates its values each 1.5 seconds; reading it more often will do no harm, but will return 'old' values. + +Known Issues +------------ + +This driver handles sensors integrated in some VIA south bridges. It is +possible that a motherboard maker used a VT82C686A/B chip as part of a +product design but was not interested in its hardware monitoring features, +in which case the sensor inputs will not be wired. This is the case of +the Asus K7V, A7V and A7V133 motherboards, to name only a few of them. +So, if you need the force_addr parameter, and end up with values which +don't seem to make any sense, don't look any further: your chip is simply +not wired for hardware monitoring. diff --git a/Documentation/i2c/busses/i2c-i810 b/Documentation/i2c/busses/i2c-i810 index 0544eb332887..83c3b9743c3c 100644 --- a/Documentation/i2c/busses/i2c-i810 +++ b/Documentation/i2c/busses/i2c-i810 @@ -2,6 +2,7 @@ Kernel driver i2c-i810 Supported adapters: * Intel 82810, 82810-DC100, 82810E, and 82815 (GMCH) + * Intel 82845G (GMCH) Authors: Frodo Looijaard <frodol@dds.nl>, diff --git a/Documentation/i2c/busses/i2c-viapro b/Documentation/i2c/busses/i2c-viapro index 702f5ac68c09..9363b8bd6109 100644 --- a/Documentation/i2c/busses/i2c-viapro +++ b/Documentation/i2c/busses/i2c-viapro @@ -4,17 +4,18 @@ Supported adapters: * VIA Technologies, Inc. VT82C596A/B Datasheet: Sometimes available at the VIA website - * VIA Technologies, Inc. VT82C686A/B + * VIA Technologies, Inc. VT82C686A/B Datasheet: Sometimes available at the VIA website * VIA Technologies, Inc. VT8231, VT8233, VT8233A, VT8235, VT8237 Datasheet: available on request from Via Authors: - Frodo Looijaard <frodol@dds.nl>, - Philip Edelbrock <phil@netroedge.com>, - Kyösti Mälkki <kmalkki@cc.hut.fi>, - Mark D. Studebaker <mdsxyz123@yahoo.com> + Frodo Looijaard <frodol@dds.nl>, + Philip Edelbrock <phil@netroedge.com>, + Kyösti Mälkki <kmalkki@cc.hut.fi>, + Mark D. Studebaker <mdsxyz123@yahoo.com>, + Jean Delvare <khali@linux-fr.org> Module Parameters ----------------- @@ -28,20 +29,22 @@ Description ----------- i2c-viapro is a true SMBus host driver for motherboards with one of the -supported VIA southbridges. +supported VIA south bridges. Your lspci -n listing must show one of these : - device 1106:3050 (VT82C596 function 3) - device 1106:3051 (VT82C596 function 3) + device 1106:3050 (VT82C596A function 3) + device 1106:3051 (VT82C596B function 3) device 1106:3057 (VT82C686 function 4) device 1106:3074 (VT8233) device 1106:3147 (VT8233A) - device 1106:8235 (VT8231) - devide 1106:3177 (VT8235) - devide 1106:3227 (VT8237) + device 1106:8235 (VT8231 function 4) + device 1106:3177 (VT8235) + device 1106:3227 (VT8237R) If none of these show up, you should look in the BIOS for settings like enable ACPI / SMBus or even USB. - +Except for the oldest chips (VT82C596A/B, VT82C686A and most probably +VT8231), this driver supports I2C block transactions. Such transactions +are mainly useful to read from and write to EEPROMs. diff --git a/Documentation/i2c/chips/x1205 b/Documentation/i2c/chips/x1205 new file mode 100644 index 000000000000..09407c991fe5 --- /dev/null +++ b/Documentation/i2c/chips/x1205 @@ -0,0 +1,38 @@ +Kernel driver x1205 +=================== + +Supported chips: + * Xicor X1205 RTC + Prefix: 'x1205' + Addresses scanned: none + Datasheet: http://www.intersil.com/cda/deviceinfo/0,1477,X1205,00.html + +Authors: + Karen Spearel <kas11@tampabay.rr.com>, + Alessandro Zummo <a.zummo@towertech.it> + +Description +----------- + +This module aims to provide complete access to the Xicor X1205 RTC. +Recently Xicor has merged with Intersil, but the chip is +still sold under the Xicor brand. + +This chip is located at address 0x6f and uses a 2-byte register addressing. +Two bytes need to be written to read a single register, while most +other chips just require one and take the second one as the data +to be written. To prevent corrupting unknown chips, the user must +explicitely set the probe parameter. + +example: + +modprobe x1205 probe=0,0x6f + +The module supports one more option, hctosys, which is used to set the +software clock from the x1205. On systems where the x1205 is the +only hardware rtc, this parameter could be used to achieve a correct +date/time earlier in the system boot sequence. + +example: + +modprobe x1205 probe=0,0x6f hctosys=1 diff --git a/Documentation/i2c/functionality b/Documentation/i2c/functionality index 41ffefbdc60c..60cca249e452 100644 --- a/Documentation/i2c/functionality +++ b/Documentation/i2c/functionality @@ -17,9 +17,10 @@ For the most up-to-date list of functionality constants, please check I2C_FUNC_I2C Plain i2c-level commands (Pure SMBus adapters typically can not do these) I2C_FUNC_10BIT_ADDR Handles the 10-bit address extensions - I2C_FUNC_PROTOCOL_MANGLING Knows about the I2C_M_REV_DIR_ADDR, - I2C_M_REV_DIR_ADDR and I2C_M_REV_DIR_NOSTART - flags (which modify the i2c protocol!) + I2C_FUNC_PROTOCOL_MANGLING Knows about the I2C_M_IGNORE_NAK, + I2C_M_REV_DIR_ADDR, I2C_M_NOSTART and + I2C_M_NO_RD_ACK flags (which modify the + I2C protocol!) I2C_FUNC_SMBUS_QUICK Handles the SMBus write_quick command I2C_FUNC_SMBUS_READ_BYTE Handles the SMBus read_byte command I2C_FUNC_SMBUS_WRITE_BYTE Handles the SMBus write_byte command diff --git a/Documentation/i2c/porting-clients b/Documentation/i2c/porting-clients index 4849dfd6961c..184fac2377aa 100644 --- a/Documentation/i2c/porting-clients +++ b/Documentation/i2c/porting-clients @@ -82,7 +82,7 @@ Technical changes: exit and exit_free. For i2c+isa drivers, labels should be named ERROR0, ERROR1 and ERROR2. Don't forget to properly set err before jumping to error labels. By the way, labels should be left-aligned. - Use memset to fill the client and data area with 0x00. + Use kzalloc instead of kmalloc. Use i2c_set_clientdata to set the client data (as opposed to a direct access to client->data). Use strlcpy instead of strcpy to copy the client name. diff --git a/Documentation/i2c/writing-clients b/Documentation/i2c/writing-clients index 077275722a7c..cff7b652588a 100644 --- a/Documentation/i2c/writing-clients +++ b/Documentation/i2c/writing-clients @@ -33,8 +33,8 @@ static struct i2c_driver foo_driver = { .command = &foo_command /* may be NULL */ } -The name can be chosen freely, and may be upto 40 characters long. Please -use something descriptive here. +The name field must match the driver name, including the case. It must not +contain spaces, and may be up to 31 characters long. Don't worry about the flags field; just put I2C_DF_NOTIFY into it. This means that your driver will be notified when new adapters are found. @@ -43,9 +43,6 @@ This is almost always what you want. All other fields are for call-back functions which will be explained below. -There use to be two additional fields in this structure, inc_use et dec_use, -for module usage count, but these fields were obsoleted and removed. - Extra client data ================= @@ -58,6 +55,7 @@ be very useful. An example structure is below. struct foo_data { + struct i2c_client client; struct semaphore lock; /* For ISA access in `sensors' drivers. */ int sysctl_id; /* To keep the /proc directory entry for `sensors' drivers. */ @@ -275,6 +273,7 @@ For now, you can ignore the `flags' parameter. It is there for future use. if (is_isa) { /* Discard immediately if this ISA range is already used */ + /* FIXME: never use check_region(), only request_region() */ if (check_region(address,FOO_EXTENT)) goto ERROR0; @@ -310,22 +309,15 @@ For now, you can ignore the `flags' parameter. It is there for future use. client structure, even though we cannot fill it completely yet. But it allows us to access several i2c functions safely */ - /* Note that we reserve some space for foo_data too. If you don't - need it, remove it. We do it here to help to lessen memory - fragmentation. */ - if (! (new_client = kmalloc(sizeof(struct i2c_client) + - sizeof(struct foo_data), - GFP_KERNEL))) { + if (!(data = kzalloc(sizeof(struct foo_data), GFP_KERNEL))) { err = -ENOMEM; goto ERROR0; } - /* This is tricky, but it will set the data to the right value. */ - client->data = new_client + 1; - data = (struct foo_data *) (client->data); + new_client = &data->client; + i2c_set_clientdata(new_client, data); new_client->addr = address; - new_client->data = data; new_client->adapter = adapter; new_client->driver = &foo_driver; new_client->flags = 0; @@ -451,7 +443,7 @@ much simpler than the attachment code, fortunately! release_region(client->addr,LM78_EXTENT); /* HYBRID SENSORS CHIP ONLY END */ - kfree(client); /* Frees client data too, if allocated at the same time */ + kfree(data); return 0; } @@ -576,12 +568,12 @@ SMBus communication extern s32 i2c_smbus_write_block_data(struct i2c_client * client, u8 command, u8 length, u8 *values); + extern s32 i2c_smbus_read_i2c_block_data(struct i2c_client * client, + u8 command, u8 *values); These ones were removed in Linux 2.6.10 because they had no users, but could be added back later if needed: - extern s32 i2c_smbus_read_i2c_block_data(struct i2c_client * client, - u8 command, u8 *values); extern s32 i2c_smbus_read_block_data(struct i2c_client * client, u8 command, u8 *values); extern s32 i2c_smbus_write_i2c_block_data(struct i2c_client * client, diff --git a/Documentation/ia64/mca.txt b/Documentation/ia64/mca.txt new file mode 100644 index 000000000000..a71cc6a67ef7 --- /dev/null +++ b/Documentation/ia64/mca.txt @@ -0,0 +1,194 @@ +An ad-hoc collection of notes on IA64 MCA and INIT processing. Feel +free to update it with notes about any area that is not clear. + +--- + +MCA/INIT are completely asynchronous. They can occur at any time, when +the OS is in any state. Including when one of the cpus is already +holding a spinlock. Trying to get any lock from MCA/INIT state is +asking for deadlock. Also the state of structures that are protected +by locks is indeterminate, including linked lists. + +--- + +The complicated ia64 MCA process. All of this is mandated by Intel's +specification for ia64 SAL, error recovery and and unwind, it is not as +if we have a choice here. + +* MCA occurs on one cpu, usually due to a double bit memory error. + This is the monarch cpu. + +* SAL sends an MCA rendezvous interrupt (which is a normal interrupt) + to all the other cpus, the slaves. + +* Slave cpus that receive the MCA interrupt call down into SAL, they + end up spinning disabled while the MCA is being serviced. + +* If any slave cpu was already spinning disabled when the MCA occurred + then it cannot service the MCA interrupt. SAL waits ~20 seconds then + sends an unmaskable INIT event to the slave cpus that have not + already rendezvoused. + +* Because MCA/INIT can be delivered at any time, including when the cpu + is down in PAL in physical mode, the registers at the time of the + event are _completely_ undefined. In particular the MCA/INIT + handlers cannot rely on the thread pointer, PAL physical mode can + (and does) modify TP. It is allowed to do that as long as it resets + TP on return. However MCA/INIT events expose us to these PAL + internal TP changes. Hence curr_task(). + +* If an MCA/INIT event occurs while the kernel was running (not user + space) and the kernel has called PAL then the MCA/INIT handler cannot + assume that the kernel stack is in a fit state to be used. Mainly + because PAL may or may not maintain the stack pointer internally. + Because the MCA/INIT handlers cannot trust the kernel stack, they + have to use their own, per-cpu stacks. The MCA/INIT stacks are + preformatted with just enough task state to let the relevant handlers + do their job. + +* Unlike most other architectures, the ia64 struct task is embedded in + the kernel stack[1]. So switching to a new kernel stack means that + we switch to a new task as well. Because various bits of the kernel + assume that current points into the struct task, switching to a new + stack also means a new value for current. + +* Once all slaves have rendezvoused and are spinning disabled, the + monarch is entered. The monarch now tries to diagnose the problem + and decide if it can recover or not. + +* Part of the monarch's job is to look at the state of all the other + tasks. The only way to do that on ia64 is to call the unwinder, + as mandated by Intel. + +* The starting point for the unwind depends on whether a task is + running or not. That is, whether it is on a cpu or is blocked. The + monarch has to determine whether or not a task is on a cpu before it + knows how to start unwinding it. The tasks that received an MCA or + INIT event are no longer running, they have been converted to blocked + tasks. But (and its a big but), the cpus that received the MCA + rendezvous interrupt are still running on their normal kernel stacks! + +* To distinguish between these two cases, the monarch must know which + tasks are on a cpu and which are not. Hence each slave cpu that + switches to an MCA/INIT stack, registers its new stack using + set_curr_task(), so the monarch can tell that the _original_ task is + no longer running on that cpu. That gives us a decent chance of + getting a valid backtrace of the _original_ task. + +* MCA/INIT can be nested, to a depth of 2 on any cpu. In the case of a + nested error, we want diagnostics on the MCA/INIT handler that + failed, not on the task that was originally running. Again this + requires set_curr_task() so the MCA/INIT handlers can register their + own stack as running on that cpu. Then a recursive error gets a + trace of the failing handler's "task". + +[1] My (Keith Owens) original design called for ia64 to separate its + struct task and the kernel stacks. Then the MCA/INIT data would be + chained stacks like i386 interrupt stacks. But that required + radical surgery on the rest of ia64, plus extra hard wired TLB + entries with its associated performance degradation. David + Mosberger vetoed that approach. Which meant that separate kernel + stacks meant separate "tasks" for the MCA/INIT handlers. + +--- + +INIT is less complicated than MCA. Pressing the nmi button or using +the equivalent command on the management console sends INIT to all +cpus. SAL picks one one of the cpus as the monarch and the rest are +slaves. All the OS INIT handlers are entered at approximately the same +time. The OS monarch prints the state of all tasks and returns, after +which the slaves return and the system resumes. + +At least that is what is supposed to happen. Alas there are broken +versions of SAL out there. Some drive all the cpus as monarchs. Some +drive them all as slaves. Some drive one cpu as monarch, wait for that +cpu to return from the OS then drive the rest as slaves. Some versions +of SAL cannot even cope with returning from the OS, they spin inside +SAL on resume. The OS INIT code has workarounds for some of these +broken SAL symptoms, but some simply cannot be fixed from the OS side. + +--- + +The scheduler hooks used by ia64 (curr_task, set_curr_task) are layer +violations. Unfortunately MCA/INIT start off as massive layer +violations (can occur at _any_ time) and they build from there. + +At least ia64 makes an attempt at recovering from hardware errors, but +it is a difficult problem because of the asynchronous nature of these +errors. When processing an unmaskable interrupt we sometimes need +special code to cope with our inability to take any locks. + +--- + +How is ia64 MCA/INIT different from x86 NMI? + +* x86 NMI typically gets delivered to one cpu. MCA/INIT gets sent to + all cpus. + +* x86 NMI cannot be nested. MCA/INIT can be nested, to a depth of 2 + per cpu. + +* x86 has a separate struct task which points to one of multiple kernel + stacks. ia64 has the struct task embedded in the single kernel + stack, so switching stack means switching task. + +* x86 does not call the BIOS so the NMI handler does not have to worry + about any registers having changed. MCA/INIT can occur while the cpu + is in PAL in physical mode, with undefined registers and an undefined + kernel stack. + +* i386 backtrace is not very sensitive to whether a process is running + or not. ia64 unwind is very, very sensitive to whether a process is + running or not. + +--- + +What happens when MCA/INIT is delivered what a cpu is running user +space code? + +The user mode registers are stored in the RSE area of the MCA/INIT on +entry to the OS and are restored from there on return to SAL, so user +mode registers are preserved across a recoverable MCA/INIT. Since the +OS has no idea what unwind data is available for the user space stack, +MCA/INIT never tries to backtrace user space. Which means that the OS +does not bother making the user space process look like a blocked task, +i.e. the OS does not copy pt_regs and switch_stack to the user space +stack. Also the OS has no idea how big the user space RSE and memory +stacks are, which makes it too risky to copy the saved state to a user +mode stack. + +--- + +How do we get a backtrace on the tasks that were running when MCA/INIT +was delivered? + +mca.c:::ia64_mca_modify_original_stack(). That identifies and +verifies the original kernel stack, copies the dirty registers from +the MCA/INIT stack's RSE to the original stack's RSE, copies the +skeleton struct pt_regs and switch_stack to the original stack, fills +in the skeleton structures from the PAL minstate area and updates the +original stack's thread.ksp. That makes the original stack look +exactly like any other blocked task, i.e. it now appears to be +sleeping. To get a backtrace, just start with thread.ksp for the +original task and unwind like any other sleeping task. + +--- + +How do we identify the tasks that were running when MCA/INIT was +delivered? + +If the previous task has been verified and converted to a blocked +state, then sos->prev_task on the MCA/INIT stack is updated to point to +the previous task. You can look at that field in dumps or debuggers. +To help distinguish between the handler and the original tasks, +handlers have _TIF_MCA_INIT set in thread_info.flags. + +The sos data is always in the MCA/INIT handler stack, at offset +MCA_SOS_OFFSET. You can get that value from mca_asm.h or calculate it +as KERNEL_STACK_SIZE - sizeof(struct pt_regs) - sizeof(struct +ia64_sal_os_state), with 16 byte alignment for all structures. + +Also the comm field of the MCA/INIT task is modified to include the pid +of the original task, for humans to use. For example, a comm field of +'MCA 12159' means that pid 12159 was running when the MCA was +delivered. diff --git a/Documentation/input/yealink.txt b/Documentation/input/yealink.txt index 85f095a7ad04..0962c5c948be 100644 --- a/Documentation/input/yealink.txt +++ b/Documentation/input/yealink.txt @@ -2,7 +2,6 @@ Driver documentation for yealink usb-p1k phones 0. Status ~~~~~~~~~ - The p1k is a relatively cheap usb 1.1 phone with: - keyboard full support, yealink.ko / input event API - LCD full support, yealink.ko / sysfs API @@ -17,9 +16,8 @@ For vendor documentation see http://www.yealink.com 1. Compilation (stand alone version) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - Currently only kernel 2.6.x.y versions are supported. -In order to build the yealink.ko module do: +In order to build the yealink.ko module do make @@ -28,6 +26,21 @@ the Makefile is pointing to the location where your kernel sources are located, default /usr/src/linux. +1.1 Troubleshooting +~~~~~~~~~~~~~~~~~~~ +Q: Module yealink compiled and installed without any problem but phone + is not initialized and does not react to any actions. +A: If you see something like: + hiddev0: USB HID v1.00 Device [Yealink Network Technology Ltd. VOIP USB Phone + in dmesg, it means that the hid driver has grabbed the device first. Try to + load module yealink before any other usb hid driver. Please see the + instructions provided by your distribution on module configuration. + +Q: Phone is working now (displays version and accepts keypad input) but I can't + find the sysfs files. +A: The sysfs files are located on the particular usb endpoint. On most + distributions you can do: "find /sys/ -name get_icons" for a hint. + 2. keyboard features ~~~~~~~~~~~~~~~~~~~~ diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 7086f0a90d14..5dffcfefc3c7 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -17,7 +17,7 @@ are specified on the kernel command line with the module name plus usbcore.blinkenlights=1 -The text in square brackets at the beginning of the description state the +The text in square brackets at the beginning of the description states the restrictions on the kernel for the said kernel parameter to be valid. The restrictions referred to are that the relevant option is valid if: @@ -27,8 +27,8 @@ restrictions referred to are that the relevant option is valid if: APM Advanced Power Management support is enabled. AX25 Appropriate AX.25 support is enabled. CD Appropriate CD support is enabled. - DEVFS devfs support is enabled. - DRM Direct Rendering Management support is enabled. + DEVFS devfs support is enabled. + DRM Direct Rendering Management support is enabled. EDD BIOS Enhanced Disk Drive Services (EDD) is enabled EFI EFI Partitioning (GPT) is enabled EIDE EIDE/ATAPI support is enabled. @@ -71,7 +71,7 @@ restrictions referred to are that the relevant option is valid if: SERIAL Serial support is enabled. SMP The kernel is an SMP kernel. SPARC Sparc architecture is enabled. - SWSUSP Software suspension is enabled. + SWSUSP Software suspend is enabled. TS Appropriate touchscreen support is enabled. USB USB support is enabled. USBHID USB Human Interface Device support is enabled. @@ -105,13 +105,13 @@ running once the system is up. See header of drivers/scsi/53c7xx.c. See also Documentation/scsi/ncr53c7xx.txt. - acpi= [HW,ACPI] Advanced Configuration and Power Interface - Format: { force | off | ht | strict } + acpi= [HW,ACPI] Advanced Configuration and Power Interface + Format: { force | off | ht | strict | noirq } force -- enable ACPI if default was off off -- disable ACPI if default was on noirq -- do not use ACPI for IRQ routing ht -- run only enough ACPI to enable Hyper Threading - strict -- Be less tolerant of platforms that are not + strict -- Be less tolerant of platforms that are not strictly ACPI specification compliant. See also Documentation/pm.txt, pci=noacpi @@ -119,20 +119,23 @@ running once the system is up. acpi_sleep= [HW,ACPI] Sleep options Format: { s3_bios, s3_mode } See Documentation/power/video.txt - + acpi_sci= [HW,ACPI] ACPI System Control Interrupt trigger mode - Format: { level | edge | high | low } + Format: { level | edge | high | low } - acpi_irq_balance [HW,ACPI] ACPI will balance active IRQs - default in APIC mode + acpi_irq_balance [HW,ACPI] + ACPI will balance active IRQs + default in APIC mode - acpi_irq_nobalance [HW,ACPI] ACPI will not move active IRQs (default) - default in PIC mode + acpi_irq_nobalance [HW,ACPI] + ACPI will not move active IRQs (default) + default in PIC mode - acpi_irq_pci= [HW,ACPI] If irq_balance, Clear listed IRQs for use by PCI + acpi_irq_pci= [HW,ACPI] If irq_balance, clear listed IRQs for + use by PCI Format: <irq>,<irq>... - acpi_irq_isa= [HW,ACPI] If irq_balance, Mark listed IRQs used by ISA + acpi_irq_isa= [HW,ACPI] If irq_balance, mark listed IRQs used by ISA Format: <irq>,<irq>... acpi_osi= [HW,ACPI] empty param disables _OSI @@ -145,14 +148,14 @@ running once the system is up. acpi_dbg_layer= [HW,ACPI] Format: <int> - Each bit of the <int> indicates an acpi debug layer, + Each bit of the <int> indicates an ACPI debug layer, 1: enable, 0: disable. It is useful for boot time debugging. After system has booted up, it can be set via /proc/acpi/debug_layer. acpi_dbg_level= [HW,ACPI] Format: <int> - Each bit of the <int> indicates an acpi debug level, + Each bit of the <int> indicates an ACPI debug level, 1: enable, 0: disable. It is useful for boot time debugging. After system has booted up, it can be set via /proc/acpi/debug_level. @@ -161,12 +164,13 @@ running once the system is up. acpi_generic_hotkey [HW,ACPI] Allow consolidated generic hotkey driver to - over-ride platform specific driver. + override platform specific driver. See also Documentation/acpi-hotkey.txt. enable_timer_pin_1 [i386,x86-64] Enable PIN 1 of APIC timer - Can be useful to work around chipset bugs (in particular on some ATI chipsets) + Can be useful to work around chipset bugs + (in particular on some ATI chipsets). The kernel tries to set a reasonable default. disable_timer_pin_1 [i386,x86-64] @@ -182,7 +186,7 @@ running once the system is up. adlib= [HW,OSS] Format: <io> - + advansys= [HW,SCSI] See header of drivers/scsi/advansys.c. @@ -192,7 +196,7 @@ running once the system is up. aedsp16= [HW,OSS] Audio Excel DSP 16 Format: <io>,<irq>,<dma>,<mss_io>,<mpu_io>,<mpu_irq> See also header of sound/oss/aedsp16.c. - + aha152x= [HW,SCSI] See Documentation/scsi/aha152x.txt. @@ -205,10 +209,6 @@ running once the system is up. aic79xx= [HW,SCSI] See Documentation/scsi/aic79xx.txt. - AM53C974= [HW,SCSI] - Format: <host-scsi-id>,<target-scsi-id>,<max-rate>,<max-offset> - See also header of drivers/scsi/AM53C974.c. - amijoy.map= [HW,JOY] Amiga joystick support Map of devices attached to JOY0DAT and JOY1DAT Format: <a>,<b> @@ -219,23 +219,24 @@ running once the system is up. connected to one of 16 gameports Format: <type1>,<type2>,..<type16> - apc= [HW,SPARC] Power management functions (SPARCstation-4/5 + deriv.) + apc= [HW,SPARC] + Power management functions (SPARCstation-4/5 + deriv.) Format: noidle Disable APC CPU standby support. SPARCstation-Fox does not play well with APC CPU idle - disable it if you have APC and your system crashes randomly. - apic= [APIC,i386] Change the output verbosity whilst booting + apic= [APIC,i386] Change the output verbosity whilst booting Format: { quiet (default) | verbose | debug } Change the amount of debugging information output when initialising the APIC and IO-APIC components. - + apm= [APM] Advanced Power Management See header of arch/i386/kernel/apm.c. applicom= [HW] Format: <mem>,<irq> - + arcrimi= [HW,NET] ARCnet - "RIM I" (entirely mem-mapped) cards Format: <io>,<irq>,<nodeID> @@ -250,38 +251,40 @@ running once the system is up. atkbd.reset= [HW] Reset keyboard during initialization - atkbd.set= [HW] Select keyboard code set - Format: <int> (2 = AT (default) 3 = PS/2) + atkbd.set= [HW] Select keyboard code set + Format: <int> (2 = AT (default), 3 = PS/2) atkbd.scroll= [HW] Enable scroll wheel on MS Office and similar keyboards atkbd.softraw= [HW] Choose between synthetic and real raw mode Format: <bool> (0 = real, 1 = synthetic (default)) - - atkbd.softrepeat= - [HW] Use software keyboard repeat + + atkbd.softrepeat= [HW] + Use software keyboard repeat autotest [IA64] awe= [HW,OSS] AWE32/SB32/AWE64 wave table synth Format: <io>,<memsize>,<isapnp> - + aztcd= [HW,CD] Aztech CD268 CDROM driver Format: <io>,0x79 (?) baycom_epp= [HW,AX25] Format: <io>,<mode> - + baycom_par= [HW,AX25] BayCom Parallel Port AX.25 Modem Format: <io>,<mode> See header of drivers/net/hamradio/baycom_par.c. - baycom_ser_fdx= [HW,AX25] BayCom Serial Port AX.25 Modem (Full Duplex Mode) + baycom_ser_fdx= [HW,AX25] + BayCom Serial Port AX.25 Modem (Full Duplex Mode) Format: <io>,<irq>,<mode>[,<baud>] See header of drivers/net/hamradio/baycom_ser_fdx.c. - baycom_ser_hdx= [HW,AX25] BayCom Serial Port AX.25 Modem (Half Duplex Mode) + baycom_ser_hdx= [HW,AX25] + BayCom Serial Port AX.25 Modem (Half Duplex Mode) Format: <io>,<irq>,<mode> See header of drivers/net/hamradio/baycom_ser_hdx.c. @@ -292,7 +295,8 @@ running once the system is up. blkmtd_count= bttv.card= [HW,V4L] bttv (bt848 + bt878 based grabber cards) - bttv.radio= Most important insmod options are available as kernel args too. + bttv.radio= Most important insmod options are available as + kernel args too. bttv.pll= See Documentation/video4linux/bttv/Insmod-options bttv.tuner= and Documentation/video4linux/bttv/CARDLIST @@ -318,15 +322,17 @@ running once the system is up. checkreqprot [SELINUX] Set initial checkreqprot flag value. Format: { "0" | "1" } See security/selinux/Kconfig help text. - 0 -- check protection applied by kernel (includes any implied execute protection). + 0 -- check protection applied by kernel (includes + any implied execute protection). 1 -- check protection requested by application. Default value is set via a kernel config option. - Value can be changed at runtime via /selinux/checkreqprot. - - clock= [BUGS=IA-32, HW] gettimeofday timesource override. + Value can be changed at runtime via + /selinux/checkreqprot. + + clock= [BUGS=IA-32,HW] gettimeofday timesource override. Forces specified timesource (if avaliable) to be used - when calculating gettimeofday(). If specicified timesource - is not avalible, it defaults to PIT. + when calculating gettimeofday(). If specicified + timesource is not avalible, it defaults to PIT. Format: { pit | tsc | cyclone | pmtmr } hpet= [IA-32,HPET] option to disable HPET and use PIT. @@ -336,17 +342,19 @@ running once the system is up. Format: { auto | [<io>,][<irq>] } com20020= [HW,NET] ARCnet - COM20020 chipset - Format: <io>[,<irq>[,<nodeID>[,<backplane>[,<ckp>[,<timeout>]]]]] + Format: + <io>[,<irq>[,<nodeID>[,<backplane>[,<ckp>[,<timeout>]]]]] com90io= [HW,NET] ARCnet - COM90xx chipset (IO-mapped buffers) Format: <io>[,<irq>] - com90xx= [HW,NET] ARCnet - COM90xx chipset (memory-mapped buffers) + com90xx= [HW,NET] + ARCnet - COM90xx chipset (memory-mapped buffers) Format: <io>[,<irq>[,<memstart>]] condev= [HW,S390] console device conmode= - + console= [KNL] Output console device and options. tty<n> Use the virtual console device <n>. @@ -367,7 +375,8 @@ running once the system is up. options are the same as for ttyS, above. cpcihp_generic= [HW,PCI] Generic port I/O CompactPCI driver - Format: <first_slot>,<last_slot>,<port>,<enum_bit>[,<debug>] + Format: + <first_slot>,<last_slot>,<port>,<enum_bit>[,<debug>] cpia_pp= [HW,PPT] Format: { parport<nr> | auto | none } @@ -384,10 +393,10 @@ running once the system is up. cs89x0_media= [HW,NET] Format: { rj45 | aui | bnc } - + cyclades= [HW,SERIAL] Cyclades multi-serial port adapter. - - dasd= [HW,NET] + + dasd= [HW,NET] See header of drivers/s390/block/dasd_devmap.c. db9.dev[2|3]= [HW,JOY] Multisystem joystick support via parallel port @@ -406,7 +415,7 @@ running once the system is up. dhash_entries= [KNL] Set number of hash buckets for dentry cache. - + digi= [HW,SERIAL] IO parameters + enable/disable command. @@ -424,11 +433,11 @@ running once the system is up. dtc3181e= [HW,SCSI] - earlyprintk= [IA-32, X86-64] + earlyprintk= [IA-32,X86-64] earlyprintk=vga earlyprintk=serial[,ttySn[,baudrate]] - Append ,keep to not disable it when the real console + Append ",keep" to not disable it when the real console takes over. Only vga or serial at a time, not both. @@ -451,7 +460,7 @@ running once the system is up. Format: {"of[f]" | "sk[ipmbr]"} See comment in arch/i386/boot/edd.S - eicon= [HW,ISDN] + eicon= [HW,ISDN] Format: <id>,<membase>,<irq> eisa_irq_edge= [PARISC,HW] @@ -462,12 +471,13 @@ running once the system is up. arch/i386/kernel/cpu/cpufreq/elanfreq.c. elevator= [IOSCHED] - Format: {"as"|"cfq"|"deadline"|"noop"} - See Documentation/block/as-iosched.txt - and Documentation/block/deadline-iosched.txt for details. + Format: {"as" | "cfq" | "deadline" | "noop"} + See Documentation/block/as-iosched.txt and + Documentation/block/deadline-iosched.txt for details. + elfcorehdr= [IA-32] - Specifies physical address of start of kernel core image - elf header. + Specifies physical address of start of kernel core + image elf header. See Documentation/kdump.txt for details. enforcing [SELINUX] Set initial enforcing status. @@ -485,7 +495,7 @@ running once the system is up. es1371= [HW,OSS] Format: <spdif>,[<nomix>,[<amplifier>]] See also header of sound/oss/es1371.c. - + ether= [HW,NET] Ethernet cards parameters This option is obsoleted by the "netdev=" option, which has equivalent usage. See its documentation for details. @@ -526,12 +536,13 @@ running once the system is up. gus= [HW,OSS] Format: <io>,<irq>,<dma>,<dma16> - + gvp11= [HW,SCSI] hashdist= [KNL,NUMA] Large hashes allocated during boot are distributed across NUMA nodes. Defaults on for IA-64, off otherwise. + Format: 0 | 1 (for off | on) hcl= [IA-64] SGI's Hardware Graph compatibility layer @@ -595,13 +606,13 @@ running once the system is up. ide?= [HW] (E)IDE subsystem Format: ide?=noprobe or chipset specific parameters. See Documentation/ide.txt. - + idebus= [HW] (E)IDE subsystem - VLB/PCI bus speed See Documentation/ide.txt. idle= [HW] Format: idle=poll or idle=halt - + ihash_entries= [KNL] Set number of hash buckets for inode cache. @@ -649,7 +660,7 @@ running once the system is up. firmware running. isapnp= [ISAPNP] - Format: <RDP>, <reset>, <pci_scan>, <verbosity> + Format: <RDP>,<reset>,<pci_scan>,<verbosity> isolcpus= [KNL,SMP] Isolate CPUs from the general scheduler. Format: <cpu number>,...,<cpu number> @@ -661,32 +672,33 @@ running once the system is up. "number of CPUs in system - 1". This option is the preferred way to isolate CPUs. The - alternative - manually setting the CPU mask of all tasks - in the system can cause problems and suboptimal load - balancer performance. + alternative -- manually setting the CPU mask of all + tasks in the system -- can cause problems and + suboptimal load balancer performance. isp16= [HW,CD] Format: <io>,<irq>,<dma>,<setup> - iucv= [HW,NET] + iucv= [HW,NET] js= [HW,JOY] Analog joystick See Documentation/input/joystick.txt. keepinitrd [HW,ARM] - kstack=N [IA-32, X86-64] Print N words from the kernel stack + kstack=N [IA-32,X86-64] Print N words from the kernel stack in oops dumps. l2cr= [PPC] - lapic [IA-32,APIC] Enable the local APIC even if BIOS disabled it. + lapic [IA-32,APIC] Enable the local APIC even if BIOS + disabled it. lasi= [HW,SCSI] PARISC LASI driver for the 53c700 chip Format: addr:<io>,irq:<irq> - llsc*= [IA64] - See function print_params() in arch/ia64/sn/kernel/llsc4.c. + llsc*= [IA64] See function print_params() in + arch/ia64/sn/kernel/llsc4.c. load_ramdisk= [RAM] List of ramdisks to load from floppy See Documentation/ramdisk.txt. @@ -713,8 +725,9 @@ running once the system is up. 7 (KERN_DEBUG) debug-level messages log_buf_len=n Sets the size of the printk ring buffer, in bytes. - Format is n, nk, nM. n must be a power of two. The - default is set in kernel config. + Format: { n | nk | nM } + n must be a power of two. The default size + is set in the kernel config file. lp=0 [LP] Specify parallel ports to use, e.g, lp=port[,port...] lp=none,parport0 (lp0 not configured, lp1 uses @@ -750,23 +763,23 @@ running once the system is up. ltpc= [NET] Format: <io>,<irq>,<dma> - mac5380= [HW,SCSI] - Format: <can_queue>,<cmd_per_lun>,<sg_tablesize>,<hostid>,<use_tags> + mac5380= [HW,SCSI] Format: + <can_queue>,<cmd_per_lun>,<sg_tablesize>,<hostid>,<use_tags> - mac53c9x= [HW,SCSI] - Format: <num_esps>,<disconnect>,<nosync>,<can_queue>,<cmd_per_lun>,<sg_tablesize>,<hostid>,<use_tags> + mac53c9x= [HW,SCSI] Format: + <num_esps>,<disconnect>,<nosync>,<can_queue>,<cmd_per_lun>,<sg_tablesize>,<hostid>,<use_tags> - machvec= [IA64] - Force the use of a particular machine-vector (machvec) in a generic - kernel. Example: machvec=hpzx1_swiotlb + machvec= [IA64] Force the use of a particular machine-vector + (machvec) in a generic kernel. + Example: machvec=hpzx1_swiotlb - mad16= [HW,OSS] - Format: <io>,<irq>,<dma>,<dma16>,<mpu_io>,<mpu_irq>,<joystick> + mad16= [HW,OSS] Format: + <io>,<irq>,<dma>,<dma16>,<mpu_io>,<mpu_irq>,<joystick> maui= [HW,OSS] Format: <io>,<irq> - - max_loop= [LOOP] Maximum number of loopback devices that can + + max_loop= [LOOP] Maximum number of loopback devices that can be mounted Format: <1-256> @@ -776,11 +789,11 @@ running once the system is up. max_addr=[KMG] [KNL,BOOT,ia64] All physical memory greater than or equal to this physical address is ignored. - max_luns= [SCSI] Maximum number of LUNs to probe + max_luns= [SCSI] Maximum number of LUNs to probe. Should be between 1 and 2^32-1. max_report_luns= - [SCSI] Maximum number of LUNs received + [SCSI] Maximum number of LUNs received. Should be between 1 and 16384. mca-pentium [BUGS=IA-32] @@ -796,11 +809,11 @@ running once the system is up. md= [HW] RAID subsystems devices and level See Documentation/md.txt. - + mdacon= [MDA] Format: <first>,<last> Specifies range of consoles to be captured by the MDA. - + mem=nn[KMG] [KNL,BOOT] Force usage of a specific amount of memory Amount of memory to be used when the kernel is not able to see the whole system memory or for test. @@ -851,15 +864,15 @@ running once the system is up. MTD_Partition= [MTD] Format: <name>,<region-number>,<size>,<offset> - MTD_Region= [MTD] - Format: <name>,<region-number>[,<base>,<size>,<buswidth>,<altbuswidth>] + MTD_Region= [MTD] Format: + <name>,<region-number>[,<base>,<size>,<buswidth>,<altbuswidth>] mtdparts= [MTD] See drivers/mtd/cmdline.c. mtouchusb.raw_coordinates= - [HW] Make the MicroTouch USB driver use raw coordinates ('y', default) - or cooked coordinates ('n') + [HW] Make the MicroTouch USB driver use raw coordinates + ('y', default) or cooked coordinates ('n') n2= [NET] SDL Inc. RISCom/N2 synchronous serial card @@ -880,7 +893,9 @@ running once the system is up. Format: <irq>,<io>,<mem_start>,<mem_end>,<name> Note that mem_start is often overloaded to mean something different and driver-specific. - + This usage is only documented in each driver source + file if at all. + nfsaddrs= [NFS] See Documentation/nfsroot.txt. @@ -893,8 +908,8 @@ running once the system is up. emulation library even if a 387 maths coprocessor is present. - noalign [KNL,ARM] - + noalign [KNL,ARM] + noapic [SMP,APIC] Tells the kernel to not make use of any IOAPICs that may be present in the system. @@ -905,19 +920,19 @@ running once the system is up. on "Classic" PPC cores. nocache [ARM] - + nodisconnect [HW,SCSI,M68K] Disables SCSI disconnects. noexec [IA-64] - noexec [IA-32, X86-64] + noexec [IA-32,X86-64] noexec=on: enable non-executable mappings (default) noexec=off: disable nn-executable mappings nofxsr [BUGS=IA-32] nohlt [BUGS=ARM] - + no-hlt [BUGS=IA-32] Tells the kernel that the hlt instruction doesn't work correctly and not to use it. @@ -948,8 +963,9 @@ running once the system is up. noresidual [PPC] Don't use residual data on PReP machines. - noresume [SWSUSP] Disables resume and restore original swap space. - + noresume [SWSUSP] Disables resume and restores original swap + space. + no-scroll [VGA] Disables scrollback. This is required for the Braillex ib80-piezo Braille reader made by F.H. Papenmeier (Germany). @@ -965,16 +981,16 @@ running once the system is up. nousb [USB] Disable the USB subsystem nowb [ARM] - + opl3= [HW,OSS] Format: <io> opl3sa= [HW,OSS] Format: <io>,<irq>,<dma>,<dma2>,<mpu_io>,<mpu_irq> - opl3sa2= [HW,OSS] - Format: <io>,<irq>,<dma>,<dma2>,<mss_io>,<mpu_io>,<ymode>,<loopback>[,<isapnp>,<multiple] - + opl3sa2= [HW,OSS] Format: + <io>,<irq>,<dma>,<dma2>,<mss_io>,<mpu_io>,<ymode>,<loopback>[,<isapnp>,<multiple] + oprofile.timer= [HW] Use timer interrupt instead of performance counters @@ -993,36 +1009,33 @@ running once the system is up. Format: <parport#> parkbd.mode= [HW] Parallel port keyboard adapter mode of operation, 0 for XT, 1 for AT (default is AT). - Format: <mode> - - parport=0 [HW,PPT] Specify parallel ports. 0 disables. - parport=auto Use 'auto' to force the driver to use - parport=0xBBB[,IRQ[,DMA]] any IRQ/DMA settings detected (the - default is to ignore detected IRQ/DMA - settings because of possible - conflicts). You can specify the base - address, IRQ, and DMA settings; IRQ and - DMA should be numbers, or 'auto' (for - using detected settings on that - particular port), or 'nofifo' (to avoid - using a FIFO even if it is detected). - Parallel ports are assigned in the - order they are specified on the command - line, starting with parport0. - - parport_init_mode= - [HW,PPT] Configure VIA parallel port to - operate in specific mode. This is - necessary on Pegasos computer where - firmware has no options for setting up - parallel port mode and sets it to - spp. Currently this function knows - 686a and 8231 chips. + Format: <mode> + + parport= [HW,PPT] Specify parallel ports. 0 disables. + Format: { 0 | auto | 0xBBB[,IRQ[,DMA]] } + Use 'auto' to force the driver to use any + IRQ/DMA settings detected (the default is to + ignore detected IRQ/DMA settings because of + possible conflicts). You can specify the base + address, IRQ, and DMA settings; IRQ and DMA + should be numbers, or 'auto' (for using detected + settings on that particular port), or 'nofifo' + (to avoid using a FIFO even if it is detected). + Parallel ports are assigned in the order they + are specified on the command line, starting + with parport0. + + parport_init_mode= [HW,PPT] + Configure VIA parallel port to operate in + a specific mode. This is necessary on Pegasos + computer where firmware has no options for setting + up parallel port mode and sets it to spp. + Currently this function knows 686a and 8231 chips. Format: [spp|ps2|epp|ecp|ecpepp] - pas2= [HW,OSS] - Format: <io>,<irq>,<dma>,<dma16>,<sb_io>,<sb_irq>,<sb_dma>,<sb_dma16> - + pas2= [HW,OSS] Format: + <io>,<irq>,<dma>,<dma16>,<sb_io>,<sb_irq>,<sb_dma>,<sb_dma16> + pas16= [HW,SCSI] See header of drivers/scsi/pas16.c. @@ -1032,64 +1045,67 @@ running once the system is up. See header of drivers/block/paride/pcd.c. See also Documentation/paride.txt. - pci=option[,option...] [PCI] various PCI subsystem options: - off [IA-32] don't probe for the PCI bus - bios [IA-32] force use of PCI BIOS, don't access - the hardware directly. Use this if your machine - has a non-standard PCI host bridge. - nobios [IA-32] disallow use of PCI BIOS, only direct - hardware access methods are allowed. Use this - if you experience crashes upon bootup and you - suspect they are caused by the BIOS. - conf1 [IA-32] Force use of PCI Configuration Mechanism 1. - conf2 [IA-32] Force use of PCI Configuration Mechanism 2. - nosort [IA-32] Don't sort PCI devices according to - order given by the PCI BIOS. This sorting is done - to get a device order compatible with older kernels. - biosirq [IA-32] Use PCI BIOS calls to get the interrupt - routing table. These calls are known to be buggy - on several machines and they hang the machine when used, - but on other computers it's the only way to get the - interrupt routing table. Try this option if the kernel - is unable to allocate IRQs or discover secondary PCI - buses on your motherboard. - rom [IA-32] Assign address space to expansion ROMs. - Use with caution as certain devices share address - decoders between ROMs and other resources. - irqmask=0xMMMM [IA-32] Set a bit mask of IRQs allowed to be assigned - automatically to PCI devices. You can make the kernel - exclude IRQs of your ISA cards this way. + pci=option[,option...] [PCI] various PCI subsystem options: + off [IA-32] don't probe for the PCI bus + bios [IA-32] force use of PCI BIOS, don't access + the hardware directly. Use this if your machine + has a non-standard PCI host bridge. + nobios [IA-32] disallow use of PCI BIOS, only direct + hardware access methods are allowed. Use this + if you experience crashes upon bootup and you + suspect they are caused by the BIOS. + conf1 [IA-32] Force use of PCI Configuration + Mechanism 1. + conf2 [IA-32] Force use of PCI Configuration + Mechanism 2. + nosort [IA-32] Don't sort PCI devices according to + order given by the PCI BIOS. This sorting is + done to get a device order compatible with + older kernels. + biosirq [IA-32] Use PCI BIOS calls to get the interrupt + routing table. These calls are known to be buggy + on several machines and they hang the machine + when used, but on other computers it's the only + way to get the interrupt routing table. Try + this option if the kernel is unable to allocate + IRQs or discover secondary PCI buses on your + motherboard. + rom [IA-32] Assign address space to expansion ROMs. + Use with caution as certain devices share + address decoders between ROMs and other + resources. + irqmask=0xMMMM [IA-32] Set a bit mask of IRQs allowed to be + assigned automatically to PCI devices. You can + make the kernel exclude IRQs of your ISA cards + this way. pirqaddr=0xAAAAA [IA-32] Specify the physical address - of the PIRQ table (normally generated - by the BIOS) if it is outside the - F0000h-100000h range. - lastbus=N [IA-32] Scan all buses till bus #N. Can be useful - if the kernel is unable to find your secondary buses - and you want to tell it explicitly which ones they are. - assign-busses [IA-32] Always assign all PCI bus - numbers ourselves, overriding - whatever the firmware may have - done. - usepirqmask [IA-32] Honor the possible IRQ mask - stored in the BIOS $PIR table. This is - needed on some systems with broken - BIOSes, notably some HP Pavilion N5400 - and Omnibook XE3 notebooks. This will - have no effect if ACPI IRQ routing is - enabled. - noacpi [IA-32] Do not use ACPI for IRQ routing - or for PCI scanning. - routeirq Do IRQ routing for all PCI devices. - This is normally done in pci_enable_device(), - so this option is a temporary workaround - for broken drivers that don't call it. - - firmware [ARM] Do not re-enumerate the bus but - instead just use the configuration - from the bootloader. This is currently - used on IXP2000 systems where the - bus has to be configured a certain way - for adjunct CPUs. + of the PIRQ table (normally generated + by the BIOS) if it is outside the + F0000h-100000h range. + lastbus=N [IA-32] Scan all buses thru bus #N. Can be + useful if the kernel is unable to find your + secondary buses and you want to tell it + explicitly which ones they are. + assign-busses [IA-32] Always assign all PCI bus + numbers ourselves, overriding + whatever the firmware may have done. + usepirqmask [IA-32] Honor the possible IRQ mask stored + in the BIOS $PIR table. This is needed on + some systems with broken BIOSes, notably + some HP Pavilion N5400 and Omnibook XE3 + notebooks. This will have no effect if ACPI + IRQ routing is enabled. + noacpi [IA-32] Do not use ACPI for IRQ routing + or for PCI scanning. + routeirq Do IRQ routing for all PCI devices. + This is normally done in pci_enable_device(), + so this option is a temporary workaround + for broken drivers that don't call it. + firmware [ARM] Do not re-enumerate the bus but instead + just use the configuration from the + bootloader. This is currently used on + IXP2000 systems where the bus has to be + configured a certain way for adjunct CPUs. pcmv= [HW,PCMCIA] BadgePAD 4 @@ -1127,19 +1143,20 @@ running once the system is up. [ISAPNP] Exclude DMAs for the autoconfiguration pnp_reserve_io= [ISAPNP] Exclude I/O ports for the autoconfiguration - Ranges are in pairs (I/O port base and size). + Ranges are in pairs (I/O port base and size). pnp_reserve_mem= - [ISAPNP] Exclude memory regions for the autoconfiguration + [ISAPNP] Exclude memory regions for the + autoconfiguration. Ranges are in pairs (memory base and size). profile= [KNL] Enable kernel profiling via /proc/profile - { schedule | <number> } - (param: schedule - profile schedule points} - (param: profile step/bucket size as a power of 2 for - statistical time based profiling) + Format: [schedule,]<number> + Param: "schedule" - profile schedule points. + Param: <number> - step/bucket size as a power of 2 for + statistical time based profiling. - processor.max_cstate= [HW, ACPI] + processor.max_cstate= [HW,ACPI] Limit processor to maximum C-state max_cstate=9 overrides any DMI blacklist limit. @@ -1147,27 +1164,28 @@ running once the system is up. before loading. See Documentation/ramdisk.txt. - psmouse.proto= [HW,MOUSE] Highest PS2 mouse protocol extension to - probe for (bare|imps|exps|lifebook|any). + psmouse.proto= [HW,MOUSE] Highest PS2 mouse protocol extension to + probe for; one of (bare|imps|exps|lifebook|any). psmouse.rate= [HW,MOUSE] Set desired mouse report rate, in reports per second. - psmouse.resetafter= - [HW,MOUSE] Try to reset the device after so many bad packets + psmouse.resetafter= [HW,MOUSE] + Try to reset the device after so many bad packets (0 = never). psmouse.resolution= [HW,MOUSE] Set desired mouse resolution, in dpi. psmouse.smartscroll= - [HW,MOUSE] Controls Logitech smartscroll autorepeat, + [HW,MOUSE] Controls Logitech smartscroll autorepeat. 0 = disabled, 1 = enabled (default). pss= [HW,OSS] Personal Sound System (ECHO ESC614) - Format: <io>,<mss_io>,<mss_irq>,<mss_dma>,<mpu_io>,<mpu_irq> + Format: + <io>,<mss_io>,<mss_irq>,<mss_dma>,<mpu_io>,<mpu_irq> pt. [PARIDE] See Documentation/paride.txt. quiet= [KNL] Disable log messages - + r128= [HW,DRM] raid= [HW,RAID] @@ -1176,10 +1194,9 @@ running once the system is up. ramdisk= [RAM] Sizes of RAM disks in kilobytes [deprecated] See Documentation/ramdisk.txt. - ramdisk_blocksize= - [RAM] + ramdisk_blocksize= [RAM] See Documentation/ramdisk.txt. - + ramdisk_size= [RAM] Sizes of RAM disks in kilobytes New name for the ramdisk parameter. See Documentation/ramdisk.txt. @@ -1195,7 +1212,8 @@ running once the system is up. reserve= [KNL,BUGS] Force the kernel to ignore some iomem area - resume= [SWSUSP] Specify the partition device for software suspension + resume= [SWSUSP] + Specify the partition device for software suspend rhash_entries= [KNL,NET] Set number of hash buckets for route cache @@ -1225,7 +1243,7 @@ running once the system is up. Format: <io>,<irq>,<dma>,<dma2> sbni= [NET] Granch SBNI12 leased line adapter - + sbpcd= [HW,CD] Soundblaster CD adapter Format: <io>,<type> See a comment before function sbpcd_setup() in @@ -1258,21 +1276,20 @@ running once the system is up. serialnumber [BUGS=IA-32] - sg_def_reserved_size= - [SCSI] - + sg_def_reserved_size= [SCSI] + sgalaxy= [HW,OSS] Format: <io>,<irq>,<dma>,<dma2>,<sgbase> shapers= [NET] Maximal number of shapers. - + sim710= [SCSI,HW] See header of drivers/scsi/sim710.c. simeth= [IA-64] simscsi= - + sjcd= [HW,CD] Format: <io>,<irq>,<dma> See header of drivers/cdrom/sjcd.c. @@ -1403,10 +1420,10 @@ running once the system is up. snd-wavefront= [HW,ALSA] snd-ymfpci= [HW,ALSA] - + sonicvibes= [HW,OSS] Format: <reverb> - + sonycd535= [HW,CD] Format: <io>[,<irq>] @@ -1423,7 +1440,7 @@ running once the system is up. sscape= [HW,OSS] Format: <io>,<irq>,<dma>,<mpu_io>,<mpu_irq> - + st= [HW,SCSI] SCSI tape parameters (buffers, etc.) See Documentation/scsi/st.txt. @@ -1443,10 +1460,8 @@ running once the system is up. stifb= [HW] Format: bpp:<bpp1>[:<bpp2>[:<bpp3>...]] - stram_swap= [HW,M68k] - swiotlb= [IA-64] Number of I/O TLB slabs - + switches= [HW,M68k] sym53c416= [HW,SCSI] @@ -1479,14 +1494,16 @@ running once the system is up. tp720= [HW,PS2] trix= [HW,OSS] MediaTrix AudioTrix Pro - Format: <io>,<irq>,<dma>,<dma2>,<sb_io>,<sb_irq>,<sb_dma>,<mpu_io>,<mpu_irq> - + Format: + <io>,<irq>,<dma>,<dma2>,<sb_io>,<sb_irq>,<sb_dma>,<mpu_io>,<mpu_irq> + tsdev.xres= [TS] Horizontal screen resolution. tsdev.yres= [TS] Vertical screen resolution. - turbografx.map[2|3]= - [HW,JOY] TurboGraFX parallel port interface - Format: <port#>,<js1>,<js2>,<js3>,<js4>,<js5>,<js6>,<js7> + turbografx.map[2|3]= [HW,JOY] + TurboGraFX parallel port interface + Format: + <port#>,<js1>,<js2>,<js3>,<js4>,<js5>,<js6>,<js7> See also Documentation/input/joystick-parport.txt u14-34f= [HW,SCSI] UltraStor 14F/34F SCSI host adapter @@ -1498,21 +1515,20 @@ running once the system is up. uart6850= [HW,OSS] Format: <io>,<irq> - usb-handoff [HW] Enable early USB BIOS -> OS handoff - usbhid.mousepoll= [USBHID] The interval which mice are to be polled at. - + video= [FB] Frame buffer configuration See Documentation/fb/modedb.txt. vga= [BOOT,IA-32] Select a particular video mode - See Documentation/i386/boot.txt and Documentation/svga.txt. + See Documentation/i386/boot.txt and + Documentation/svga.txt. Use vga=ask for menu. This is actually a boot loader parameter; the value is passed to the kernel using a special protocol. - vmalloc=nn[KMG] [KNL,BOOT] forces the vmalloc area to have an exact + vmalloc=nn[KMG] [KNL,BOOT] Forces the vmalloc area to have an exact size of <nn>. This can be used to increase the minimum size (128MB on x86). It can also be used to decrease the size and leave more room for directly @@ -1520,11 +1536,11 @@ running once the system is up. vmhalt= [KNL,S390] - vmpoff= [KNL,S390] - + vmpoff= [KNL,S390] + waveartist= [HW,OSS] Format: <io>,<irq>,<dma>,<dma2> - + wd33c93= [HW,SCSI] See header of drivers/scsi/wd33c93.c. @@ -1538,21 +1554,25 @@ running once the system is up. xd_geo= See header of drivers/block/xd.c. xirc2ps_cs= [NET,PCMCIA] - Format: <irq>,<irq_mask>,<io>,<full_duplex>,<do_sound>,<lockup_hack>[,<irq2>[,<irq3>[,<irq4>]]] - + Format: + <irq>,<irq_mask>,<io>,<full_duplex>,<do_sound>,<lockup_hack>[,<irq2>[,<irq3>[,<irq4>]]] +______________________________________________________________________ Changelog: +2000-06-?? Mr. Unknown The last known update (for 2.4.0) - the changelog was not kept before. - 2000-06-?? Mr. Unknown +2002-11-24 Petr Baudis <pasky@ucw.cz> + Randy Dunlap <randy.dunlap@verizon.net> Update for 2.5.49, description for most of the options introduced, references to other documentation (C files, READMEs, ..), added S390, PPC, SPARC, MTD, ALSA and OSS category. Minor corrections and reformatting. - 2002-11-24 Petr Baudis <pasky@ucw.cz> - Randy Dunlap <randy.dunlap@verizon.net> + +2005-10-19 Randy Dunlap <rdunlap@xenotime.net> + Lots of typos, whitespace, some reformatting. TODO: diff --git a/Documentation/keys-request-key.txt b/Documentation/keys-request-key.txt new file mode 100644 index 000000000000..5f2b9c5edbb5 --- /dev/null +++ b/Documentation/keys-request-key.txt @@ -0,0 +1,161 @@ + =================== + KEY REQUEST SERVICE + =================== + +The key request service is part of the key retention service (refer to +Documentation/keys.txt). This document explains more fully how that the +requesting algorithm works. + +The process starts by either the kernel requesting a service by calling +request_key(): + + struct key *request_key(const struct key_type *type, + const char *description, + const char *callout_string); + +Or by userspace invoking the request_key system call: + + key_serial_t request_key(const char *type, + const char *description, + const char *callout_info, + key_serial_t dest_keyring); + +The main difference between the two access points is that the in-kernel +interface does not need to link the key to a keyring to prevent it from being +immediately destroyed. The kernel interface returns a pointer directly to the +key, and it's up to the caller to destroy the key. + +The userspace interface links the key to a keyring associated with the process +to prevent the key from going away, and returns the serial number of the key to +the caller. + + +=========== +THE PROCESS +=========== + +A request proceeds in the following manner: + + (1) Process A calls request_key() [the userspace syscall calls the kernel + interface]. + + (2) request_key() searches the process's subscribed keyrings to see if there's + a suitable key there. If there is, it returns the key. If there isn't, and + callout_info is not set, an error is returned. Otherwise the process + proceeds to the next step. + + (3) request_key() sees that A doesn't have the desired key yet, so it creates + two things: + + (a) An uninstantiated key U of requested type and description. + + (b) An authorisation key V that refers to key U and notes that process A + is the context in which key U should be instantiated and secured, and + from which associated key requests may be satisfied. + + (4) request_key() then forks and executes /sbin/request-key with a new session + keyring that contains a link to auth key V. + + (5) /sbin/request-key execs an appropriate program to perform the actual + instantiation. + + (6) The program may want to access another key from A's context (say a + Kerberos TGT key). It just requests the appropriate key, and the keyring + search notes that the session keyring has auth key V in its bottom level. + + This will permit it to then search the keyrings of process A with the + UID, GID, groups and security info of process A as if it was process A, + and come up with key W. + + (7) The program then does what it must to get the data with which to + instantiate key U, using key W as a reference (perhaps it contacts a + Kerberos server using the TGT) and then instantiates key U. + + (8) Upon instantiating key U, auth key V is automatically revoked so that it + may not be used again. + + (9) The program then exits 0 and request_key() deletes key V and returns key + U to the caller. + +This also extends further. If key W (step 5 above) didn't exist, key W would be +created uninstantiated, another auth key (X) would be created [as per step 3] +and another copy of /sbin/request-key spawned [as per step 4]; but the context +specified by auth key X will still be process A, as it was in auth key V. + +This is because process A's keyrings can't simply be attached to +/sbin/request-key at the appropriate places because (a) execve will discard two +of them, and (b) it requires the same UID/GID/Groups all the way through. + + +====================== +NEGATIVE INSTANTIATION +====================== + +Rather than instantiating a key, it is possible for the possessor of an +authorisation key to negatively instantiate a key that's under construction. +This is a short duration placeholder that causes any attempt at re-requesting +the key whilst it exists to fail with error ENOKEY. + +This is provided to prevent excessive repeated spawning of /sbin/request-key +processes for a key that will never be obtainable. + +Should the /sbin/request-key process exit anything other than 0 or die on a +signal, the key under construction will be automatically negatively +instantiated for a short amount of time. + + +==================== +THE SEARCH ALGORITHM +==================== + +A search of any particular keyring proceeds in the following fashion: + + (1) When the key management code searches for a key (keyring_search_aux) it + firstly calls key_permission(SEARCH) on the keyring it's starting with, + if this denies permission, it doesn't search further. + + (2) It considers all the non-keyring keys within that keyring and, if any key + matches the criteria specified, calls key_permission(SEARCH) on it to see + if the key is allowed to be found. If it is, that key is returned; if + not, the search continues, and the error code is retained if of higher + priority than the one currently set. + + (3) It then considers all the keyring-type keys in the keyring it's currently + searching. It calls key_permission(SEARCH) on each keyring, and if this + grants permission, it recurses, executing steps (2) and (3) on that + keyring. + +The process stops immediately a valid key is found with permission granted to +use it. Any error from a previous match attempt is discarded and the key is +returned. + +When search_process_keyrings() is invoked, it performs the following searches +until one succeeds: + + (1) If extant, the process's thread keyring is searched. + + (2) If extant, the process's process keyring is searched. + + (3) The process's session keyring is searched. + + (4) If the process has a request_key() authorisation key in its session + keyring then: + + (a) If extant, the calling process's thread keyring is searched. + + (b) If extant, the calling process's process keyring is searched. + + (c) The calling process's session keyring is searched. + +The moment one succeeds, all pending errors are discarded and the found key is +returned. + +Only if all these fail does the whole thing fail with the highest priority +error. Note that several errors may have come from LSM. + +The error priority is: + + EKEYREVOKED > EKEYEXPIRED > ENOKEY + +EACCES/EPERM are only returned on a direct search of a specific keyring where +the basal keyring does not grant Search permission. diff --git a/Documentation/keys.txt b/Documentation/keys.txt index 0321ded4b9ae..31154882000a 100644 --- a/Documentation/keys.txt +++ b/Documentation/keys.txt @@ -195,8 +195,8 @@ KEY ACCESS PERMISSIONS ====================== Keys have an owner user ID, a group access ID, and a permissions mask. The mask -has up to eight bits each for user, group and other access. Only five of each -set of eight bits are defined. These permissions granted are: +has up to eight bits each for possessor, user, group and other access. Only +six of each set of eight bits are defined. These permissions granted are: (*) View @@ -224,6 +224,10 @@ set of eight bits are defined. These permissions granted are: keyring to a key, a process must have Write permission on the keyring and Link permission on the key. + (*) Set Attribute + + This permits a key's UID, GID and permissions mask to be changed. + For changing the ownership, group ID or permissions mask, being the owner of the key or having the sysadmin capability is sufficient. @@ -241,16 +245,16 @@ about the status of the key service: type, description and permissions. The payload of the key is not available this way: - SERIAL FLAGS USAGE EXPY PERM UID GID TYPE DESCRIPTION: SUMMARY - 00000001 I----- 39 perm 1f0000 0 0 keyring _uid_ses.0: 1/4 - 00000002 I----- 2 perm 1f0000 0 0 keyring _uid.0: empty - 00000007 I----- 1 perm 1f0000 0 0 keyring _pid.1: empty - 0000018d I----- 1 perm 1f0000 0 0 keyring _pid.412: empty - 000004d2 I--Q-- 1 perm 1f0000 32 -1 keyring _uid.32: 1/4 - 000004d3 I--Q-- 3 perm 1f0000 32 -1 keyring _uid_ses.32: empty - 00000892 I--QU- 1 perm 1f0000 0 0 user metal:copper: 0 - 00000893 I--Q-N 1 35s 1f0000 0 0 user metal:silver: 0 - 00000894 I--Q-- 1 10h 1f0000 0 0 user metal:gold: 0 + SERIAL FLAGS USAGE EXPY PERM UID GID TYPE DESCRIPTION: SUMMARY + 00000001 I----- 39 perm 1f3f0000 0 0 keyring _uid_ses.0: 1/4 + 00000002 I----- 2 perm 1f3f0000 0 0 keyring _uid.0: empty + 00000007 I----- 1 perm 1f3f0000 0 0 keyring _pid.1: empty + 0000018d I----- 1 perm 1f3f0000 0 0 keyring _pid.412: empty + 000004d2 I--Q-- 1 perm 1f3f0000 32 -1 keyring _uid.32: 1/4 + 000004d3 I--Q-- 3 perm 1f3f0000 32 -1 keyring _uid_ses.32: empty + 00000892 I--QU- 1 perm 1f000000 0 0 user metal:copper: 0 + 00000893 I--Q-N 1 35s 1f3f0000 0 0 user metal:silver: 0 + 00000894 I--Q-- 1 10h 003f0000 0 0 user metal:gold: 0 The flags are: @@ -361,6 +365,8 @@ The main syscalls are: /sbin/request-key will be invoked in an attempt to obtain a key. The callout_info string will be passed as an argument to the program. + See also Documentation/keys-request-key.txt. + The keyctl syscall functions are: @@ -533,8 +539,8 @@ The keyctl syscall functions are: (*) Read the payload data from a key: - key_serial_t keyctl(KEYCTL_READ, key_serial_t keyring, char *buffer, - size_t buflen); + long keyctl(KEYCTL_READ, key_serial_t keyring, char *buffer, + size_t buflen); This function attempts to read the payload data from the specified key into the buffer. The process must have read permission on the key to @@ -555,9 +561,9 @@ The keyctl syscall functions are: (*) Instantiate a partially constructed key. - key_serial_t keyctl(KEYCTL_INSTANTIATE, key_serial_t key, - const void *payload, size_t plen, - key_serial_t keyring); + long keyctl(KEYCTL_INSTANTIATE, key_serial_t key, + const void *payload, size_t plen, + key_serial_t keyring); If the kernel calls back to userspace to complete the instantiation of a key, userspace should use this call to supply data for the key before the @@ -576,8 +582,8 @@ The keyctl syscall functions are: (*) Negatively instantiate a partially constructed key. - key_serial_t keyctl(KEYCTL_NEGATE, key_serial_t key, - unsigned timeout, key_serial_t keyring); + long keyctl(KEYCTL_NEGATE, key_serial_t key, + unsigned timeout, key_serial_t keyring); If the kernel calls back to userspace to complete the instantiation of a key, userspace should use this call mark the key as negative before the @@ -637,6 +643,34 @@ call, and the key released upon close. How to deal with conflicting keys due to two different users opening the same file is left to the filesystem author to solve. +Note that there are two different types of pointers to keys that may be +encountered: + + (*) struct key * + + This simply points to the key structure itself. Key structures will be at + least four-byte aligned. + + (*) key_ref_t + + This is equivalent to a struct key *, but the least significant bit is set + if the caller "possesses" the key. By "possession" it is meant that the + calling processes has a searchable link to the key from one of its + keyrings. There are three functions for dealing with these: + + key_ref_t make_key_ref(const struct key *key, + unsigned long possession); + + struct key *key_ref_to_ptr(const key_ref_t key_ref); + + unsigned long is_key_possessed(const key_ref_t key_ref); + + The first function constructs a key reference from a key pointer and + possession information (which must be 0 or 1 and not any other value). + + The second function retrieves the key pointer from a reference and the + third retrieves the possession flag. + When accessing a key's payload contents, certain precautions must be taken to prevent access vs modification races. See the section "Notes on accessing payload contents" for more information. @@ -660,12 +694,18 @@ payload contents" for more information. If successful, the key will have been attached to the default keyring for implicitly obtained request-key keys, as set by KEYCTL_SET_REQKEY_KEYRING. + See also Documentation/keys-request-key.txt. + (*) When it is no longer required, the key should be released using: void key_put(struct key *key); - This can be called from interrupt context. If CONFIG_KEYS is not set then + Or: + + void key_ref_put(key_ref_t key_ref); + + These can be called from interrupt context. If CONFIG_KEYS is not set then the argument will not be parsed. @@ -689,13 +729,17 @@ payload contents" for more information. (*) If a keyring was found in the search, this can be further searched by: - struct key *keyring_search(struct key *keyring, - const struct key_type *type, - const char *description) + key_ref_t keyring_search(key_ref_t keyring_ref, + const struct key_type *type, + const char *description) This searches the keyring tree specified for a matching key. Error ENOKEY - is returned upon failure. If successful, the returned key will need to be - released. + is returned upon failure (use IS_ERR/PTR_ERR to determine). If successful, + the returned key will need to be released. + + The possession attribute from the keyring reference is used to control + access through the permissions mask and is propagated to the returned key + reference pointer if successful. (*) To check the validity of a key, this function can be called: @@ -732,7 +776,7 @@ More complex payload contents must be allocated and a pointer to them set in key->payload.data. One of the following ways must be selected to access the data: - (1) Unmodifyable key type. + (1) Unmodifiable key type. If the key type does not have a modify method, then the key's payload can be accessed without any form of locking, provided that it's known to be diff --git a/Documentation/m68k/kernel-options.txt b/Documentation/m68k/kernel-options.txt index e191baad8308..d5d3f064f552 100644 --- a/Documentation/m68k/kernel-options.txt +++ b/Documentation/m68k/kernel-options.txt @@ -626,7 +626,7 @@ ignored (others aren't affected). can be performed in optimal order. Not all SCSI devices support tagged queuing (:-(). -4.6 switches= +4.5 switches= ------------- Syntax: switches=<list of switches> @@ -661,28 +661,6 @@ correctly. earlier initialization ("ov_"-less) takes precedence. But the switching-off on reset still happens in this case. -4.5) stram_swap= ----------------- - -Syntax: stram_swap=<do_swap>[,<max_swap>] - - This option is available only if the kernel has been compiled with -CONFIG_STRAM_SWAP enabled. Normally, the kernel then determines -dynamically whether to actually use ST-RAM as swap space. (Currently, -the fraction of ST-RAM must be less or equal 1/3 of total memory to -enable this swapping.) You can override the kernel's decision by -specifying this option. 1 for <do_swap> means always enable the swap, -even if you have less alternate RAM. 0 stands for never swap to -ST-RAM, even if it's small enough compared to the rest of memory. - - If ST-RAM swapping is enabled, the kernel usually uses all free -ST-RAM as swap "device". If the kernel resides in ST-RAM, the region -allocated by it is obviously never used for swapping :-) You can also -limit this amount by specifying the second parameter, <max_swap>, if -you want to use parts of ST-RAM as normal system memory. <max_swap> is -in kBytes and the number should be a multiple of 4 (otherwise: rounded -down). - 5) Options for Amiga Only: ========================== diff --git a/Documentation/magic-number.txt b/Documentation/magic-number.txt index bd8eefa17587..af67faccf4de 100644 --- a/Documentation/magic-number.txt +++ b/Documentation/magic-number.txt @@ -120,7 +120,7 @@ ISDN_NET_MAGIC 0x49344C02 isdn_net_local_s drivers/isdn/i4l/isdn_net_li SAVEKMSG_MAGIC2 0x4B4D5347 savekmsg arch/*/amiga/config.c STLI_BOARDMAGIC 0x4bc6c825 stlibrd include/linux/istallion.h CS_STATE_MAGIC 0x4c4f4749 cs_state sound/oss/cs46xx.c -SLAB_C_MAGIC 0x4f17a36d kmem_cache_s mm/slab.c +SLAB_C_MAGIC 0x4f17a36d kmem_cache mm/slab.c COW_MAGIC 0x4f4f4f4d cow_header_v1 arch/um/drivers/ubd_user.c I810_CARD_MAGIC 0x5072696E i810_card sound/oss/i810_audio.c TRIDENT_CARD_MAGIC 0x5072696E trident_card sound/oss/trident.c diff --git a/Documentation/mips/AU1xxx_IDE.README b/Documentation/mips/AU1xxx_IDE.README new file mode 100644 index 000000000000..a7e4c4ea3560 --- /dev/null +++ b/Documentation/mips/AU1xxx_IDE.README @@ -0,0 +1,168 @@ +README for MIPS AU1XXX IDE driver - Released 2005-07-15 + +ABOUT +----- +This file describes the 'drivers/ide/mips/au1xxx-ide.c', related files and the +services they provide. + +If you are short in patience and just want to know how to add your hard disc to +the white or black list, go to the 'ADD NEW HARD DISC TO WHITE OR BLACK LIST' +section. + + +LICENSE +------- + +Copyright (c) 2003-2005 AMD, Personal Connectivity Solutions + +This program is free software; you can redistribute it and/or modify it under +the terms of the GNU General Public License as published by the Free Software +Foundation; either version 2 of the License, or (at your option) any later +version. + +THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, +INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND +FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR +BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR +CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF +SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS +INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN +CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) +ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE +POSSIBILITY OF SUCH DAMAGE. + +You should have received a copy of the GNU General Public License along with +this program; if not, write to the Free Software Foundation, Inc., +675 Mass Ave, Cambridge, MA 02139, USA. + +Note: for more information, please refer "AMD Alchemy Au1200/Au1550 IDE + Interface and Linux Device Driver" Application Note. + + +FILES, CONFIGS AND COMPATABILITY +-------------------------------- + +Two files are introduced: + + a) 'include/asm-mips/mach-au1x00/au1xxx_ide.h' + containes : struct _auide_hwif + struct drive_list_entry dma_white_list + struct drive_list_entry dma_black_list + timing parameters for PIO mode 0/1/2/3/4 + timing parameters for MWDMA 0/1/2 + + b) 'drivers/ide/mips/au1xxx-ide.c' + contains the functionality of the AU1XXX IDE driver + +Four configs variables are introduced: + + CONFIG_BLK_DEV_IDE_AU1XXX_PIO_DBDMA - enable the PIO+DBDMA mode + CONFIG_BLK_DEV_IDE_AU1XXX_MDMA2_DBDMA - enable the MWDMA mode + CONFIG_BLK_DEV_IDE_AU1XXX_BURSTABLE_ON - set Burstable FIFO in DBDMA + controler + CONFIG_BLK_DEV_IDE_AU1XXX_SEQTS_PER_RQ - maximum transfer size + per descriptor + +If MWDMA is enabled and the connected hard disc is not on the white list, the +kernel switches to a "safe mwdma mode" at boot time. In this mode the IDE +performance is substantial slower then in full speed mwdma. In this case +please add your hard disc to the white list (follow instruction from 'ADD NEW +HARD DISC TO WHITE OR BLACK LIST' section). + + +SUPPORTED IDE MODES +------------------- + +The AU1XXX IDE driver supported all PIO modes - PIO mode 0/1/2/3/4 - and all +MWDMA modes - MWDMA 0/1/2 -. There is no support for SWDMA and UDMA mode. + +To change the PIO mode use the program hdparm with option -p, e.g. +'hdparm -p0 [device]' for PIO mode 0. To enable the MWDMA mode use the option +-X, e.g. 'hdparm -X32 [device]' for MWDMA mode 0. + + +PERFORMANCE CONFIGURATIONS +-------------------------- + +If the used system doesn't need USB support enable the following kernel configs: + +CONFIG_IDE=y +CONFIG_BLK_DEV_IDE=y +CONFIG_IDE_GENERIC=y +CONFIG_BLK_DEV_IDEPCI=y +CONFIG_BLK_DEV_GENERIC=y +CONFIG_BLK_DEV_IDEDMA_PCI=y +CONFIG_IDEDMA_PCI_AUTO=y +CONFIG_BLK_DEV_IDE_AU1XXX=y +CONFIG_BLK_DEV_IDE_AU1XXX_MDMA2_DBDMA=y +CONFIG_BLK_DEV_IDE_AU1XXX_BURSTABLE_ON=y +CONFIG_BLK_DEV_IDE_AU1XXX_SEQTS_PER_RQ=128 +CONFIG_BLK_DEV_IDEDMA=y +CONFIG_IDEDMA_AUTO=y + +If the used system need the USB support enable the following kernel configs for +high IDE to USB throughput. + +CONFIG_BLK_DEV_IDEDISK=y +CONFIG_IDE_GENERIC=y +CONFIG_BLK_DEV_IDEPCI=y +CONFIG_BLK_DEV_GENERIC=y +CONFIG_BLK_DEV_IDEDMA_PCI=y +CONFIG_IDEDMA_PCI_AUTO=y +CONFIG_BLK_DEV_IDE_AU1XXX=y +CONFIG_BLK_DEV_IDE_AU1XXX_MDMA2_DBDMA=y +CONFIG_BLK_DEV_IDE_AU1XXX_SEQTS_PER_RQ=128 +CONFIG_BLK_DEV_IDEDMA=y +CONFIG_IDEDMA_AUTO=y + + +ADD NEW HARD DISC TO WHITE OR BLACK LIST +---------------------------------------- + +Step 1 : detect the model name of your hard disc + + a) connect your hard disc to the AU1XXX + + b) boot your kernel and get the hard disc model. + + Example boot log: + + --snipped-- + Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2 + ide: Assuming 50MHz system bus speed for PIO modes; override with idebus=xx + Au1xxx IDE(builtin) configured for MWDMA2 + Probing IDE interface ide0... + hda: Maxtor 6E040L0, ATA DISK drive + ide0 at 0xac800000-0xac800007,0xac8001c0 on irq 64 + hda: max request size: 64KiB + hda: 80293248 sectors (41110 MB) w/2048KiB Cache, CHS=65535/16/63, (U)DMA + --snipped-- + + In this example 'Maxtor 6E040L0'. + +Step 2 : edit 'include/asm-mips/mach-au1x00/au1xxx_ide.h' + + Add your hard disc to the dma_white_list or dma_black_list structur. + +Step 3 : Recompile the kernel + + Enable MWDMA support in the kernel configuration. Recompile the kernel and + reboot. + +Step 4 : Tests + + If you have add a hard disc to the white list, please run some stress tests + for verification. + + +ACKNOWLEDGMENTS +--------------- + +These drivers wouldn't have been done without the base of kernel 2.4.x AU1XXX +IDE driver from AMD. + +Additional input also from: +Matthias Lenk <matthias.lenk@amd.com> + +Happy hacking! +Enrico Walther <enrico.walther@amd.com> diff --git a/Documentation/networking/bonding.txt b/Documentation/networking/bonding.txt index a55f0f95b171..b0fe41da007b 100644 --- a/Documentation/networking/bonding.txt +++ b/Documentation/networking/bonding.txt @@ -777,7 +777,7 @@ doing so is the same as described in the "Configuring Multiple Bonds Manually" section, below. NOTE: It has been observed that some Red Hat supplied kernels -are apparently unable to rename modules at load time (the "-obonding1" +are apparently unable to rename modules at load time (the "-o bond1" part). Attempts to pass that option to modprobe will produce an "Operation not permitted" error. This has been reported on some Fedora Core kernels, and has been seen on RHEL 4 as well. On kernels @@ -883,7 +883,8 @@ the above does not work, and the second bonding instance never sees its options. In that case, the second options line can be substituted as follows: -install bonding1 /sbin/modprobe bonding -obond1 mode=balance-alb miimon=50 +install bond1 /sbin/modprobe --ignore-install bonding -o bond1 \ + mode=balance-alb miimon=50 This may be repeated any number of times, specifying a new and unique name in place of bond1 for each subsequent instance. diff --git a/Documentation/networking/decnet.txt b/Documentation/networking/decnet.txt index c6bd25f5d61d..e6c39c5831f5 100644 --- a/Documentation/networking/decnet.txt +++ b/Documentation/networking/decnet.txt @@ -176,8 +176,6 @@ information (_most_ of which _is_ _essential_) includes: - Which client caused the problem ? - How much data was being transferred ? - Was the network congested ? - - If there was a kernel panic, please run the output through ksymoops - before sending it to me, otherwise its _useless_. - How can the problem be reproduced ? - Can you use tcpdump to get a trace ? (N.B. Most (all?) versions of tcpdump don't understand how to dump DECnet properly, so including diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index ab65714d95fc..65895bb51414 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -309,7 +309,7 @@ tcp_tso_win_divisor - INTEGER can be consumed by a single TSO frame. The setting of this parameter is a choice between burstiness and building larger TSO frames. - Default: 8 + Default: 3 tcp_frto - BOOLEAN Enables F-RTO, an enhanced recovery algorithm for TCP retransmission @@ -355,10 +355,14 @@ ip_dynaddr - BOOLEAN Default: 0 icmp_echo_ignore_all - BOOLEAN + If set non-zero, then the kernel will ignore all ICMP ECHO + requests sent to it. + Default: 0 + icmp_echo_ignore_broadcasts - BOOLEAN - If either is set to true, then the kernel will ignore either all - ICMP ECHO requests sent to it or just those to broadcast/multicast - addresses, respectively. + If set non-zero, then the kernel will ignore all ICMP ECHO and + TIMESTAMP requests sent to it via broadcast/multicast. + Default: 1 icmp_ratelimit - INTEGER Limit the maximal rates for sending ICMP packets whose type matches diff --git a/Documentation/networking/s2io.txt b/Documentation/networking/s2io.txt index 6726b524ec45..bd528ffbeb4b 100644 --- a/Documentation/networking/s2io.txt +++ b/Documentation/networking/s2io.txt @@ -1,48 +1,153 @@ -S2IO Technologies XFrame 10 Gig adapter. -------------------------------------------- - -I. Module loadable parameters. -When loaded as a module, the driver provides a host of Module loadable -parameters, so the device can be tuned as per the users needs. -A list of the Module params is given below. -(i) ring_num: This can be used to program the number of - receive rings used in the driver. -(ii) ring_len: This defines the number of descriptors each ring - can have. There can be a maximum of 8 rings. -(iii) frame_len: This is an array of size 8. Using this we can - set the maximum size of the received frame that can - be steered into the corrsponding receive ring. -(iv) fifo_num: This defines the number of Tx FIFOs thats used in - the driver. -(v) fifo_len: Each element defines the number of - Tx descriptors that can be associated with each - corresponding FIFO. There are a maximum of 8 FIFOs. -(vi) tx_prio: This is a bool, if module is loaded with a non-zero - value for tx_prio multi FIFO scheme is activated. -(vii) rx_prio: This is a bool, if module is loaded with a non-zero - value for tx_prio multi RING scheme is activated. -(viii) latency_timer: The value given against this param will be - loaded into the latency timer register in PCI Config - space, else the register is left with its reset value. - -II. Performance tuning. - By changing a few sysctl parameters. - Copy the following lines into a file and run the following command, - "sysctl -p <file_name>" -### IPV4 specific settings -net.ipv4.tcp_timestamps = 0 # turns TCP timestamp support off, default 1, reduces CPU use -net.ipv4.tcp_sack = 0 # turn SACK support off, default on -# on systems with a VERY fast bus -> memory interface this is the big gainer -net.ipv4.tcp_rmem = 10000000 10000000 10000000 # sets min/default/max TCP read buffer, default 4096 87380 174760 -net.ipv4.tcp_wmem = 10000000 10000000 10000000 # sets min/pressure/max TCP write buffer, default 4096 16384 131072 -net.ipv4.tcp_mem = 10000000 10000000 10000000 # sets min/pressure/max TCP buffer space, default 31744 32256 32768 - -### CORE settings (mostly for socket and UDP effect) -net.core.rmem_max = 524287 # maximum receive socket buffer size, default 131071 -net.core.wmem_max = 524287 # maximum send socket buffer size, default 131071 -net.core.rmem_default = 524287 # default receive socket buffer size, default 65535 -net.core.wmem_default = 524287 # default send socket buffer size, default 65535 -net.core.optmem_max = 524287 # maximum amount of option memory buffers, default 10240 -net.core.netdev_max_backlog = 300000 # number of unprocessed input packets before kernel starts dropping them, default 300 ----End of performance tuning file--- +Release notes for Neterion's (Formerly S2io) Xframe I/II PCI-X 10GbE driver. + +Contents +======= +- 1. Introduction +- 2. Identifying the adapter/interface +- 3. Features supported +- 4. Command line parameters +- 5. Performance suggestions +- 6. Available Downloads + + +1. Introduction: +This Linux driver supports Neterion's Xframe I PCI-X 1.0 and +Xframe II PCI-X 2.0 adapters. It supports several features +such as jumbo frames, MSI/MSI-X, checksum offloads, TSO, UFO and so on. +See below for complete list of features. +All features are supported for both IPv4 and IPv6. + +2. Identifying the adapter/interface: +a. Insert the adapter(s) in your system. +b. Build and load driver +# insmod s2io.ko +c. View log messages +# dmesg | tail -40 +You will see messages similar to: +eth3: Neterion Xframe I 10GbE adapter (rev 3), Version 2.0.9.1, Intr type INTA +eth4: Neterion Xframe II 10GbE adapter (rev 2), Version 2.0.9.1, Intr type INTA +eth4: Device is on 64 bit 133MHz PCIX(M1) bus + +The above messages identify the adapter type(Xframe I/II), adapter revision, +driver version, interface name(eth3, eth4), Interrupt type(INTA, MSI, MSI-X). +In case of Xframe II, the PCI/PCI-X bus width and frequency are displayed +as well. + +To associate an interface with a physical adapter use "ethtool -p <ethX>". +The corresponding adapter's LED will blink multiple times. + +3. Features supported: +a. Jumbo frames. Xframe I/II supports MTU upto 9600 bytes, +modifiable using ifconfig command. + +b. Offloads. Supports checksum offload(TCP/UDP/IP) on transmit +and receive, TSO. + +c. Multi-buffer receive mode. Scattering of packet across multiple +buffers. Currently driver supports 2-buffer mode which yields +significant performance improvement on certain platforms(SGI Altix, +IBM xSeries). + +d. MSI/MSI-X. Can be enabled on platforms which support this feature +(IA64, Xeon) resulting in noticeable performance improvement(upto 7% +on certain platforms). + +e. NAPI. Compile-time option(CONFIG_S2IO_NAPI) for better Rx interrupt +moderation. + +f. Statistics. Comprehensive MAC-level and software statistics displayed +using "ethtool -S" option. + +g. Multi-FIFO/Ring. Supports up to 8 transmit queues and receive rings, +with multiple steering options. + +4. Command line parameters +a. tx_fifo_num +Number of transmit queues +Valid range: 1-8 +Default: 1 + +b. rx_ring_num +Number of receive rings +Valid range: 1-8 +Default: 1 + +c. tx_fifo_len +Size of each transmit queue +Valid range: Total length of all queues should not exceed 8192 +Default: 4096 + +d. rx_ring_sz +Size of each receive ring(in 4K blocks) +Valid range: Limited by memory on system +Default: 30 + +e. intr_type +Specifies interrupt type. Possible values 1(INTA), 2(MSI), 3(MSI-X) +Valid range: 1-3 +Default: 1 + +5. Performance suggestions +General: +a. Set MTU to maximum(9000 for switch setup, 9600 in back-to-back configuration) +b. Set TCP windows size to optimal value. +For instance, for MTU=1500 a value of 210K has been observed to result in +good performance. +# sysctl -w net.ipv4.tcp_rmem="210000 210000 210000" +# sysctl -w net.ipv4.tcp_wmem="210000 210000 210000" +For MTU=9000, TCP window size of 10 MB is recommended. +# sysctl -w net.ipv4.tcp_rmem="10000000 10000000 10000000" +# sysctl -w net.ipv4.tcp_wmem="10000000 10000000 10000000" + +Transmit performance: +a. By default, the driver respects BIOS settings for PCI bus parameters. +However, you may want to experiment with PCI bus parameters +max-split-transactions(MOST) and MMRBC (use setpci command). +A MOST value of 2 has been found optimal for Opterons and 3 for Itanium. +It could be different for your hardware. +Set MMRBC to 4K**. + +For example you can set +For opteron +#setpci -d 17d5:* 62=1d +For Itanium +#setpci -d 17d5:* 62=3d + +For detailed description of the PCI registers, please see Xframe User Guide. + +b. Ensure Transmit Checksum offload is enabled. Use ethtool to set/verify this +parameter. +c. Turn on TSO(using "ethtool -K") +# ethtool -K <ethX> tso on + +Receive performance: +a. By default, the driver respects BIOS settings for PCI bus parameters. +However, you may want to set PCI latency timer to 248. +#setpci -d 17d5:* LATENCY_TIMER=f8 +For detailed description of the PCI registers, please see Xframe User Guide. +b. Use 2-buffer mode. This results in large performance boost on +on certain platforms(eg. SGI Altix, IBM xSeries). +c. Ensure Receive Checksum offload is enabled. Use "ethtool -K ethX" command to +set/verify this option. +d. Enable NAPI feature(in kernel configuration Device Drivers ---> Network +device support ---> Ethernet (10000 Mbit) ---> S2IO 10Gbe Xframe NIC) to +bring down CPU utilization. + +** For AMD opteron platforms with 8131 chipset, MMRBC=1 and MOST=1 are +recommended as safe parameters. +For more information, please review the AMD8131 errata at +http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/26310.pdf + +6. Available Downloads +Neterion "s2io" driver in Red Hat and Suse 2.6-based distributions is kept up +to date, also the latest "s2io" code (including support for 2.4 kernels) is +available via "Support" link on the Neterion site: http://www.neterion.com. + +For Xframe User Guide (Programming manual), visit ftp site ns1.s2io.com, +user: linuxdocs password: HALdocs + +7. Support +For further support please contact either your 10GbE Xframe NIC vendor (IBM, +HP, SGI etc.) or click on the "Support" link on the Neterion site: +http://www.neterion.com. diff --git a/Documentation/oops-tracing.txt b/Documentation/oops-tracing.txt index 66eaaab7773d..c563842ed805 100644 --- a/Documentation/oops-tracing.txt +++ b/Documentation/oops-tracing.txt @@ -1,6 +1,6 @@ NOTE: ksymoops is useless on 2.6. Please use the Oops in its original format (from dmesg, etc). Ignore any references in this or other docs to "decoding -the Oops" or "running it through ksymoops". If you post an Oops fron 2.6 that +the Oops" or "running it through ksymoops". If you post an Oops from 2.6 that has been run through ksymoops, people will just tell you to repost it. Quick Summary diff --git a/Documentation/power/video.txt b/Documentation/power/video.txt index 526d6dd267ea..912bed87c758 100644 --- a/Documentation/power/video.txt +++ b/Documentation/power/video.txt @@ -11,9 +11,9 @@ boot video card. (Kernel usually does not even contain video card driver -- vesafb and vgacon are widely used). This is not problem for swsusp, because during swsusp resume, BIOS is -run normally so video card is normally initialized. S3 has absolutely -no chance of working with SMP/HT. Be sure it to turn it off before -testing (swsusp should work ok, OTOH). +run normally so video card is normally initialized. It should not be +problem for S1 standby, because hardware should retain its state over +that. There are a few types of systems where video works after S3 resume: @@ -64,7 +64,7 @@ your video card (good luck getting docs :-(). Maybe suspending from X (proper X, knowing your hardware, not XF68_FBcon) might have better chance of working. -Table of known working systems: +Table of known working notebooks: Model hack (or "how to do it") ------------------------------------------------------------------------------ @@ -73,7 +73,7 @@ Acer TM 242FX vbetool (6) Acer TM C110 video_post (8) Acer TM C300 vga=normal (only suspend on console, not in X), vbetool (6) or video_post (8) Acer TM 4052LCi s3_bios (2) -Acer TM 636Lci s3_bios vga=normal (2) +Acer TM 636Lci s3_bios,s3_mode (4) Acer TM 650 (Radeon M7) vga=normal plus boot-radeon (5) gets text console back Acer TM 660 ??? (*) Acer TM 800 vga=normal, X patches, see webpage (5) or vbetool (6) @@ -137,6 +137,13 @@ Toshiba Satellite P10-554 s3_bios,s3_mode (4)(****) Toshiba M30 (2) xor X with nvidia driver using internal AGP Uniwill 244IIO ??? (*) +Known working desktop systems +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Mainboard Graphics card hack (or "how to do it") +------------------------------------------------------------------------------ +Asus A7V8X nVidia RIVA TNT2 model 64 s3_bios,s3_mode (4) + (*) from http://www.ubuntulinux.org/wiki/HoaryPMResults, not sure which options to use. If you know, please tell me. diff --git a/Documentation/s390/driver-model.txt b/Documentation/s390/driver-model.txt index 19461958e2bd..df09758bf3fe 100644 --- a/Documentation/s390/driver-model.txt +++ b/Documentation/s390/driver-model.txt @@ -8,11 +8,10 @@ All devices which can be addressed by means of ccws are called 'CCW devices' - even if they aren't actually driven by ccws. All ccw devices are accessed via a subchannel, this is reflected in the -structures under root/: +structures under devices/: -root/ - - sys - - legacy +devices/ + - system/ - css0/ - 0.0.0000/0.0.0815/ - 0.0.0001/0.0.4711/ @@ -36,7 +35,7 @@ availability: Can be 'good' or 'boxed'; 'no path' or 'no device' for online: An interface to set the device online and offline. In the special case of the device being disconnected (see the - notify function under 1.2), piping 0 to online will focibly delete + notify function under 1.2), piping 0 to online will forcibly delete the device. The device drivers can add entries to export per-device data and interfaces. @@ -222,7 +221,7 @@ and are called 'chp0.<chpid>'. They have no driver and do not belong to any bus. Please note, that unlike /proc/chpids in 2.4, the channel path objects reflect only the logical state and not the physical state, since we cannot track the latter consistently due to lacking machine support (we don't need to be aware -of anyway). +of it anyway). status - Can be 'online' or 'offline'. Piping 'on' or 'off' sets the chpid logically online/offline. @@ -235,12 +234,16 @@ status - Can be 'online' or 'offline'. 3. System devices ----------------- -Note: cpus may yet be added here. - 3.1 xpram --------- -xpram shows up under sys/ as 'xpram'. +xpram shows up under devices/system/ as 'xpram'. + +3.2 cpus +-------- + +For each cpu, a directory is created under devices/system/cpu/. Each cpu has an +attribute 'online' which can be 0 or 1. 4. Other devices diff --git a/Documentation/scsi/LICENSE.qla2xxx b/Documentation/scsi/LICENSE.qla2xxx new file mode 100644 index 000000000000..9e15b4f9cd28 --- /dev/null +++ b/Documentation/scsi/LICENSE.qla2xxx @@ -0,0 +1,45 @@ +Copyright (c) 2003-2005 QLogic Corporation +QLogic Linux Fibre Channel HBA Driver + +This program includes a device driver for Linux 2.6 that may be +distributed with QLogic hardware specific firmware binary file. +You may modify and redistribute the device driver code under the +GNU General Public License as published by the Free Software +Foundation (version 2 or a later version). + +You may redistribute the hardware specific firmware binary file +under the following terms: + + 1. Redistribution of source code (only if applicable), + must retain the above copyright notice, this list of + conditions and the following disclaimer. + + 2. Redistribution in binary form must reproduce the above + copyright notice, this list of conditions and the + following disclaimer in the documentation and/or other + materials provided with the distribution. + + 3. The name of QLogic Corporation may not be used to + endorse or promote products derived from this software + without specific prior written permission + +REGARDLESS OF WHAT LICENSING MECHANISM IS USED OR APPLICABLE, +THIS PROGRAM IS PROVIDED BY QLOGIC CORPORATION "AS IS'' AND ANY +EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A +PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR +BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED +TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, +DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON +ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, +OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY +OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE +POSSIBILITY OF SUCH DAMAGE. + +USER ACKNOWLEDGES AND AGREES THAT USE OF THIS PROGRAM WILL NOT +CREATE OR GIVE GROUNDS FOR A LICENSE BY IMPLICATION, ESTOPPEL, OR +OTHERWISE IN ANY INTELLECTUAL PROPERTY RIGHTS (PATENT, COPYRIGHT, +TRADE SECRET, MASK WORK, OR OTHER PROPRIETARY RIGHT) EMBODIED IN +ANY OTHER QLOGIC HARDWARE OR SOFTWARE EITHER SOLELY OR IN +COMBINATION WITH THIS PROGRAM. diff --git a/Documentation/serial/driver b/Documentation/serial/driver index 87856d3cfb67..42ef9970bc86 100644 --- a/Documentation/serial/driver +++ b/Documentation/serial/driver @@ -116,12 +116,15 @@ hardware. line becoming inactive or the tty layer indicating we want to stop transmission due to an XOFF character. + The driver should stop transmitting characters as soon as + possible. + Locking: port->lock taken. Interrupts: locally disabled. This call must not sleep start_tx(port) - start transmitting characters. + Start transmitting characters. Locking: port->lock taken. Interrupts: locally disabled. @@ -281,26 +284,31 @@ hardware. Other functions --------------- -uart_update_timeout(port,cflag,quot) +uart_update_timeout(port,cflag,baud) Update the FIFO drain timeout, port->timeout, according to the - number of bits, parity, stop bits and quotient. + number of bits, parity, stop bits and baud rate. Locking: caller is expected to take port->lock Interrupts: n/a -uart_get_baud_rate(port,termios) +uart_get_baud_rate(port,termios,old,min,max) Return the numeric baud rate for the specified termios, taking account of the special 38400 baud "kludge". The B0 baud rate is mapped to 9600 baud. + If the baud rate is not within min..max, then if old is non-NULL, + the original baud rate will be tried. If that exceeds the + min..max constraint, 9600 baud will be returned. termios will + be updated to the baud rate in use. + + Note: min..max must always allow 9600 baud to be selected. + Locking: caller dependent. Interrupts: n/a -uart_get_divisor(port,termios,oldtermios) - Return the divsor (baud_base / baud) for the selected baud rate - specified by termios. If the baud rate is out of range, try - the original baud rate specified by oldtermios (if non-NULL). - If that fails, try 9600 baud. +uart_get_divisor(port,baud) + Return the divsor (baud_base / baud) for the specified baud + rate, appropriately rounded. If 38400 baud and custom divisor is selected, return the custom divisor instead. @@ -308,6 +316,46 @@ uart_get_divisor(port,termios,oldtermios) Locking: caller dependent. Interrupts: n/a +uart_match_port(port1,port2) + This utility function can be used to determine whether two + uart_port structures describe the same port. + + Locking: n/a + Interrupts: n/a + +uart_write_wakeup(port) + A driver is expected to call this function when the number of + characters in the transmit buffer have dropped below a threshold. + + Locking: port->lock should be held. + Interrupts: n/a + +uart_register_driver(drv) + Register a uart driver with the core driver. We in turn register + with the tty layer, and initialise the core driver per-port state. + + drv->port should be NULL, and the per-port structures should be + registered using uart_add_one_port after this call has succeeded. + + Locking: none + Interrupts: enabled + +uart_unregister_driver() + Remove all references to a driver from the core driver. The low + level driver must have removed all its ports via the + uart_remove_one_port() if it registered them with uart_add_one_port(). + + Locking: none + Interrupts: enabled + +uart_suspend_port() + +uart_resume_port() + +uart_add_one_port() + +uart_remove_one_port() + Other notes ----------- diff --git a/Documentation/sound/alsa/ALSA-Configuration.txt b/Documentation/sound/alsa/ALSA-Configuration.txt index 13cba955cb5a..2f27f391c7cc 100644 --- a/Documentation/sound/alsa/ALSA-Configuration.txt +++ b/Documentation/sound/alsa/ALSA-Configuration.txt @@ -167,7 +167,7 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed. spdif - Support SPDIF I/O - Default: disabled - Module supports autoprobe and multiple chips (max 8). + This module supports one chip and autoprobe. The power-management is supported. @@ -206,7 +206,7 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed. See "AC97 Quirk Option" section below. spdif_aclink - S/PDIF transfer over AC-link (default = 1) - This module supports up to 8 cards and autoprobe. + This module supports one card and autoprobe. ATI IXP has two different methods to control SPDIF output. One is over AC-link and another is over the "direct" SPDIF output. The @@ -218,7 +218,7 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed. Module for ATI IXP 150/200/250 AC97 modem controllers. - Module supports up to 8 cards. + This module supports one card and autoprobe. Note: The default index value of this module is -2, i.e. the first slot is excluded. @@ -637,7 +637,7 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed. model - force the model name position_fix - Fix DMA pointer (0 = auto, 1 = none, 2 = POSBUF, 3 = FIFO size) - Module supports up to 8 cards. + This module supports one card and autoprobe. Each codec may have a model table for different configurations. If your machine isn't listed there, the default (usually minimal) @@ -663,6 +663,10 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed. adjusted. Appearing only when compiled with $CONFIG_SND_DEBUG=y + ALC260 + hp HP machines + fujitsu Fujitsu S7020 + CMI9880 minimal 3-jack in back min_fp 3-jack in back, 2-jack in front @@ -811,7 +815,7 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed. semaphores (e.g. on some ASUS laptops) (default off) - Module supports autoprobe and multiple bus-master chips (max 8). + This module supports one chip and autoprobe. Note: the latest driver supports auto-detection of chip clock. if you still encounter too fast playback, specify the clock @@ -830,7 +834,7 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed. ac97_clock - AC'97 codec clock base (0 = auto-detect) - This module supports up to 8 cards and autoprobe. + This module supports one card and autoprobe. Note: The default index value of this module is -2, i.e. the first slot is excluded. @@ -950,8 +954,10 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed. use_cache - 0 or 1 (disabled by default) vaio_hack - alias buffer_top=0x25a800 reset_workaround - enable AC97 RESET workaround for some laptops + reset_workaround2 - enable extended AC97 RESET workaround for some + other laptops - Module supports autoprobe and multiple chips (max 8). + This module supports one chip and autoprobe. The power-management is supported. @@ -980,6 +986,11 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed. workaround is enabled automatically. For other laptops with a hard freeze, you can try reset_workaround=1 option. + Note: Dell Latitude CSx laptops have another problem regarding + AC97 RESET. On these laptops, reset_workaround2 option is + turned on as default. This option is worth to try if the + previous reset_workaround option doesn't help. + Note: This driver is really crappy. It's a porting from the OSS driver, which is a result of black-magic reverse engineering. The detection of codec will fail if the driver is loaded *after* @@ -1310,7 +1321,7 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed. ac97_quirk - AC'97 workaround for strange hardware See "AC97 Quirk Option" section below. - Module supports autoprobe and multiple bus-master chips (max 8). + This module supports one chip and autoprobe. Note: on some SMP motherboards like MSI 694D the interrupts might not be generated properly. In such a case, please try to @@ -1352,7 +1363,7 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed. ac97_clock - AC'97 codec clock base (default 48000Hz) - Module supports up to 8 cards. + This module supports one card and autoprobe. Note: The default index value of this module is -2, i.e. the first slot is excluded. diff --git a/Documentation/sound/alsa/DocBook/writing-an-alsa-driver.tmpl b/Documentation/sound/alsa/DocBook/writing-an-alsa-driver.tmpl index 24e85520890b..260334c98d95 100644 --- a/Documentation/sound/alsa/DocBook/writing-an-alsa-driver.tmpl +++ b/Documentation/sound/alsa/DocBook/writing-an-alsa-driver.tmpl @@ -18,8 +18,8 @@ </affiliation> </author> - <date>March 6, 2005</date> - <edition>0.3.4</edition> + <date>October 6, 2005</date> + <edition>0.3.5</edition> <abstract> <para> @@ -30,7 +30,7 @@ <legalnotice> <para> - Copyright (c) 2002-2004 Takashi Iwai <email>tiwai@suse.de</email> + Copyright (c) 2002-2005 Takashi Iwai <email>tiwai@suse.de</email> </para> <para> @@ -1433,25 +1433,10 @@ <informalexample> <programlisting> <![CDATA[ - if (chip->res_port) { - release_resource(chip->res_port); - kfree_nocheck(chip->res_port); - } + release_and_free_resource(chip->res_port); ]]> </programlisting> </informalexample> - - As you can see, the resource pointer is also to be freed - via <function>kfree_nocheck()</function> after - <function>release_resource()</function> is called. You - cannot use <function>kfree()</function> here, because on ALSA, - <function>kfree()</function> may be a wrapper to its own - allocator with the memory debugging. Since the resource pointer - is allocated externally outside the ALSA, it must be released - via the native - <function>kfree()</function>. - <function>kfree_nocheck()</function> is used for that; it calls - the native <function>kfree()</function> without wrapper. </para> <para> @@ -2190,8 +2175,7 @@ struct _snd_pcm_runtime { unsigned int rate_den; /* -- SW params -- */ - int tstamp_timespec; /* use timeval (0) or timespec (1) */ - snd_pcm_tstamp_t tstamp_mode; /* mmap timestamp is updated */ + struct timespec tstamp_mode; /* mmap timestamp is updated */ unsigned int period_step; unsigned int sleep_min; /* min ticks to sleep */ snd_pcm_uframes_t xfer_align; /* xfer size need to be a multiple */ @@ -3709,8 +3693,7 @@ struct _snd_pcm_runtime { <para> Here, the chip instance is retrieved via <function>snd_kcontrol_chip()</function> macro. This macro - converts from kcontrol->private_data to the type defined by - <type>chip_t</type>. The + just accesses to kcontrol->private_data. The kcontrol->private_data field is given as the argument of <function>snd_ctl_new()</function> (see the later subsection @@ -5998,32 +5981,23 @@ struct _snd_pcm_runtime { The first argument is the expression to evaluate, and the second argument is the action if it fails. When <constant>CONFIG_SND_DEBUG</constant>, is set, it will show an - error message such as <computeroutput>BUG? (xxx) (called from - yyy)</computeroutput>. When no debug flag is set, this is - ignored. + error message such as <computeroutput>BUG? (xxx)</computeroutput> + together with stack trace. </para> - </section> - - <section id="useful-functions-snd-runtime-check"> - <title><function>snd_runtime_check()</function></title> <para> - This macro is quite similar with - <function>snd_assert()</function>. Unlike - <function>snd_assert()</function>, the expression is always - evaluated regardless of - <constant>CONFIG_SND_DEBUG</constant>. When - <constant>CONFIG_SND_DEBUG</constant> is set, the macro will - show a message like <computeroutput>ERROR (xx) (called from - yyy)</computeroutput>. + When no debug flag is set, this macro is ignored. </para> </section> <section id="useful-functions-snd-bug"> <title><function>snd_BUG()</function></title> <para> - It calls <function>snd_assert(0,)</function> -- that is, just - prints the error message at the point. It's useful to show that - a fatal error happens there. + It shows <computeroutput>BUG?</computeroutput> message and + stack trace as well as <function>snd_assert</function> at the point. + It's useful to show that a fatal error happens there. + </para> + <para> + When no debug flag is set, this macro is ignored. </para> </section> </chapter> diff --git a/Documentation/sparse.txt b/Documentation/sparse.txt index 5df44dc894e5..3f1c5464b1c9 100644 --- a/Documentation/sparse.txt +++ b/Documentation/sparse.txt @@ -41,9 +41,9 @@ sure that bitwise types don't get mixed up (little-endian vs big-endian vs cpu-endian vs whatever), and there the constant "0" really _is_ special. -Modify top-level Makefile to say +Use -CHECK = sparse -Wbitwise + make C=[12] CF=-Wbitwise or you don't get any checking at all. @@ -51,9 +51,9 @@ or you don't get any checking at all. Where to get sparse ~~~~~~~~~~~~~~~~~~~ -With BK, you can just get it from +With git, you can just get it from - bk://sparse.bkbits.net/sparse + rsync://rsync.kernel.org/pub/scm/devel/sparse/sparse.git and DaveJ has tar-balls at diff --git a/Documentation/usb/URB.txt b/Documentation/usb/URB.txt index d59b95cc6f1b..a49e5f2c2b46 100644 --- a/Documentation/usb/URB.txt +++ b/Documentation/usb/URB.txt @@ -1,5 +1,6 @@ Revised: 2000-Dec-05. Again: 2002-Jul-06 +Again: 2005-Sep-19 NOTE: @@ -18,8 +19,8 @@ called USB Request Block, or URB for short. and deliver the data and status back. - Execution of an URB is inherently an asynchronous operation, i.e. the - usb_submit_urb(urb) call returns immediately after it has successfully queued - the requested action. + usb_submit_urb(urb) call returns immediately after it has successfully + queued the requested action. - Transfers for one URB can be canceled with usb_unlink_urb(urb) at any time. @@ -94,8 +95,9 @@ To free an URB, use void usb_free_urb(struct urb *urb) -You may not free an urb that you've submitted, but which hasn't yet been -returned to you in a completion callback. +You may free an urb that you've submitted, but which hasn't yet been +returned to you in a completion callback. It will automatically be +deallocated when it is no longer in use. 1.4. What has to be filled in? @@ -145,30 +147,36 @@ to get seamless ISO streaming. 1.6. How to cancel an already running URB? -For an URB which you've submitted, but which hasn't been returned to -your driver by the host controller, call +There are two ways to cancel an URB you've submitted but which hasn't +been returned to your driver yet. For an asynchronous cancel, call int usb_unlink_urb(struct urb *urb) It removes the urb from the internal list and frees all allocated -HW descriptors. The status is changed to reflect unlinking. After -usb_unlink_urb() returns with that status code, you can free the URB -with usb_free_urb(). +HW descriptors. The status is changed to reflect unlinking. Note +that the URB will not normally have finished when usb_unlink_urb() +returns; you must still wait for the completion handler to be called. -There is also an asynchronous unlink mode. To use this, set the -the URB_ASYNC_UNLINK flag in urb->transfer flags before calling -usb_unlink_urb(). When using async unlinking, the URB will not -normally be unlinked when usb_unlink_urb() returns. Instead, wait -for the completion handler to be called. +To cancel an URB synchronously, call + + void usb_kill_urb(struct urb *urb) + +It does everything usb_unlink_urb does, and in addition it waits +until after the URB has been returned and the completion handler +has finished. It also marks the URB as temporarily unusable, so +that if the completion handler or anyone else tries to resubmit it +they will get a -EPERM error. Thus you can be sure that when +usb_kill_urb() returns, the URB is totally idle. 1.7. What about the completion handler? The handler is of the following type: - typedef void (*usb_complete_t)(struct urb *); + typedef void (*usb_complete_t)(struct urb *, struct pt_regs *) -i.e. it gets just the URB that caused the completion call. +I.e., it gets the URB that caused the completion call, plus the +register values at the time of the corresponding interrupt (if any). In the completion handler, you should have a look at urb->status to detect any USB errors. Since the context parameter is included in the URB, you can pass information to the completion handler. @@ -176,17 +184,11 @@ you can pass information to the completion handler. Note that even when an error (or unlink) is reported, data may have been transferred. That's because USB transfers are packetized; it might take sixteen packets to transfer your 1KByte buffer, and ten of them might -have transferred succesfully before the completion is called. +have transferred succesfully before the completion was called. NOTE: ***** WARNING ***** -Don't use urb->dev field in your completion handler; it's cleared -as part of giving urbs back to drivers. (Addressing an issue with -ownership of periodic URBs, which was otherwise ambiguous.) Instead, -use urb->context to hold all the data your driver needs. - -NOTE: ***** WARNING ***** -Also, NEVER SLEEP IN A COMPLETION HANDLER. These are normally called +NEVER SLEEP IN A COMPLETION HANDLER. These are normally called during hardware interrupt processing. If you can, defer substantial work to a tasklet (bottom half) to keep system latencies low. You'll probably need to use spinlocks to protect data structures you manipulate @@ -229,24 +231,10 @@ ISO data with some other event stream. Interrupt transfers, like isochronous transfers, are periodic, and happen in intervals that are powers of two (1, 2, 4 etc) units. Units are frames for full and low speed devices, and microframes for high speed ones. - -Currently, after you submit one interrupt URB, that urb is owned by the -host controller driver until you cancel it with usb_unlink_urb(). You -may unlink interrupt urbs in their completion handlers, if you need to. - -After a transfer completion is called, the URB is automagically resubmitted. -THIS BEHAVIOR IS EXPECTED TO BE REMOVED!! - -Interrupt transfers may only send (or receive) the "maxpacket" value for -the given interrupt endpoint; if you need more data, you will need to -copy that data out of (or into) another buffer. Similarly, you can't -queue interrupt transfers. -THESE RESTRICTIONS ARE EXPECTED TO BE REMOVED!! - -Note that this automagic resubmission model does make it awkward to use -interrupt OUT transfers. The portable solution involves unlinking those -OUT urbs after the data is transferred, and perhaps submitting a final -URB for a short packet. - The usb_submit_urb() call modifies urb->interval to the implemented interval value that is less than or equal to the requested interval value. + +In Linux 2.6, unlike earlier versions, interrupt URBs are not automagically +restarted when they complete. They end when the completion handler is +called, just like other URBs. If you want an interrupt URB to be restarted, +your completion handler must resubmit it. diff --git a/Documentation/video4linux/bttv/README.freeze b/Documentation/video4linux/bttv/README.freeze index 51f8d4379a94..4259dccc8287 100644 --- a/Documentation/video4linux/bttv/README.freeze +++ b/Documentation/video4linux/bttv/README.freeze @@ -27,9 +27,9 @@ information out of a register+stack dump printed by the kernel on protection faults (so-called "kernel oops"). If you run into some kind of deadlock, you can try to dump a call trace -for each process using sysrq-t (see Documentation/sysrq.txt). ksymoops -will translate these dumps into kernel symbols too. This way it is -possible to figure where *exactly* some process in "D" state is stuck. +for each process using sysrq-t (see Documentation/sysrq.txt). +This way it is possible to figure where *exactly* some process in "D" +state is stuck. I've seen reports that bttv 0.7.x crashes whereas 0.8.x works rock solid for some people. Thus probably a small buglet left somewhere in bttv diff --git a/Documentation/vm/hugetlbpage.txt b/Documentation/vm/hugetlbpage.txt index 1b9bcd1fe98b..1ad9af1ca4d0 100644 --- a/Documentation/vm/hugetlbpage.txt +++ b/Documentation/vm/hugetlbpage.txt @@ -13,12 +13,13 @@ This optimization is more critical now as bigger and bigger physical memories Users can use the huge page support in Linux kernel by either using the mmap system call or standard SYSv shared memory system calls (shmget, shmat). -First the Linux kernel needs to be built with CONFIG_HUGETLB_PAGE (present -under Processor types and feature) and CONFIG_HUGETLBFS (present under file -system option on config menu) config options. +First the Linux kernel needs to be built with the CONFIG_HUGETLBFS +(present under "File systems") and CONFIG_HUGETLB_PAGE (selected +automatically when CONFIG_HUGETLBFS is selected) configuration +options. The kernel built with hugepage support should show the number of configured -hugepages in the system by running the "cat /proc/meminfo" command. +hugepages in the system by running the "cat /proc/meminfo" command. /proc/meminfo also provides information about the total number of hugetlb pages configured in the kernel. It also displays information about the @@ -38,19 +39,19 @@ in the kernel. /proc/sys/vm/nr_hugepages indicates the current number of configured hugetlb pages in the kernel. Super user can dynamically request more (or free some -pre-configured) hugepages. -The allocation( or deallocation) of hugetlb pages is posible only if there are +pre-configured) hugepages. +The allocation (or deallocation) of hugetlb pages is possible only if there are enough physically contiguous free pages in system (freeing of hugepages is -possible only if there are enough hugetlb pages free that can be transfered +possible only if there are enough hugetlb pages free that can be transfered back to regular memory pool). Pages that are used as hugetlb pages are reserved inside the kernel and can -not be used for other purposes. +not be used for other purposes. Once the kernel with Hugetlb page support is built and running, a user can use either the mmap system call or shared memory system calls to start using the huge pages. It is required that the system administrator preallocate -enough memory for huge page purposes. +enough memory for huge page purposes. Use the following command to dynamically allocate/deallocate hugepages: @@ -80,9 +81,9 @@ memory (huge pages) allowed for that filesystem (/mnt/huge). The size is rounded down to HPAGE_SIZE. The option nr_inode sets the maximum number of inodes that /mnt/huge can use. If the size or nr_inode options are not provided on command line then no limits are set. For size and nr_inodes -options, you can use [G|g]/[M|m]/[K|k] to represent giga/mega/kilo. For -example, size=2K has the same meaning as size=2048. An example is given at -the end of this document. +options, you can use [G|g]/[M|m]/[K|k] to represent giga/mega/kilo. For +example, size=2K has the same meaning as size=2048. An example is given at +the end of this document. read and write system calls are not supported on files that reside on hugetlb file systems. |