diff options
author | Linus Torvalds <torvalds@linux-foundation.org> | 2016-12-13 08:58:13 +0300 |
---|---|---|
committer | Linus Torvalds <torvalds@linux-foundation.org> | 2016-12-13 08:58:13 +0300 |
commit | e7aa8c2eb11ba69b1b69099c3c7bd6be3087b0ba (patch) | |
tree | f63906f41699c8e38af9d12b063e2ceab0286ef2 /Documentation/core-api | |
parent | e34bac726d27056081d0250c0e173e4b155aa340 (diff) | |
parent | 868c97a846a73e937d835b09b8c885a69df50ec8 (diff) | |
download | linux-e7aa8c2eb11ba69b1b69099c3c7bd6be3087b0ba.tar.xz |
Merge tag 'docs-4.10' of git://git.lwn.net/linux
Pull documentation update from Jonathan Corbet:
"These are the documentation changes for 4.10.
It's another busy cycle for the docs tree, as the sphinx conversion
continues. Highlights include:
- Further work on PDF output, which remains a bit of a pain but
should be more solid now.
- Five more DocBook template files converted to Sphinx. Only 27 to
go... Lots of plain-text files have also been converted and
integrated.
- Images in binary formats have been replaced with more
source-friendly versions.
- Various bits of organizational work, including the renaming of
various files discussed at the kernel summit.
- New documentation for the device_link mechanism.
... and, of course, lots of typo fixes and small updates"
* tag 'docs-4.10' of git://git.lwn.net/linux: (193 commits)
dma-buf: Extract dma-buf.rst
Update Documentation/00-INDEX
docs: 00-INDEX: document directories/files with no docs
docs: 00-INDEX: remove non-existing entries
docs: 00-INDEX: add missing entries for documentation files/dirs
docs: 00-INDEX: consolidate process/ and admin-guide/ description
scripts: add a script to check if Documentation/00-INDEX is sane
Docs: change sh -> awk in REPORTING-BUGS
Documentation/core-api/device_link: Add initial documentation
core-api: remove an unexpected unident
ppc/idle: Add documentation for powersave=off
Doc: Correct typo, "Introdution" => "Introduction"
Documentation/atomic_ops.txt: convert to ReST markup
Documentation/local_ops.txt: convert to ReST markup
Documentation/assoc_array.txt: convert to ReST markup
docs-rst: parse-headers.pl: cleanup the documentation
docs-rst: fix media cleandocs target
docs-rst: media/Makefile: reorganize the rules
docs-rst: media: build SVG from graphviz files
docs-rst: replace bayer.png by a SVG image
...
Diffstat (limited to 'Documentation/core-api')
-rw-r--r-- | Documentation/core-api/assoc_array.rst | 551 | ||||
-rw-r--r-- | Documentation/core-api/atomic_ops.rst | 658 | ||||
-rw-r--r-- | Documentation/core-api/conf.py | 10 | ||||
-rw-r--r-- | Documentation/core-api/debug-objects.rst | 310 | ||||
-rw-r--r-- | Documentation/core-api/index.rst | 33 | ||||
-rw-r--r-- | Documentation/core-api/local_ops.rst | 206 | ||||
-rw-r--r-- | Documentation/core-api/tracepoint.rst | 55 | ||||
-rw-r--r-- | Documentation/core-api/workqueue.rst | 394 |
8 files changed, 2217 insertions, 0 deletions
diff --git a/Documentation/core-api/assoc_array.rst b/Documentation/core-api/assoc_array.rst new file mode 100644 index 000000000000..d83cfff9ea43 --- /dev/null +++ b/Documentation/core-api/assoc_array.rst @@ -0,0 +1,551 @@ +======================================== +Generic Associative Array Implementation +======================================== + +Overview +======== + +This associative array implementation is an object container with the following +properties: + +1. Objects are opaque pointers. The implementation does not care where they + point (if anywhere) or what they point to (if anything). +.. note:: Pointers to objects _must_ be zero in the least significant bit. + +2. Objects do not need to contain linkage blocks for use by the array. This + permits an object to be located in multiple arrays simultaneously. + Rather, the array is made up of metadata blocks that point to objects. + +3. Objects require index keys to locate them within the array. + +4. Index keys must be unique. Inserting an object with the same key as one + already in the array will replace the old object. + +5. Index keys can be of any length and can be of different lengths. + +6. Index keys should encode the length early on, before any variation due to + length is seen. + +7. Index keys can include a hash to scatter objects throughout the array. + +8. The array can iterated over. The objects will not necessarily come out in + key order. + +9. The array can be iterated over whilst it is being modified, provided the + RCU readlock is being held by the iterator. Note, however, under these + circumstances, some objects may be seen more than once. If this is a + problem, the iterator should lock against modification. Objects will not + be missed, however, unless deleted. + +10. Objects in the array can be looked up by means of their index key. + +11. Objects can be looked up whilst the array is being modified, provided the + RCU readlock is being held by the thread doing the look up. + +The implementation uses a tree of 16-pointer nodes internally that are indexed +on each level by nibbles from the index key in the same manner as in a radix +tree. To improve memory efficiency, shortcuts can be emplaced to skip over +what would otherwise be a series of single-occupancy nodes. Further, nodes +pack leaf object pointers into spare space in the node rather than making an +extra branch until as such time an object needs to be added to a full node. + + +The Public API +============== + +The public API can be found in ``<linux/assoc_array.h>``. The associative +array is rooted on the following structure:: + + struct assoc_array { + ... + }; + +The code is selected by enabling ``CONFIG_ASSOCIATIVE_ARRAY`` with:: + + ./script/config -e ASSOCIATIVE_ARRAY + + +Edit Script +----------- + +The insertion and deletion functions produce an 'edit script' that can later be +applied to effect the changes without risking ``ENOMEM``. This retains the +preallocated metadata blocks that will be installed in the internal tree and +keeps track of the metadata blocks that will be removed from the tree when the +script is applied. + +This is also used to keep track of dead blocks and dead objects after the +script has been applied so that they can be freed later. The freeing is done +after an RCU grace period has passed - thus allowing access functions to +proceed under the RCU read lock. + +The script appears as outside of the API as a pointer of the type:: + + struct assoc_array_edit; + +There are two functions for dealing with the script: + +1. Apply an edit script:: + + void assoc_array_apply_edit(struct assoc_array_edit *edit); + +This will perform the edit functions, interpolating various write barriers +to permit accesses under the RCU read lock to continue. The edit script +will then be passed to ``call_rcu()`` to free it and any dead stuff it points +to. + +2. Cancel an edit script:: + + void assoc_array_cancel_edit(struct assoc_array_edit *edit); + +This frees the edit script and all preallocated memory immediately. If +this was for insertion, the new object is _not_ released by this function, +but must rather be released by the caller. + +These functions are guaranteed not to fail. + + +Operations Table +---------------- + +Various functions take a table of operations:: + + struct assoc_array_ops { + ... + }; + +This points to a number of methods, all of which need to be provided: + +1. Get a chunk of index key from caller data:: + + unsigned long (*get_key_chunk)(const void *index_key, int level); + +This should return a chunk of caller-supplied index key starting at the +*bit* position given by the level argument. The level argument will be a +multiple of ``ASSOC_ARRAY_KEY_CHUNK_SIZE`` and the function should return +``ASSOC_ARRAY_KEY_CHUNK_SIZE bits``. No error is possible. + + +2. Get a chunk of an object's index key:: + + unsigned long (*get_object_key_chunk)(const void *object, int level); + +As the previous function, but gets its data from an object in the array +rather than from a caller-supplied index key. + + +3. See if this is the object we're looking for:: + + bool (*compare_object)(const void *object, const void *index_key); + +Compare the object against an index key and return ``true`` if it matches and +``false`` if it doesn't. + + +4. Diff the index keys of two objects:: + + int (*diff_objects)(const void *object, const void *index_key); + +Return the bit position at which the index key of the specified object +differs from the given index key or -1 if they are the same. + + +5. Free an object:: + + void (*free_object)(void *object); + +Free the specified object. Note that this may be called an RCU grace period +after ``assoc_array_apply_edit()`` was called, so ``synchronize_rcu()`` may be +necessary on module unloading. + + +Manipulation Functions +---------------------- + +There are a number of functions for manipulating an associative array: + +1. Initialise an associative array:: + + void assoc_array_init(struct assoc_array *array); + +This initialises the base structure for an associative array. It can't fail. + + +2. Insert/replace an object in an associative array:: + + struct assoc_array_edit * + assoc_array_insert(struct assoc_array *array, + const struct assoc_array_ops *ops, + const void *index_key, + void *object); + +This inserts the given object into the array. Note that the least +significant bit of the pointer must be zero as it's used to type-mark +pointers internally. + +If an object already exists for that key then it will be replaced with the +new object and the old one will be freed automatically. + +The ``index_key`` argument should hold index key information and is +passed to the methods in the ops table when they are called. + +This function makes no alteration to the array itself, but rather returns +an edit script that must be applied. ``-ENOMEM`` is returned in the case of +an out-of-memory error. + +The caller should lock exclusively against other modifiers of the array. + + +3. Delete an object from an associative array:: + + struct assoc_array_edit * + assoc_array_delete(struct assoc_array *array, + const struct assoc_array_ops *ops, + const void *index_key); + +This deletes an object that matches the specified data from the array. + +The ``index_key`` argument should hold index key information and is +passed to the methods in the ops table when they are called. + +This function makes no alteration to the array itself, but rather returns +an edit script that must be applied. ``-ENOMEM`` is returned in the case of +an out-of-memory error. ``NULL`` will be returned if the specified object is +not found within the array. + +The caller should lock exclusively against other modifiers of the array. + + +4. Delete all objects from an associative array:: + + struct assoc_array_edit * + assoc_array_clear(struct assoc_array *array, + const struct assoc_array_ops *ops); + +This deletes all the objects from an associative array and leaves it +completely empty. + +This function makes no alteration to the array itself, but rather returns +an edit script that must be applied. ``-ENOMEM`` is returned in the case of +an out-of-memory error. + +The caller should lock exclusively against other modifiers of the array. + + +5. Destroy an associative array, deleting all objects:: + + void assoc_array_destroy(struct assoc_array *array, + const struct assoc_array_ops *ops); + +This destroys the contents of the associative array and leaves it +completely empty. It is not permitted for another thread to be traversing +the array under the RCU read lock at the same time as this function is +destroying it as no RCU deferral is performed on memory release - +something that would require memory to be allocated. + +The caller should lock exclusively against other modifiers and accessors +of the array. + + +6. Garbage collect an associative array:: + + int assoc_array_gc(struct assoc_array *array, + const struct assoc_array_ops *ops, + bool (*iterator)(void *object, void *iterator_data), + void *iterator_data); + +This iterates over the objects in an associative array and passes each one to +``iterator()``. If ``iterator()`` returns ``true``, the object is kept. If it +returns ``false``, the object will be freed. If the ``iterator()`` function +returns ``true``, it must perform any appropriate refcount incrementing on the +object before returning. + +The internal tree will be packed down if possible as part of the iteration +to reduce the number of nodes in it. + +The ``iterator_data`` is passed directly to ``iterator()`` and is otherwise +ignored by the function. + +The function will return ``0`` if successful and ``-ENOMEM`` if there wasn't +enough memory. + +It is possible for other threads to iterate over or search the array under +the RCU read lock whilst this function is in progress. The caller should +lock exclusively against other modifiers of the array. + + +Access Functions +---------------- + +There are two functions for accessing an associative array: + +1. Iterate over all the objects in an associative array:: + + int assoc_array_iterate(const struct assoc_array *array, + int (*iterator)(const void *object, + void *iterator_data), + void *iterator_data); + +This passes each object in the array to the iterator callback function. +``iterator_data`` is private data for that function. + +This may be used on an array at the same time as the array is being +modified, provided the RCU read lock is held. Under such circumstances, +it is possible for the iteration function to see some objects twice. If +this is a problem, then modification should be locked against. The +iteration algorithm should not, however, miss any objects. + +The function will return ``0`` if no objects were in the array or else it will +return the result of the last iterator function called. Iteration stops +immediately if any call to the iteration function results in a non-zero +return. + + +2. Find an object in an associative array:: + + void *assoc_array_find(const struct assoc_array *array, + const struct assoc_array_ops *ops, + const void *index_key); + +This walks through the array's internal tree directly to the object +specified by the index key.. + +This may be used on an array at the same time as the array is being +modified, provided the RCU read lock is held. + +The function will return the object if found (and set ``*_type`` to the object +type) or will return ``NULL`` if the object was not found. + + +Index Key Form +-------------- + +The index key can be of any form, but since the algorithms aren't told how long +the key is, it is strongly recommended that the index key includes its length +very early on before any variation due to the length would have an effect on +comparisons. + +This will cause leaves with different length keys to scatter away from each +other - and those with the same length keys to cluster together. + +It is also recommended that the index key begin with a hash of the rest of the +key to maximise scattering throughout keyspace. + +The better the scattering, the wider and lower the internal tree will be. + +Poor scattering isn't too much of a problem as there are shortcuts and nodes +can contain mixtures of leaves and metadata pointers. + +The index key is read in chunks of machine word. Each chunk is subdivided into +one nibble (4 bits) per level, so on a 32-bit CPU this is good for 8 levels and +on a 64-bit CPU, 16 levels. Unless the scattering is really poor, it is +unlikely that more than one word of any particular index key will have to be +used. + + +Internal Workings +================= + +The associative array data structure has an internal tree. This tree is +constructed of two types of metadata blocks: nodes and shortcuts. + +A node is an array of slots. Each slot can contain one of four things: + +* A NULL pointer, indicating that the slot is empty. +* A pointer to an object (a leaf). +* A pointer to a node at the next level. +* A pointer to a shortcut. + + +Basic Internal Tree Layout +-------------------------- + +Ignoring shortcuts for the moment, the nodes form a multilevel tree. The index +key space is strictly subdivided by the nodes in the tree and nodes occur on +fixed levels. For example:: + + Level: 0 1 2 3 + =============== =============== =============== =============== + NODE D + NODE B NODE C +------>+---+ + +------>+---+ +------>+---+ | | 0 | + NODE A | | 0 | | | 0 | | +---+ + +---+ | +---+ | +---+ | : : + | 0 | | : : | : : | +---+ + +---+ | +---+ | +---+ | | f | + | 1 |---+ | 3 |---+ | 7 |---+ +---+ + +---+ +---+ +---+ + : : : : | 8 |---+ + +---+ +---+ +---+ | NODE E + | e |---+ | f | : : +------>+---+ + +---+ | +---+ +---+ | 0 | + | f | | | f | +---+ + +---+ | +---+ : : + | NODE F +---+ + +------>+---+ | f | + | 0 | NODE G +---+ + +---+ +------>+---+ + : : | | 0 | + +---+ | +---+ + | 6 |---+ : : + +---+ +---+ + : : | f | + +---+ +---+ + | f | + +---+ + +In the above example, there are 7 nodes (A-G), each with 16 slots (0-f). +Assuming no other meta data nodes in the tree, the key space is divided +thusly:: + + KEY PREFIX NODE + ========== ==== + 137* D + 138* E + 13[0-69-f]* C + 1[0-24-f]* B + e6* G + e[0-57-f]* F + [02-df]* A + +So, for instance, keys with the following example index keys will be found in +the appropriate nodes:: + + INDEX KEY PREFIX NODE + =============== ======= ==== + 13694892892489 13 C + 13795289025897 137 D + 13889dde88793 138 E + 138bbb89003093 138 E + 1394879524789 12 C + 1458952489 1 B + 9431809de993ba - A + b4542910809cd - A + e5284310def98 e F + e68428974237 e6 G + e7fffcbd443 e F + f3842239082 - A + +To save memory, if a node can hold all the leaves in its portion of keyspace, +then the node will have all those leaves in it and will not have any metadata +pointers - even if some of those leaves would like to be in the same slot. + +A node can contain a heterogeneous mix of leaves and metadata pointers. +Metadata pointers must be in the slots that match their subdivisions of key +space. The leaves can be in any slot not occupied by a metadata pointer. It +is guaranteed that none of the leaves in a node will match a slot occupied by a +metadata pointer. If the metadata pointer is there, any leaf whose key matches +the metadata key prefix must be in the subtree that the metadata pointer points +to. + +In the above example list of index keys, node A will contain:: + + SLOT CONTENT INDEX KEY (PREFIX) + ==== =============== ================== + 1 PTR TO NODE B 1* + any LEAF 9431809de993ba + any LEAF b4542910809cd + e PTR TO NODE F e* + any LEAF f3842239082 + +and node B:: + + 3 PTR TO NODE C 13* + any LEAF 1458952489 + + +Shortcuts +--------- + +Shortcuts are metadata records that jump over a piece of keyspace. A shortcut +is a replacement for a series of single-occupancy nodes ascending through the +levels. Shortcuts exist to save memory and to speed up traversal. + +It is possible for the root of the tree to be a shortcut - say, for example, +the tree contains at least 17 nodes all with key prefix ``1111``. The +insertion algorithm will insert a shortcut to skip over the ``1111`` keyspace +in a single bound and get to the fourth level where these actually become +different. + + +Splitting And Collapsing Nodes +------------------------------ + +Each node has a maximum capacity of 16 leaves and metadata pointers. If the +insertion algorithm finds that it is trying to insert a 17th object into a +node, that node will be split such that at least two leaves that have a common +key segment at that level end up in a separate node rooted on that slot for +that common key segment. + +If the leaves in a full node and the leaf that is being inserted are +sufficiently similar, then a shortcut will be inserted into the tree. + +When the number of objects in the subtree rooted at a node falls to 16 or +fewer, then the subtree will be collapsed down to a single node - and this will +ripple towards the root if possible. + + +Non-Recursive Iteration +----------------------- + +Each node and shortcut contains a back pointer to its parent and the number of +slot in that parent that points to it. None-recursive iteration uses these to +proceed rootwards through the tree, going to the parent node, slot N + 1 to +make sure progress is made without the need for a stack. + +The backpointers, however, make simultaneous alteration and iteration tricky. + + +Simultaneous Alteration And Iteration +------------------------------------- + +There are a number of cases to consider: + +1. Simple insert/replace. This involves simply replacing a NULL or old + matching leaf pointer with the pointer to the new leaf after a barrier. + The metadata blocks don't change otherwise. An old leaf won't be freed + until after the RCU grace period. + +2. Simple delete. This involves just clearing an old matching leaf. The + metadata blocks don't change otherwise. The old leaf won't be freed until + after the RCU grace period. + +3. Insertion replacing part of a subtree that we haven't yet entered. This + may involve replacement of part of that subtree - but that won't affect + the iteration as we won't have reached the pointer to it yet and the + ancestry blocks are not replaced (the layout of those does not change). + +4. Insertion replacing nodes that we're actively processing. This isn't a + problem as we've passed the anchoring pointer and won't switch onto the + new layout until we follow the back pointers - at which point we've + already examined the leaves in the replaced node (we iterate over all the + leaves in a node before following any of its metadata pointers). + + We might, however, re-see some leaves that have been split out into a new + branch that's in a slot further along than we were at. + +5. Insertion replacing nodes that we're processing a dependent branch of. + This won't affect us until we follow the back pointers. Similar to (4). + +6. Deletion collapsing a branch under us. This doesn't affect us because the + back pointers will get us back to the parent of the new node before we + could see the new node. The entire collapsed subtree is thrown away + unchanged - and will still be rooted on the same slot, so we shouldn't + process it a second time as we'll go back to slot + 1. + +.. note:: + + Under some circumstances, we need to simultaneously change the parent + pointer and the parent slot pointer on a node (say, for example, we + inserted another node before it and moved it up a level). We cannot do + this without locking against a read - so we have to replace that node too. + + However, when we're changing a shortcut into a node this isn't a problem + as shortcuts only have one slot and so the parent slot number isn't used + when traversing backwards over one. This means that it's okay to change + the slot number first - provided suitable barriers are used to make sure + the parent slot number is read after the back pointer. + +Obsolete blocks and leaves are freed up after an RCU grace period has passed, +so as long as anyone doing walking or iteration holds the RCU read lock, the +old superstructure should not go away on them. diff --git a/Documentation/core-api/atomic_ops.rst b/Documentation/core-api/atomic_ops.rst new file mode 100644 index 000000000000..55e43f1c80de --- /dev/null +++ b/Documentation/core-api/atomic_ops.rst @@ -0,0 +1,658 @@ +======================================================= +Semantics and Behavior of Atomic and Bitmask Operations +======================================================= + +:Author: David S. Miller + +This document is intended to serve as a guide to Linux port +maintainers on how to implement atomic counter, bitops, and spinlock +interfaces properly. + +Atomic Type And Operations +========================== + +The atomic_t type should be defined as a signed integer and +the atomic_long_t type as a signed long integer. Also, they should +be made opaque such that any kind of cast to a normal C integer type +will fail. Something like the following should suffice:: + + typedef struct { int counter; } atomic_t; + typedef struct { long counter; } atomic_long_t; + +Historically, counter has been declared volatile. This is now discouraged. +See :ref:`Documentation/process/volatile-considered-harmful.rst +<volatile_considered_harmful>` for the complete rationale. + +local_t is very similar to atomic_t. If the counter is per CPU and only +updated by one CPU, local_t is probably more appropriate. Please see +:ref:`Documentation/core-api/local_ops.rst <local_ops>` for the semantics of +local_t. + +The first operations to implement for atomic_t's are the initializers and +plain reads. :: + + #define ATOMIC_INIT(i) { (i) } + #define atomic_set(v, i) ((v)->counter = (i)) + +The first macro is used in definitions, such as:: + + static atomic_t my_counter = ATOMIC_INIT(1); + +The initializer is atomic in that the return values of the atomic operations +are guaranteed to be correct reflecting the initialized value if the +initializer is used before runtime. If the initializer is used at runtime, a +proper implicit or explicit read memory barrier is needed before reading the +value with atomic_read from another thread. + +As with all of the ``atomic_`` interfaces, replace the leading ``atomic_`` +with ``atomic_long_`` to operate on atomic_long_t. + +The second interface can be used at runtime, as in:: + + struct foo { atomic_t counter; }; + ... + + struct foo *k; + + k = kmalloc(sizeof(*k), GFP_KERNEL); + if (!k) + return -ENOMEM; + atomic_set(&k->counter, 0); + +The setting is atomic in that the return values of the atomic operations by +all threads are guaranteed to be correct reflecting either the value that has +been set with this operation or set with another operation. A proper implicit +or explicit memory barrier is needed before the value set with the operation +is guaranteed to be readable with atomic_read from another thread. + +Next, we have:: + + #define atomic_read(v) ((v)->counter) + +which simply reads the counter value currently visible to the calling thread. +The read is atomic in that the return value is guaranteed to be one of the +values initialized or modified with the interface operations if a proper +implicit or explicit memory barrier is used after possible runtime +initialization by any other thread and the value is modified only with the +interface operations. atomic_read does not guarantee that the runtime +initialization by any other thread is visible yet, so the user of the +interface must take care of that with a proper implicit or explicit memory +barrier. + +.. warning:: + + ``atomic_read()`` and ``atomic_set()`` DO NOT IMPLY BARRIERS! + + Some architectures may choose to use the volatile keyword, barriers, or + inline assembly to guarantee some degree of immediacy for atomic_read() + and atomic_set(). This is not uniformly guaranteed, and may change in + the future, so all users of atomic_t should treat atomic_read() and + atomic_set() as simple C statements that may be reordered or optimized + away entirely by the compiler or processor, and explicitly invoke the + appropriate compiler and/or memory barrier for each use case. Failure + to do so will result in code that may suddenly break when used with + different architectures or compiler optimizations, or even changes in + unrelated code which changes how the compiler optimizes the section + accessing atomic_t variables. + +Properly aligned pointers, longs, ints, and chars (and unsigned +equivalents) may be atomically loaded from and stored to in the same +sense as described for atomic_read() and atomic_set(). The READ_ONCE() +and WRITE_ONCE() macros should be used to prevent the compiler from using +optimizations that might otherwise optimize accesses out of existence on +the one hand, or that might create unsolicited accesses on the other. + +For example consider the following code:: + + while (a > 0) + do_something(); + +If the compiler can prove that do_something() does not store to the +variable a, then the compiler is within its rights transforming this to +the following:: + + tmp = a; + if (a > 0) + for (;;) + do_something(); + +If you don't want the compiler to do this (and you probably don't), then +you should use something like the following:: + + while (READ_ONCE(a) < 0) + do_something(); + +Alternatively, you could place a barrier() call in the loop. + +For another example, consider the following code:: + + tmp_a = a; + do_something_with(tmp_a); + do_something_else_with(tmp_a); + +If the compiler can prove that do_something_with() does not store to the +variable a, then the compiler is within its rights to manufacture an +additional load as follows:: + + tmp_a = a; + do_something_with(tmp_a); + tmp_a = a; + do_something_else_with(tmp_a); + +This could fatally confuse your code if it expected the same value +to be passed to do_something_with() and do_something_else_with(). + +The compiler would be likely to manufacture this additional load if +do_something_with() was an inline function that made very heavy use +of registers: reloading from variable a could save a flush to the +stack and later reload. To prevent the compiler from attacking your +code in this manner, write the following:: + + tmp_a = READ_ONCE(a); + do_something_with(tmp_a); + do_something_else_with(tmp_a); + +For a final example, consider the following code, assuming that the +variable a is set at boot time before the second CPU is brought online +and never changed later, so that memory barriers are not needed:: + + if (a) + b = 9; + else + b = 42; + +The compiler is within its rights to manufacture an additional store +by transforming the above code into the following:: + + b = 42; + if (a) + b = 9; + +This could come as a fatal surprise to other code running concurrently +that expected b to never have the value 42 if a was zero. To prevent +the compiler from doing this, write something like:: + + if (a) + WRITE_ONCE(b, 9); + else + WRITE_ONCE(b, 42); + +Don't even -think- about doing this without proper use of memory barriers, +locks, or atomic operations if variable a can change at runtime! + +.. warning:: + + ``READ_ONCE()`` OR ``WRITE_ONCE()`` DO NOT IMPLY A BARRIER! + +Now, we move onto the atomic operation interfaces typically implemented with +the help of assembly code. :: + + void atomic_add(int i, atomic_t *v); + void atomic_sub(int i, atomic_t *v); + void atomic_inc(atomic_t *v); + void atomic_dec(atomic_t *v); + +These four routines add and subtract integral values to/from the given +atomic_t value. The first two routines pass explicit integers by +which to make the adjustment, whereas the latter two use an implicit +adjustment value of "1". + +One very important aspect of these two routines is that they DO NOT +require any explicit memory barriers. They need only perform the +atomic_t counter update in an SMP safe manner. + +Next, we have:: + + int atomic_inc_return(atomic_t *v); + int atomic_dec_return(atomic_t *v); + +These routines add 1 and subtract 1, respectively, from the given +atomic_t and return the new counter value after the operation is +performed. + +Unlike the above routines, it is required that these primitives +include explicit memory barriers that are performed before and after +the operation. It must be done such that all memory operations before +and after the atomic operation calls are strongly ordered with respect +to the atomic operation itself. + +For example, it should behave as if a smp_mb() call existed both +before and after the atomic operation. + +If the atomic instructions used in an implementation provide explicit +memory barrier semantics which satisfy the above requirements, that is +fine as well. + +Let's move on:: + + int atomic_add_return(int i, atomic_t *v); + int atomic_sub_return(int i, atomic_t *v); + +These behave just like atomic_{inc,dec}_return() except that an +explicit counter adjustment is given instead of the implicit "1". +This means that like atomic_{inc,dec}_return(), the memory barrier +semantics are required. + +Next:: + + int atomic_inc_and_test(atomic_t *v); + int atomic_dec_and_test(atomic_t *v); + +These two routines increment and decrement by 1, respectively, the +given atomic counter. They return a boolean indicating whether the +resulting counter value was zero or not. + +Again, these primitives provide explicit memory barrier semantics around +the atomic operation:: + + int atomic_sub_and_test(int i, atomic_t *v); + +This is identical to atomic_dec_and_test() except that an explicit +decrement is given instead of the implicit "1". This primitive must +provide explicit memory barrier semantics around the operation:: + + int atomic_add_negative(int i, atomic_t *v); + +The given increment is added to the given atomic counter value. A boolean +is return which indicates whether the resulting counter value is negative. +This primitive must provide explicit memory barrier semantics around +the operation. + +Then:: + + int atomic_xchg(atomic_t *v, int new); + +This performs an atomic exchange operation on the atomic variable v, setting +the given new value. It returns the old value that the atomic variable v had +just before the operation. + +atomic_xchg must provide explicit memory barriers around the operation. :: + + int atomic_cmpxchg(atomic_t *v, int old, int new); + +This performs an atomic compare exchange operation on the atomic value v, +with the given old and new values. Like all atomic_xxx operations, +atomic_cmpxchg will only satisfy its atomicity semantics as long as all +other accesses of \*v are performed through atomic_xxx operations. + +atomic_cmpxchg must provide explicit memory barriers around the operation, +although if the comparison fails then no memory ordering guarantees are +required. + +The semantics for atomic_cmpxchg are the same as those defined for 'cas' +below. + +Finally:: + + int atomic_add_unless(atomic_t *v, int a, int u); + +If the atomic value v is not equal to u, this function adds a to v, and +returns non zero. If v is equal to u then it returns zero. This is done as +an atomic operation. + +atomic_add_unless must provide explicit memory barriers around the +operation unless it fails (returns 0). + +atomic_inc_not_zero, equivalent to atomic_add_unless(v, 1, 0) + + +If a caller requires memory barrier semantics around an atomic_t +operation which does not return a value, a set of interfaces are +defined which accomplish this:: + + void smp_mb__before_atomic(void); + void smp_mb__after_atomic(void); + +For example, smp_mb__before_atomic() can be used like so:: + + obj->dead = 1; + smp_mb__before_atomic(); + atomic_dec(&obj->ref_count); + +It makes sure that all memory operations preceding the atomic_dec() +call are strongly ordered with respect to the atomic counter +operation. In the above example, it guarantees that the assignment of +"1" to obj->dead will be globally visible to other cpus before the +atomic counter decrement. + +Without the explicit smp_mb__before_atomic() call, the +implementation could legally allow the atomic counter update visible +to other cpus before the "obj->dead = 1;" assignment. + +A missing memory barrier in the cases where they are required by the +atomic_t implementation above can have disastrous results. Here is +an example, which follows a pattern occurring frequently in the Linux +kernel. It is the use of atomic counters to implement reference +counting, and it works such that once the counter falls to zero it can +be guaranteed that no other entity can be accessing the object:: + + static void obj_list_add(struct obj *obj, struct list_head *head) + { + obj->active = 1; + list_add(&obj->list, head); + } + + static void obj_list_del(struct obj *obj) + { + list_del(&obj->list); + obj->active = 0; + } + + static void obj_destroy(struct obj *obj) + { + BUG_ON(obj->active); + kfree(obj); + } + + struct obj *obj_list_peek(struct list_head *head) + { + if (!list_empty(head)) { + struct obj *obj; + + obj = list_entry(head->next, struct obj, list); + atomic_inc(&obj->refcnt); + return obj; + } + return NULL; + } + + void obj_poke(void) + { + struct obj *obj; + + spin_lock(&global_list_lock); + obj = obj_list_peek(&global_list); + spin_unlock(&global_list_lock); + + if (obj) { + obj->ops->poke(obj); + if (atomic_dec_and_test(&obj->refcnt)) + obj_destroy(obj); + } + } + + void obj_timeout(struct obj *obj) + { + spin_lock(&global_list_lock); + obj_list_del(obj); + spin_unlock(&global_list_lock); + + if (atomic_dec_and_test(&obj->refcnt)) + obj_destroy(obj); + } + +.. note:: + + This is a simplification of the ARP queue management in the generic + neighbour discover code of the networking. Olaf Kirch found a bug wrt. + memory barriers in kfree_skb() that exposed the atomic_t memory barrier + requirements quite clearly. + +Given the above scheme, it must be the case that the obj->active +update done by the obj list deletion be visible to other processors +before the atomic counter decrement is performed. + +Otherwise, the counter could fall to zero, yet obj->active would still +be set, thus triggering the assertion in obj_destroy(). The error +sequence looks like this:: + + cpu 0 cpu 1 + obj_poke() obj_timeout() + obj = obj_list_peek(); + ... gains ref to obj, refcnt=2 + obj_list_del(obj); + obj->active = 0 ... + ... visibility delayed ... + atomic_dec_and_test() + ... refcnt drops to 1 ... + atomic_dec_and_test() + ... refcount drops to 0 ... + obj_destroy() + BUG() triggers since obj->active + still seen as one + obj->active update visibility occurs + +With the memory barrier semantics required of the atomic_t operations +which return values, the above sequence of memory visibility can never +happen. Specifically, in the above case the atomic_dec_and_test() +counter decrement would not become globally visible until the +obj->active update does. + +As a historical note, 32-bit Sparc used to only allow usage of +24-bits of its atomic_t type. This was because it used 8 bits +as a spinlock for SMP safety. Sparc32 lacked a "compare and swap" +type instruction. However, 32-bit Sparc has since been moved over +to a "hash table of spinlocks" scheme, that allows the full 32-bit +counter to be realized. Essentially, an array of spinlocks are +indexed into based upon the address of the atomic_t being operated +on, and that lock protects the atomic operation. Parisc uses the +same scheme. + +Another note is that the atomic_t operations returning values are +extremely slow on an old 386. + + +Atomic Bitmask +============== + +We will now cover the atomic bitmask operations. You will find that +their SMP and memory barrier semantics are similar in shape and scope +to the atomic_t ops above. + +Native atomic bit operations are defined to operate on objects aligned +to the size of an "unsigned long" C data type, and are least of that +size. The endianness of the bits within each "unsigned long" are the +native endianness of the cpu. :: + + void set_bit(unsigned long nr, volatile unsigned long *addr); + void clear_bit(unsigned long nr, volatile unsigned long *addr); + void change_bit(unsigned long nr, volatile unsigned long *addr); + +These routines set, clear, and change, respectively, the bit number +indicated by "nr" on the bit mask pointed to by "ADDR". + +They must execute atomically, yet there are no implicit memory barrier +semantics required of these interfaces. :: + + int test_and_set_bit(unsigned long nr, volatile unsigned long *addr); + int test_and_clear_bit(unsigned long nr, volatile unsigned long *addr); + int test_and_change_bit(unsigned long nr, volatile unsigned long *addr); + +Like the above, except that these routines return a boolean which +indicates whether the changed bit was set _BEFORE_ the atomic bit +operation. + +WARNING! It is incredibly important that the value be a boolean, +ie. "0" or "1". Do not try to be fancy and save a few instructions by +declaring the above to return "long" and just returning something like +"old_val & mask" because that will not work. + +For one thing, this return value gets truncated to int in many code +paths using these interfaces, so on 64-bit if the bit is set in the +upper 32-bits then testers will never see that. + +One great example of where this problem crops up are the thread_info +flag operations. Routines such as test_and_set_ti_thread_flag() chop +the return value into an int. There are other places where things +like this occur as well. + +These routines, like the atomic_t counter operations returning values, +must provide explicit memory barrier semantics around their execution. +All memory operations before the atomic bit operation call must be +made visible globally before the atomic bit operation is made visible. +Likewise, the atomic bit operation must be visible globally before any +subsequent memory operation is made visible. For example:: + + obj->dead = 1; + if (test_and_set_bit(0, &obj->flags)) + /* ... */; + obj->killed = 1; + +The implementation of test_and_set_bit() must guarantee that +"obj->dead = 1;" is visible to cpus before the atomic memory operation +done by test_and_set_bit() becomes visible. Likewise, the atomic +memory operation done by test_and_set_bit() must become visible before +"obj->killed = 1;" is visible. + +Finally there is the basic operation:: + + int test_bit(unsigned long nr, __const__ volatile unsigned long *addr); + +Which returns a boolean indicating if bit "nr" is set in the bitmask +pointed to by "addr". + +If explicit memory barriers are required around {set,clear}_bit() (which do +not return a value, and thus does not need to provide memory barrier +semantics), two interfaces are provided:: + + void smp_mb__before_atomic(void); + void smp_mb__after_atomic(void); + +They are used as follows, and are akin to their atomic_t operation +brothers:: + + /* All memory operations before this call will + * be globally visible before the clear_bit(). + */ + smp_mb__before_atomic(); + clear_bit( ... ); + + /* The clear_bit() will be visible before all + * subsequent memory operations. + */ + smp_mb__after_atomic(); + +There are two special bitops with lock barrier semantics (acquire/release, +same as spinlocks). These operate in the same way as their non-_lock/unlock +postfixed variants, except that they are to provide acquire/release semantics, +respectively. This means they can be used for bit_spin_trylock and +bit_spin_unlock type operations without specifying any more barriers. :: + + int test_and_set_bit_lock(unsigned long nr, unsigned long *addr); + void clear_bit_unlock(unsigned long nr, unsigned long *addr); + void __clear_bit_unlock(unsigned long nr, unsigned long *addr); + +The __clear_bit_unlock version is non-atomic, however it still implements +unlock barrier semantics. This can be useful if the lock itself is protecting +the other bits in the word. + +Finally, there are non-atomic versions of the bitmask operations +provided. They are used in contexts where some other higher-level SMP +locking scheme is being used to protect the bitmask, and thus less +expensive non-atomic operations may be used in the implementation. +They have names similar to the above bitmask operation interfaces, +except that two underscores are prefixed to the interface name. :: + + void __set_bit(unsigned long nr, volatile unsigned long *addr); + void __clear_bit(unsigned long nr, volatile unsigned long *addr); + void __change_bit(unsigned long nr, volatile unsigned long *addr); + int __test_and_set_bit(unsigned long nr, volatile unsigned long *addr); + int __test_and_clear_bit(unsigned long nr, volatile unsigned long *addr); + int __test_and_change_bit(unsigned long nr, volatile unsigned long *addr); + +These non-atomic variants also do not require any special memory +barrier semantics. + +The routines xchg() and cmpxchg() must provide the same exact +memory-barrier semantics as the atomic and bit operations returning +values. + +.. note:: + + If someone wants to use xchg(), cmpxchg() and their variants, + linux/atomic.h should be included rather than asm/cmpxchg.h, unless the + code is in arch/* and can take care of itself. + +Spinlocks and rwlocks have memory barrier expectations as well. +The rule to follow is simple: + +1) When acquiring a lock, the implementation must make it globally + visible before any subsequent memory operation. + +2) When releasing a lock, the implementation must make it such that + all previous memory operations are globally visible before the + lock release. + +Which finally brings us to _atomic_dec_and_lock(). There is an +architecture-neutral version implemented in lib/dec_and_lock.c, +but most platforms will wish to optimize this in assembler. :: + + int _atomic_dec_and_lock(atomic_t *atomic, spinlock_t *lock); + +Atomically decrement the given counter, and if will drop to zero +atomically acquire the given spinlock and perform the decrement +of the counter to zero. If it does not drop to zero, do nothing +with the spinlock. + +It is actually pretty simple to get the memory barrier correct. +Simply satisfy the spinlock grab requirements, which is make +sure the spinlock operation is globally visible before any +subsequent memory operation. + +We can demonstrate this operation more clearly if we define +an abstract atomic operation:: + + long cas(long *mem, long old, long new); + +"cas" stands for "compare and swap". It atomically: + +1) Compares "old" with the value currently at "mem". +2) If they are equal, "new" is written to "mem". +3) Regardless, the current value at "mem" is returned. + +As an example usage, here is what an atomic counter update +might look like:: + + void example_atomic_inc(long *counter) + { + long old, new, ret; + + while (1) { + old = *counter; + new = old + 1; + + ret = cas(counter, old, new); + if (ret == old) + break; + } + } + +Let's use cas() in order to build a pseudo-C atomic_dec_and_lock():: + + int _atomic_dec_and_lock(atomic_t *atomic, spinlock_t *lock) + { + long old, new, ret; + int went_to_zero; + + went_to_zero = 0; + while (1) { + old = atomic_read(atomic); + new = old - 1; + if (new == 0) { + went_to_zero = 1; + spin_lock(lock); + } + ret = cas(atomic, old, new); + if (ret == old) + break; + if (went_to_zero) { + spin_unlock(lock); + went_to_zero = 0; + } + } + + return went_to_zero; + } + +Now, as far as memory barriers go, as long as spin_lock() +strictly orders all subsequent memory operations (including +the cas()) with respect to itself, things will be fine. + +Said another way, _atomic_dec_and_lock() must guarantee that +a counter dropping to zero is never made visible before the +spinlock being acquired. + +.. note:: + + Note that this also means that for the case where the counter is not + dropping to zero, there are no memory ordering requirements. diff --git a/Documentation/core-api/conf.py b/Documentation/core-api/conf.py new file mode 100644 index 000000000000..db1f7659f3da --- /dev/null +++ b/Documentation/core-api/conf.py @@ -0,0 +1,10 @@ +# -*- coding: utf-8; mode: python -*- + +project = "Core-API Documentation" + +tags.add("subproject") + +latex_documents = [ + ('index', 'core-api.tex', project, + 'The kernel development community', 'manual'), +] diff --git a/Documentation/core-api/debug-objects.rst b/Documentation/core-api/debug-objects.rst new file mode 100644 index 000000000000..ac926fd55a64 --- /dev/null +++ b/Documentation/core-api/debug-objects.rst @@ -0,0 +1,310 @@ +============================================ +The object-lifetime debugging infrastructure +============================================ + +:Author: Thomas Gleixner + +Introduction +============ + +debugobjects is a generic infrastructure to track the life time of +kernel objects and validate the operations on those. + +debugobjects is useful to check for the following error patterns: + +- Activation of uninitialized objects + +- Initialization of active objects + +- Usage of freed/destroyed objects + +debugobjects is not changing the data structure of the real object so it +can be compiled in with a minimal runtime impact and enabled on demand +with a kernel command line option. + +Howto use debugobjects +====================== + +A kernel subsystem needs to provide a data structure which describes the +object type and add calls into the debug code at appropriate places. The +data structure to describe the object type needs at minimum the name of +the object type. Optional functions can and should be provided to fixup +detected problems so the kernel can continue to work and the debug +information can be retrieved from a live system instead of hard core +debugging with serial consoles and stack trace transcripts from the +monitor. + +The debug calls provided by debugobjects are: + +- debug_object_init + +- debug_object_init_on_stack + +- debug_object_activate + +- debug_object_deactivate + +- debug_object_destroy + +- debug_object_free + +- debug_object_assert_init + +Each of these functions takes the address of the real object and a +pointer to the object type specific debug description structure. + +Each detected error is reported in the statistics and a limited number +of errors are printk'ed including a full stack trace. + +The statistics are available via /sys/kernel/debug/debug_objects/stats. +They provide information about the number of warnings and the number of +successful fixups along with information about the usage of the internal +tracking objects and the state of the internal tracking objects pool. + +Debug functions +=============== + +.. kernel-doc:: lib/debugobjects.c + :functions: debug_object_init + +This function is called whenever the initialization function of a real +object is called. + +When the real object is already tracked by debugobjects it is checked, +whether the object can be initialized. Initializing is not allowed for +active and destroyed objects. When debugobjects detects an error, then +it calls the fixup_init function of the object type description +structure if provided by the caller. The fixup function can correct the +problem before the real initialization of the object happens. E.g. it +can deactivate an active object in order to prevent damage to the +subsystem. + +When the real object is not yet tracked by debugobjects, debugobjects +allocates a tracker object for the real object and sets the tracker +object state to ODEBUG_STATE_INIT. It verifies that the object is not +on the callers stack. If it is on the callers stack then a limited +number of warnings including a full stack trace is printk'ed. The +calling code must use debug_object_init_on_stack() and remove the +object before leaving the function which allocated it. See next section. + +.. kernel-doc:: lib/debugobjects.c + :functions: debug_object_init_on_stack + +This function is called whenever the initialization function of a real +object which resides on the stack is called. + +When the real object is already tracked by debugobjects it is checked, +whether the object can be initialized. Initializing is not allowed for +active and destroyed objects. When debugobjects detects an error, then +it calls the fixup_init function of the object type description +structure if provided by the caller. The fixup function can correct the +problem before the real initialization of the object happens. E.g. it +can deactivate an active object in order to prevent damage to the +subsystem. + +When the real object is not yet tracked by debugobjects debugobjects +allocates a tracker object for the real object and sets the tracker +object state to ODEBUG_STATE_INIT. It verifies that the object is on +the callers stack. + +An object which is on the stack must be removed from the tracker by +calling debug_object_free() before the function which allocates the +object returns. Otherwise we keep track of stale objects. + +.. kernel-doc:: lib/debugobjects.c + :functions: debug_object_activate + +This function is called whenever the activation function of a real +object is called. + +When the real object is already tracked by debugobjects it is checked, +whether the object can be activated. Activating is not allowed for +active and destroyed objects. When debugobjects detects an error, then +it calls the fixup_activate function of the object type description +structure if provided by the caller. The fixup function can correct the +problem before the real activation of the object happens. E.g. it can +deactivate an active object in order to prevent damage to the subsystem. + +When the real object is not yet tracked by debugobjects then the +fixup_activate function is called if available. This is necessary to +allow the legitimate activation of statically allocated and initialized +objects. The fixup function checks whether the object is valid and calls +the debug_objects_init() function to initialize the tracking of this +object. + +When the activation is legitimate, then the state of the associated +tracker object is set to ODEBUG_STATE_ACTIVE. + + +.. kernel-doc:: lib/debugobjects.c + :functions: debug_object_deactivate + +This function is called whenever the deactivation function of a real +object is called. + +When the real object is tracked by debugobjects it is checked, whether +the object can be deactivated. Deactivating is not allowed for untracked +or destroyed objects. + +When the deactivation is legitimate, then the state of the associated +tracker object is set to ODEBUG_STATE_INACTIVE. + +.. kernel-doc:: lib/debugobjects.c + :functions: debug_object_destroy + +This function is called to mark an object destroyed. This is useful to +prevent the usage of invalid objects, which are still available in +memory: either statically allocated objects or objects which are freed +later. + +When the real object is tracked by debugobjects it is checked, whether +the object can be destroyed. Destruction is not allowed for active and +destroyed objects. When debugobjects detects an error, then it calls the +fixup_destroy function of the object type description structure if +provided by the caller. The fixup function can correct the problem +before the real destruction of the object happens. E.g. it can +deactivate an active object in order to prevent damage to the subsystem. + +When the destruction is legitimate, then the state of the associated +tracker object is set to ODEBUG_STATE_DESTROYED. + +.. kernel-doc:: lib/debugobjects.c + :functions: debug_object_free + +This function is called before an object is freed. + +When the real object is tracked by debugobjects it is checked, whether +the object can be freed. Free is not allowed for active objects. When +debugobjects detects an error, then it calls the fixup_free function of +the object type description structure if provided by the caller. The +fixup function can correct the problem before the real free of the +object happens. E.g. it can deactivate an active object in order to +prevent damage to the subsystem. + +Note that debug_object_free removes the object from the tracker. Later +usage of the object is detected by the other debug checks. + + +.. kernel-doc:: lib/debugobjects.c + :functions: debug_object_assert_init + +This function is called to assert that an object has been initialized. + +When the real object is not tracked by debugobjects, it calls +fixup_assert_init of the object type description structure provided by +the caller, with the hardcoded object state ODEBUG_NOT_AVAILABLE. The +fixup function can correct the problem by calling debug_object_init +and other specific initializing functions. + +When the real object is already tracked by debugobjects it is ignored. + +Fixup functions +=============== + +Debug object type description structure +--------------------------------------- + +.. kernel-doc:: include/linux/debugobjects.h + :internal: + +fixup_init +----------- + +This function is called from the debug code whenever a problem in +debug_object_init is detected. The function takes the address of the +object and the state which is currently recorded in the tracker. + +Called from debug_object_init when the object state is: + +- ODEBUG_STATE_ACTIVE + +The function returns true when the fixup was successful, otherwise +false. The return value is used to update the statistics. + +Note, that the function needs to call the debug_object_init() function +again, after the damage has been repaired in order to keep the state +consistent. + +fixup_activate +--------------- + +This function is called from the debug code whenever a problem in +debug_object_activate is detected. + +Called from debug_object_activate when the object state is: + +- ODEBUG_STATE_NOTAVAILABLE + +- ODEBUG_STATE_ACTIVE + +The function returns true when the fixup was successful, otherwise +false. The return value is used to update the statistics. + +Note that the function needs to call the debug_object_activate() +function again after the damage has been repaired in order to keep the +state consistent. + +The activation of statically initialized objects is a special case. When +debug_object_activate() has no tracked object for this object address +then fixup_activate() is called with object state +ODEBUG_STATE_NOTAVAILABLE. The fixup function needs to check whether +this is a legitimate case of a statically initialized object or not. In +case it is it calls debug_object_init() and debug_object_activate() +to make the object known to the tracker and marked active. In this case +the function should return false because this is not a real fixup. + +fixup_destroy +-------------- + +This function is called from the debug code whenever a problem in +debug_object_destroy is detected. + +Called from debug_object_destroy when the object state is: + +- ODEBUG_STATE_ACTIVE + +The function returns true when the fixup was successful, otherwise +false. The return value is used to update the statistics. + +fixup_free +----------- + +This function is called from the debug code whenever a problem in +debug_object_free is detected. Further it can be called from the debug +checks in kfree/vfree, when an active object is detected from the +debug_check_no_obj_freed() sanity checks. + +Called from debug_object_free() or debug_check_no_obj_freed() when +the object state is: + +- ODEBUG_STATE_ACTIVE + +The function returns true when the fixup was successful, otherwise +false. The return value is used to update the statistics. + +fixup_assert_init +------------------- + +This function is called from the debug code whenever a problem in +debug_object_assert_init is detected. + +Called from debug_object_assert_init() with a hardcoded state +ODEBUG_STATE_NOTAVAILABLE when the object is not found in the debug +bucket. + +The function returns true when the fixup was successful, otherwise +false. The return value is used to update the statistics. + +Note, this function should make sure debug_object_init() is called +before returning. + +The handling of statically initialized objects is a special case. The +fixup function should check if this is a legitimate case of a statically +initialized object or not. In this case only debug_object_init() +should be called to make the object known to the tracker. Then the +function should return false because this is not a real fixup. + +Known Bugs And Assumptions +========================== + +None (knock on wood). diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst new file mode 100644 index 000000000000..2872ca1a52f1 --- /dev/null +++ b/Documentation/core-api/index.rst @@ -0,0 +1,33 @@ +====================== +Core API Documentation +====================== + +This is the beginning of a manual for core kernel APIs. The conversion +(and writing!) of documents for this manual is much appreciated! + +Core utilities +============== + +.. toctree:: + :maxdepth: 1 + + assoc_array + atomic_ops + local_ops + workqueue + +Interfaces for kernel debugging +=============================== + +.. toctree:: + :maxdepth: 1 + + debug-objects + tracepoint + +.. only:: subproject + + Indices + ======= + + * :ref:`genindex` diff --git a/Documentation/core-api/local_ops.rst b/Documentation/core-api/local_ops.rst new file mode 100644 index 000000000000..1062ddba62c7 --- /dev/null +++ b/Documentation/core-api/local_ops.rst @@ -0,0 +1,206 @@ + +.. _local_ops: + +================================================= +Semantics and Behavior of Local Atomic Operations +================================================= + +:Author: Mathieu Desnoyers + + +This document explains the purpose of the local atomic operations, how +to implement them for any given architecture and shows how they can be used +properly. It also stresses on the precautions that must be taken when reading +those local variables across CPUs when the order of memory writes matters. + +.. note:: + + Note that ``local_t`` based operations are not recommended for general + kernel use. Please use the ``this_cpu`` operations instead unless there is + really a special purpose. Most uses of ``local_t`` in the kernel have been + replaced by ``this_cpu`` operations. ``this_cpu`` operations combine the + relocation with the ``local_t`` like semantics in a single instruction and + yield more compact and faster executing code. + + +Purpose of local atomic operations +================================== + +Local atomic operations are meant to provide fast and highly reentrant per CPU +counters. They minimize the performance cost of standard atomic operations by +removing the LOCK prefix and memory barriers normally required to synchronize +across CPUs. + +Having fast per CPU atomic counters is interesting in many cases: it does not +require disabling interrupts to protect from interrupt handlers and it permits +coherent counters in NMI handlers. It is especially useful for tracing purposes +and for various performance monitoring counters. + +Local atomic operations only guarantee variable modification atomicity wrt the +CPU which owns the data. Therefore, care must taken to make sure that only one +CPU writes to the ``local_t`` data. This is done by using per cpu data and +making sure that we modify it from within a preemption safe context. It is +however permitted to read ``local_t`` data from any CPU: it will then appear to +be written out of order wrt other memory writes by the owner CPU. + + +Implementation for a given architecture +======================================= + +It can be done by slightly modifying the standard atomic operations: only +their UP variant must be kept. It typically means removing LOCK prefix (on +i386 and x86_64) and any SMP synchronization barrier. If the architecture does +not have a different behavior between SMP and UP, including +``asm-generic/local.h`` in your architecture's ``local.h`` is sufficient. + +The ``local_t`` type is defined as an opaque ``signed long`` by embedding an +``atomic_long_t`` inside a structure. This is made so a cast from this type to +a ``long`` fails. The definition looks like:: + + typedef struct { atomic_long_t a; } local_t; + + +Rules to follow when using local atomic operations +================================================== + +* Variables touched by local ops must be per cpu variables. +* *Only* the CPU owner of these variables must write to them. +* This CPU can use local ops from any context (process, irq, softirq, nmi, ...) + to update its ``local_t`` variables. +* Preemption (or interrupts) must be disabled when using local ops in + process context to make sure the process won't be migrated to a + different CPU between getting the per-cpu variable and doing the + actual local op. +* When using local ops in interrupt context, no special care must be + taken on a mainline kernel, since they will run on the local CPU with + preemption already disabled. I suggest, however, to explicitly + disable preemption anyway to make sure it will still work correctly on + -rt kernels. +* Reading the local cpu variable will provide the current copy of the + variable. +* Reads of these variables can be done from any CPU, because updates to + "``long``", aligned, variables are always atomic. Since no memory + synchronization is done by the writer CPU, an outdated copy of the + variable can be read when reading some *other* cpu's variables. + + +How to use local atomic operations +================================== + +:: + + #include <linux/percpu.h> + #include <asm/local.h> + + static DEFINE_PER_CPU(local_t, counters) = LOCAL_INIT(0); + + +Counting +======== + +Counting is done on all the bits of a signed long. + +In preemptible context, use ``get_cpu_var()`` and ``put_cpu_var()`` around +local atomic operations: it makes sure that preemption is disabled around write +access to the per cpu variable. For instance:: + + local_inc(&get_cpu_var(counters)); + put_cpu_var(counters); + +If you are already in a preemption-safe context, you can use +``this_cpu_ptr()`` instead:: + + local_inc(this_cpu_ptr(&counters)); + + + +Reading the counters +==================== + +Those local counters can be read from foreign CPUs to sum the count. Note that +the data seen by local_read across CPUs must be considered to be out of order +relatively to other memory writes happening on the CPU that owns the data:: + + long sum = 0; + for_each_online_cpu(cpu) + sum += local_read(&per_cpu(counters, cpu)); + +If you want to use a remote local_read to synchronize access to a resource +between CPUs, explicit ``smp_wmb()`` and ``smp_rmb()`` memory barriers must be used +respectively on the writer and the reader CPUs. It would be the case if you use +the ``local_t`` variable as a counter of bytes written in a buffer: there should +be a ``smp_wmb()`` between the buffer write and the counter increment and also a +``smp_rmb()`` between the counter read and the buffer read. + + +Here is a sample module which implements a basic per cpu counter using +``local.h``:: + + /* test-local.c + * + * Sample module for local.h usage. + */ + + + #include <asm/local.h> + #include <linux/module.h> + #include <linux/timer.h> + + static DEFINE_PER_CPU(local_t, counters) = LOCAL_INIT(0); + + static struct timer_list test_timer; + + /* IPI called on each CPU. */ + static void test_each(void *info) + { + /* Increment the counter from a non preemptible context */ + printk("Increment on cpu %d\n", smp_processor_id()); + local_inc(this_cpu_ptr(&counters)); + + /* This is what incrementing the variable would look like within a + * preemptible context (it disables preemption) : + * + * local_inc(&get_cpu_var(counters)); + * put_cpu_var(counters); + */ + } + + static void do_test_timer(unsigned long data) + { + int cpu; + + /* Increment the counters */ + on_each_cpu(test_each, NULL, 1); + /* Read all the counters */ + printk("Counters read from CPU %d\n", smp_processor_id()); + for_each_online_cpu(cpu) { + printk("Read : CPU %d, count %ld\n", cpu, + local_read(&per_cpu(counters, cpu))); + } + del_timer(&test_timer); + test_timer.expires = jiffies + 1000; + add_timer(&test_timer); + } + + static int __init test_init(void) + { + /* initialize the timer that will increment the counter */ + init_timer(&test_timer); + test_timer.function = do_test_timer; + test_timer.expires = jiffies + 1; + add_timer(&test_timer); + + return 0; + } + + static void __exit test_exit(void) + { + del_timer_sync(&test_timer); + } + + module_init(test_init); + module_exit(test_exit); + + MODULE_LICENSE("GPL"); + MODULE_AUTHOR("Mathieu Desnoyers"); + MODULE_DESCRIPTION("Local Atomic Ops"); diff --git a/Documentation/core-api/tracepoint.rst b/Documentation/core-api/tracepoint.rst new file mode 100644 index 000000000000..6b44bec0de43 --- /dev/null +++ b/Documentation/core-api/tracepoint.rst @@ -0,0 +1,55 @@ +=============================== +The Linux Kernel Tracepoint API +=============================== + +:Author: Jason Baron +:Author: William Cohen + +Introduction +============ + +Tracepoints are static probe points that are located in strategic points +throughout the kernel. 'Probes' register/unregister with tracepoints via +a callback mechanism. The 'probes' are strictly typed functions that are +passed a unique set of parameters defined by each tracepoint. + +From this simple callback mechanism, 'probes' can be used to profile, +debug, and understand kernel behavior. There are a number of tools that +provide a framework for using 'probes'. These tools include Systemtap, +ftrace, and LTTng. + +Tracepoints are defined in a number of header files via various macros. +Thus, the purpose of this document is to provide a clear accounting of +the available tracepoints. The intention is to understand not only what +tracepoints are available but also to understand where future +tracepoints might be added. + +The API presented has functions of the form: +``trace_tracepointname(function parameters)``. These are the tracepoints +callbacks that are found throughout the code. Registering and +unregistering probes with these callback sites is covered in the +``Documentation/trace/*`` directory. + +IRQ +=== + +.. kernel-doc:: include/trace/events/irq.h + :internal: + +SIGNAL +====== + +.. kernel-doc:: include/trace/events/signal.h + :internal: + +Block IO +======== + +.. kernel-doc:: include/trace/events/block.h + :internal: + +Workqueue +========= + +.. kernel-doc:: include/trace/events/workqueue.h + :internal: diff --git a/Documentation/core-api/workqueue.rst b/Documentation/core-api/workqueue.rst new file mode 100644 index 000000000000..ffdec94fbca1 --- /dev/null +++ b/Documentation/core-api/workqueue.rst @@ -0,0 +1,394 @@ +==================================== +Concurrency Managed Workqueue (cmwq) +==================================== + +:Date: September, 2010 +:Author: Tejun Heo <tj@kernel.org> +:Author: Florian Mickler <florian@mickler.org> + + +Introduction +============ + +There are many cases where an asynchronous process execution context +is needed and the workqueue (wq) API is the most commonly used +mechanism for such cases. + +When such an asynchronous execution context is needed, a work item +describing which function to execute is put on a queue. An +independent thread serves as the asynchronous execution context. The +queue is called workqueue and the thread is called worker. + +While there are work items on the workqueue the worker executes the +functions associated with the work items one after the other. When +there is no work item left on the workqueue the worker becomes idle. +When a new work item gets queued, the worker begins executing again. + + +Why cmwq? +========= + +In the original wq implementation, a multi threaded (MT) wq had one +worker thread per CPU and a single threaded (ST) wq had one worker +thread system-wide. A single MT wq needed to keep around the same +number of workers as the number of CPUs. The kernel grew a lot of MT +wq users over the years and with the number of CPU cores continuously +rising, some systems saturated the default 32k PID space just booting +up. + +Although MT wq wasted a lot of resource, the level of concurrency +provided was unsatisfactory. The limitation was common to both ST and +MT wq albeit less severe on MT. Each wq maintained its own separate +worker pool. A MT wq could provide only one execution context per CPU +while a ST wq one for the whole system. Work items had to compete for +those very limited execution contexts leading to various problems +including proneness to deadlocks around the single execution context. + +The tension between the provided level of concurrency and resource +usage also forced its users to make unnecessary tradeoffs like libata +choosing to use ST wq for polling PIOs and accepting an unnecessary +limitation that no two polling PIOs can progress at the same time. As +MT wq don't provide much better concurrency, users which require +higher level of concurrency, like async or fscache, had to implement +their own thread pool. + +Concurrency Managed Workqueue (cmwq) is a reimplementation of wq with +focus on the following goals. + +* Maintain compatibility with the original workqueue API. + +* Use per-CPU unified worker pools shared by all wq to provide + flexible level of concurrency on demand without wasting a lot of + resource. + +* Automatically regulate worker pool and level of concurrency so that + the API users don't need to worry about such details. + + +The Design +========== + +In order to ease the asynchronous execution of functions a new +abstraction, the work item, is introduced. + +A work item is a simple struct that holds a pointer to the function +that is to be executed asynchronously. Whenever a driver or subsystem +wants a function to be executed asynchronously it has to set up a work +item pointing to that function and queue that work item on a +workqueue. + +Special purpose threads, called worker threads, execute the functions +off of the queue, one after the other. If no work is queued, the +worker threads become idle. These worker threads are managed in so +called worker-pools. + +The cmwq design differentiates between the user-facing workqueues that +subsystems and drivers queue work items on and the backend mechanism +which manages worker-pools and processes the queued work items. + +There are two worker-pools, one for normal work items and the other +for high priority ones, for each possible CPU and some extra +worker-pools to serve work items queued on unbound workqueues - the +number of these backing pools is dynamic. + +Subsystems and drivers can create and queue work items through special +workqueue API functions as they see fit. They can influence some +aspects of the way the work items are executed by setting flags on the +workqueue they are putting the work item on. These flags include +things like CPU locality, concurrency limits, priority and more. To +get a detailed overview refer to the API description of +``alloc_workqueue()`` below. + +When a work item is queued to a workqueue, the target worker-pool is +determined according to the queue parameters and workqueue attributes +and appended on the shared worklist of the worker-pool. For example, +unless specifically overridden, a work item of a bound workqueue will +be queued on the worklist of either normal or highpri worker-pool that +is associated to the CPU the issuer is running on. + +For any worker pool implementation, managing the concurrency level +(how many execution contexts are active) is an important issue. cmwq +tries to keep the concurrency at a minimal but sufficient level. +Minimal to save resources and sufficient in that the system is used at +its full capacity. + +Each worker-pool bound to an actual CPU implements concurrency +management by hooking into the scheduler. The worker-pool is notified +whenever an active worker wakes up or sleeps and keeps track of the +number of the currently runnable workers. Generally, work items are +not expected to hog a CPU and consume many cycles. That means +maintaining just enough concurrency to prevent work processing from +stalling should be optimal. As long as there are one or more runnable +workers on the CPU, the worker-pool doesn't start execution of a new +work, but, when the last running worker goes to sleep, it immediately +schedules a new worker so that the CPU doesn't sit idle while there +are pending work items. This allows using a minimal number of workers +without losing execution bandwidth. + +Keeping idle workers around doesn't cost other than the memory space +for kthreads, so cmwq holds onto idle ones for a while before killing +them. + +For unbound workqueues, the number of backing pools is dynamic. +Unbound workqueue can be assigned custom attributes using +``apply_workqueue_attrs()`` and workqueue will automatically create +backing worker pools matching the attributes. The responsibility of +regulating concurrency level is on the users. There is also a flag to +mark a bound wq to ignore the concurrency management. Please refer to +the API section for details. + +Forward progress guarantee relies on that workers can be created when +more execution contexts are necessary, which in turn is guaranteed +through the use of rescue workers. All work items which might be used +on code paths that handle memory reclaim are required to be queued on +wq's that have a rescue-worker reserved for execution under memory +pressure. Else it is possible that the worker-pool deadlocks waiting +for execution contexts to free up. + + +Application Programming Interface (API) +======================================= + +``alloc_workqueue()`` allocates a wq. The original +``create_*workqueue()`` functions are deprecated and scheduled for +removal. ``alloc_workqueue()`` takes three arguments - @``name``, +``@flags`` and ``@max_active``. ``@name`` is the name of the wq and +also used as the name of the rescuer thread if there is one. + +A wq no longer manages execution resources but serves as a domain for +forward progress guarantee, flush and work item attributes. ``@flags`` +and ``@max_active`` control how work items are assigned execution +resources, scheduled and executed. + + +``flags`` +--------- + +``WQ_UNBOUND`` + Work items queued to an unbound wq are served by the special + worker-pools which host workers which are not bound to any + specific CPU. This makes the wq behave as a simple execution + context provider without concurrency management. The unbound + worker-pools try to start execution of work items as soon as + possible. Unbound wq sacrifices locality but is useful for + the following cases. + + * Wide fluctuation in the concurrency level requirement is + expected and using bound wq may end up creating large number + of mostly unused workers across different CPUs as the issuer + hops through different CPUs. + + * Long running CPU intensive workloads which can be better + managed by the system scheduler. + +``WQ_FREEZABLE`` + A freezable wq participates in the freeze phase of the system + suspend operations. Work items on the wq are drained and no + new work item starts execution until thawed. + +``WQ_MEM_RECLAIM`` + All wq which might be used in the memory reclaim paths **MUST** + have this flag set. The wq is guaranteed to have at least one + execution context regardless of memory pressure. + +``WQ_HIGHPRI`` + Work items of a highpri wq are queued to the highpri + worker-pool of the target cpu. Highpri worker-pools are + served by worker threads with elevated nice level. + + Note that normal and highpri worker-pools don't interact with + each other. Each maintain its separate pool of workers and + implements concurrency management among its workers. + +``WQ_CPU_INTENSIVE`` + Work items of a CPU intensive wq do not contribute to the + concurrency level. In other words, runnable CPU intensive + work items will not prevent other work items in the same + worker-pool from starting execution. This is useful for bound + work items which are expected to hog CPU cycles so that their + execution is regulated by the system scheduler. + + Although CPU intensive work items don't contribute to the + concurrency level, start of their executions is still + regulated by the concurrency management and runnable + non-CPU-intensive work items can delay execution of CPU + intensive work items. + + This flag is meaningless for unbound wq. + +Note that the flag ``WQ_NON_REENTRANT`` no longer exists as all +workqueues are now non-reentrant - any work item is guaranteed to be +executed by at most one worker system-wide at any given time. + + +``max_active`` +-------------- + +``@max_active`` determines the maximum number of execution contexts +per CPU which can be assigned to the work items of a wq. For example, +with ``@max_active`` of 16, at most 16 work items of the wq can be +executing at the same time per CPU. + +Currently, for a bound wq, the maximum limit for ``@max_active`` is +512 and the default value used when 0 is specified is 256. For an +unbound wq, the limit is higher of 512 and 4 * +``num_possible_cpus()``. These values are chosen sufficiently high +such that they are not the limiting factor while providing protection +in runaway cases. + +The number of active work items of a wq is usually regulated by the +users of the wq, more specifically, by how many work items the users +may queue at the same time. Unless there is a specific need for +throttling the number of active work items, specifying '0' is +recommended. + +Some users depend on the strict execution ordering of ST wq. The +combination of ``@max_active`` of 1 and ``WQ_UNBOUND`` is used to +achieve this behavior. Work items on such wq are always queued to the +unbound worker-pools and only one work item can be active at any given +time thus achieving the same ordering property as ST wq. + + +Example Execution Scenarios +=========================== + +The following example execution scenarios try to illustrate how cmwq +behave under different configurations. + + Work items w0, w1, w2 are queued to a bound wq q0 on the same CPU. + w0 burns CPU for 5ms then sleeps for 10ms then burns CPU for 5ms + again before finishing. w1 and w2 burn CPU for 5ms then sleep for + 10ms. + +Ignoring all other tasks, works and processing overhead, and assuming +simple FIFO scheduling, the following is one highly simplified version +of possible sequences of events with the original wq. :: + + TIME IN MSECS EVENT + 0 w0 starts and burns CPU + 5 w0 sleeps + 15 w0 wakes up and burns CPU + 20 w0 finishes + 20 w1 starts and burns CPU + 25 w1 sleeps + 35 w1 wakes up and finishes + 35 w2 starts and burns CPU + 40 w2 sleeps + 50 w2 wakes up and finishes + +And with cmwq with ``@max_active`` >= 3, :: + + TIME IN MSECS EVENT + 0 w0 starts and burns CPU + 5 w0 sleeps + 5 w1 starts and burns CPU + 10 w1 sleeps + 10 w2 starts and burns CPU + 15 w2 sleeps + 15 w0 wakes up and burns CPU + 20 w0 finishes + 20 w1 wakes up and finishes + 25 w2 wakes up and finishes + +If ``@max_active`` == 2, :: + + TIME IN MSECS EVENT + 0 w0 starts and burns CPU + 5 w0 sleeps + 5 w1 starts and burns CPU + 10 w1 sleeps + 15 w0 wakes up and burns CPU + 20 w0 finishes + 20 w1 wakes up and finishes + 20 w2 starts and burns CPU + 25 w2 sleeps + 35 w2 wakes up and finishes + +Now, let's assume w1 and w2 are queued to a different wq q1 which has +``WQ_CPU_INTENSIVE`` set, :: + + TIME IN MSECS EVENT + 0 w0 starts and burns CPU + 5 w0 sleeps + 5 w1 and w2 start and burn CPU + 10 w1 sleeps + 15 w2 sleeps + 15 w0 wakes up and burns CPU + 20 w0 finishes + 20 w1 wakes up and finishes + 25 w2 wakes up and finishes + + +Guidelines +========== + +* Do not forget to use ``WQ_MEM_RECLAIM`` if a wq may process work + items which are used during memory reclaim. Each wq with + ``WQ_MEM_RECLAIM`` set has an execution context reserved for it. If + there is dependency among multiple work items used during memory + reclaim, they should be queued to separate wq each with + ``WQ_MEM_RECLAIM``. + +* Unless strict ordering is required, there is no need to use ST wq. + +* Unless there is a specific need, using 0 for @max_active is + recommended. In most use cases, concurrency level usually stays + well under the default limit. + +* A wq serves as a domain for forward progress guarantee + (``WQ_MEM_RECLAIM``, flush and work item attributes. Work items + which are not involved in memory reclaim and don't need to be + flushed as a part of a group of work items, and don't require any + special attribute, can use one of the system wq. There is no + difference in execution characteristics between using a dedicated wq + and a system wq. + +* Unless work items are expected to consume a huge amount of CPU + cycles, using a bound wq is usually beneficial due to the increased + level of locality in wq operations and work item execution. + + +Debugging +========= + +Because the work functions are executed by generic worker threads +there are a few tricks needed to shed some light on misbehaving +workqueue users. + +Worker threads show up in the process list as: :: + + root 5671 0.0 0.0 0 0 ? S 12:07 0:00 [kworker/0:1] + root 5672 0.0 0.0 0 0 ? S 12:07 0:00 [kworker/1:2] + root 5673 0.0 0.0 0 0 ? S 12:12 0:00 [kworker/0:0] + root 5674 0.0 0.0 0 0 ? S 12:13 0:00 [kworker/1:0] + +If kworkers are going crazy (using too much cpu), there are two types +of possible problems: + + 1. Something being scheduled in rapid succession + 2. A single work item that consumes lots of cpu cycles + +The first one can be tracked using tracing: :: + + $ echo workqueue:workqueue_queue_work > /sys/kernel/debug/tracing/set_event + $ cat /sys/kernel/debug/tracing/trace_pipe > out.txt + (wait a few secs) + ^C + +If something is busy looping on work queueing, it would be dominating +the output and the offender can be determined with the work item +function. + +For the second type of problems it should be possible to just check +the stack trace of the offending worker thread. :: + + $ cat /proc/THE_OFFENDING_KWORKER/stack + +The work item's function should be trivially visible in the stack +trace. + + +Kernel Inline Documentations Reference +====================================== + +.. kernel-doc:: include/linux/workqueue.h |