<feed xmlns='http://www.w3.org/2005/Atom'>
<title>kernel/linux.git/include/linux/memcontrol.h, branch v4.2-rc2</title>
<subtitle>Linux kernel stable tree (mirror)</subtitle>
<id>https://git.radix-linux.su/kernel/linux.git/atom?h=v4.2-rc2</id>
<link rel='self' href='https://git.radix-linux.su/kernel/linux.git/atom?h=v4.2-rc2'/>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/'/>
<updated>2015-06-25T23:00:17+00:00</updated>
<entry>
<title>Merge branch 'for-4.2/writeback' of git://git.kernel.dk/linux-block</title>
<updated>2015-06-25T23:00:17+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2015-06-25T23:00:17+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=e4bc13adfd016fc1036838170288b5680d1a98b0'/>
<id>urn:sha1:e4bc13adfd016fc1036838170288b5680d1a98b0</id>
<content type='text'>
Pull cgroup writeback support from Jens Axboe:
 "This is the big pull request for adding cgroup writeback support.

  This code has been in development for a long time, and it has been
  simmering in for-next for a good chunk of this cycle too.  This is one
  of those problems that has been talked about for at least half a
  decade, finally there's a solution and code to go with it.

  Also see last weeks writeup on LWN:

        http://lwn.net/Articles/648292/"

* 'for-4.2/writeback' of git://git.kernel.dk/linux-block: (85 commits)
  writeback, blkio: add documentation for cgroup writeback support
  vfs, writeback: replace FS_CGROUP_WRITEBACK with SB_I_CGROUPWB
  writeback: do foreign inode detection iff cgroup writeback is enabled
  v9fs: fix error handling in v9fs_session_init()
  bdi: fix wrong error return value in cgwb_create()
  buffer: remove unusued 'ret' variable
  writeback: disassociate inodes from dying bdi_writebacks
  writeback: implement foreign cgroup inode bdi_writeback switching
  writeback: add lockdep annotation to inode_to_wb()
  writeback: use unlocked_inode_to_wb transaction in inode_congested()
  writeback: implement unlocked_inode_to_wb transaction and use it for stat updates
  writeback: implement [locked_]inode_to_wb_and_lock_list()
  writeback: implement foreign cgroup inode detection
  writeback: make writeback_control track the inode being written back
  writeback: relocate wb[_try]_get(), wb_put(), inode_{attach|detach}_wb()
  mm: vmscan: disable memcg direct reclaim stalling if cgroup writeback support is in use
  writeback: implement memcg writeback domain based throttling
  writeback: reset wb_domain-&gt;dirty_limit[_tstmp] when memcg domain size changes
  writeback: implement memcg wb_domain
  writeback: update wb_over_bg_thresh() to use wb_domain aware operations
  ...
</content>
</entry>
<entry>
<title>writeback: implement memcg writeback domain based throttling</title>
<updated>2015-06-02T14:38:13+00:00</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2015-05-22T22:23:35+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=c2aa723a6093633ae4ec15b08a4db276643cab3e'/>
<id>urn:sha1:c2aa723a6093633ae4ec15b08a4db276643cab3e</id>
<content type='text'>
While cgroup writeback support now connects memcg and blkcg so that
writeback IOs are properly attributed and controlled, the IO back
pressure propagation mechanism implemented in balance_dirty_pages()
and its subroutines wasn't aware of cgroup writeback.

Processes belonging to a memcg may have access to only subset of total
memory available in the system and not factoring this into dirty
throttling rendered it completely ineffective for processes under
memcg limits and memcg ended up building a separate ad-hoc degenerate
mechanism directly into vmscan code to limit page dirtying.

The previous patches updated balance_dirty_pages() and its subroutines
so that they can deal with multiple wb_domain's (writeback domains)
and defined per-memcg wb_domain.  Processes belonging to a non-root
memcg are bound to two wb_domains, global wb_domain and memcg
wb_domain, and should be throttled according to IO pressures from both
domains.  This patch updates dirty throttling code so that it repeats
similar calculations for the two domains - the differences between the
two are few and minor - and applies the lower of the two sets of
resulting constraints.

wb_over_bg_thresh(), which controls when background writeback
terminates, is also updated to consider both global and memcg
wb_domains.  It returns true if dirty is over bg_thresh for either
domain.

This makes the dirty throttling mechanism operational for memcg
domains including writeback-bandwidth-proportional dirty page
distribution inside them but the ad-hoc memcg throttling mechanism in
vmscan is still in place.  The next patch will rip it out.

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Cc: Jens Axboe &lt;axboe@kernel.dk&gt;
Cc: Jan Kara &lt;jack@suse.cz&gt;
Cc: Wu Fengguang &lt;fengguang.wu@intel.com&gt;
Cc: Greg Thelen &lt;gthelen@google.com&gt;
Signed-off-by: Jens Axboe &lt;axboe@fb.com&gt;
</content>
</entry>
<entry>
<title>writeback: implement memcg wb_domain</title>
<updated>2015-06-02T14:38:13+00:00</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2015-05-22T22:23:33+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=841710aa6e4acd066ab9fe8c8cb6f4e4e6709d83'/>
<id>urn:sha1:841710aa6e4acd066ab9fe8c8cb6f4e4e6709d83</id>
<content type='text'>
Dirtyable memory is distributed to a wb (bdi_writeback) according to
the relative bandwidth the wb is writing out in the whole system.
This distribution is global - each wb is measured against all other
wb's and gets the proportinately sized portion of the memory in the
whole system.

For cgroup writeback, the amount of dirtyable memory is scoped by
memcg and thus each wb would need to be measured and controlled in its
memcg.  IOW, a wb will belong to two writeback domains - the global
and memcg domains.

The previous patches laid the groundwork to support the two wb_domains
and this patch implements memcg wb_domain.  memcg-&gt;cgwb_domain is
initialized on css online and destroyed on css release,
wb-&gt;memcg_completions is added, and __wb_writeout_inc() is updated to
increment completions against both global and memcg wb_domains.

The following patches will update balance_dirty_pages() and its
subroutines to actually consider memcg wb_domain for throttling.

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Cc: Jens Axboe &lt;axboe@kernel.dk&gt;
Cc: Jan Kara &lt;jack@suse.cz&gt;
Cc: Wu Fengguang &lt;fengguang.wu@intel.com&gt;
Cc: Greg Thelen &lt;gthelen@google.com&gt;
Signed-off-by: Jens Axboe &lt;axboe@fb.com&gt;
</content>
</entry>
<entry>
<title>writeback: make backing_dev_info host cgroup-specific bdi_writebacks</title>
<updated>2015-06-02T14:33:35+00:00</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2015-05-22T21:13:37+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=52ebea749aaed195245701a8f90a23d672c7a933'/>
<id>urn:sha1:52ebea749aaed195245701a8f90a23d672c7a933</id>
<content type='text'>
For the planned cgroup writeback support, on each bdi
(backing_dev_info), each memcg will be served by a separate wb
(bdi_writeback).  This patch updates bdi so that a bdi can host
multiple wbs (bdi_writebacks).

On the default hierarchy, blkcg implicitly enables memcg.  This allows
using memcg's page ownership for attributing writeback IOs, and every
memcg - blkcg combination can be served by its own wb by assigning a
dedicated wb to each memcg.  This means that there may be multiple
wb's of a bdi mapped to the same blkcg.  As congested state is per
blkcg - bdi combination, those wb's should share the same congested
state.  This is achieved by tracking congested state via
bdi_writeback_congested structs which are keyed by blkcg.

bdi-&gt;wb remains unchanged and will keep serving the root cgroup.
cgwb's (cgroup wb's) for non-root cgroups are created on-demand or
looked up while dirtying an inode according to the memcg of the page
being dirtied or current task.  Each cgwb is indexed on bdi-&gt;cgwb_tree
by its memcg id.  Once an inode is associated with its wb, it can be
retrieved using inode_to_wb().

Currently, none of the filesystems has FS_CGROUP_WRITEBACK and all
pages will keep being associated with bdi-&gt;wb.

v3: inode_attach_wb() in account_page_dirtied() moved inside
    mapping_cap_account_dirty() block where it's known to be !NULL.
    Also, an unnecessary NULL check before kfree() removed.  Both
    detected by the kbuild bot.

v2: Updated so that wb association is per inode and wb is per memcg
    rather than blkcg.

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Cc: kbuild test robot &lt;fengguang.wu@intel.com&gt;
Cc: Dan Carpenter &lt;dan.carpenter@oracle.com&gt;
Cc: Jens Axboe &lt;axboe@kernel.dk&gt;
Cc: Jan Kara &lt;jack@suse.cz&gt;
Signed-off-by: Jens Axboe &lt;axboe@fb.com&gt;
</content>
</entry>
<entry>
<title>memcg: implement mem_cgroup_css_from_page()</title>
<updated>2015-06-02T14:33:34+00:00</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2015-05-28T00:00:02+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=ad7fa852d3d2816d68a138ebc5bc8967aeb7fd86'/>
<id>urn:sha1:ad7fa852d3d2816d68a138ebc5bc8967aeb7fd86</id>
<content type='text'>
Implement mem_cgroup_css_from_page() which returns the
cgroup_subsys_state of the memcg associated with a given page on the
default hierarchy.  This will be used by cgroup writeback support.

This function assumes that page-&gt;mem_cgroup association doesn't change
until the page is released, which is true on the default hierarchy as
long as replace_page_cache_page() is not used.  As the only user of
replace_page_cache_page() is FUSE which won't support cgroup writeback
for the time being, this works for now, and replace_page_cache_page()
will soon be updated so that the invariant actually holds.

Note that the RCU protected page-&gt;mem_cgroup access is consistent with
other usages across memcg but ultimately incorrect.  These unlocked
accesses are missing required barriers.  page-&gt;mem_cgroup should be
made an RCU pointer and updated and accessed using RCU operations.

v4: Instead of triggering WARN, return the root css on the traditional
    hierarchies.  This makes the function a lot easier to deal with
    especially as there's no light way to synchronize against
    hierarchy rebinding.

v3: s/mem_cgroup_migrate()/mem_cgroup_css_from_page()/

v2: Trigger WARN if the function is used on the traditional
    hierarchies and add comment about the assumed invariant.

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Michal Hocko &lt;mhocko@suse.cz&gt;
Signed-off-by: Jens Axboe &lt;axboe@fb.com&gt;
</content>
</entry>
<entry>
<title>memcg: add mem_cgroup_root_css</title>
<updated>2015-06-02T14:33:33+00:00</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2015-05-22T21:13:20+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=56161634e4824380a67243a4cf3fa52eb1e5d836'/>
<id>urn:sha1:56161634e4824380a67243a4cf3fa52eb1e5d836</id>
<content type='text'>
Add global mem_cgroup_root_css which points to the root memcg css.
This will be used by cgroup writeback support.  If memcg is disabled,
it's defined as ERR_PTR(-EINVAL).

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
aCc: Michal Hocko &lt;mhocko@suse.cz&gt;
Signed-off-by: Jens Axboe &lt;axboe@fb.com&gt;
</content>
</entry>
<entry>
<title>memcg: add per cgroup dirty page accounting</title>
<updated>2015-06-02T14:33:33+00:00</updated>
<author>
<name>Greg Thelen</name>
<email>gthelen@google.com</email>
</author>
<published>2015-05-22T21:13:16+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=c4843a7593a9df3ff5b1806084cefdfa81dd7c79'/>
<id>urn:sha1:c4843a7593a9df3ff5b1806084cefdfa81dd7c79</id>
<content type='text'>
When modifying PG_Dirty on cached file pages, update the new
MEM_CGROUP_STAT_DIRTY counter.  This is done in the same places where
global NR_FILE_DIRTY is managed.  The new memcg stat is visible in the
per memcg memory.stat cgroupfs file.  The most recent past attempt at
this was http://thread.gmane.org/gmane.linux.kernel.cgroups/8632

The new accounting supports future efforts to add per cgroup dirty
page throttling and writeback.  It also helps an administrator break
down a container's memory usage and provides evidence to understand
memcg oom kills (the new dirty count is included in memcg oom kill
messages).

The ability to move page accounting between memcg
(memory.move_charge_at_immigrate) makes this accounting more
complicated than the global counter.  The existing
mem_cgroup_{begin,end}_page_stat() lock is used to serialize move
accounting with stat updates.
Typical update operation:
	memcg = mem_cgroup_begin_page_stat(page)
	if (TestSetPageDirty()) {
		[...]
		mem_cgroup_update_page_stat(memcg)
	}
	mem_cgroup_end_page_stat(memcg)

Summary of mem_cgroup_end_page_stat() overhead:
- Without CONFIG_MEMCG it's a no-op
- With CONFIG_MEMCG and no inter memcg task movement, it's just
  rcu_read_lock()
- With CONFIG_MEMCG and inter memcg  task movement, it's
  rcu_read_lock() + spin_lock_irqsave()

A memcg parameter is added to several routines because their callers
now grab mem_cgroup_begin_page_stat() which returns the memcg later
needed by for mem_cgroup_update_page_stat().

Because mem_cgroup_begin_page_stat() may disable interrupts, some
adjustments are needed:
- move __mark_inode_dirty() from __set_page_dirty() to its caller.
  __mark_inode_dirty() locking does not want interrupts disabled.
- use spin_lock_irqsave(tree_lock) rather than spin_lock_irq() in
  __delete_from_page_cache(), replace_page_cache_page(),
  invalidate_complete_page2(), and __remove_mapping().

   text    data     bss      dec    hex filename
8925147 1774832 1785856 12485835 be84cb vmlinux-!CONFIG_MEMCG-before
8925339 1774832 1785856 12486027 be858b vmlinux-!CONFIG_MEMCG-after
                            +192 text bytes
8965977 1784992 1785856 12536825 bf4bf9 vmlinux-CONFIG_MEMCG-before
8966750 1784992 1785856 12537598 bf4efe vmlinux-CONFIG_MEMCG-after
                            +773 text bytes

Performance tests run on v4.0-rc1-36-g4f671fe2f952.  Lower is better for
all metrics, they're all wall clock or cycle counts.  The read and write
fault benchmarks just measure fault time, they do not include I/O time.

* CONFIG_MEMCG not set:
                            baseline                              patched
  kbuild                 1m25.030000(+-0.088% 3 samples)       1m25.426667(+-0.120% 3 samples)
  dd write 100 MiB          0.859211561 +-15.10%                  0.874162885 +-15.03%
  dd write 200 MiB          1.670653105 +-17.87%                  1.669384764 +-11.99%
  dd write 1000 MiB         8.434691190 +-14.15%                  8.474733215 +-14.77%
  read fault cycles       254.0(+-0.000% 10 samples)            253.0(+-0.000% 10 samples)
  write fault cycles     2021.2(+-3.070% 10 samples)           1984.5(+-1.036% 10 samples)

* CONFIG_MEMCG=y root_memcg:
                            baseline                              patched
  kbuild                 1m25.716667(+-0.105% 3 samples)       1m25.686667(+-0.153% 3 samples)
  dd write 100 MiB          0.855650830 +-14.90%                  0.887557919 +-14.90%
  dd write 200 MiB          1.688322953 +-12.72%                  1.667682724 +-13.33%
  dd write 1000 MiB         8.418601605 +-14.30%                  8.673532299 +-15.00%
  read fault cycles       266.0(+-0.000% 10 samples)            266.0(+-0.000% 10 samples)
  write fault cycles     2051.7(+-1.349% 10 samples)           2049.6(+-1.686% 10 samples)

* CONFIG_MEMCG=y non-root_memcg:
                            baseline                              patched
  kbuild                 1m26.120000(+-0.273% 3 samples)       1m25.763333(+-0.127% 3 samples)
  dd write 100 MiB          0.861723964 +-15.25%                  0.818129350 +-14.82%
  dd write 200 MiB          1.669887569 +-13.30%                  1.698645885 +-13.27%
  dd write 1000 MiB         8.383191730 +-14.65%                  8.351742280 +-14.52%
  read fault cycles       265.7(+-0.172% 10 samples)            267.0(+-0.000% 10 samples)
  write fault cycles     2070.6(+-1.512% 10 samples)           2084.4(+-2.148% 10 samples)

As expected anon page faults are not affected by this patch.

tj: Updated to apply on top of the recent cancel_dirty_page() changes.

Signed-off-by: Sha Zhengju &lt;handai.szj@gmail.com&gt;
Signed-off-by: Greg Thelen &lt;gthelen@google.com&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Signed-off-by: Jens Axboe &lt;axboe@fb.com&gt;
</content>
</entry>
<entry>
<title>gfp: add __GFP_NOACCOUNT</title>
<updated>2015-05-15T00:55:51+00:00</updated>
<author>
<name>Vladimir Davydov</name>
<email>vdavydov@parallels.com</email>
</author>
<published>2015-05-14T22:16:55+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=8f4fc071b1926d0b20336e2b3f8ab85c94c734c5'/>
<id>urn:sha1:8f4fc071b1926d0b20336e2b3f8ab85c94c734c5</id>
<content type='text'>
Not all kmem allocations should be accounted to memcg.  The following
patch gives an example when accounting of a certain type of allocations to
memcg can effectively result in a memory leak.  This patch adds the
__GFP_NOACCOUNT flag which if passed to kmalloc and friends will force the
allocation to go through the root cgroup.  It will be used by the next
patch.

Note, since in case of kmemleak enabled each kmalloc implies yet another
allocation from the kmemleak_object cache, we add __GFP_NOACCOUNT to
gfp_kmemleak_mask.

Alternatively, we could introduce a per kmem cache flag disabling
accounting for all allocations of a particular kind, but (a) we would not
be able to bypass accounting for kmalloc then and (b) a kmem cache with
this flag set could not be merged with a kmem cache without this flag,
which would increase the number of global caches and therefore
fragmentation even if the memory cgroup controller is not used.

Despite its generic name, currently __GFP_NOACCOUNT disables accounting
only for kmem allocations while user page allocations are always charged.
To catch abusing of this flag, a warning is issued on an attempt of
passing it to mem_cgroup_try_charge.

Signed-off-by: Vladimir Davydov &lt;vdavydov@parallels.com&gt;
Cc: Tejun Heo &lt;tj@kernel.org&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Michal Hocko &lt;mhocko@suse.cz&gt;
Cc: Christoph Lameter &lt;cl@linux.com&gt;
Cc: Pekka Enberg &lt;penberg@kernel.org&gt;
Cc: David Rientjes &lt;rientjes@google.com&gt;
Cc: Joonsoo Kim &lt;iamjoonsoo.kim@lge.com&gt;
Cc: Greg Thelen &lt;gthelen@google.com&gt;
Cc: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
Cc: &lt;stable@vger.kernel.org&gt;	[4.0.x]
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>list_lru: introduce per-memcg lists</title>
<updated>2015-02-13T02:54:09+00:00</updated>
<author>
<name>Vladimir Davydov</name>
<email>vdavydov@parallels.com</email>
</author>
<published>2015-02-12T22:59:10+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=60d3fd32a7a9da4c8c93a9f89cfda22a0b4c65ce'/>
<id>urn:sha1:60d3fd32a7a9da4c8c93a9f89cfda22a0b4c65ce</id>
<content type='text'>
There are several FS shrinkers, including super_block::s_shrink, that
keep reclaimable objects in the list_lru structure.  Hence to turn them
to memcg-aware shrinkers, it is enough to make list_lru per-memcg.

This patch does the trick.  It adds an array of lru lists to the
list_lru_node structure (per-node part of the list_lru), one for each
kmem-active memcg, and dispatches every item addition or removal to the
list corresponding to the memcg which the item is accounted to.  So now
the list_lru structure is not just per node, but per node and per memcg.

Not all list_lrus need this feature, so this patch also adds a new
method, list_lru_init_memcg, which initializes a list_lru as memcg
aware.  Otherwise (i.e.  if initialized with old list_lru_init), the
list_lru won't have per memcg lists.

Just like per memcg caches arrays, the arrays of per-memcg lists are
indexed by memcg_cache_id, so we must grow them whenever
memcg_nr_cache_ids is increased.  So we introduce a callback,
memcg_update_all_list_lrus, invoked by memcg_alloc_cache_id if the id
space is full.

The locking is implemented in a manner similar to lruvecs, i.e.  we have
one lock per node that protects all lists (both global and per cgroup) on
the node.

Signed-off-by: Vladimir Davydov &lt;vdavydov@parallels.com&gt;
Cc: Dave Chinner &lt;david@fromorbit.com&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Michal Hocko &lt;mhocko@suse.cz&gt;
Cc: Greg Thelen &lt;gthelen@google.com&gt;
Cc: Glauber Costa &lt;glommer@gmail.com&gt;
Cc: Alexander Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: Christoph Lameter &lt;cl@linux.com&gt;
Cc: Pekka Enberg &lt;penberg@kernel.org&gt;
Cc: David Rientjes &lt;rientjes@google.com&gt;
Cc: Joonsoo Kim &lt;iamjoonsoo.kim@lge.com&gt;
Cc: Tejun Heo &lt;tj@kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>memcg: add rwsem to synchronize against memcg_caches arrays relocation</title>
<updated>2015-02-13T02:54:09+00:00</updated>
<author>
<name>Vladimir Davydov</name>
<email>vdavydov@parallels.com</email>
</author>
<published>2015-02-12T22:59:01+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=05257a1a3dcc196c197714b5c9a8dd35b7f6aefc'/>
<id>urn:sha1:05257a1a3dcc196c197714b5c9a8dd35b7f6aefc</id>
<content type='text'>
We need a stable value of memcg_nr_cache_ids in kmem_cache_create()
(memcg_alloc_cache_params() wants it for root caches), where we only
hold the slab_mutex and no memcg-related locks.  As a result, we have to
update memcg_nr_cache_ids under the slab_mutex, which we can only take
on the slab's side (see memcg_update_array_size).  This looks awkward
and will become even worse when per-memcg list_lru is introduced, which
also wants stable access to memcg_nr_cache_ids.

To get rid of this dependency between the memcg_nr_cache_ids and the
slab_mutex, this patch introduces a special rwsem.  The rwsem is held
for writing during memcg_caches arrays relocation and memcg_nr_cache_ids
updates.  Therefore one can take it for reading to get a stable access
to memcg_caches arrays and/or memcg_nr_cache_ids.

Currently the semaphore is taken for reading only from
kmem_cache_create, right before taking the slab_mutex, so right now
there's no much point in using rwsem instead of mutex.  However, once
list_lru is made per-memcg it will allow list_lru initializations to
proceed concurrently.

Signed-off-by: Vladimir Davydov &lt;vdavydov@parallels.com&gt;
Cc: Dave Chinner &lt;david@fromorbit.com&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Michal Hocko &lt;mhocko@suse.cz&gt;
Cc: Greg Thelen &lt;gthelen@google.com&gt;
Cc: Glauber Costa &lt;glommer@gmail.com&gt;
Cc: Alexander Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: Christoph Lameter &lt;cl@linux.com&gt;
Cc: Pekka Enberg &lt;penberg@kernel.org&gt;
Cc: David Rientjes &lt;rientjes@google.com&gt;
Cc: Joonsoo Kim &lt;iamjoonsoo.kim@lge.com&gt;
Cc: Tejun Heo &lt;tj@kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
</feed>
