diff options
| author | Tejun Heo <tj@kernel.org> | 2019-08-26 19:06:56 +0300 | 
|---|---|---|
| committer | Jens Axboe <axboe@kernel.dk> | 2019-08-27 18:22:38 +0300 | 
| commit | 97b27821b4854ca744946dae32a3f2fd55bcd5bc (patch) | |
| tree | a89a3f0e814f684b87f9dd2198818d351b52c8aa /tools/perf/scripts/python/stat-cpi.py | |
| parent | d62241c7a406f0680d702bd974f6f17e28ab8e5d (diff) | |
| download | linux-97b27821b4854ca744946dae32a3f2fd55bcd5bc.tar.xz | |
writeback, memcg: Implement foreign dirty flushing
There's an inherent mismatch between memcg and writeback.  The former
trackes ownership per-page while the latter per-inode.  This was a
deliberate design decision because honoring per-page ownership in the
writeback path is complicated, may lead to higher CPU and IO overheads
and deemed unnecessary given that write-sharing an inode across
different cgroups isn't a common use-case.
Combined with inode majority-writer ownership switching, this works
well enough in most cases but there are some pathological cases.  For
example, let's say there are two cgroups A and B which keep writing to
different but confined parts of the same inode.  B owns the inode and
A's memory is limited far below B's.  A's dirty ratio can rise enough
to trigger balance_dirty_pages() sleeps but B's can be low enough to
avoid triggering background writeback.  A will be slowed down without
a way to make writeback of the dirty pages happen.
This patch implements foreign dirty recording and foreign mechanism so
that when a memcg encounters a condition as above it can trigger
flushes on bdi_writebacks which can clean its pages.  Please see the
comment on top of mem_cgroup_track_foreign_dirty_slowpath() for
details.
A reproducer follows.
write-range.c::
  #include <stdio.h>
  #include <stdlib.h>
  #include <unistd.h>
  #include <fcntl.h>
  #include <sys/types.h>
  static const char *usage = "write-range FILE START SIZE\n";
  int main(int argc, char **argv)
  {
	  int fd;
	  unsigned long start, size, end, pos;
	  char *endp;
	  char buf[4096];
	  if (argc < 4) {
		  fprintf(stderr, usage);
		  return 1;
	  }
	  fd = open(argv[1], O_WRONLY);
	  if (fd < 0) {
		  perror("open");
		  return 1;
	  }
	  start = strtoul(argv[2], &endp, 0);
	  if (*endp != '\0') {
		  fprintf(stderr, usage);
		  return 1;
	  }
	  size = strtoul(argv[3], &endp, 0);
	  if (*endp != '\0') {
		  fprintf(stderr, usage);
		  return 1;
	  }
	  end = start + size;
	  while (1) {
		  for (pos = start; pos < end; ) {
			  long bread, bwritten = 0;
			  if (lseek(fd, pos, SEEK_SET) < 0) {
				  perror("lseek");
				  return 1;
			  }
			  bread = read(0, buf, sizeof(buf) < end - pos ?
					       sizeof(buf) : end - pos);
			  if (bread < 0) {
				  perror("read");
				  return 1;
			  }
			  if (bread == 0)
				  return 0;
			  while (bwritten < bread) {
				  long this;
				  this = write(fd, buf + bwritten,
					       bread - bwritten);
				  if (this < 0) {
					  perror("write");
					  return 1;
				  }
				  bwritten += this;
				  pos += bwritten;
			  }
		  }
	  }
  }
repro.sh::
  #!/bin/bash
  set -e
  set -x
  sysctl -w vm.dirty_expire_centisecs=300000
  sysctl -w vm.dirty_writeback_centisecs=300000
  sysctl -w vm.dirtytime_expire_seconds=300000
  echo 3 > /proc/sys/vm/drop_caches
  TEST=/sys/fs/cgroup/test
  A=$TEST/A
  B=$TEST/B
  mkdir -p $A $B
  echo "+memory +io" > $TEST/cgroup.subtree_control
  echo $((1<<30)) > $A/memory.high
  echo $((32<<30)) > $B/memory.high
  rm -f testfile
  touch testfile
  fallocate -l 4G testfile
  echo "Starting B"
  (echo $BASHPID > $B/cgroup.procs
   pv -q --rate-limit 70M < /dev/urandom | ./write-range testfile $((2<<30)) $((2<<30))) &
  echo "Waiting 10s to ensure B claims the testfile inode"
  sleep 5
  sync
  sleep 5
  sync
  echo "Starting A"
  (echo $BASHPID > $A/cgroup.procs
   pv < /dev/urandom | ./write-range testfile 0 $((2<<30)))
v2: Added comments explaining why the specific intervals are being used.
v3: Use 0 @nr when calling cgroup_writeback_by_id() to use best-effort
    flushing while avoding possible livelocks.
v4: Use get_jiffies_64() and time_before/after64() instead of raw
    jiffies_64 and arthimetic comparisons as suggested by Jan.
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Diffstat (limited to 'tools/perf/scripts/python/stat-cpi.py')
0 files changed, 0 insertions, 0 deletions
