Age | Commit message (Collapse) | Author | Files | Lines |
|
This function is not used anywhere in the kernel.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
Both functions are entirely unused.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
Add complete support for full hierarchical scheduling, with a cgroups
interface. Full hierarchical scheduling is implemented through the
'entity' abstraction: both bfq_queues, i.e., the internal BFQ queues
associated with processes, and groups are represented in general by
entities. Given the bfq_queues associated with the processes belonging
to a given group, the entities representing these queues are sons of
the entity representing the group. At higher levels, if a group, say
G, contains other groups, then the entity representing G is the parent
entity of the entities representing the groups in G.
Hierarchical scheduling is performed as follows: if the timestamps of
a leaf entity (i.e., of a bfq_queue) change, and such a change lets
the entity become the next-to-serve entity for its parent entity, then
the timestamps of the parent entity are recomputed as a function of
the budget of its new next-to-serve leaf entity. If the parent entity
belongs, in its turn, to a group, and its new timestamps let it become
the next-to-serve for its parent entity, then the timestamps of the
latter parent entity are recomputed as well, and so on. When a new
bfq_queue must be set in service, the reverse path is followed: the
next-to-serve highest-level entity is chosen, then its next-to-serve
child entity, and so on, until the next-to-serve leaf entity is
reached, and the bfq_queue that this entity represents is set in
service.
Writeback is accounted for on a per-group basis, i.e., for each group,
the async I/O requests of the processes of the group are enqueued in a
distinct bfq_queue, and the entity associated with this queue is a
child of the entity associated with the group.
Weights can be assigned explicitly to groups and processes through the
cgroups interface, differently from what happens, for single
processes, if the cgroups interface is not used (as explained in the
description of the previous patch). In particular, since each node has
a full scheduler, each group can be assigned its own weight.
Signed-off-by: Fabio Checconi <fchecconi@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
For ease of management it would be nice for users to specify that the
device node for a nbd device is destroyed once it is disconnected and
there are no more users. Add a client flag and enable this operation to
happen.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
Allow users to query the status of existing nbd devices. Right now this
only returns whether or not the device is connected, but could be
extended in the future to include more information.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
Sometimes we like to upgrade our server without making all of our
clients freak out and reconnect. This patch provides a way to specify a
dead connection timeout to allow us to pause all requests and wait for
new connections to be opened. With this in place I can take down the
nbd server for less than the dead connection timeout time and bring it
back up and everything resumes gracefully.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
Provide a mechanism to notify userspace that there's been a link problem
on a NBD device. This will allow userspace to re-establish a connection
and provide the new socket to the device without disrupting the device.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
We want to be able to reconnect dead connections to existing block
devices, so add a reconfigure netlink command. We will also allow users
to change their timeout on the fly, but everything else will require a
disconnect and reconnect. You won't be able to add more connections
either, simply replace dead connections with new more lively
connections.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
The existing ioctl interface for configuring NBD devices is a bit
cumbersome and hard to extend. The other problem is we leave a
userspace app sitting in it's syscall until the device disconnects,
which is less than ideal.
This patch introduces a netlink interface for adding and disconnecting
nbd devices. This has the benefits of being easily extendable without
breaking older userspace applications, and allows us to configure a nbd
device without leaving a userspace app sitting waiting for the device to
disconnect.
With this interface we also gain the ability to configure more devices
than are preallocated at insmod time. We also have gained the ability
to not specify a particular device and be provided one for us so that
userspace doesn't need to find a free device to configure.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
Target initialization has two responsibilities: creating the target
partition and instantiating the target. This patch enables to create a
factory partition (e.g., do not trigger recovery on the given target).
This is useful for target development and for being able to restore the
device state at any moment in time without requiring a full-device
erase.
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
According to the OCSSD 1.2 specification, the 0x200 hint enables the
media scrambler for the read/write opcode, providing that the controller
has been correctly configured by the firmware. Rename the macro to
represent this meaning.
Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
Until now erases have been submitted as synchronous commands through a
dedicated erase function. In order to enable targets implementing
asynchronous erases, refactor the erase path so that it uses the normal
async I/O submission functions. If a target requires sync I/O, it can
implement it internally. Also, adapt rrpc to use the new erase path.
Signed-off-by: Javier González <javier@cnexlabs.com>
Fixed spelling error.
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
Currently, this callback is called right after put_request() and has no
distinguishable purpose. Instead, let's call it before put_request() as
soon as I/O has completed on the request, before we account it in
blk-stat. With this, Kyber can enable stats when it sees a latency
outlier and make sure the outlier gets accounted.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
blk_mq_finish_request() is required for schedulers that define their own
put_request(). blk_mq_run_hw_queue() is required for schedulers that
hold back requests to be run later.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
This operation supports the use case of limiting the number of bits that
can be allocated for a given operation. Rather than setting aside some
bits at the end of the bitmap, we can set aside bits in each word of the
bitmap. This means we can keep the allocation hints spread out and
support sbitmap_resize() nicely at the cost of lower granularity for the
allowed depth.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
This drivers was added in 2008, but as far as a I can tell we never had a
single platform that actually registered resources for the platform driver.
It's also been unmaintained for a long time and apparently has a ATA mode
that can be driven using the IDE/libata subsystem.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
Now that we use the proper REQ_OP_WRITE_ZEROES operation everywhere we can
kill this hack.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
This avoids fallbacks to explicit zeroing in (__)blkdev_issue_zeroout if
the caller doesn't want them.
Also clean up the convoluted check for the return condition that this
new flag is added to.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
If this flag is set logical provisioning capable device should
release space for the zeroed blocks if possible, if it is not set
devices should keep the blocks anchored.
Also remove an out of sync kerneldoc comment for a static function
that would have become even more out of data with this change.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
Turn the existing discard flag into a new BLKDEV_ZERO_UNMAP flag with
similar semantics, but without referring to diѕcard.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
Copy & paste from the REQ_OP_WRITE_SAME code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
Make life easy for implementations that needs to send a data buffer
to the device (e.g. SCSI) by numbering it as a data out command.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
Schedulers need to be informed when a hardware queue is added or removed
at runtime so they can allocate/free per-hardware queue data. So,
replace the blk_mq_sched_init_hctx_data() helper, which only makes sense
at init time, with .init_hctx() and .exit_hctx() hooks.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
We've added a considerable amount of fixes for stalls and issues
with the blk-mq scheduling in the 4.11 series since forking
off the for-4.12/block branch. We need to do improvements on
top of that for 4.12, so pull in the previous fixes to make
our lives easier going forward.
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
To improve scalability, if hardware queues are shared, restart
a single hardware queue in round-robin fashion. Rename
blk_mq_sched_restart_queues() to reflect the new semantics.
Remove blk_mq_sched_mark_restart_queue() because this function
has no callers. Remove flag QUEUE_FLAG_RESTART because this
patch removes the code that uses this flag.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
Introduce a function that runs a hardware queue unconditionally
after a delay. Note: there is already a function that stops and
restarts a hardware queue after a delay, namely blk_mq_delay_queue().
This function will be used in the next patch in this series.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Long Li <longli@microsoft.com>
Cc: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
Currently only dm and md/raid5 bios trigger
trace_block_bio_complete(). Now that we have bio_chain() and
bio_inc_remaining(), it is not possible, in general, for a driver to
know when the bio is really complete. Only bio_endio() knows that.
So move the trace_block_bio_complete() call to bio_endio().
Now trace_block_bio_complete() pairs with trace_block_bio_queue().
Any bio for which a 'queue' event is traced, will subsequently
generate a 'complete' event.
There are a few cases where completion tracing is not wanted.
1/ If blk_update_request() has already generated a completion
trace event at the 'request' level, there is no point generating
one at the bio level too. In this case the bi_sector and bi_size
will have changed, so the bio level event would be wrong
2/ If the bio hasn't actually been queued yet, but is being aborted
early, then a trace event could be confusing. Some filesystems
call bio_endio() but do not want tracing.
3/ The bio_integrity code interposes itself by replacing bi_end_io,
then restoring it and calling bio_endio() again. This would produce
two identical trace events if left like that.
To handle these, we introduce a flag BIO_TRACE_COMPLETION and only
produce the trace event when this is set.
We address point 1 above by clearing the flag in blk_update_request().
We address point 2 above by only setting the flag when
generic_make_request() is called.
We address point 3 above by clearing the flag after generating a
completion event.
When bio_split() is used on a bio, particularly in blk_queue_split(),
there is an extra complication. A new bio is split off the front, and
may be handle directly without going through generic_make_request().
The old bio, which has been advanced, is passed to
generic_make_request(), so it will trigger a trace event a second
time.
Probably the best result when a split happens is to see a single
'queue' event for the whole bio, then multiple 'complete' events - one
for each component. To achieve this was can:
- copy the BIO_TRACE_COMPLETION flag to the new bio in bio_split()
- avoid generating a 'queue' event if BIO_TRACE_COMPLETION is already set.
This way, the split-off bio won't create a queue event, the original
won't either even if it re-submitted to generic_make_request(),
but both will produce completion events, each for their own range.
So if generic_make_request() is called (which generates a QUEUED
event), then bi_endio() will create a single COMPLETE event for each
range that the bio is split into, unless the driver has explicitly
requested it not to.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
The comment for the 'flags' field of 'bio' mentions
"command" which is no longer stored there, and doesn't
mention the bvec pool number, which is.
BIO_RESET_BITS is set in such a way that it would need to be
updated if new bits were added, which is easy to miss.
BVEC_POOL_BITS is larger than needed. The BVEC_POOL_IDX()
ranges from 0 to 6, so 3 bits are sufficient.
This patch make improvements in each of these areas.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
In elevator_switch(), if blk_mq_init_sched() fails, we attempt to fall
back to the original scheduler. However, at this point, we've already
torn down the original scheduler's tags, so this causes a crash. Doing
the fallback like the legacy elevator path is much harder for mq, so fix
it by just falling back to none, instead.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
After commit 64c7f1d1572c, we went from 1 to 2 holes in my
test setup. If we move the timeout field a bit, we remove
both of those holes and shrink struct request by 8 bytes.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
Instead of bloating the generic struct request with it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
The block layer core sets blk_mq_queue_data.list but no block
drivers read that member. Hence remove it and also the code that
is used to set this member.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
As Dan Carpenter pointed out: mixing 16-bit nvme status with 32-bit
error status from driver. Corrected comment on fcp request struct
status field, and converted done routine to explicitly set nvme status
codes for nvme status.
Signed-off-by: James Smart <james.smart@broadcom.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
Update FC-NVME definitions to match FC-NVME r1.14 (16-020vB) plus
change voted in by 2/22 FC-NVME Adhoc (see HOSTID below).
Includes the following:
- Addition of "status_code" field to ERSP IU
- Addition of FC-NVME LS RJT reason_codes and reason_explanations
- CreateAssociation payload, HostID field shortened to 16 bytes
Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
Several locations in the stack need to handle ipv4/ipv6
(with scope) and port strings conversion to sockaddr.
Add a helper that takes either AF_INET, AF_INET6 or
AF_UNSPEC (for wildcard) to centralize this handling.
Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
The enum values for QPTYPE, PRTYPE and CMS are off by 1 from the
values defined in figure 42 of the NVM Express over Fabrics 1.0:
http://www.nvmexpress.org/wp-content/uploads/NVMe_over_Fabrics_1_0_Gold_20160605-1.pdf
Fix our enums to match the final spec.
Signed-off-by: Roland Dreier <roland@purestorage.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
|
|
I inadvertently applied the v5 version of this patch, whereas
the agreed upon version was v5. Revert this one so we can apply
the right one.
This reverts commit 7fc6b87a9ff537e7df32b1278118ce9c5bcd6788.
|
|
Now that the remaining drivers have been converted to one request queue
per gendisk, let's warn if a request queue gets registered more than
once. This will catch future drivers which might do it inadvertently or
any old drivers that I may have missed.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
As the .q_usage_counter is used by both legacy and
mq path, we need to block new I/O if queue becomes
dead in blk_queue_enter().
So rename it and we can use this function in both
paths.
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
blkg_conf_prep() currently calls blkg_lookup_create() while holding
request queue spinlock. This means allocating memory for struct
blkcg_gq has to be made non-blocking. This causes occasional -ENOMEM
failures in call paths like below:
pcpu_alloc+0x68f/0x710
__alloc_percpu_gfp+0xd/0x10
__percpu_counter_init+0x55/0xc0
cfq_pd_alloc+0x3b2/0x4e0
blkg_alloc+0x187/0x230
blkg_create+0x489/0x670
blkg_lookup_create+0x9a/0x230
blkg_conf_prep+0x1fb/0x240
__cfqg_set_weight_device.isra.105+0x5c/0x180
cfq_set_weight_on_dfl+0x69/0xc0
cgroup_file_write+0x39/0x1c0
kernfs_fop_write+0x13f/0x1d0
__vfs_write+0x23/0x120
vfs_write+0xc2/0x1f0
SyS_write+0x44/0xb0
entry_SYSCALL_64_fastpath+0x18/0xad
In the code path above, percpu allocator cannot call vmalloc() due to
queue spinlock.
A failure in this call path gives grief to tools which are trying to
configure io weights. We see occasional failures happen shortly after
reboots even when system is not under any memory pressure. Machines
with a lot of cpus are more vulnerable to this condition.
Update blkg_create() function to temporarily drop the rcu and queue
locks when it is allowed by gfp mask.
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
Pull KVM fixes from Paolo Bonzini:
"All x86-specific, apart from some arch-independent syzkaller fixes"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: x86: cleanup the page tracking SRCU instance
KVM: nVMX: fix nested EPT detection
KVM: pci-assign: do not map smm memory slot pages in vt-d page tables
KVM: kvm_io_bus_unregister_dev() should never fail
KVM: VMX: Fix enable VPID conditions
KVM: nVMX: Fix nested VPID vmx exec control
KVM: x86: correct async page present tracepoint
kvm: vmx: Flush TLB when the APIC-access address changes
KVM: x86: use pic/ioapic destructor when destroy vm
KVM: x86: check existance before destroy
KVM: x86: clear bus pointer when destroyed
KVM: Documentation: document MCE ioctls
KVM: nVMX: don't reset kvm mmu twice
PTP: fix ptr_ret.cocci warnings
kvm: fix usage of uninit spinlock in avic_vm_destroy()
KVM: VMX: downgrade warning on unexpected exit code
|
|
User configures latency target, but the latency threshold for each
request size isn't fixed. For a SSD, the IO latency highly depends on
request size. To calculate latency threshold, we sample some data, eg,
average latency for request size 4k, 8k, 16k, 32k .. 1M. The latency
threshold of each request size will be the sample latency (I'll call it
base latency) plus latency target. For example, the base latency for
request size 4k is 80us and user configures latency target 60us. The 4k
latency threshold will be 80 + 60 = 140us.
To sample data, we calculate the order base 2 of rounded up IO sectors.
If the IO size is bigger than 1M, it will be accounted as 1M. Since the
calculation does round up, the base latency will be slightly smaller
than actual value. Also if there isn't any IO dispatched for a specific
IO size, we will use the base latency of smaller IO size for this IO
size.
But we shouldn't sample data at any time. The base latency is supposed
to be latency where disk isn't congested, because we use latency
threshold to schedule IOs between cgroups. If disk is congested, the
latency is higher, using it for scheduling is meaningless. Hence we only
do the sampling when block throttling is in the LOW limit, with
assumption disk isn't congested in such state. If the assumption isn't
true, eg, low limit is too high, calculated latency threshold will be
higher.
Hard disk is completely different. Latency depends on spindle seek
instead of request size. Currently this feature is SSD only, we probably
can use a fixed threshold like 4ms for hard disk though.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
Currently there is no way to know the request size when the request is
finished. Next patch will need this info. We could add extra field to
record the size, but blk_issue_stat has enough space to record it, so
this patch just overloads blk_issue_stat. With this, we will have 49bits
to track time, which still is very long time.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
A cgroup gets assigned a low limit, but the cgroup could never dispatch
enough IO to cross the low limit. In such case, the queue state machine
will remain in LIMIT_LOW state and all other cgroups will be throttled
according to low limit. This is unfair for other cgroups. We should
treat the cgroup idle and upgrade the state machine to lower state.
We also have a downgrade logic. If the state machine upgrades because of
cgroup idle (real idle), the state machine will downgrade soon as the
cgroup is below its low limit. This isn't what we want. A more
complicated case is cgroup isn't idle when queue is in LIMIT_LOW. But
when queue gets upgraded to lower state, other cgroups could dispatch
more IO and this cgroup can't dispatch enough IO, so the cgroup is below
its low limit and looks like idle (fake idle). In this case, the queue
should downgrade soon. The key to determine if we should do downgrade is
to detect if cgroup is truely idle.
Unfortunately it's very hard to determine if a cgroup is real idle. This
patch uses the 'think time check' idea from CFQ for the purpose. Please
note, the idea doesn't work for all workloads. For example, a workload
with io depth 8 has disk utilization 100%, hence think time is 0, eg,
not idle. But the workload can run higher bandwidth with io depth 16.
Compared to io depth 16, the io depth 8 workload is idle. We use the
idea to roughly determine if a cgroup is idle.
We treat a cgroup idle if its think time is above a threshold (by
default 1ms for SSD and 100ms for HD). The idea is think time above the
threshold will start to harm performance. HD is much slower so a longer
think time is ok.
The patch (and the latter patches) uses 'unsigned long' to track time.
We convert 'ns' to 'us' with 'ns >> 10'. This is fast but loses
precision, should not a big deal.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
Pull char/misc driver fixes from Greg KH:
"A smattering of different small fixes for some random driver
subsystems. Nothing all that major, just resolutions for reported
issues and bugs.
All have been in linux-next with no reported issues"
* tag 'char-misc-4.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (21 commits)
extcon: int3496: Set the id pin to direction-input if necessary
extcon: int3496: Use gpiod_get instead of gpiod_get_index
extcon: int3496: Add dependency on X86 as it's Intel specific
extcon: int3496: Add GPIO ACPI mapping table
extcon: int3496: Rename GPIO pins in accordance with binding
vmw_vmci: handle the return value from pci_alloc_irq_vectors correctly
ppdev: fix registering same device name
parport: fix attempt to write duplicate procfiles
auxdisplay: img-ascii-lcd: add missing sentinel entry in img_ascii_lcd_matches
Drivers: hv: vmbus: Don't leak memory when a channel is rescinded
Drivers: hv: vmbus: Don't leak channel ids
Drivers: hv: util: don't forget to init host_ts.lock
Drivers: hv: util: move waiting for release to hv_utils_transport itself
vmbus: remove hv_event_tasklet_disable/enable
vmbus: use rcu for per-cpu channel list
mei: don't wait for os version message reply
mei: fix deadlock on mei reset
intel_th: pci: Add Gemini Lake support
intel_th: pci: Add Denverton SOC support
intel_th: Don't leak module refcount on failure to activate
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging
Pull IIO driver fixes from Greg KH:
"Here are some small IIO driver fixes for 4.11-rc4 that resolve a
number of tiny reported issues. All of these have been in linux-next
for a while with no reported issues"
* tag 'staging-4.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
iio: imu: st_lsm6dsx: fix FIFO_CTRL2 overwrite during watermark configuration
iio: adc: ti_am335x_adc: fix fifo overrun recovery
iio: sw-device: Fix config group initialization
iio: magnetometer: ak8974: remove incorrect __exit markups
iio: hid-sensor-trigger: Change get poll value function order to avoid sensor properties losing after resume from S3
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Pull USB/PHY fixes from Greg KH:
"Here are a number of small USB and PHY driver fixes for 4.11-rc4.
Nothing major here, just an bunch of small fixes, and a handfull of
good fixes from Johan for devices with crazy descriptors. There are a
few new device ids in here as well.
All of these have been in linux-next with no reported issues"
* tag 'usb-4.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (26 commits)
usb: gadget: f_hid: fix: Don't access hidg->req without spinlock held
usb: gadget: udc: remove pointer dereference after free
usb: gadget: f_uvc: Sanity check wMaxPacketSize for SuperSpeed
usb: gadget: f_uvc: Fix SuperSpeed companion descriptor's wBytesPerInterval
usb: gadget: acm: fix endianness in notifications
usb: dwc3: gadget: delay unmap of bounced requests
USB: serial: qcserial: add Dell DW5811e
usb: hub: Fix crash after failure to read BOS descriptor
ACM gadget: fix endianness in notifications
USB: usbtmc: fix probe error path
USB: usbtmc: add missing endpoint sanity check
USB: serial: option: add Quectel UC15, UC20, EC21, and EC25 modems
usb: musb: fix possible spinlock deadlock
usb: musb: dsps: fix iounmap in error and exit paths
usb: musb: cppi41: don't check early-TX-interrupt for Isoch transfer
usb-core: Add LINEAR_FRAME_INTR_BINTERVAL USB quirk
uwb: i1480-dfu: fix NULL-deref at probe
uwb: hwa-rc: fix NULL-deref at probe
USB: wusbcore: fix NULL-deref at probe
USB: uss720: fix NULL-deref at probe
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/fscrypt
Pull fscrypto fixes from Ted Ts'o:
"A code cleanup and bugfix for fs/crypto"
* tag 'fscrypt-for-linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/fscrypt:
fscrypt: eliminate ->prepare_context() operation
fscrypt: remove broken support for detecting keyring key revocation
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging
Pull hwmon fixes from Guenter Roeck:
- bug fixes in asus_atk0110, it87 and max31790 drivers
- added missing API definition to hwmon core
* tag 'hwmon-for-linus-v4.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
hwmon: (asus_atk0110) fix uninitialized data access
hwmon: Add missing HWMON_T_ALARM
hwmon: (it87) Avoid registering the same chip on both SIO addresses
hwmon: (max31790) Set correct PWM value
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma
Pull rdma fixes from Doug Ledford:
"This has been a slow -rc cycle for the RDMA subsystem. We really
haven't had a lot of rc fixes come in. This pull request is the first
of this entire rc cycle and it has all of the suitable fixes so far
and it's still only about 20 patches. The fix for the minor breakage
cause by the dma mapping patchset is in here, as well as a couple
other potential oops fixes, but the rest is more minor.
Summary:
- fix for dma_ops change in this kernel, resolving the s390, powerpc,
and IOMMU operation
- a few other oops fixes
- the rest are all minor fixes"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma:
IB/qib: fix false-postive maybe-uninitialized warning
RDMA/iser: Fix possible mr leak on device removal event
IB/device: Convert ib-comp-wq to be CPU-bound
IB/cq: Don't process more than the given budget
IB/rxe: increment msn only when completing a request
uapi: fix rdma/mlx5-abi.h userspace compilation errors
IB/core: Restore I/O MMU, s390 and powerpc support
IB/rxe: Update documentation link
RDMA/ocrdma: fix a type issue in ocrdma_put_pd_num()
IB/rxe: double free on error
RDMA/vmw_pvrdma: Activate device on ethernet link up
RDMA/vmw_pvrdma: Dont hardcode QP header page
RDMA/vmw_pvrdma: Cleanup unused variables
infiniband: Fix alignment of mmap cookies to support VIPT caching
IB/core: Protect against self-requeue of a cq work item
i40iw: Receive netdev events post INET_NOTIFIER state
|