summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2018-08-08cfq: Annotate fall-through in a switch statementBart Van Assche1-0/+1
This patch avoids that gcc complains about fall-through when building with W=1. Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-07blk-wbt: Avoid lock contention and thundering herd issue in wbt_waitAnchal Agarwal1-31/+24
I am currently running a large bare metal instance (i3.metal) on EC2 with 72 cores, 512GB of RAM and NVME drives, with a 4.18 kernel. I have a workload that simulates a database workload and I am running into lockup issues when writeback throttling is enabled,with the hung task detector also kicking in. Crash dumps show that most CPUs (up to 50 of them) are all trying to get the wbt wait queue lock while trying to add themselves to it in __wbt_wait (see stack traces below). [ 0.948118] CPU: 45 PID: 0 Comm: swapper/45 Not tainted 4.14.51-62.38.amzn1.x86_64 #1 [ 0.948119] Hardware name: Amazon EC2 i3.metal/Not Specified, BIOS 1.0 10/16/2017 [ 0.948120] task: ffff883f7878c000 task.stack: ffffc9000c69c000 [ 0.948124] RIP: 0010:native_queued_spin_lock_slowpath+0xf8/0x1a0 [ 0.948125] RSP: 0018:ffff883f7fcc3dc8 EFLAGS: 00000046 [ 0.948126] RAX: 0000000000000000 RBX: ffff887f7709ca68 RCX: ffff883f7fce2a00 [ 0.948128] RDX: 000000000000001c RSI: 0000000000740001 RDI: ffff887f7709ca68 [ 0.948129] RBP: 0000000000000002 R08: 0000000000b80000 R09: 0000000000000000 [ 0.948130] R10: ffff883f7fcc3d78 R11: 000000000de27121 R12: 0000000000000002 [ 0.948131] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000 [ 0.948132] FS: 0000000000000000(0000) GS:ffff883f7fcc0000(0000) knlGS:0000000000000000 [ 0.948134] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 0.948135] CR2: 000000c424c77000 CR3: 0000000002010005 CR4: 00000000003606e0 [ 0.948136] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 0.948137] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 0.948138] Call Trace: [ 0.948139] <IRQ> [ 0.948142] do_raw_spin_lock+0xad/0xc0 [ 0.948145] _raw_spin_lock_irqsave+0x44/0x4b [ 0.948149] ? __wake_up_common_lock+0x53/0x90 [ 0.948150] __wake_up_common_lock+0x53/0x90 [ 0.948155] wbt_done+0x7b/0xa0 [ 0.948158] blk_mq_free_request+0xb7/0x110 [ 0.948161] __blk_mq_complete_request+0xcb/0x140 [ 0.948166] nvme_process_cq+0xce/0x1a0 [nvme] [ 0.948169] nvme_irq+0x23/0x50 [nvme] [ 0.948173] __handle_irq_event_percpu+0x46/0x300 [ 0.948176] handle_irq_event_percpu+0x20/0x50 [ 0.948179] handle_irq_event+0x34/0x60 [ 0.948181] handle_edge_irq+0x77/0x190 [ 0.948185] handle_irq+0xaf/0x120 [ 0.948188] do_IRQ+0x53/0x110 [ 0.948191] common_interrupt+0x87/0x87 [ 0.948192] </IRQ> .... [ 0.311136] CPU: 4 PID: 9737 Comm: run_linux_amd64 Not tainted 4.14.51-62.38.amzn1.x86_64 #1 [ 0.311137] Hardware name: Amazon EC2 i3.metal/Not Specified, BIOS 1.0 10/16/2017 [ 0.311138] task: ffff883f6e6a8000 task.stack: ffffc9000f1ec000 [ 0.311141] RIP: 0010:native_queued_spin_lock_slowpath+0xf5/0x1a0 [ 0.311142] RSP: 0018:ffffc9000f1efa28 EFLAGS: 00000046 [ 0.311144] RAX: 0000000000000000 RBX: ffff887f7709ca68 RCX: ffff883f7f722a00 [ 0.311145] RDX: 0000000000000035 RSI: 0000000000d80001 RDI: ffff887f7709ca68 [ 0.311146] RBP: 0000000000000202 R08: 0000000000140000 R09: 0000000000000000 [ 0.311147] R10: ffffc9000f1ef9d8 R11: 000000001a249fa0 R12: ffff887f7709ca68 [ 0.311148] R13: ffffc9000f1efad0 R14: 0000000000000000 R15: ffff887f7709ca00 [ 0.311149] FS: 000000c423f30090(0000) GS:ffff883f7f700000(0000) knlGS:0000000000000000 [ 0.311150] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 0.311151] CR2: 00007feefcea4000 CR3: 0000007f7016e001 CR4: 00000000003606e0 [ 0.311152] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 0.311153] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 0.311154] Call Trace: [ 0.311157] do_raw_spin_lock+0xad/0xc0 [ 0.311160] _raw_spin_lock_irqsave+0x44/0x4b [ 0.311162] ? prepare_to_wait_exclusive+0x28/0xb0 [ 0.311164] prepare_to_wait_exclusive+0x28/0xb0 [ 0.311167] wbt_wait+0x127/0x330 [ 0.311169] ? finish_wait+0x80/0x80 [ 0.311172] ? generic_make_request+0xda/0x3b0 [ 0.311174] blk_mq_make_request+0xd6/0x7b0 [ 0.311176] ? blk_queue_enter+0x24/0x260 [ 0.311178] ? generic_make_request+0xda/0x3b0 [ 0.311181] generic_make_request+0x10c/0x3b0 [ 0.311183] ? submit_bio+0x5c/0x110 [ 0.311185] submit_bio+0x5c/0x110 [ 0.311197] ? __ext4_journal_stop+0x36/0xa0 [ext4] [ 0.311210] ext4_io_submit+0x48/0x60 [ext4] [ 0.311222] ext4_writepages+0x810/0x11f0 [ext4] [ 0.311229] ? do_writepages+0x3c/0xd0 [ 0.311239] ? ext4_mark_inode_dirty+0x260/0x260 [ext4] [ 0.311240] do_writepages+0x3c/0xd0 [ 0.311243] ? _raw_spin_unlock+0x24/0x30 [ 0.311245] ? wbc_attach_and_unlock_inode+0x165/0x280 [ 0.311248] ? __filemap_fdatawrite_range+0xa3/0xe0 [ 0.311250] __filemap_fdatawrite_range+0xa3/0xe0 [ 0.311253] file_write_and_wait_range+0x34/0x90 [ 0.311264] ext4_sync_file+0x151/0x500 [ext4] [ 0.311267] do_fsync+0x38/0x60 [ 0.311270] SyS_fsync+0xc/0x10 [ 0.311272] do_syscall_64+0x6f/0x170 [ 0.311274] entry_SYSCALL_64_after_hwframe+0x42/0xb7 In the original patch, wbt_done is waking up all the exclusive processes in the wait queue, which can cause a thundering herd if there is a large number of writer threads in the queue. The original intention of the code seems to be to wake up one thread only however, it uses wake_up_all() in __wbt_done(), and then uses the following check in __wbt_wait to have only one thread actually get out of the wait loop: if (waitqueue_active(&rqw->wait) && rqw->wait.head.next != &wait->entry) return false; The problem with this is that the wait entry in wbt_wait is define with DEFINE_WAIT, which uses the autoremove wakeup function. That means that the above check is invalid - the wait entry will have been removed from the queue already by the time we hit the check in the loop. Secondly, auto-removing the wait entries also means that the wait queue essentially gets reordered "randomly" (e.g. threads re-add themselves in the order they got to run after being woken up). Additionally, new requests entering wbt_wait might overtake requests that were queued earlier, because the wait queue will be (temporarily) empty after the wake_up_all, so the waitqueue_active check will not stop them. This can cause certain threads to starve under high load. The fix is to leave the woken up requests in the queue and remove them in finish_wait() once the current thread breaks out of the wait loop in __wbt_wait. This will ensure new requests always end up at the back of the queue, and they won't overtake requests that are already in the wait queue. With that change, the loop in wbt_wait is also in line with many other wait loops in the kernel. Waking up just one thread drastically reduces lock contention, as does moving the wait queue add/remove out of the loop. A significant drop in lockdep's lock contention numbers is seen when running the test application on the patched kernel. Signed-off-by: Anchal Agarwal <anchalag@amazon.com> Signed-off-by: Frank van der Linden <fllinden@amazon.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-07target/loop: depend on SCSIChristoph Hellwig1-0/+1
The target loopback driver is a low-level driver for the SCSI subsystem, and as such needs to depend on it. Fixes: 8a39a047 ("target: don't depend on SCSI") Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-06xen-blkfront: use true and false for boolean valuesGustavo A. R. Silva1-2/+2
Return statements in functions returning bool should use true or false instead of an integer value. This code was detected with the help of Coccinelle. Acked-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-06lightnvm: remove minor version check for 2.0Matias Bjørling1-6/+0
A minor version number increase should not break backwards compatibility. Fixes: 3cb98f84d368b ("lightnvm: add minor version to generic geometry") Reviewed-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-06Merge branch 'nvme-4.19' of git://git.infradead.org/nvme into for-4.19/block2Jens Axboe9-74/+880
Pull NVMe changes from Christoph: "This contains the support for TP4004, Asymmetric Namespace Access, which makes NVMe multipathing usable in practice." * 'nvme-4.19' of git://git.infradead.org/nvme: nvmet: use Retain Async Event bit to clear AEN nvmet: support configuring ANA groups nvmet: add minimal ANA support nvmet: track and limit the number of namespaces per subsystem nvmet: keep a port pointer in nvmet_ctrl nvme: add ANA support nvme: remove nvme_req_needs_failover nvme: simplify the API for getting log pages nvme.h: add ANA definitions nvme.h: add support for the log specific field Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-06Merge tag 'v4.18-rc6' into for-4.19/block2Jens Axboe487-2406/+4164
Pull in 4.18-rc6 to get the NVMe core AEN change to avoid a merge conflict down the line. Signed-of-by: Jens Axboe <axboe@kernel.dk>
2018-08-03scsi: Check sense buffer size at build timeKees Cook3-8/+18
To avoid introducing problems like those fixed in commit f7068114d45e ("sr: pass down correctly sized SCSI sense buffer"), this creates a macro wrapper for scsi_execute() that verifies the size of the sense buffer similar to what was done for command string sizes in commit 3756f6401c30 ("exec: avoid gcc-8 warning for get_task_comm"). Another solution could be to add a length argument to scsi_execute(), but this function already takes a lot of arguments and Jens was not fond of that approach. Additionally, this moves the SCSI_SENSE_BUFFERSIZE definition into scsi_device.h, and removes a redundant include for scsi_device.h from scsi_cmnd.h. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-03libata-scsi: Move sense buffers onto stackKees Cook1-12/+6
To support future compile-time sizeof() checks that will be able to validate the length of sense buffers, this removes the only dynamically allocated sense buffers in the tree by putting the 96 byte sense buffers on the stack. Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-03cdrom: Use struct scsi_sense_hdr internallyKees Cook2-3/+7
This removes more casts of struct request_sense and uses the standard struct scsi_sense_hdr instead. This also fixes any possible stale values since the prior code did not check the sense length. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-03ide-cd: Remove redundant sense bufferKees Cook1-8/+9
This is already able to process the sense buffer, so remove the redundant parsing during the failure path. This also fixes any possible stale values since the prior code did not check the sense length. Acked-by: David S. Miller <davem@davemloft.net> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-03block: Switch struct packet_command to use struct scsi_sense_hdrKees Cook7-65/+63
There is a lot of needless struct request_sense usage in the CDROM code. These can all be struct scsi_sense_hdr instead, to avoid any confusion over their respective structure sizes. This patch is a lot of noise changing "sense" to "sshdr", but the final code is more readable to distinguish between "sense" meaning "struct request_sense" and "sshdr" meaning "struct scsi_sense_hdr". Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-03target: don't depend on SCSIChristoph Hellwig1-2/+3
The core target code only needs code from scsi_common.c, which is now separately selectable. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-03scsi: build scsi_common.o for all scsi passthrough request usersChristoph Hellwig2-2/+2
Split scsi_common.o out of SCSI so that non-SCSI users can pull it in easily for future sense buffer helper usage. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-03scsi: cxlflash: Drop unused sense buffersKees Cook2-11/+4
This removes the unused sense buffer in read_cap16() and write_same16(). Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Matthew R. Ochs <mrochs@linux.vnet.ibm.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-03ide-cd: Drop unused sense buffersKees Cook3-44/+28
This drops unused sense buffers from: cdrom_eject() cdrom_read_capacity() cdrom_read_tocentry() ide_cd_lockdoor() ide_cd_read_toc() Acked-by: David S. Miller <davem@davemloft.net> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-02blk-mq: fix updating tags depthMing Lei1-4/+4
The passed 'nr' from userspace represents the total depth, meantime inside 'struct blk_mq_tags', 'nr_tags' stores the total tag depth, and 'nr_reserved_tags' stores the reserved part. There are two issues in blk_mq_tag_update_depth() now: 1) for growing tags, we should have used the passed 'nr', and keep the number of reserved tags not changed. 2) the passed 'nr' should have been used for checking against 'tags->nr_tags', instead of number of the normal part. This patch fixes the above two cases, and avoids kernel crash caused by wrong resizing sbitmap queue. Cc: "Ewan D. Milne" <emilne@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Bart Van Assche <bart.vanassche@sandisk.com> Cc: Omar Sandoval <osandov@fb.com> Tested by: Marco Patalano <mpatalan@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-02block: really disable runtime-pm for blk-mqMing Lei1-2/+4
Runtime PM isn't ready for blk-mq yet, and commit 765e40b675a9 ("block: disable runtime-pm for blk-mq") tried to disable it. Unfortunately, it can't take effect in that way since user space still can switch it on via 'echo auto > /sys/block/sdN/device/power/control'. This patch disables runtime-pm for blk-mq really by pm_runtime_disable() and fixes all kinds of PM related kernel crash. Cc: Tomas Janousek <tomi@nomi.cz> Cc: Przemek Socha <soprwa@gmail.com> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: <stable@vger.kernel.org> Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-02aoe: mark expected switch fall-throughGustavo A. R. Silva1-0/+1
In preparation to enabling -Wimplicit-fallthrough, mark switch cases where we are expecting to fall through. Addresses-Coverity-ID: 114722 ("Missing break in switch") Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-02block: make iolatency avg_lat exponentially decayDennis Zhou (Facebook)2-24/+57
Currently, avg_lat is calculated by accumulating the mean of every window in a long running cumulative average. As time goes on, the metric becomes less and less useful due to the accumulated history. This patch reuses the same calculation done in load averages to make the avg_lat metric more lively. Unlike load averages, the avg only advances when a window elapses (due to an io). Idle periods extend the most recent window. Bucketing is used to limit the history of avg_lat by binding it to the window size. So, the window range for 1/exp (decay rate) is [1 min, 2.5 min) when windows elapse immediately. The current sample window size is exposed in the debug info to enable calculation of the window range. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-01blk-cgroup: clear the throttle queue on forkJosef Bacik1-0/+5
We were hitting a panic in production where we put too many times on the request queue. This is because we'd get the throttle_queue of the parent if we fork()'ed while we needed to be throttled, but we didn't have a reference on it. Instead just clear these flags on fork so the child doesn't pay for the sins of its father. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-01blk-cgroup: hold the queue ref during throttlingJosef Bacik1-1/+1
The blkg lifetime is protected by the queue lifetime, so we need to put the queue _after_ we're done using the blkg. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-01blk-iolatency: fix blkg leak in timer_fnJosef Bacik1-1/+1
At this point we have a ref on the blkg, we need to drop it if we don't have a iolat. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-01block/bsg-lib: use PTR_ERR_OR_ZERO to simplify the flow pathzhong jiang1-3/+2
Simplify the code by using the PTR_ERR_OR_ZERO, instead of the open code. It is better. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: zhong jiang <zhongjiang@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-31t10-pi: provide empty t10_pi_complete() for !CONFIG_BLK_DEV_INTEGRITYJens Axboe1-0/+11
Fixes a link failure whtn BLK_DEV_INTEGRITY isn't defined. Fixes: 10c41ddd6132 ("block: move dif_prepare/dif_complete functions to block layer") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-30block: blk_init_allocated_queue() set q->fq as NULL in the fail casexiao jin1-0/+1
We find the memory use-after-free issue in __blk_drain_queue() on the kernel 4.14. After read the latest kernel 4.18-rc6 we think it has the same problem. Memory is allocated for q->fq in the blk_init_allocated_queue(). If the elevator init function called with error return, it will run into the fail case to free the q->fq. Then the __blk_drain_queue() uses the same memory after the free of the q->fq, it will lead to the unpredictable event. The patch is to set q->fq as NULL in the fail case of blk_init_allocated_queue(). Fixes: commit 7c94e1c157a2 ("block: introduce blk_flush_queue to drive flush machinery") Cc: <stable@vger.kernel.org> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com> Signed-off-by: xiao jin <jin.xiao@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-30nvme: use blk API to remap ref tags for IOs with metadataMax Gurtovoy3-82/+20
Also moved the logic of the remapping to the nvme core driver instead of implementing it in the nvme pci driver. This way all the other nvme transport drivers will benefit from it (in case they'll implement metadata support). Suggested-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-30block: move dif_prepare/dif_complete functions to block layerMax Gurtovoy5-125/+118
Currently these functions are implemented in the scsi layer, but their actual place should be the block layer since T10-PI is a general data integrity feature that is used in the nvme protocol as well. Also, use the tuple size from the integrity profile since it may vary between integrity types. Suggested-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-30block: move ref_tag calculation func to the block layerMax Gurtovoy6-12/+16
Currently this function is implemented in the scsi layer, but it's actual place should be the block layer since T10-PI is a general data integrity feature that is used in the nvme protocol as well. Suggested-by: Christoph Hellwig <hch@lst.de> Cc: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-30block: don't account for split bio's size in cgroup statsJosef Bacik1-2/+8
We need to check in blkcg_bio_issue_check if the bio is flagged as QUEUE_ENTERED, because if it is then we've already accounted for the size of the IO in the cgroup stats. We can still however account for the extra IO since it'll be another request. Reported-by: Tejun Heo <tj@kernel.org> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-28pktcdvd: Fix possible Spectre-v1 for pkt_devsJinbum Park1-1/+3
User controls @dev_minor which to be used as index of pkt_devs. So, It can be exploited via Spectre-like attack. (speculative execution) This kind of attack leaks address of pkt_devs, [1] It leads an attacker to bypass security mechanism such as KASLR. So sanitize @dev_minor before using it to prevent attack. [1] https://github.com/jinb-park/linux-exploit/ tree/master/exploit-remaining-spectre-gadget/leak_pkt_devs.c Signed-off-by: Jinbum Park <jinb.park7@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-27nvmet: use Retain Async Event bit to clear AENChaitanya Kulkarni1-2/+15
In the current implementation, we clear the AEN bit when we get the "get log page" command if given log page is associated with AEN. This patch allows optionally retaining the AEN for the ctrl under consideration when Retain Asynchronous Event (RAE) bit is set as a part of "get log page" command. This allows the host to read the Log page and optionally retaining the AEN associated with this log page when using userspace tools like nvme-cli. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> [hch: also use the new helper in the just merged ANA code] Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-07-27nvmet: support configuring ANA groupsChristoph Hellwig4-4/+236
Allow creating non-default ANA groups (group ID > 1). Groups are created either by assigning the group ID to a namespace, or by creating a configfs group object under a specific port. All namespaces assigned to a group that doesn't have a configfs object for a given port are marked as inaccessible. Allow changing the ANA state on a per-port basis by creating an ana_groups directory under each port, and another directory with an ana_state file in it. The default ANA group 1 directory is created automatically for each port. For all changes in ANA configuration the ANA change AEN is sent. We only keep a global changecount instead of additional per-group changecounts to keep the implementation as simple as possible. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
2018-07-27nvmet: add minimal ANA supportChristoph Hellwig4-4/+142
Add support for Asynchronous Namespace Access as specified in NVMe 1.3 TP 4004. Just add a default ANA group 1 that is optimized on all ports. This is (and will remain) the default assignment for any namespace not epxlicitly assigned to another ANA group. The ANA state can be manually changed through the configfs interface, including the change state. Includes fixes and improvements from Hannes Reinecke. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
2018-07-27nvmet: track and limit the number of namespaces per subsystemChristoph Hellwig3-1/+16
TP 4004 introduces a new 'Maximum Number of Allocated Namespaces' field in the Identify controller data to help the host size resources. Put an upper limit on the supported namespaces to be able to support this value as supporting 32-bits worth of namespaces would lead to very large buffers. The limit is completely arbitrary at this point. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
2018-07-27nvmet: keep a port pointer in nvmet_ctrlChristoph Hellwig2-0/+4
This will be needed for the ANA AEN code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
2018-07-27nvme: add ANA supportChristoph Hellwig3-27/+408
Add support for Asynchronous Namespace Access as specified in NVMe 1.3 TP 4004. With ANA each namespace attached to a controller belongs to an ANA group that describes the characteristics of accessing the namespaces through this controller. In the optimized and non-optimized states namespaces can be accessed regularly, although in a multi-pathing environment we should always prefer to access a namespace through a controller where an optimized relationship exists. Namespaces in Inaccessible, Permanent-Loss or Change state for a given controller should not be accessed. The states are updated through reading the ANA log page, which is read once during controller initialization, whenever the ANA change notice AEN is received, or when one of the ANA specific status codes that signal a state change is received on a command. The ANA state is kept in the nvme_ns structure, which makes the checks in the fast path very simple. Updating the ANA state when reading the log page is also very simple, the only downside is that finding the initial ANA state when scanning for namespaces is a bit cumbersome. The gendisk for a ns_head is only registered once a live path for it exists. Without that the kernel would hang during partition scanning. Includes fixes and improvements from Hannes Reinecke. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
2018-07-27nvme: remove nvme_req_needs_failoverChristoph Hellwig3-14/+2
Now that we just call out to blk_path_error there isn't really any good reason to not merge it into the only caller. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
2018-07-27nvme: simplify the API for getting log pagesChristoph Hellwig3-25/+16
Merge nvme_get_log and nvme_get_log_ext into a single helper, which takes a plain nsid instead of the nvme_ns pointer. Also add support for the log specific field while we're at it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
2018-07-27nvme.h: add ANA definitionsChristoph Hellwig1-3/+47
Add various defintions from NVMe 1.3 TP 4004. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
2018-07-27nvme.h: add support for the log specific fieldChristoph Hellwig1-1/+1
NVMe 1.3 added a new log specific field to the get log page CQ defintion, add it to our get_log_page SQ structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
2018-07-27partitions/aix: append null character to print data from diskMauricio Faria de Oliveira1-2/+6
Even if properly initialized, the lvname array (i.e., strings) is read from disk, and might contain corrupt data (e.g., lack the null terminating character for strings). So, make sure the partition name string used in pr_warn() has the null terminating character. Fixes: 6ceea22bbbc8 ("partitions: add aix lvm partition support files") Suggested-by: Daniel J. Axtens <daniel.axtens@canonical.com> Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-27partitions/aix: fix usage of uninitialized lv_info and lvname structuresMauricio Faria de Oliveira1-2/+3
The if-block that sets a successful return value in aix_partition() uses 'lvip[].pps_per_lv' and 'n[].name' potentially uninitialized. For example, if 'numlvs' is zero or alloc_lvn() fails, neither is initialized, but are used anyway if alloc_pvd() succeeds after it. So, make the alloc_pvd() call conditional on their initialization. This has been hit when attaching an apparently corrupted/stressed AIX LUN, misleading the kernel to pr_warn() invalid data and hang. [...] partition (null) (11 pp's found) is not contiguous [...] partition (null) (2 pp's found) is not contiguous [...] partition (null) (3 pp's found) is not contiguous [...] partition (null) (64 pp's found) is not contiguous Fixes: 6ceea22bbbc8 ("partitions: add aix lvm partition support files") Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-27bcache: stop using the deprecated get_seconds()Arnd Bergmann2-8/+8
The get_seconds function is deprecated now since it returns a 32-bit value that will eventually overflow, and we are replacing it throughout the kernel with ktime_get_seconds() or ktime_get_real_seconds() that return a time64_t. bcache uses get_seconds() to read the current system time and store it in the superblock as well as in uuid_entry structures that are user visible. Unfortunately, the two structures in are still limited to 32 bits, so this won't fix any real problems but will still overflow in year 2106. Let's at least document that properly, in case we get an updated format in the future it can be fixed. We still have a long time before the overflow and checking the tools at https://github.com/koverstreet/bcache-tools reveals no access to any of them. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-27bcache: do not assign in if condition in bcache_device_init()Florian Schmaus1-5/+11
Fixes an error condition reported by checkpatch.pl which is caused by assigning a variable in an if condition. Signed-off-by: Florian Schmaus <flo@geekplace.eu> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-27bcache: do not assign in if condition in bcache_init()Florian Schmaus1-3/+9
Fixes an error condition reported by checkpatch.pl which is caused by assigning a variable in an if condition. Signed-off-by: Florian Schmaus <flo@geekplace.eu> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-27bcache: free heap cache_set->flush_btree in bch_journal_freeShenghui Wang1-0/+1
Free the cache_set->flush_bree heap memory on journal free. Signed-off-by: Wang Sheng-Hui <shhuiw@foxmail.com> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-27bcache: do not assign in if condition register_bcache()Florian Schmaus1-2/+6
Fixes an error condition reported by checkpatch.pl which is caused by assigning a variable in an if condition. Signed-off-by: Florian Schmaus <flo@geekplace.eu> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-27bcache: fix I/O significant decline while backend devices registeringTang Junhui1-3/+26
I attached several backend devices in the same cache set, and produced lots of dirty data by running small rand I/O writes in a long time, then I continue run I/O in the others cached devices, and stopped a cached device, after a mean while, I register the stopped device again, I see the running I/O in the others cached devices dropped significantly, sometimes even jumps to zero. In currently code, bcache would traverse each keys and btree node to count the dirty data under read locker, and the writes threads can not get the btree write locker, and when there is a lot of keys and btree node in the registering device, it would last several seconds, so the write I/Os in others cached device are blocked and declined significantly. In this patch, when a device registering to a ache set, which exist others cached devices with running I/Os, we get the amount of dirty data of the device in an incremental way, and do not block other cached devices all the time. Patch v2: Rename some variables and macros name as Coly suggested. Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-27bcache: calculate the number of incremental GC nodes according to the total ↵Tang Junhui1-2/+35
of btree nodes This patch base on "[PATCH] bcache: finish incremental GC". Since incremental GC would stop 100ms when front side I/O comes, so when there are many btree nodes, if GC only processes constant (100) nodes each time, GC would last a long time, and the front I/Os would run out of the buckets (since no new bucket can be allocated during GC), and I/Os be blocked again. So GC should not process constant nodes, but varied nodes according to the number of btree nodes. In this patch, GC is divided into constant (100) times, so when there are many btree nodes, GC can process more nodes each time, otherwise GC will process less nodes each time (but no less than MIN_GC_NODES). Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>