1
0
Fork 0
mirror of synced 2025-03-06 20:59:54 +01:00
Commit graph

751 commits

Author SHA1 Message Date
Ruozhu Li
8b77fa6fdc nvme: fix use after free when disconnecting a reconnecting ctrl
A crash happens when trying to disconnect a reconnecting ctrl:

 1) The network was cut off when the connection was just established,
    scan work hang there waiting for some IOs complete.  Those I/Os were
    retried because we return BLK_STS_RESOURCE to blk in reconnecting.
 2) After a while, I tried to disconnect this connection.  This
    procedure also hangs because it tried to obtain ctrl->scan_lock.
    It should be noted that now we have switched the controller state
    to NVME_CTRL_DELETING.
 3) In nvme_check_ready(), we always return true when ctrl->state is
    NVME_CTRL_DELETING, so those retrying I/Os were issued to the bottom
    device which was already freed.

To fix this, when ctrl->state is NVME_CTRL_DELETING, issue cmd to bottom
device only when queue state is live.  If not, return host path error to
the block layer

Signed-off-by: Ruozhu Li <liruozhu@huawei.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-12-07 18:21:16 +01:00
Keith Busch
d39ad2a45c nvme: disable namespace access for unsupported metadata
The only fabrics target that supports metadata handling through the
separate integrity buffer is RDMA. It is currently usable only if the
size is 8B per block and formatted for protection information. If an
rdma target were to export a namespace with a different format (ex:
4k+64B), the driver will not be able to submit valid read/write commands
for that namespace.

Suppress setting the metadata feature in the namespace so that the
gendisk capacity will be set to 0. This will prevent read/write access
through the block stack, but will continue to allow ioctl passthrough
commands.

Cc: Max Gurtovoy <mgurtovoy@nvidia.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-12-06 08:52:08 +01:00
Keith Busch
16cc33b237 nvme: show subsys nqn for duplicate cntlids
The driver assigned nvme handle isn't persistent across reboots, so is
not enough information to match up where the collisions are occuring.
Add the subsys nqn string to the output so that it can more easily be
identified later.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=215099
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-12-06 08:52:08 +01:00
Klaus Jensen
00b33cf3da nvme: fix write zeroes pi
Write Zeroes sets PRACT when block integrity is enabled (as it should),
but neglects to also set the reftag which is expected by reads. This
causes protection errors on reads.

Fix this by setting the reftag for type 1 and 2 (for type 3, reads will
not check the reftag).

Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-11-23 17:22:41 +01:00
Enzo Matsumiya
5a6254d55e nvme-pci: add NO APST quirk for Kioxia device
This particular Kioxia device times out and aborts I/O during any load,
but it's more easily observable with discards (fstrim).

The device gets to a state that is also not possible to use
"nvme set-feature" to disable APST.
Booting with nvme_core.default_ps_max_latency=0 solves the issue.

We had a dozen or so of these devices behaving this same way in
customer environments.

Signed-off-by: Enzo Matsumiya <ematsumiya@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-11-23 17:22:41 +01:00
Linus Torvalds
3e28850cbd for-5.16/block-2021-11-09
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmGKqAcQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpojrD/4yA+GgV+jWeIepYWvU81TQFpt9AJmzWbrY
 uryj4dy7EdMjun+JkAP8k4qreqvTZRsJMkr9dhmS4qaM8/Vt8K/RU/0n/lxNVmqc
 1//ZaTS6DURVAc52GHIXD3q4cv8pHofTZZlrj1Hgz35shlOayStGJtktH5f8uQl4
 5Yxjh+HKr15Chym+fKlbR6T7BgVxxNyhT9q89BgUwMAJX+1KRVtwtkyVK5IbObFy
 zOeiC+n9niQ6iJHcLoqb7LjfBOs/VjdNOQYGSCAnrBxuQ8GnEP2xDw2nvFlOPE12
 5tWEwTgAX7381ilbL6VvNTlTafIs/Axt8mI0cY/OMW7ApiHwO3rXjQSqA4yrnKCJ
 h6M1QavqThd2DtMnOi0U5wwgtD2UjS+CMpK5XFxeIyl6GqTgZcaWm3VqRnG68KZD
 r5+o99GKWCHy0cckxq2WiWJouReeNZ9u9R6HNDw0Vb8UNyWgBR+v2MkX+SHS/c85
 2gXm10hwBH7BFnC4X8ceiuT/bm7xm9S6D/3LCVitlUTBRfqobsQEQjSciPeoOtL0
 rRSTKob7jtokiB2q01wx3q1jnUMpxE1fqJkpLjUvebTzw+a+xfPwy0nNTGq0XXIv
 WMVRRpSWCZm04Ru0q/K8cj0GOyur5x+ilefZ1V+/sRU5dVmGuJgbJUxei1HPC6eV
 z9Rn0aFv4g==
 =1GPi
 -----END PGP SIGNATURE-----

Merge tag 'for-5.16/block-2021-11-09' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

 - Set of fixes for the batched tag allocation (Ming, me)

 - add_disk() error handling fix (Luis)

 - Nested queue quiesce fixes (Ming)

 - Shared tags init error handling fix (Ye)

 - Misc cleanups (Jean, Ming, me)

* tag 'for-5.16/block-2021-11-09' of git://git.kernel.dk/linux-block:
  nvme: wait until quiesce is done
  scsi: make sure that request queue queiesce and unquiesce balanced
  scsi: avoid to quiesce sdev->request_queue two times
  blk-mq: add one API for waiting until quiesce is done
  blk-mq: don't free tags if the tag_set is used by other device in queue initialztion
  block: fix device_add_disk() kobject_create_and_add() error handling
  block: ensure cached plug request matches the current queue
  block: move queue enter logic into blk_mq_submit_bio()
  block: make bio_queue_enter() fast-path available inline
  block: split request allocation components into helpers
  block: have plug stored requests hold references to the queue
  blk-mq: update hctx->nr_active in blk_mq_end_request_batch()
  blk-mq: add RQF_ELV debug entry
  blk-mq: only try to run plug merge if request has same queue with incoming bio
  block: move RQF_ELV setting into allocators
  dm: don't stop request queue after the dm device is suspended
  block: replace always false argument with 'false'
  block: assign correct tag before doing prefetch of request
  blk-mq: fix redundant check of !e expression
2021-11-09 11:20:07 -08:00
Ming Lei
26af1cd003 nvme: wait until quiesce is done
NVMe uses one atomic flag to check if quiesce is needed. If quiesce is
started, the helper returns immediately. This way is wrong, since we
have to wait until quiesce is done.

Fixes: e70feb8b3e ("blk-mq: support concurrent queue quiesce/unquiesce")
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20211109071144.181581-5-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-09 08:14:27 -07:00
Linus Torvalds
643a7234e0 for-5.16/drivers-2021-10-29
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmF8KFsQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgph1ZEACwNuHkAZcIgNzKhzuLP9OjMhv9vV+q254G
 /EcM31e+qgRioMd0ihbVsgW76jOwLEmb3ldKGcN+0Wo5+Sv9Im8+wAWYY1REOZO5
 ZTUBfAzhEh63/EtqTFiU8U+7dmXqy4z7NaICnhlynjwkd3IT+I561os6kcqwJMMr
 G+Q1Cnk9rgCMIoLOCoVThIpjmjyZzF33qJb2VEIkHfkot62iNdpABWaSASF+CCba
 z8LfbvLAYz3YLl4thXlLJFU282T5y7gzgSomGvX4F0rMJSbqFbgoNEPxaYw9CvzC
 uC6MnYCYdCdvVkWVm1b8I8LYzPd5GrpVOSh3JQGvuA4Ppv2IyJCDSruYGgVUlhao
 cVPzuHCqNCfKk0ykYVRZy9oKiBk5wmFeKM/lSHu408y8VNraPNIAEpB6sA9qGr22
 AYr8lNh3JDr0g8dtFsDOq+7u3MANW0KQozfzwTPZo6NjzEE1D2jIg39Ljiijo9+Y
 3pU8pitIAhsKd2KhW1H6LmtJbF4dX756VKYDXOhzgORU0NZYgvGhBIj9tAdpQR0S
 xeae5Kj0/wBGcqR/owf/n1EY/q7rWgNDETnsBhbmzMZyhwH3L6zhT+bfD8YoQCHY
 ueyqhyIUe4YBxTrIpICqwDlqaMYAmQ0jRaci+bK9ovVlQ89FQ9o/BE2COPlI/DGX
 w+rUmmoX4g==
 =HiWU
 -----END PGP SIGNATURE-----

Merge tag 'for-5.16/drivers-2021-10-29' of git://git.kernel.dk/linux-block

Pull block driver updates from Jens Axboe:

 - paride driver cleanups (Christoph)

 - Remove cryptoloop support (Christoph)

 - null_blk poll support (me)

 - Now that add_disk() supports proper error handling, add it to various
   drivers (Luis)

 - Make ataflop actually work again (Michael)

 - s390 dasd fixes (Stefan, Heiko)

 - nbd fixes (Yu, Ye)

 - Remove redundant wq flush in mtip32xx (Christophe)

 - NVMe updates
      - fix a multipath partition scanning deadlock (Hannes Reinecke)
      - generate uevent once a multipath namespace is operational again
        (Hannes Reinecke)
      - support unique discovery controller NQNs (Hannes Reinecke)
      - fix use-after-free when a port is removed (Israel Rukshin)
      - clear shadow doorbell memory on resets (Keith Busch)
      - use struct_size (Len Baker)
      - add error handling support for add_disk (Luis Chamberlain)
      - limit the maximal queue size for RDMA controllers (Max Gurtovoy)
      - use a few more symbolic names (Max Gurtovoy)
      - fix error code in nvme_rdma_setup_ctrl (Max Gurtovoy)
      - add support for ->map_queues on FC (Saurav Kashyap)
      - support the current discovery subsystem entry (Hannes Reinecke)
      - use flex_array_size and struct_size (Len Baker)

 - bcache fixes (Christoph, Coly, Chao, Lin, Qing)

 - MD updates (Christoph, Guoqing, Xiao)

 - Misc fixes (Dan, Ding, Jiapeng, Shin'ichiro, Ye)

* tag 'for-5.16/drivers-2021-10-29' of git://git.kernel.dk/linux-block: (117 commits)
  null_blk: Fix handling of submit_queues and poll_queues attributes
  block: ataflop: Fix warning comparing pointer to 0
  bcache: replace snprintf in show functions with sysfs_emit
  bcache: move uapi header bcache.h to bcache code directory
  nvmet: use flex_array_size and struct_size
  nvmet: register discovery subsystem as 'current'
  nvmet: switch check for subsystem type
  nvme: add new discovery log page entry definitions
  block: ataflop: more blk-mq refactoring fixes
  block: remove support for cryptoloop and the xor transfer
  mtd: add add_disk() error handling
  rnbd: add error handling support for add_disk()
  um/drivers/ubd_kern: add error handling support for add_disk()
  m68k/emu/nfblock: add error handling support for add_disk()
  xen-blkfront: add error handling support for add_disk()
  bcache: add error handling support for add_disk()
  dm: add add_disk() error handling
  block: aoe: fixup coccinelle warnings
  nvmet: use struct_size over open coded arithmetic
  nvme: drop scan_lock and always kick requeue list when removing namespaces
  ...
2021-11-01 09:27:38 -07:00
Hannes Reinecke
e5ea42faa7 nvme: display correct subsystem NQN
With discovery controllers supporting unique subsystem NQNs the
actual subsystem NQN might be different from that one passed in
via the connect args. So add a helper to display the resulting
subsystem NQN.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-10-20 19:16:02 +02:00
Hannes Reinecke
20e8b689c9 nvme: Add connect option 'discovery'
Add a connect option 'discovery' to specify that the connection
should be made to a discovery controller, not a normal I/O controller.
With discovery controllers supporting unique subsystem NQNs we
cannot easily distinguish by the subsystem NQN if this should be
a discovery connection, but we need this information to blank out
options not supported by discovery controllers.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-10-20 19:16:02 +02:00
Hannes Reinecke
954ae16681 nvme: expose subsystem type in sysfs attribute 'subsystype'
With unique discovery controller NQNs we cannot distinguish the
subsystem type by the NQN alone, but need to check the subsystem
type, too.
So expose the subsystem type in a new sysfs attribute 'subsystype'.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-10-20 19:16:02 +02:00
Ming Lei
9e6a6b1212 nvme: paring quiesce/unquiesce
The current blk_mq_quiesce_queue() and blk_mq_unquiesce_queue() always
stops and starts the queue unconditionally. And there can be concurrent
quiesce/unquiesce coming from different unrelated code paths, so
unquiesce may come unexpectedly and start queue too early.

Prepare for supporting concurrent quiesce/unquiesce from multiple
contexts, so that we can address the above issue.

NVMe has very complicated quiesce/unquiesce use pattern, add one atomic
bit for makeiing sure that blk-mq quiece/unquiesce is always called in
pair.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211014081710.1871747-5-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 18:27:58 -06:00
Ming Lei
ebc9b95260 nvme: prepare for pairing quiescing and unquiescing
Add two helpers so that we can prepare for pairing quiescing and
unquiescing which will be done in next patch.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211014081710.1871747-4-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 18:27:58 -06:00
Ming Lei
6ca1d9027e nvme: apply nvme API to quiesce/unquiesce admin queue
Apply the added two APIs to quiesce/unquiesce admin queue.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211014081710.1871747-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 18:27:58 -06:00
Ming Lei
a277654baf nvme: add APIs for stopping/starting admin queue
Add two APIs for stopping and starting admin queue.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211014081710.1871747-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 18:27:58 -06:00
Jens Axboe
a9a7e30fd9 nvme: don't memset() the normal read/write command
This memset in the fast path costs a lot of cycles on my setup. Here's a
top-of-profile of doing ~6.7M IOPS:

+    5.90%  io_uring  [nvme]            [k] nvme_queue_rq
+    5.32%  io_uring  [nvme_core]       [k] nvme_setup_cmd
+    5.17%  io_uring  [kernel.vmlinux]  [k] io_submit_sqes
+    4.97%  io_uring  [kernel.vmlinux]  [k] blkdev_direct_IO

and a perf diff with this patch:

     0.92%     +4.40%  [nvme_core]       [k] nvme_setup_cmd

reducing it from 5.3% to only 0.9%. This takes it from the 2nd most
cycle consumer to something that's mostly irrelevant.

Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 12:41:09 -06:00
Jens Axboe
9c3d29296f nvme: move command clear into the various setup helpers
We don't have to worry about doing extra memsets by moving it outside
the protection of RQF_DONTPREP, as nvme doesn't do partial completions.

This is in preparation for making the read/write fast path not do a full
memset of the command.

Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19 12:40:51 -06:00
Jens Axboe
c234a65392 nvme: add support for batched completion of polled IO
Take advantage of struct io_comp_batch, if passed in to the nvme poll
handler. If it's set, rather than complete each request individually
inline, store them in the io_comp_batch list. We only do so for requests
that will complete successfully, anything else will be completed inline as
before.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 14:40:45 -06:00
Christoph Hellwig
6ce913fe3e block: rename REQ_HIPRI to REQ_POLLED
Unlike the RWF_HIPRI userspace ABI which is intentionally kept vague,
the bio flag is specific to the polling implementation, so rename and
document it properly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-12-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:36 -06:00
Christoph Hellwig
fe45e630a1 block: move integrity handling out of <linux/blkdev.h>
Split the integrity/metadata handling definitions out into a new header.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210920123328.1399408-17-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18 06:17:02 -06:00
Jens Axboe
baa0ab2ba2 nvme fixes for Linux 5.15:
- fix the abort command id (Keith Busch)
  - nvme: fix per-namespace chardev deletion (Adam Manzanares)
 -----BEGIN PGP SIGNATURE-----
 
 iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAmFoKrwLHGhjaEBsc3Qu
 ZGUACgkQD55TZVIEUYOCGw//fXeDv+hBmuhfmMX5ac889bAT9iWQcg86jb/2UBEF
 KmtFJI6vQO7J4amCImSEoYD55WZ85ZZo+i+Q61abLep493MXINnrm/ZkNtVp0kbT
 h+pokfxbwo4IKPOCmWqVhXxspaIQL++J6TDMvAzuv4S4YAxeD/W+4bXshMu8E4aD
 XZZo2uj1EVYDeHY1rasCElMAw4DZDdaIL7Zn11qqQJDWK+XwP3YJ773/rXoG3i/j
 Kq0jtB/wWRc3VrherMXtEOz1Eu1MVOhUhr6YF2vitv+qBzmJKMResO1bVP619vqQ
 MoEGZmqUfUxLY4CcWbtV4iIIBK+H3ySaEoHeNLu31aiucRcAWCLIwHLvwrc2cSpP
 BozCuA03iRyGtcLDftfQfD3jyzeB8qmIrOaXd/blBlPAuQXzilF/WyB+IzRXorYs
 kxtwdWLTbh3nWMQrEG9gh6Fx02Hbf+dNdXb/GN0NFDJDD6BfPyAut38lMxdRKMAz
 9q4HfPTszyP78sU54N6qULVDVTJZ8F7sjY90OueZCBT3VjwsyCd7eO4sti1Avu4Y
 KmWxKfKQyGCmlls0oIrCFm7A0tjHFJhCdoNN0w3DmpMy6GN0cBE3aI3WAb1MAWan
 Qp6eI9DXuHwsLW8fqLv9+lWP3BI59qj2ClSy3OBGqAWiW1ifyq4cPUOR80vh22c2
 EHA=
 =eQZ/
 -----END PGP SIGNATURE-----

Merge tag 'nvme-5.15-2021-10-14' of git://git.infradead.org/nvme into block-5.15

Pull NVMe fixes from Christoph:

"nvme fixes for Linux 5.15:

 - fix the abort command id (Keith Busch)
 - nvme: fix per-namespace chardev deletion (Adam Manzanares)"

* tag 'nvme-5.15-2021-10-14' of git://git.infradead.org/nvme:
  nvme: fix per-namespace chardev deletion
  nvme-pci: Fix abort command id
2021-10-14 09:07:14 -06:00
Adam Manzanares
be5eb93354 nvme: fix per-namespace chardev deletion
Decrease reference count of chardevice during char device deletion in
order to fix a memory leak.  Add a release callabck for the device
associated chardev and move ida_simple_remove into the release function.

Fixes: 2637baed78 ("nvme: introduce generic per-namespace chardev")
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Suggested-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Adam Manzanares <a.manzanares@samsung.com>
Reviewed-by: Javier González <javier@javigon.com>
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-10-14 08:07:47 +02:00
Keith Busch
a2941f6aa7 nvme: add command id quirk for apple controllers
Some apple controllers use the command id as an index to implementation
specific data structures and will fail if the value is out of bounds.
The nvme driver's recently introduced command sequence number breaks
this controller.

Provide a quirk so these spec incompliant controllers can function as
before. The driver will not have the ability to detect bad completions
when this quirk is used, but we weren't previously checking this anyway.

The quirk bit was selected so that it can readily apply to stable.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=214509
Cc: Sven Peter <sven@svenpeter.dev>
Reported-by: Orlando Chamberlain <redecorating@protonmail.com>
Reported-by: Aditya Garg <gargaditya08@live.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Sven Peter <sven@svenpeter.dev>
Link: https://lore.kernel.org/r/20210927154306.387437-1-kbusch@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-27 10:02:07 -06:00
Christoph Hellwig
298ba0e3d4 nvme: keep ctrl->namespaces ordered
Various places in the nvme code that rely on ctrl->namespace to be
ordered.  Ensure that the namespae is inserted into the list at the
right position from the start instead of sorting it after the fact.

Fixes: 540c801c65 ("NVMe: Implement namespace list scanning")
Reported-by: Anton Eidelman <anton.eidelman@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
2021-09-21 09:17:15 +02:00
Jens Axboe
65ed1e692f nvme fixes for Linux 5.15
- fix ANA state updates when a namespace is not present (Anton Eidelman)
  - nvmet: fix a width vs precision bug in nvmet_subsys_attr_serial_show
    (Dan Carpenter)
  - avoid race in shutdown namespace removal (Daniel Wagner)
  - fix io_work priority inversion in nvme-tcp (Keith Busch)
  - destroy cm id before destroy qp to avoid use after free (Ruozhu Li)
 -----BEGIN PGP SIGNATURE-----
 
 iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAmFB+aQLHGhjaEBsc3Qu
 ZGUACgkQD55TZVIEUYOfPg//fREqsvY1eMkexP/fOyrpVpLH1lLLM2SHUYwyQOtg
 4CVxl2KU8YGyGLGpAMO6ToctrNV83oUQUQlgy7PyQQpUzr6m4sJm7Lbxzvqhyh96
 MKeHRcWo173twiCLICx4GADSmlwofmwUSzBXdhEtt1gNmg26T+nnDewHpSqLY5Ax
 BNAtFkAGTZodupVKUNsI+tqAyXY3Bgk069YqDiZfBCqU62JzLj6QX0aVXcpS2kcv
 bVpJ0kbnuoUkHG/eIrLZeiyvikxQWx1KaLaL3Y24wrdkVWmZJpCYbg/gXq5v9GUQ
 U8yq/iuOSTtfE0+7hNCyNz8tNLfzxejd0jDCfQcPwXOuzIfNkdzdqhkhste2HYB5
 uriDj1eTQq1siH6kKu8oYQBMSxMSw9wM5q71d5QX2XJzw3hl9bqw5GqiT8Gjk1pX
 K759Tj+IMFNa1SqUGI5IBfkHz4vvuwOr+pO60fwd3CzajOI8mNnNAuJ+1I2UJaEP
 cgq+coD950BsWCyhL+XO/NhYPUSugI331JQqlIkVPPYxqBCJK22c7rZusM0Sm4Gq
 aweWKIxtI9cbsrlf6WfJdyHf5clZ9AjMmWEzHFbDeGqUhb3pTFoT5H/UqSw9/6xe
 pi7iKS6lBoZAoWKfyt9bOKfmaDlcrylAxPQ/LfyJGSTy5Buj5bcc4FO5rRJ+BIay
 t88=
 =wad8
 -----END PGP SIGNATURE-----

Merge tag 'nvme-5.15-2021-09-15' of git://git.infradead.org/nvme into block-5.15

Pull NVMe fixes from Christoph:

"nvme fixes for Linux 5.15

 - fix ANA state updates when a namespace is not present (Anton Eidelman)
 - nvmet: fix a width vs precision bug in nvmet_subsys_attr_serial_show
   (Dan Carpenter)
 - avoid race in shutdown namespace removal (Daniel Wagner)
 - fix io_work priority inversion in nvme-tcp (Keith Busch)
 - destroy cm id before destroy qp to avoid use after free (Ruozhu Li)"

* tag 'nvme-5.15-2021-09-15' of git://git.infradead.org/nvme:
  nvme-tcp: fix io_work priority inversion
  nvme-rdma: destroy cm id before destroy qp to avoid use after free
  nvme-multipath: fix ANA state updates when a namespace is not present
  nvme: avoid race in shutdown namespace removal
  nvmet: fix a width vs precision bug in nvmet_subsys_attr_serial_show()
2021-09-15 07:53:32 -06:00
Christoph Hellwig
9da4c7276e nvme: remove the call to nvme_update_disk_info in nvme_ns_remove
There is no need to explicitly unregister the integrity profile when
deleting the gendisk.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Link: https://lore.kernel.org/r/20210914070657.87677-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-14 20:03:30 -06:00
Daniel Wagner
9edceaf430 nvme: avoid race in shutdown namespace removal
When we remove the siblings entry, we update ns->head->list, hence we
can't separate the removal and test for being empty. They have to be
in the same critical section to avoid a race.

To avoid breaking the refcounting imbalance again, add a list empty
check to nvme_find_ns_head.

Fixes: 5396fdac56 ("nvme: fix refcounting imbalance when all paths are down")
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Tested-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-09-13 17:04:04 +02:00
Luis Chamberlain
ab3994f6ef nvme: add error handling support for add_disk()
We never checked for errors on add_disk() as this function
returned void. Now that this is fixed, use the shiny new
error handling.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-09-06 10:08:09 +02:00
Daniel Wagner
041bd1a1fc nvme: only call synchronize_srcu when clearing current path
The function nmve_mpath_clear_current_path returns true if the current
path has changed. In this case we have to wait for all concurrent
submissions to finish. But if we didn't change the current path, there
is no point in waiting for another RCU period to finish.

Signed-off-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-09-06 10:03:11 +02:00
Tatsuya Sasaki
b58da2d270 nvme: update keep alive interval when kato is modified
Currently the connection between host and NVMe-oF target gets
disconnected by keep-alive timeout when a user connects to a target
with a relatively large kato value and then sets the smaller kato
with a set features command (e.g. connects with 60 seconds kato value
and then sets 10 seconds kato value).

The cause is that keep alive command interval on the host, which is
defined as unsigned int kato in nvme_ctrl structure, does not follow
the kato value changes.

This patch updates the keep alive interval in the following steps when
the kato is modified by a set features command: stops the keep alive
work queue, then sets the kato as new timer value and re-start the queue.

Signed-off-by: Tatsuya Sasaki <tatsuya6.sasaki@kioxia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-09-06 10:03:11 +02:00
Adam Manzanares
43dc987828 nvme: move nvme_multi_css into nvme.h
Preparatory patch in order to reuse nvme_multi_css in the nvme target
code.

Signed-off-by: Adam Manzanares <a.manzanares@samsung.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-09-06 09:59:02 +02:00
Hannes Reinecke
e7d65803e2 nvme-multipath: revalidate paths during rescan
When triggering a rescan due to a namespace resize we will be
receiving AENs on every controller, triggering a rescan of all
attached namespaces. If multipath is active only the current path and
the ns_head disk will be updated, the other paths will still refer to
the old size until AENs for the remaining controllers are received.

If I/O comes in before that it might be routed to one of the old
paths, triggering an I/O failure with 'access beyond end of device'.
With this patch the old paths are skipped from multipath path
selection until the controller serving these paths has been rescanned.

Signed-off-by: Hannes Reinecke <hare@suse.de>
[dwagner: - introduce NVME_NS_READY flag instead of NVME_NS_INVALIDATE
          - use 'revalidate' instead of 'invalidate' which
	    follows the zoned device code path.
	  - clear NVME_NS_READY before clearing current_path]
Signed-off-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-09-06 09:59:02 +02:00
Linus Torvalds
9a1d6c9e3f for-5.15/drivers-2021-08-30
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmEs6LsQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpqnLD/9c8v7WTLjrDR6FLD8fHUmkwk9ss6OeyYJC
 Z62QOyk6BqNOu6FAwBax9wFaboXdUqOdpJU0PVQ7WJ5wBiCQ9DAZY6T+iwW0jE79
 +iOSqdXHVLAIyIM9GplzLH5AH3tx4445bX7fRWwWX1OgmSidkAhb25FusCvpcpHx
 1k+9dSLClLeHPR6jVT3k6tHv2RzPSw+/vYOggeWYA0YYPfoCx/Ft0uwO+PjKpvLQ
 Je5jASlLGYCXazswJBZgfjbroA97EuaLOmHHIHrwhkkFsbV6ewv6mlmanbMEs4fX
 Wh+axTt8so27g6gbw31EOcGsxTi0B37Jx9MOrSla6NdJoZkFE2sn6K+D5k4oeSrg
 QgYXL00U62eSgWmgSB0f0X081cQfI+FUMe5u5S368WdrgCPfaXl11zHw8nXw8gEW
 UvqR4zr3hQd4piXsIWl2bwZrmpPBCeB8iStLq3C92RLPFT6hJO3GM/ZmwTn+0HT0
 lMXzoEdkPywkKWi8aBbSgzXiGknNl8HAYnwMhcQjiHbYQOycGkI9pigJDNY9Ox1l
 fYHFSompmJ/XK8cIiU7QIglXEXJky5jQ89Ni0ryCstOaP20tPxWtkpOCgidXfNGz
 4lmQV8D5aBTUFs6ifPjXfiXUmDiU3SaxiFhAqaEkGII9BbkrNhlibB4LBAU+toi1
 Q0yGhGR/mg==
 =4uWF
 -----END PGP SIGNATURE-----

Merge tag 'for-5.15/drivers-2021-08-30' of git://git.kernel.dk/linux-block

Pull block driver updates from Jens Axboe:
 "Sitting on top of the core block changes, here are the driver changes
  for the 5.15 merge window:

   - NVMe updates via Christoph:
       - suspend improvements for devices with an HMB (Keith Busch)
       - handle double completions more gacefull (Sagi Grimberg)
       - cleanup the selects for the nvme core code a bit (Sagi Grimberg)
       - don't update queue count when failing to set io queues (Ruozhu Li)
       - various nvmet connect fixes (Amit Engel)
       - cleanup lightnvm leftovers (Keith Busch, me)
       - small cleanups (Colin Ian King, Hou Pu)
       - add tracing for the Set Features command (Hou Pu)
       - CMB sysfs cleanups (Keith Busch)
       - add a mutex_destroy call (Keith Busch)

   - remove lightnvm subsystem. It's served its purpose and ultimately
     led to zoned nvme support, we no longer need it (Christoph)

   - revert floppy O_NDELAY fix (Denis)

   - nbd fixes (Hou, Pavel, Baokun)

   - nbd locking fixes (Tetsuo)

   - nbd device removal fixes (Christoph)

   - raid10 rcu warning fix (Xiao)

   - raid1 write behind fix (Guoqing)

   - rnbd fixes (Gioh, Md Haris)

   - misc fixes (Colin)"

* tag 'for-5.15/drivers-2021-08-30' of git://git.kernel.dk/linux-block: (42 commits)
  Revert "floppy: reintroduce O_NDELAY fix"
  raid1: ensure write behind bio has less than BIO_MAX_VECS sectors
  md/raid10: Remove unnecessary rcu_dereference in raid10_handle_discard
  nbd: remove nbd->destroy_complete
  nbd: only return usable devices from nbd_find_unused
  nbd: set nbd->index before releasing nbd_index_mutex
  nbd: prevent IDR lookups from finding partially initialized devices
  nbd: reset NBD to NULL when restarting in nbd_genl_connect
  nbd: add missing locking to the nbd_dev_add error path
  nvme: remove the unused NVME_NS_* enum
  nvme: remove nvm_ndev from ns
  nvme: Have NVME_FABRICS select NVME_CORE instead of transport drivers
  block: nbd: add sanity check for first_minor
  nvmet: check that host sqsize does not exceed ctrl MQES
  nvmet: avoid duplicate qid in connect cmd
  nvmet: pass back cntlid on successful completion
  nvme-rdma: don't update queue count when failing to set io queues
  nvme-tcp: don't update queue count when failing to set io queues
  nvme-tcp: pair send_mutex init with destroy
  nvme: allow user toggling hmb usage
  ...
2021-08-30 19:01:46 -07:00
Christoph Hellwig
5f432cceb3 nvme: use blk_mq_alloc_disk
Switch to use the blk_mq_alloc_disk helper for allocating the
request_queue and gendisk.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Link: https://lore.kernel.org/r/20210816131910.615153-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23 12:54:30 -06:00
Christoph Hellwig
3973e15fa5 nvme: use bvec_virt
Use bvec_virt instead of open coding it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20210804095634.460779-16-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-16 10:50:33 -06:00
Sagi Grimberg
e7006de6c2 nvme: code command_id with a genctr for use-after-free validation
We cannot detect a (perhaps buggy) controller that is sending us
a completion for a request that was already completed (for example
sending a completion twice), this phenomenon was seen in the wild
a few times.

So to protect against this, we use the upper 4 msbits of the nvme sqe
command_id to use as a 4-bit generation counter and verify it matches
the existing request generation that is incrementing on every execution.

The 16-bit command_id structure now is constructed by:
| xxxx | xxxxxxxxxxxx |
  gen    request tag

This means that we are giving up some possible queue depth as 12 bits
allow for a maximum queue depth of 4095 instead of 65536, however we
never create such long queues anyways so no real harm done.

Suggested-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Acked-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Tested-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-08-16 14:42:22 +02:00
Christoph Hellwig
9ea9b9c483 remove the lightnvm subsystem
Lightnvm supports the OCSSD 1.x and 2.0 specs which were early attempts
to produce Open Channel SSDs and never made it into the NVMe spec
proper.  They have since been superceeded by NVMe enhancements such
as ZNS support.  Remove the support per the deprecation schedule.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210812132308.38486-1-hch@lst.de
Reviewed-by: Matias Bjørling <mb@lightnvm.io>
Reviewed-by: Javier González <javier@javigon.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-14 15:54:09 -06:00
Christoph Hellwig
50b4aecfbb block: remove GENHD_FL_UP
Just check inode_unhashed on the whole device bdev inode instead,
and provide a helper to check for that information.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210809064028.1198327-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-12 10:29:36 -06:00
Christoph Hellwig
5eba200526 nvme: remove the GENHD_FL_UP check in nvme_ns_remove
Early probe failure never reaches nvme_ns_remove, so GENHD_FL_UP must
be set at this point.  Remove the check.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210809064028.1198327-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-12 10:29:36 -06:00
Christoph Hellwig
471aa704db block: pass a gendisk to blk_queue_update_readahead
.. and rename the function to disk_update_readahead.  This is in
preparation for moving the BDI from the request_queue to the gendisk.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210809141744.1203023-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-09 11:52:28 -06:00
Christoph Hellwig
aaeb7bb061 nvme: set the PRACT bit when using Write Zeroes with T10 PI
When using Write Zeroes on a namespace that has protection
information enabled they behavior without the PRACT bit
counter-intuitive and will generally lead to validation failures
when reading the written blocks.  Fix this by always setting the
PRACT bit that generates matching PI data on the fly.

Fixes: 6e02318eae ("nvme: add support for the Write Zeroes command")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-07-21 17:24:10 +02:00
Hannes Reinecke
5396fdac56 nvme: fix refcounting imbalance when all paths are down
When the last path to a ns_head drops the current code
removes the ns_head from the subsystem list, but will only
delete the disk itself if the last reference to the ns_head
drops. This is causing an refcounting imbalance eg when
applications have a reference to the disk, as then they'll
never get notified that the disk is in fact dead.
This patch moves the call 'del_gendisk' into nvme_mpath_check_last_path(),
ensuring that the disk can be properly removed and applications get the
appropriate notifications.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-07-21 09:55:40 +02:00
Keith Busch
ae5e6886b4 nvme: use return value from blk_execute_rq()
We don't have an nvme status to report if the driver's .queue_rq()
returns an error without dispatching the requested nvme command. Check
the return value from blk_execute_rq() for all passthrough commands so
the caller may know their command was not successful.

If the command is from the target passthrough interface and fails to
dispatch, synthesize the response back to the host as a internal target
error.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Link: https://lore.kernel.org/r/20210610214437.641245-5-kbusch@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-30 15:35:45 -06:00
Keith Busch
be42a33b92 nvme: use blk_execute_rq() for passthrough commands
The generic blk_execute_rq() knows how to handle polled completions. Use
that instead of implementing an nvme specific handler.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Link: https://lore.kernel.org/r/20210610214437.641245-3-kbusch@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-30 15:35:38 -06:00
Linus Torvalds
440462198d for-5.14/drivers-2021-06-29
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmDbd5UQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpvsNEADCJKP81boFzRcdJo7EqaNDAzZyKOIg9Oq7
 4GZE0Wm0SgA6+04bKrNVd9KLcKvQ+NC1pK7UJemSSH2y9ir+zHfyYgAV0/+wFmYm
 NgHlDjBvf80XSI5wezcb6MxZT+R7IaIpDsW1ZvV9hFtPSncn5o2OIWiSdJtHT/Rv
 enlgZPc7OwNWoVMX8eR58IoO0k3S6GLpctUZHt/AUukaKgoOks0X523qhEPf3Upr
 RkbIZuqLWVgpdT6457iSE/OijUczD4thTI8bdprxzhgimOm2vV52sO6F5HtHc7GX
 qW+PWYUaiUk7UpObuOuyv0yyUG45ii73iY1W0w66RiyCjVTgtpdwwMQ38VlBcoOg
 zcE1jneAEJt6TiS6zfRaER/10JoCIG4gp1+apPuaXud/o3BqWI0cagVHAgaLziBI
 F7bDJkbJZIR6GrWMgemBI+mc5/LACBePxzPGLScKFptejtQ/ysfZQ6aCLROJWB2U
 4EnysAaUBf6tywj30JqfQvqFNGkHIgY95FKiXJW6GzqqwgBouNf48vS15BgkwI+2
 EijcqUhlOVNfc3RIc0ZL5c9KcPIN9t5sqBrWZe3wgCErhxAx6w6Za9nDdP+US9bl
 /apCpvDFlu59g8n1wtkNE/uC+XqdKDwsplYhnfpX0FGni5wIknhQq3bSe4dPFgSn
 pG5VMrw3pA==
 =D6dS
 -----END PGP SIGNATURE-----

Merge tag 'for-5.14/drivers-2021-06-29' of git://git.kernel.dk/linux-block

Pull block driver updates from Jens Axboe:
 "Pretty calm round, mostly just NVMe and a bit of MD:

   - NVMe updates (via Christoph)
        - improve the APST configuration algorithm (Alexey Bogoslavsky)
        - look for StorageD3Enable on companion ACPI device
          (Mario Limonciello)
        - allow selecting the network interface for TCP connections
          (Martin Belanger)
        - misc cleanups (Amit Engel, Chaitanya Kulkarni, Colin Ian King,
          Christoph)
        - move the ACPI StorageD3 code to drivers/acpi/ and add quirks
          for certain AMD CPUs (Mario Limonciello)
        - zoned device support for nvmet (Chaitanya Kulkarni)
        - fix the rules for changing the serial number in nvmet
          (Noam Gottlieb)
        - various small fixes and cleanups (Dan Carpenter, JK Kim,
          Chaitanya Kulkarni, Hannes Reinecke, Wesley Sheng, Geert
          Uytterhoeven, Daniel Wagner)

   - MD updates (Via Song)
        - iostats rewrite (Guoqing Jiang)
        - raid5 lock contention optimization (Gal Ofri)

   - Fall through warning fix (Gustavo)

   - Misc fixes (Gustavo, Jiapeng)"

* tag 'for-5.14/drivers-2021-06-29' of git://git.kernel.dk/linux-block: (78 commits)
  nvmet: use NVMET_MAX_NAMESPACES to set nn value
  loop: Fix missing discard support when using LOOP_CONFIGURE
  nvme.h: add missing nvme_lba_range_type endianness annotations
  nvme: remove zeroout memset call for struct
  nvme-pci: remove zeroout memset call for struct
  nvmet: remove zeroout memset call for struct
  nvmet: add ZBD over ZNS backend support
  nvmet: add Command Set Identifier support
  nvmet: add nvmet_req_bio put helper for backends
  nvmet: add req cns error complete helper
  block: export blk_next_bio()
  nvmet: remove local variable
  nvmet: use nvme status value directly
  nvmet: use u32 type for the local variable nsid
  nvmet: use u32 for nvmet_subsys max_nsid
  nvmet: use req->cmd directly in file-ns fast path
  nvmet: use req->cmd directly in bdev-ns fast path
  nvmet: make ver stable once connection established
  nvmet: allow mn change if subsys not discovered
  nvmet: make sn stable once connection was established
  ...
2021-06-30 12:21:16 -07:00
Linus Torvalds
df668a5fe4 for-5.14/block-2021-06-29
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmDbXAwQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpr0HEADDJaSgjpnWQwH1RVLNagJa9KnktxZYsEs+
 as3QmDdpKRG3rEC9bdE7FLe/xq3WBaO5j1hTQ9P6IguqLyS1Df72DtTlKyaCrZoe
 zv9eIlY4lZUfksE2nzWmlN9uG0FBVXeEQpHCLSNbUZeK1zvV6+NNhQqw2kc0sEqu
 hReUFeMUbsMcu/w5T3XMVJNsTMCql9wta2H0q5hONQyJQSrIwa1D+sUdE5I8fO4j
 bnoYX9yxHX26EztX1UJiGRgoq5Trz7LY7hAfljKSkewpFwiHE2vBdq2L0C2RKsIV
 tTs2DjMCMQyPNeA7WAG8HlR4aPG+7+/fuBP1KJHkykjWXglWN7OqISuBv6rrBgQs
 gNRnZ4qmb1CzD6aLEBk59nHt6po6eMxXIW856YktKy8rKcrgK29qP44Z+oomkPKo
 ZjQ0wqN5CvpObM/dIKxl9bAJ4zQDHBt49d5nTTQLfWl/mgevu6ZNWD/hONyCQmFy
 zKKqQ/wkxWHutOsjC5/MKNb3ZRNH9tt9X+HfULO2DU6IqqifYw/ex4z4MVsBopJC
 7pPfd81kgC73TgXe1AaCwHqNWsrqYCuTK0ew1CtGudlS3lucMwtap4GBiCgg5gbu
 M8pEgwO4OcCLHyRUc8zdfqI7HumbprbFmojPkwGSEe0ofVD74lMhzbUj5jvTYY2B
 t8D2XcgyOA==
 =lhon
 -----END PGP SIGNATURE-----

Merge tag 'for-5.14/block-2021-06-29' of git://git.kernel.dk/linux-block

Pull core block updates from Jens Axboe:

 - disk events cleanup (Christoph)

 - gendisk and request queue allocation simplifications (Christoph)

 - bdev_disk_changed cleanups (Christoph)

 - IO priority improvements (Bart)

 - Chained bio completion trace fix (Edward)

 - blk-wbt fixes (Jan)

 - blk-wbt enable/disable fix (Zhang)

 - Scheduler dispatch improvements (Jan, Ming)

 - Shared tagset scheduler improvements (John)

 - BFQ updates (Paolo, Luca, Pietro)

 - BFQ lock inversion fix (Jan)

 - Documentation improvements (Kir)

 - CLONE_IO block cgroup fix (Tejun)

 - Remove of ancient and deprecated block dump feature (zhangyi)

 - Discard merge fix (Ming)

 - Misc fixes or followup fixes (Colin, Damien, Dan, Long, Max, Thomas,
   Yang)

* tag 'for-5.14/block-2021-06-29' of git://git.kernel.dk/linux-block: (129 commits)
  block: fix discard request merge
  block/mq-deadline: Remove a WARN_ON_ONCE() call
  blk-mq: update hctx->dispatch_busy in case of real scheduler
  blk: Fix lock inversion between ioc lock and bfqd lock
  bfq: Remove merged request already in bfq_requests_merged()
  block: pass a gendisk to bdev_disk_changed
  block: move bdev_disk_changed
  block: add the events* attributes to disk_attrs
  block: move the disk events code to a separate file
  block: fix trace completion for chained bio
  block/partitions/msdos: Fix typo inidicator -> indicator
  block, bfq: reset waker pointer with shared queues
  block, bfq: check waker only for queues with no in-flight I/O
  block, bfq: avoid delayed merge of async queues
  block, bfq: boost throughput by extending queue-merging times
  block, bfq: consider also creation time in delayed stable merge
  block, bfq: fix delayed stable merge check
  block, bfq: let also stably merged queues enjoy weight raising
  blk-wbt: make sure throttle is enabled properly
  blk-wbt: introduce a new disable state to prevent false positive by rwb_enabled()
  ...
2021-06-30 12:12:56 -07:00
Chaitanya Kulkarni
cc72c44267 nvme: remove zeroout memset call for struct
Declare and initialize structure variables to zero values so that we can
remove zeroout memset calls in the host/core.c.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-06-17 15:51:21 +02:00
Christoph Hellwig
f1cf35e17e nvme: remove nvme_{get,put}_ns_from_disk
Now that only one caller is left remove the helpers by restructuring
nvme_pr_command so that it has two helpers for sending a command of to a
given nsid using either the ns_head for multipath, or the namespace
stored in the gendisk.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2021-06-03 10:29:26 +03:00
Christoph Hellwig
8b4fb0f968 nvme: split nvme_report_zones
Split multipath support out of nvme_report_zones into a separate helper
and simplify the non-multipath version as a result.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2021-06-03 10:29:26 +03:00
Martin Belanger
3ede8f72a9 nvme-tcp: allow selecting the network interface for connections
In our application, we need a way to force TCP connections to go out a
specific IP interface instead of letting Linux select the interface
based on the routing tables.

Add the 'host-iface' option to allow specifying the interface to use.
When the option host-iface is specified, the driver uses the specified
interface to set the option SO_BINDTODEVICE on the TCP socket before
connecting.

This new option is needed in addtion to the existing host-traddr for
the following reasons:

Specifying an IP interface by its associated IP address is less
intuitive than specifying the actual interface name and, in some cases,
simply doesn't work. That's because the association between interfaces
and IP addresses is not predictable. IP addresses can be changed or can
change by themselves over time (e.g. DHCP). Interface names are
predictable [1] and will persist over time. Consider the following
configuration.

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state ...
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 100.0.0.100/24 scope global lo
       valid_lft forever preferred_lft forever
2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc ...
    link/ether 08:00:27:21:65:ec brd ff:ff:ff:ff:ff:ff
    inet 100.0.0.100/24 scope global enp0s3
       valid_lft forever preferred_lft forever
3: enp0s8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc ...
    link/ether 08:00:27:4f:95:5c brd ff:ff:ff:ff:ff:ff
    inet 100.0.0.100/24 scope global enp0s8
       valid_lft forever preferred_lft forever

The above is a VM that I configured with the same IP address
(100.0.0.100) on all interfaces. Doing a reverse lookup to identify the
unique interface associated with 100.0.0.100 does not work here. And
this is why the option host_iface is required. I understand that the
above config does not represent a standard host system, but I'm using
this to prove a point: "We can never know how users will configure
their systems". By te way, The above configuration is perfectly fine
by Linux.

The current TCP implementation for host_traddr performs a
bind()-before-connect(). This is a common construct to set the source
IP address on a TCP socket before connecting. This has no effect on how
Linux selects the interface for the connection. That's because Linux
uses the Weak End System model as described in RFC1122 [2]. On the other
hand, setting the Source IP Address has benefits and should be supported
by linux-nvme. In fact, setting the Source IP Address is a mandatory
FedGov requirement (e.g. connection to a RADIUS/TACACS+ server).
Consider the following configuration.

$ ip addr list dev enp0s8
3: enp0s8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc ...
    link/ether 08:00:27:4f:95:5c brd ff:ff:ff:ff:ff:ff
    inet 192.168.56.101/24 brd 192.168.56.255 scope global enp0s8
       valid_lft 426sec preferred_lft 426sec
    inet 192.168.56.102/24 scope global secondary enp0s8
       valid_lft forever preferred_lft forever
    inet 192.168.56.103/24 scope global secondary enp0s8
       valid_lft forever preferred_lft forever
    inet 192.168.56.104/24 scope global secondary enp0s8
       valid_lft forever preferred_lft forever

Here we can see that several addresses are associated with interface
enp0s8. By default, Linux always selects the default IP address,
192.168.56.101, as the source address when connecting over interface
enp0s8. Some users, however, want the ability to specify a different
source address (e.g., 192.168.56.102, 192.168.56.103, ...). The option
host_traddr can be used as-is to perform this function.

In conclusion, I believe that we need 2 options for TCP connections.
One that can be used to specify an interface (host-iface). And one that
can be used to set the source address (host-traddr). Users should be
allowed to use one or the other, or both, or none. Of course, the
documentation for host_traddr will need some clarification. It should
state that when used for TCP connection, this option only sets the
source address. And the documentation for host_iface should say that
this option is only available for TCP connections.

References:
[1] https://www.freedesktop.org/wiki/Software/systemd/PredictableNetworkInterfaceNames/
[2] https://tools.ietf.org/html/rfc1122

Tested both IPv4 and IPv6 connections.

Signed-off-by: Martin Belanger <martin.belanger@dell.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-06-03 10:29:24 +03:00