Skip to content

Fpin v10 patches#10

Draft
johnmeneghini wants to merge 11 commits intonvme-6.18from
fpin_v16
Draft

Fpin v10 patches#10
johnmeneghini wants to merge 11 commits intonvme-6.18from
fpin_v16

Conversation

@johnmeneghini
Copy link
Owner

Add fc_host_fpin_set_nvme_rport_marginal() function to
evaluate the FPIN LI TLV information and set the 'marginal'
path status for all affected nvme rports.

Co-developed-by: Hannes Reinecke hare@kernel.org
Signed-off-by: Hannes Reinecke hare@kernel.org
Tested-by: Bryan Gurney bgurney@redhat.com
Signed-off-by: John Meneghini jmeneghi@redhat.com

Refactor and fc_rport_set_marginal_state smp safe by holding
shost->host_lock around all rport->port_state accesses.

Call nvme_fc_modify_rport_fpin_state() when FC_PORTSTATE_MARGINAL is set
or cleared. This allows the user to quickly set or clear the
NVME_CTRL_MARGINAL state from sysfs.

E.g.:

 echo "Marginal" > /sys/class/fc_remote_ports/rport-13:0-5/port_state
 echo "Online" > /sys/class/fc_remote_ports/rport-13:0-5/port_state

Note: nvme_fc_modify_rport_fpin_state() will only affect rports that have FC_PORT_ROLE_NVME_TARGET set.

Signed-off-by: John Meneghini jmeneghi@redhat.com

Hannes Reinecke and others added 11 commits September 23, 2025 08:28
Introduce 'union fc_tlv_desc' to have a common structure for all FC
ELS TLV structures and avoid type casts.

[bgurney: The cast inside the union fc_tlv_next_desc() has "u8",
which causes a failure to build.  Use "__u8" instead.]

Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Justin Tee <justin.tee@broadcom.com>
Tested-by: Bryan Gurney <bgurney@redhat.com>
Reviewed-by: John Meneghini <jmeneghi@redhat.com>
Tested-by: Muneendra Kumar <muneendra.kumar@broadcom.com>
Add a new controller flag, NVME_CTRL_MARGINAL, to help multipath I/O
policies to react to a path that is set to a "marginal" state.

The flag is cleared on controller reset, which is often the case when
faulty cabling or transceiver hardware is replaced.

Signed-off-by: Bryan Gurney <bgurney@redhat.com>
FPIN LI (link integrity) messages are received when the attached
fabric detects hardware errors. In response to these messages I/O
should be directed away from the affected ports, and only used
if the 'optimized' paths are unavailable.
To handle this a new controller flag 'NVME_CTRL_MARGINAL' is added
which will cause the multipath scheduler to skip these paths when
checking for 'optimized' paths. They are, however, still eligible
for non-optimized path selected. The flag is cleared upon reset as then the
faulty hardware might be replaced.

Signed-off-by: Hannes Reinecke <hare@kernel.org>
Tested-by: Bryan Gurney <bgurney@redhat.com>
Reviewed-by: John Meneghini <jmeneghi@redhat.com>
Tested-by: Muneendra Kumar <muneendra.kumar@broadcom.com>
If a controller has received a link integrity or congestion event, and
has the NVME_CTRL_MARGINAL flag set, emit "marginal" in the state
instead of "live", to identify the marginal paths.

Co-developed-by: John Meneghini <jmeneghi@redhat.com>
Signed-off-by: John Meneghini <jmeneghi@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Tested-by: Muneendra Kumar <muneendra.kumar@broadcom.com>
Signed-off-by: Bryan Gurney <bgurney@redhat.com>
Exclude marginal paths from queue-depth io policy. In the case where all
paths are marginal and no optimized or non-optimized path is found, we
fall back to __nvme_find_path which selects the best marginal path.

Tested-by: Bryan Gurney <bgurney@redhat.com>
Signed-off-by: John Meneghini <jmeneghi@redhat.com>
Add nvme_fc_modify_rport_fpin_state() and supporting functions. This
function is called by the SCSI FC transport and driver layer to set or
clear the 'marginal' path status for a specific rport.

Co-developed-by: Hannes Reinecke <hare@kernel.org>
Signed-off-by: Hannes Reinecke <hare@kernel.org>
Tested-by: Bryan Gurney <bgurney@redhat.com>
Signed-off-by: John Meneghini <jmeneghi@redhat.com>
Add fc_host_fpin_set_nvme_rport_marginal() function to
evaluate the FPIN LI TLV information and set the 'marginal'
path status for all affected nvme rports.

Co-developed-by: Hannes Reinecke <hare@kernel.org>
Signed-off-by: Hannes Reinecke <hare@kernel.org>
Tested-by: Bryan Gurney <bgurney@redhat.com>
Signed-off-by: John Meneghini <jmeneghi@redhat.com>
Call fc_host_fpin_set_nvme_rport_marginal() to enable FPIN notifications
for NVMe.

Co-developed-by: Hannes Reinecke <hare@kernel.org>
Signed-off-by: Hannes Reinecke <hare@kernel.org>
Tested-by: Bryan Gurney <bgurney@redhat.com>
Signed-off-by: John Meneghini <jmeneghi@redhat.com>
Call fc_host_fpin_set_nvme_rport_marginal() to enable FPIN notifications
for NVMe.

Co-developed-by: Hannes Reinecke <hare@kernel.org>
Signed-off-by: Hannes Reinecke <hare@kernel.org>
Tested-by: Bryan Gurney <bgurney@redhat.com>
Signed-off-by: John Meneghini <jmeneghi@redhat.com>
Refactor and fc_rport_set_marginal_state smp safe by holding
`shost->host_lock` around all `rport->port_state` accesses.

Call nvme_fc_modify_rport_fpin_state() when FC_PORTSTATE_MARGINAL is set
or cleared.  This allows the user to quickly set or clear the
NVME_CTRL_MARGINAL state from sysfs.

E.g.:

 echo "Marginal" > /sys/class/fc_remote_ports/rport-13:0-5/port_state
 echo "Online" > /sys/class/fc_remote_ports/rport-13:0-5/port_state

Note: nvme_fc_modify_rport_fpin_state() will only affect rports that
      have FC_PORT_ROLE_NVME_TARGET set.

Signed-off-by: John Meneghini <jmeneghi@redhat.com>
purex_item.iocb is defined as a 64-element u8 array, but 64 is the
minimum size and it can be allocated larger. This makes it a standard
empty flex array.

This was motivated by field-spanning write warnings during FPIN testing.

  >  kernel: memcpy: detected field-spanning write (size 60) of single
  >  field "((uint8_t *)fpin_pkt + buffer_copy_offset)"
  >  at drivers/scsi/qla2xxx/qla_isr.c:1221 (size 44)

I removed the outer wrapper from the iocb flex array, so that it can be
linked to `purex_item.size` with `__counted_by`.

These changes remove the default minimum 64-byte allocation, requiring
further changes.

  In `struct scsi_qla_host` the embedded `default_item` is now followed
  by `__default_item_iocb[QLA_DEFAULT_PAYLOAD_SIZE]` to reserve space
  that will be used as `default_item.iocb`. This is wrapped using the
  `TRAILING_OVERLAP()` macro helper, which effectively creates a union
  between flexible-array member `default_item.iocb` and
  `__default_item_iocb`.

  Since `struct pure_item` now contains a flexible-array member, the
  helper must be placed at the end of `struct scsi_qla_host` to prevent
  a `-Wflex-array-member-not-at-end` warning.

  `qla24xx_alloc_purex_item()` is adjusted to no longer expect the
  default minimum size to be part of `sizeof(struct purex_item)`,
  the entire flexible array size is added to the structure size for
  allocation.

This also slightly changes the layout of the purex_item struct, as
2-bytes of padding are added between `size` and `iocb`. The resulting
size is the same, but iocb is shifted 2-bytes (the original `purex_item`
structure was padded at the end, after the 64-byte defined array size).
I don't think this is a problem.

In qla_os.c:qla24xx_process_purex_rdp()

To avoid a null pointer dereference the vha->default_item should be set
to 0 last if the item pointer passed to the function matches.  Also use
a local variable to avoid multiple de-referencing of the item.

Tested-by: Bryan Gurney <bgurney@redhat.com>
Co-developed-by: Chris Leech <cleech@redhat.com>
Signed-off-by: Chris Leech <cleech@redhat.com>
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
johnmeneghini pushed a commit that referenced this pull request Sep 30, 2025
smc_lo_register_dmb() allocates DMB buffers with kzalloc(), which are
later passed to get_page() in smc_rx_splice(). Since kmalloc memory is
not page-backed, this triggers WARN_ON_ONCE() in get_page() and prevents
holding a refcount on the buffer. This can lead to use-after-free if
the memory is released before splice_to_pipe() completes.

Use folio_alloc() instead, ensuring DMBs are page-backed and safe for
get_page().

WARNING: CPU: 18 PID: 12152 at ./include/linux/mm.h:1330 smc_rx_splice+0xaf8/0xe20 [smc]
CPU: 18 UID: 0 PID: 12152 Comm: smcapp Kdump: loaded Not tainted 6.17.0-rc3-11705-g9cf4672ecfee #10 NONE
Hardware name: IBM 3931 A01 704 (z/VM 7.4.0)
Krnl PSW : 0704e00180000000 000793161032696c (smc_rx_splice+0xafc/0xe20 [smc])
           R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
Krnl GPRS: 0000000000000000 001cee80007d3001 00077400000000f8 0000000000000005
           0000000000000001 001cee80007d3006 0007740000001000 001c000000000000
           000000009b0c99e0 0000000000001000 001c0000000000f8 001c000000000000
           000003ffcc6f7c88 0007740003e98000 0007931600000005 000792969b2ff7b8
Krnl Code: 0007931610326960: af000000		mc	0,0
           0007931610326964: a7f4ff43		brc	15,00079316103267ea
          #0007931610326968: af000000		mc	0,0
          >000793161032696c: a7f4ff3f		brc	15,00079316103267ea
           0007931610326970: e320f1000004	lg	%r2,256(%r15)
           0007931610326976: c0e53fd1b5f5	brasl	%r14,000793168fd5d560
           000793161032697c: a7f4fbb5		brc	15,00079316103260e6
           0007931610326980: b904002b		lgr	%r2,%r11
Call Trace:
 smc_rx_splice+0xafc/0xe20 [smc]
 smc_rx_splice+0x756/0xe20 [smc])
 smc_rx_recvmsg+0xa74/0xe00 [smc]
 smc_splice_read+0x1ce/0x3b0 [smc]
 sock_splice_read+0xa2/0xf0
 do_splice_read+0x198/0x240
 splice_file_to_pipe+0x7e/0x110
 do_splice+0x59e/0xde0
 __do_splice+0x11a/0x2d0
 __s390x_sys_splice+0x140/0x1f0
 __do_syscall+0x122/0x280
 system_call+0x6e/0x90
Last Breaking-Event-Address:
smc_rx_splice+0x960/0xe20 [smc]
---[ end trace 0000000000000000 ]---

Fixes: f7a2207 ("net/smc: implement DMB-related operations of loopback-ism")
Reviewed-by: Mahanta Jambigi <mjambigi@linux.ibm.com>
Signed-off-by: Sidraya Jayagond <sidraya@linux.ibm.com>
Link: https://patch.msgid.link/20250917184220.801066-1-sidraya@linux.ibm.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
johnmeneghini pushed a commit that referenced this pull request Sep 30, 2025
Running sha224_kunit on a KMSAN-enabled kernel results in a crash in
kmsan_internal_set_shadow_origin():

    BUG: unable to handle page fault for address: ffffbc3840291000
    #PF: supervisor read access in kernel mode
    #PF: error_code(0x0000) - not-present page
    PGD 1810067 P4D 1810067 PUD 192d067 PMD 3c17067 PTE 0
    Oops: 0000 [#1] SMP NOPTI
    CPU: 0 UID: 0 PID: 81 Comm: kunit_try_catch Tainted: G                 N  6.17.0-rc3 #10 PREEMPT(voluntary)
    Tainted: [N]=TEST
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.17.0-0-gb52ca86e094d-prebuilt.qemu.org 04/01/2014
    RIP: 0010:kmsan_internal_set_shadow_origin+0x91/0x100
    [...]
    Call Trace:
    <TASK>
    __msan_memset+0xee/0x1a0
    sha224_final+0x9e/0x350
    test_hash_buffer_overruns+0x46f/0x5f0
    ? kmsan_get_shadow_origin_ptr+0x46/0xa0
    ? __pfx_test_hash_buffer_overruns+0x10/0x10
    kunit_try_run_case+0x198/0xa00

This occurs when memset() is called on a buffer that is not 4-byte aligned
and extends to the end of a guard page, i.e.  the next page is unmapped.

The bug is that the loop at the end of kmsan_internal_set_shadow_origin()
accesses the wrong shadow memory bytes when the address is not 4-byte
aligned.  Since each 4 bytes are associated with an origin, it rounds the
address and size so that it can access all the origins that contain the
buffer.  However, when it checks the corresponding shadow bytes for a
particular origin, it incorrectly uses the original unrounded shadow
address.  This results in reads from shadow memory beyond the end of the
buffer's shadow memory, which crashes when that memory is not mapped.

To fix this, correctly align the shadow address before accessing the 4
shadow bytes corresponding to each origin.

Link: https://lkml.kernel.org/r/20250911195858.394235-1-ebiggers@kernel.org
Fixes: 2ef3cec ("kmsan: do not wipe out origin when doing partial unpoisoning")
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Tested-by: Alexander Potapenko <glider@google.com>
Reviewed-by: Alexander Potapenko <glider@google.com>
Cc: Dmitriy Vyukov <dvyukov@google.com>
Cc: Marco Elver <elver@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
johnmeneghini pushed a commit that referenced this pull request Jan 14, 2026
Initial rss_hdr allocation uses virtio_device->device,
but virtnet_set_queues() frees using net_device->device.
This device mismatch causing below devres warning

[ 3788.514041] ------------[ cut here ]------------
[ 3788.514044] WARNING: drivers/base/devres.c:1095 at devm_kfree+0x84/0x98, CPU#16: vdpa/1463
[ 3788.514054] Modules linked in: octep_vdpa virtio_net virtio_vdpa [last unloaded: virtio_vdpa]
[ 3788.514064] CPU: 16 UID: 0 PID: 1463 Comm: vdpa Tainted: G        W           6.18.0 #10 PREEMPT
[ 3788.514067] Tainted: [W]=WARN
[ 3788.514069] Hardware name: Marvell CN106XX board (DT)
[ 3788.514071] pstate: 63400009 (nZCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
[ 3788.514074] pc : devm_kfree+0x84/0x98
[ 3788.514076] lr : devm_kfree+0x54/0x98
[ 3788.514079] sp : ffff800084e2f220
[ 3788.514080] x29: ffff800084e2f220 x28: ffff0003b2366000 x27: 000000000000003f
[ 3788.514085] x26: 000000000000003f x25: ffff000106f17c10 x24: 0000000000000080
[ 3788.514089] x23: ffff00045bb8ab08 x22: ffff00045bb8a000 x21: 0000000000000018
[ 3788.514093] x20: ffff0004355c3080 x19: ffff00045bb8aa00 x18: 0000000000080000
[ 3788.514098] x17: 0000000000000040 x16: 000000000000001f x15: 000000000007ffff
[ 3788.514102] x14: 0000000000000488 x13: 0000000000000005 x12: 00000000000fffff
[ 3788.514106] x11: ffffffffffffffff x10: 0000000000000005 x9 : ffff800080c8c05c
[ 3788.514110] x8 : ffff800084e2eeb8 x7 : 0000000000000000 x6 : 000000000000003f
[ 3788.514115] x5 : ffff8000831bafe0 x4 : ffff800080c8b010 x3 : ffff0004355c3080
[ 3788.514119] x2 : ffff0004355c3080 x1 : 0000000000000000 x0 : 0000000000000000
[ 3788.514123] Call trace:
[ 3788.514125]  devm_kfree+0x84/0x98 (P)
[ 3788.514129]  virtnet_set_queues+0x134/0x2e8 [virtio_net]
[ 3788.514135]  virtnet_probe+0x9c0/0xe00 [virtio_net]
[ 3788.514139]  virtio_dev_probe+0x1e0/0x338
[ 3788.514144]  really_probe+0xc8/0x3a0
[ 3788.514149]  __driver_probe_device+0x84/0x170
[ 3788.514152]  driver_probe_device+0x44/0x120
[ 3788.514155]  __device_attach_driver+0xc4/0x168
[ 3788.514158]  bus_for_each_drv+0x8c/0xf0
[ 3788.514161]  __device_attach+0xa4/0x1c0
[ 3788.514164]  device_initial_probe+0x1c/0x30
[ 3788.514168]  bus_probe_device+0xb4/0xc0
[ 3788.514170]  device_add+0x614/0x828
[ 3788.514173]  register_virtio_device+0x214/0x258
[ 3788.514175]  virtio_vdpa_probe+0xa0/0x110 [virtio_vdpa]
[ 3788.514179]  vdpa_dev_probe+0xa8/0xd8
[ 3788.514183]  really_probe+0xc8/0x3a0
[ 3788.514186]  __driver_probe_device+0x84/0x170
[ 3788.514189]  driver_probe_device+0x44/0x120
[ 3788.514192]  __device_attach_driver+0xc4/0x168
[ 3788.514195]  bus_for_each_drv+0x8c/0xf0
[ 3788.514197]  __device_attach+0xa4/0x1c0
[ 3788.514200]  device_initial_probe+0x1c/0x30
[ 3788.514203]  bus_probe_device+0xb4/0xc0
[ 3788.514206]  device_add+0x614/0x828
[ 3788.514209]  _vdpa_register_device+0x58/0x88
[ 3788.514211]  octep_vdpa_dev_add+0x104/0x228 [octep_vdpa]
[ 3788.514215]  vdpa_nl_cmd_dev_add_set_doit+0x2d0/0x3c0
[ 3788.514218]  genl_family_rcv_msg_doit+0xe4/0x158
[ 3788.514222]  genl_rcv_msg+0x218/0x298
[ 3788.514225]  netlink_rcv_skb+0x64/0x138
[ 3788.514229]  genl_rcv+0x40/0x60
[ 3788.514233]  netlink_unicast+0x32c/0x3b0
[ 3788.514237]  netlink_sendmsg+0x170/0x3b8
[ 3788.514241]  __sys_sendto+0x12c/0x1c0
[ 3788.514246]  __arm64_sys_sendto+0x30/0x48
[ 3788.514249]  invoke_syscall.constprop.0+0x58/0xf8
[ 3788.514255]  do_el0_svc+0x48/0xd0
[ 3788.514259]  el0_svc+0x48/0x210
[ 3788.514264]  el0t_64_sync_handler+0xa0/0xe8
[ 3788.514268]  el0t_64_sync+0x198/0x1a0
[ 3788.514271] ---[ end trace 0000000000000000 ]---

Fix by using virtio_device->device consistently for
allocation and deallocation

Fixes: 4944be2 ("virtio_net: Allocate rss_hdr with devres")
Signed-off-by: Kommula Shiva Shankar <kshankar@marvell.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Link: https://patch.msgid.link/20260102101900.692770-1-kshankar@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants