Draft
Conversation
Introduce 'union fc_tlv_desc' to have a common structure for all FC ELS TLV structures and avoid type casts. [bgurney: The cast inside the union fc_tlv_next_desc() has "u8", which causes a failure to build. Use "__u8" instead.] Signed-off-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Justin Tee <justin.tee@broadcom.com> Tested-by: Bryan Gurney <bgurney@redhat.com> Reviewed-by: John Meneghini <jmeneghi@redhat.com> Tested-by: Muneendra Kumar <muneendra.kumar@broadcom.com>
Add a new controller flag, NVME_CTRL_MARGINAL, to help multipath I/O policies to react to a path that is set to a "marginal" state. The flag is cleared on controller reset, which is often the case when faulty cabling or transceiver hardware is replaced. Signed-off-by: Bryan Gurney <bgurney@redhat.com>
FPIN LI (link integrity) messages are received when the attached fabric detects hardware errors. In response to these messages I/O should be directed away from the affected ports, and only used if the 'optimized' paths are unavailable. To handle this a new controller flag 'NVME_CTRL_MARGINAL' is added which will cause the multipath scheduler to skip these paths when checking for 'optimized' paths. They are, however, still eligible for non-optimized path selected. The flag is cleared upon reset as then the faulty hardware might be replaced. Signed-off-by: Hannes Reinecke <hare@kernel.org> Tested-by: Bryan Gurney <bgurney@redhat.com> Reviewed-by: John Meneghini <jmeneghi@redhat.com> Tested-by: Muneendra Kumar <muneendra.kumar@broadcom.com>
If a controller has received a link integrity or congestion event, and has the NVME_CTRL_MARGINAL flag set, emit "marginal" in the state instead of "live", to identify the marginal paths. Co-developed-by: John Meneghini <jmeneghi@redhat.com> Signed-off-by: John Meneghini <jmeneghi@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Tested-by: Muneendra Kumar <muneendra.kumar@broadcom.com> Signed-off-by: Bryan Gurney <bgurney@redhat.com>
Exclude marginal paths from queue-depth io policy. In the case where all paths are marginal and no optimized or non-optimized path is found, we fall back to __nvme_find_path which selects the best marginal path. Tested-by: Bryan Gurney <bgurney@redhat.com> Signed-off-by: John Meneghini <jmeneghi@redhat.com>
Add nvme_fc_modify_rport_fpin_state() and supporting functions. This function is called by the SCSI FC transport and driver layer to set or clear the 'marginal' path status for a specific rport. Co-developed-by: Hannes Reinecke <hare@kernel.org> Signed-off-by: Hannes Reinecke <hare@kernel.org> Tested-by: Bryan Gurney <bgurney@redhat.com> Signed-off-by: John Meneghini <jmeneghi@redhat.com>
Add fc_host_fpin_set_nvme_rport_marginal() function to evaluate the FPIN LI TLV information and set the 'marginal' path status for all affected nvme rports. Co-developed-by: Hannes Reinecke <hare@kernel.org> Signed-off-by: Hannes Reinecke <hare@kernel.org> Tested-by: Bryan Gurney <bgurney@redhat.com> Signed-off-by: John Meneghini <jmeneghi@redhat.com>
Call fc_host_fpin_set_nvme_rport_marginal() to enable FPIN notifications for NVMe. Co-developed-by: Hannes Reinecke <hare@kernel.org> Signed-off-by: Hannes Reinecke <hare@kernel.org> Tested-by: Bryan Gurney <bgurney@redhat.com> Signed-off-by: John Meneghini <jmeneghi@redhat.com>
Call fc_host_fpin_set_nvme_rport_marginal() to enable FPIN notifications for NVMe. Co-developed-by: Hannes Reinecke <hare@kernel.org> Signed-off-by: Hannes Reinecke <hare@kernel.org> Tested-by: Bryan Gurney <bgurney@redhat.com> Signed-off-by: John Meneghini <jmeneghi@redhat.com>
Refactor and fc_rport_set_marginal_state smp safe by holding
`shost->host_lock` around all `rport->port_state` accesses.
Call nvme_fc_modify_rport_fpin_state() when FC_PORTSTATE_MARGINAL is set
or cleared. This allows the user to quickly set or clear the
NVME_CTRL_MARGINAL state from sysfs.
E.g.:
echo "Marginal" > /sys/class/fc_remote_ports/rport-13:0-5/port_state
echo "Online" > /sys/class/fc_remote_ports/rport-13:0-5/port_state
Note: nvme_fc_modify_rport_fpin_state() will only affect rports that
have FC_PORT_ROLE_NVME_TARGET set.
Signed-off-by: John Meneghini <jmeneghi@redhat.com>
purex_item.iocb is defined as a 64-element u8 array, but 64 is the minimum size and it can be allocated larger. This makes it a standard empty flex array. This was motivated by field-spanning write warnings during FPIN testing. > kernel: memcpy: detected field-spanning write (size 60) of single > field "((uint8_t *)fpin_pkt + buffer_copy_offset)" > at drivers/scsi/qla2xxx/qla_isr.c:1221 (size 44) I removed the outer wrapper from the iocb flex array, so that it can be linked to `purex_item.size` with `__counted_by`. These changes remove the default minimum 64-byte allocation, requiring further changes. In `struct scsi_qla_host` the embedded `default_item` is now followed by `__default_item_iocb[QLA_DEFAULT_PAYLOAD_SIZE]` to reserve space that will be used as `default_item.iocb`. This is wrapped using the `TRAILING_OVERLAP()` macro helper, which effectively creates a union between flexible-array member `default_item.iocb` and `__default_item_iocb`. Since `struct pure_item` now contains a flexible-array member, the helper must be placed at the end of `struct scsi_qla_host` to prevent a `-Wflex-array-member-not-at-end` warning. `qla24xx_alloc_purex_item()` is adjusted to no longer expect the default minimum size to be part of `sizeof(struct purex_item)`, the entire flexible array size is added to the structure size for allocation. This also slightly changes the layout of the purex_item struct, as 2-bytes of padding are added between `size` and `iocb`. The resulting size is the same, but iocb is shifted 2-bytes (the original `purex_item` structure was padded at the end, after the 64-byte defined array size). I don't think this is a problem. In qla_os.c:qla24xx_process_purex_rdp() To avoid a null pointer dereference the vha->default_item should be set to 0 last if the item pointer passed to the function matches. Also use a local variable to avoid multiple de-referencing of the item. Tested-by: Bryan Gurney <bgurney@redhat.com> Co-developed-by: Chris Leech <cleech@redhat.com> Signed-off-by: Chris Leech <cleech@redhat.com> Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
johnmeneghini
pushed a commit
that referenced
this pull request
Sep 30, 2025
smc_lo_register_dmb() allocates DMB buffers with kzalloc(), which are later passed to get_page() in smc_rx_splice(). Since kmalloc memory is not page-backed, this triggers WARN_ON_ONCE() in get_page() and prevents holding a refcount on the buffer. This can lead to use-after-free if the memory is released before splice_to_pipe() completes. Use folio_alloc() instead, ensuring DMBs are page-backed and safe for get_page(). WARNING: CPU: 18 PID: 12152 at ./include/linux/mm.h:1330 smc_rx_splice+0xaf8/0xe20 [smc] CPU: 18 UID: 0 PID: 12152 Comm: smcapp Kdump: loaded Not tainted 6.17.0-rc3-11705-g9cf4672ecfee #10 NONE Hardware name: IBM 3931 A01 704 (z/VM 7.4.0) Krnl PSW : 0704e00180000000 000793161032696c (smc_rx_splice+0xafc/0xe20 [smc]) R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3 Krnl GPRS: 0000000000000000 001cee80007d3001 00077400000000f8 0000000000000005 0000000000000001 001cee80007d3006 0007740000001000 001c000000000000 000000009b0c99e0 0000000000001000 001c0000000000f8 001c000000000000 000003ffcc6f7c88 0007740003e98000 0007931600000005 000792969b2ff7b8 Krnl Code: 0007931610326960: af000000 mc 0,0 0007931610326964: a7f4ff43 brc 15,00079316103267ea #0007931610326968: af000000 mc 0,0 >000793161032696c: a7f4ff3f brc 15,00079316103267ea 0007931610326970: e320f1000004 lg %r2,256(%r15) 0007931610326976: c0e53fd1b5f5 brasl %r14,000793168fd5d560 000793161032697c: a7f4fbb5 brc 15,00079316103260e6 0007931610326980: b904002b lgr %r2,%r11 Call Trace: smc_rx_splice+0xafc/0xe20 [smc] smc_rx_splice+0x756/0xe20 [smc]) smc_rx_recvmsg+0xa74/0xe00 [smc] smc_splice_read+0x1ce/0x3b0 [smc] sock_splice_read+0xa2/0xf0 do_splice_read+0x198/0x240 splice_file_to_pipe+0x7e/0x110 do_splice+0x59e/0xde0 __do_splice+0x11a/0x2d0 __s390x_sys_splice+0x140/0x1f0 __do_syscall+0x122/0x280 system_call+0x6e/0x90 Last Breaking-Event-Address: smc_rx_splice+0x960/0xe20 [smc] ---[ end trace 0000000000000000 ]--- Fixes: f7a2207 ("net/smc: implement DMB-related operations of loopback-ism") Reviewed-by: Mahanta Jambigi <mjambigi@linux.ibm.com> Signed-off-by: Sidraya Jayagond <sidraya@linux.ibm.com> Link: https://patch.msgid.link/20250917184220.801066-1-sidraya@linux.ibm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
johnmeneghini
pushed a commit
that referenced
this pull request
Sep 30, 2025
Running sha224_kunit on a KMSAN-enabled kernel results in a crash in
kmsan_internal_set_shadow_origin():
BUG: unable to handle page fault for address: ffffbc3840291000
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 1810067 P4D 1810067 PUD 192d067 PMD 3c17067 PTE 0
Oops: 0000 [#1] SMP NOPTI
CPU: 0 UID: 0 PID: 81 Comm: kunit_try_catch Tainted: G N 6.17.0-rc3 #10 PREEMPT(voluntary)
Tainted: [N]=TEST
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.17.0-0-gb52ca86e094d-prebuilt.qemu.org 04/01/2014
RIP: 0010:kmsan_internal_set_shadow_origin+0x91/0x100
[...]
Call Trace:
<TASK>
__msan_memset+0xee/0x1a0
sha224_final+0x9e/0x350
test_hash_buffer_overruns+0x46f/0x5f0
? kmsan_get_shadow_origin_ptr+0x46/0xa0
? __pfx_test_hash_buffer_overruns+0x10/0x10
kunit_try_run_case+0x198/0xa00
This occurs when memset() is called on a buffer that is not 4-byte aligned
and extends to the end of a guard page, i.e. the next page is unmapped.
The bug is that the loop at the end of kmsan_internal_set_shadow_origin()
accesses the wrong shadow memory bytes when the address is not 4-byte
aligned. Since each 4 bytes are associated with an origin, it rounds the
address and size so that it can access all the origins that contain the
buffer. However, when it checks the corresponding shadow bytes for a
particular origin, it incorrectly uses the original unrounded shadow
address. This results in reads from shadow memory beyond the end of the
buffer's shadow memory, which crashes when that memory is not mapped.
To fix this, correctly align the shadow address before accessing the 4
shadow bytes corresponding to each origin.
Link: https://lkml.kernel.org/r/20250911195858.394235-1-ebiggers@kernel.org
Fixes: 2ef3cec ("kmsan: do not wipe out origin when doing partial unpoisoning")
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Tested-by: Alexander Potapenko <glider@google.com>
Reviewed-by: Alexander Potapenko <glider@google.com>
Cc: Dmitriy Vyukov <dvyukov@google.com>
Cc: Marco Elver <elver@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
johnmeneghini
pushed a commit
that referenced
this pull request
Jan 14, 2026
Initial rss_hdr allocation uses virtio_device->device, but virtnet_set_queues() frees using net_device->device. This device mismatch causing below devres warning [ 3788.514041] ------------[ cut here ]------------ [ 3788.514044] WARNING: drivers/base/devres.c:1095 at devm_kfree+0x84/0x98, CPU#16: vdpa/1463 [ 3788.514054] Modules linked in: octep_vdpa virtio_net virtio_vdpa [last unloaded: virtio_vdpa] [ 3788.514064] CPU: 16 UID: 0 PID: 1463 Comm: vdpa Tainted: G W 6.18.0 #10 PREEMPT [ 3788.514067] Tainted: [W]=WARN [ 3788.514069] Hardware name: Marvell CN106XX board (DT) [ 3788.514071] pstate: 63400009 (nZCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--) [ 3788.514074] pc : devm_kfree+0x84/0x98 [ 3788.514076] lr : devm_kfree+0x54/0x98 [ 3788.514079] sp : ffff800084e2f220 [ 3788.514080] x29: ffff800084e2f220 x28: ffff0003b2366000 x27: 000000000000003f [ 3788.514085] x26: 000000000000003f x25: ffff000106f17c10 x24: 0000000000000080 [ 3788.514089] x23: ffff00045bb8ab08 x22: ffff00045bb8a000 x21: 0000000000000018 [ 3788.514093] x20: ffff0004355c3080 x19: ffff00045bb8aa00 x18: 0000000000080000 [ 3788.514098] x17: 0000000000000040 x16: 000000000000001f x15: 000000000007ffff [ 3788.514102] x14: 0000000000000488 x13: 0000000000000005 x12: 00000000000fffff [ 3788.514106] x11: ffffffffffffffff x10: 0000000000000005 x9 : ffff800080c8c05c [ 3788.514110] x8 : ffff800084e2eeb8 x7 : 0000000000000000 x6 : 000000000000003f [ 3788.514115] x5 : ffff8000831bafe0 x4 : ffff800080c8b010 x3 : ffff0004355c3080 [ 3788.514119] x2 : ffff0004355c3080 x1 : 0000000000000000 x0 : 0000000000000000 [ 3788.514123] Call trace: [ 3788.514125] devm_kfree+0x84/0x98 (P) [ 3788.514129] virtnet_set_queues+0x134/0x2e8 [virtio_net] [ 3788.514135] virtnet_probe+0x9c0/0xe00 [virtio_net] [ 3788.514139] virtio_dev_probe+0x1e0/0x338 [ 3788.514144] really_probe+0xc8/0x3a0 [ 3788.514149] __driver_probe_device+0x84/0x170 [ 3788.514152] driver_probe_device+0x44/0x120 [ 3788.514155] __device_attach_driver+0xc4/0x168 [ 3788.514158] bus_for_each_drv+0x8c/0xf0 [ 3788.514161] __device_attach+0xa4/0x1c0 [ 3788.514164] device_initial_probe+0x1c/0x30 [ 3788.514168] bus_probe_device+0xb4/0xc0 [ 3788.514170] device_add+0x614/0x828 [ 3788.514173] register_virtio_device+0x214/0x258 [ 3788.514175] virtio_vdpa_probe+0xa0/0x110 [virtio_vdpa] [ 3788.514179] vdpa_dev_probe+0xa8/0xd8 [ 3788.514183] really_probe+0xc8/0x3a0 [ 3788.514186] __driver_probe_device+0x84/0x170 [ 3788.514189] driver_probe_device+0x44/0x120 [ 3788.514192] __device_attach_driver+0xc4/0x168 [ 3788.514195] bus_for_each_drv+0x8c/0xf0 [ 3788.514197] __device_attach+0xa4/0x1c0 [ 3788.514200] device_initial_probe+0x1c/0x30 [ 3788.514203] bus_probe_device+0xb4/0xc0 [ 3788.514206] device_add+0x614/0x828 [ 3788.514209] _vdpa_register_device+0x58/0x88 [ 3788.514211] octep_vdpa_dev_add+0x104/0x228 [octep_vdpa] [ 3788.514215] vdpa_nl_cmd_dev_add_set_doit+0x2d0/0x3c0 [ 3788.514218] genl_family_rcv_msg_doit+0xe4/0x158 [ 3788.514222] genl_rcv_msg+0x218/0x298 [ 3788.514225] netlink_rcv_skb+0x64/0x138 [ 3788.514229] genl_rcv+0x40/0x60 [ 3788.514233] netlink_unicast+0x32c/0x3b0 [ 3788.514237] netlink_sendmsg+0x170/0x3b8 [ 3788.514241] __sys_sendto+0x12c/0x1c0 [ 3788.514246] __arm64_sys_sendto+0x30/0x48 [ 3788.514249] invoke_syscall.constprop.0+0x58/0xf8 [ 3788.514255] do_el0_svc+0x48/0xd0 [ 3788.514259] el0_svc+0x48/0x210 [ 3788.514264] el0t_64_sync_handler+0xa0/0xe8 [ 3788.514268] el0t_64_sync+0x198/0x1a0 [ 3788.514271] ---[ end trace 0000000000000000 ]--- Fix by using virtio_device->device consistently for allocation and deallocation Fixes: 4944be2 ("virtio_net: Allocate rss_hdr with devres") Signed-off-by: Kommula Shiva Shankar <kshankar@marvell.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com> Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Link: https://patch.msgid.link/20260102101900.692770-1-kshankar@marvell.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add fc_host_fpin_set_nvme_rport_marginal() function to
evaluate the FPIN LI TLV information and set the 'marginal'
path status for all affected nvme rports.
Co-developed-by: Hannes Reinecke hare@kernel.org
Signed-off-by: Hannes Reinecke hare@kernel.org
Tested-by: Bryan Gurney bgurney@redhat.com
Signed-off-by: John Meneghini jmeneghi@redhat.com
Refactor and fc_rport_set_marginal_state smp safe by holding
shost->host_lockaround allrport->port_stateaccesses.Call nvme_fc_modify_rport_fpin_state() when FC_PORTSTATE_MARGINAL is set
or cleared. This allows the user to quickly set or clear the
NVME_CTRL_MARGINAL state from sysfs.
E.g.:
Note: nvme_fc_modify_rport_fpin_state() will only affect rports that have FC_PORT_ROLE_NVME_TARGET set.
Signed-off-by: John Meneghini jmeneghi@redhat.com