Skip to content

Conversation

@gautschimi
Copy link
Contributor

This commit adds the possibility to increase the fifo depth in the xbar to values > 1 to support multiple outstanding transactions.

Why this is beneficial:
The ibex instruction cache issues two 32b requests to the flash controller. Inside xbar_main a pipeline register is added to break the critical path to the flash. The pipeline register is added with a fifo of depth=1 for req and rsp data and effectively inserts a bubble after each request and response because the fifo is immediately full. Ibex and flash_ctrl can deal with up to 2 outstanding transactions.

The impact on performance is low because the instruction cache reads the critical word first and hides the additional latency that is inserted by the fifo with depth=1. Nonetheless, in phases with many cache misses, the performance can be improved at the price of an additional fifo entry.

This commit adds the possibility to increase the fifo depth in the xbar
to values > 1 to support multiple outstanding transactions.

Why this is beneficial:
The ibex instruction cache issues two 32b requests to the flash controller.
Inside xbar_main a pipeline register is added to break the critical path
to the flash. The pipeline register is added with a fifo of depth=1 for
req and rsp data and effectively inserts a bubble after each request and
response because the fifo is immediately full. Ibex and flash_ctrl can
deal with up to 2 outstanding transactions.

The impact on performance is low because the instruction cache reads the
critical word first and hides the additional latency that is inserted by
the fifo with depth=1. Nonetheless, in phases with many cache misses,
the performance can be improved at the price of an additional fifo entry.

Signed-off-by: Michael Gautschi <[email protected]>
@gautschimi gautschimi force-pushed the xbar_fifo_depth_support branch from 2972f23 to 40202d2 Compare November 28, 2025 10:57
@gautschimi
Copy link
Contributor Author

master:
image

Requests every 2nd cycle, 1 bubble after each request and response.

fifo_depth=2
image

Requests are granted in two cycles in a row to access a full cacheline. No more bubbles in between cacheline requests and responses

@gautschimi gautschimi marked this pull request as ready for review November 28, 2025 11:19
Copy link
Contributor

@vogelpi vogelpi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this looks good and it seems like a nice improvement!

Two questions:

  • Could you observe any performance improvement e.g. for CoreMark?
  • Shall we also enable this for Darjeeling (executes from a big SRAM but also has the I-Cache present)?

rsp_fifo_pass = True

# FIFO depth option. default is 1
# If pipeline is false or req/rsp_fifo_pass are true, this field has no meaning
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does the depth have no meaning if the either of the pass options are true? I guess it's just a limitation of the current implementation, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • If req/rsp_fifo_pass is set, the requests can pass through immediately. I guess in this case we could also make the depth variable. on the other hand, the requests+responses can be buffered by the receiving side if pass-through is set.
  • if pipeline is false, the depth is set to 0

@gautschimi
Copy link
Contributor Author

Thanks, this looks good and it seems like a nice improvement!

Two questions:

* Could you observe any performance improvement e.g. for CoreMark?

* Shall we also enable this for Darjeeling (executes from a big SRAM but also has the I-Cache present)?
  1. I checked on coremark, but the performance benefit is very small (<1%) which surprised me a bit. Looking into the cache in more detail explains it a bit. The cache fetches the critical instruction first, and immediately passes it to the core. Hence, there is no benefit for the miss penalty. With the changes, the other instruction arrives immediately after the first one. Before it arrived 2 cycles later which can cause 1 additional stall. But the coremark benchmark is rather small, and there are not so many cache misses that it makes a difference.

  2. As far as I can see, there is no pipeline in the xbar. however there is a pipeline in the core wrapper:
    https://github.com/lowRISC/opentitan/blob/master/hw/ip_templates/rv_core_ibex/rtl/rv_core_ibex.sv.tpl#L158-L163

earlgrey sets this to 0 while darjeeling sets it to 1. (which also sets the fifo depth to 2, similar to this PR)

I think these are the main differences:

  • Darjeeling: The pipeline register in the core is enabled for instruction + data requests
  • Darjeeling: All requests (flash, sram, rom, peripherals) are pipelined
  • Earlgrey: The pipeline is only added for requests to the flash (instruction+data)

I'm not sure which is better, I think both can be justified.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants