Skip to content

​​F‐Stack ff_rss_check() Optimization Introduction​

johnjiang edited this page Oct 28, 2025 · 3 revisions

F-Stack ff_rss_check() Optimization Introduction

Table of Contents

  1. Overview
  2. Background and Problem Analysis
  3. Optimization Solution: Static RSS Port Table
  4. Configuration Instructions
  5. Performance Improvement
  6. Other Parameter Optimizations and Adjustments
  7. Summary

1. Overview This document introduces a significant optimization made to the ff_rss_check() function in F-Stack. This optimization significantly improves the performance of applications acting as clients initiating a large number of short-lived connections by introducing a pre-computed static port lookup table. It addresses the issue where the original ff_rss_check() function could become a performance bottleneck in high-concurrency scenarios.

  • Core Optimization Commit: e54aa4317b5d81f9f8643e491d8ec0ec1e72282a

2. Background and Problem Analysis The ff_rss_check() function in F-Stack is responsible for a critical task: allocating a suitable local source port when an application acts as a client to initiate a new TCP connection.

  • Original Mechanism: Each time a new connection needed to be established, ff_rss_check() was called dynamically. This function would select an RSS (Receive Side Scaling) friendly local port based on the target service's IP and port (4-tuple information) through a hash calculation, ensuring packets could be efficiently distributed to the correct CPU core for processing.
  • Performance Bottleneck: In scenarios requiring frequent creation of short-lived connections (e.g., handling HTTP requests without keep-alive), each connection establishment required executing port selection and conflict checks one or more times. This process involved multiple toeplitz_hash calculations (averaging about 300+ TSC cycles per call, with the average number of calls being at least the total number of queues/processes for that port). Under high concurrent load, this consumed considerable CPU resources and became a bottleneck limiting the connection establishment rate.

3. Optimization Solution: Static RSS Port Table The core idea of the optimization solution is to shift from "dynamic calculation" to "static lookup".

  1. Pre-initialize Static Table:
    • During application startup (ff_init()), a static port lookup table (ff_rss_tbl) is pre-calculated and initialized based on predefined rules in the configuration file.
    • This table contains pre-computed combinations of {{remote address, remote port} : {local address, [all available local ports]}} (determining which local ports ensure return packets come back to the process's queue), along with auxiliary data structures like the start/end index of available ports, the index of the last selected port, etc.
  2. Efficient Port Selection:
    • When a new connection needs to be established, the system first attempts to look it up in this pre-generated static table.
    • If a local port matching the current connection's target address/port and is not occupied is found, it is used directly. This process is almost lock-free and has very low overhead (averaging about 100-250 TSC cycles).
    • Most importantly, the static lookup table eliminates the need for multiple RSS selection calculations. It only requires one lookup to find a local port guaranteeing that remote return packets will be processed by the current process's queue, greatly improving the efficiency of source port selection and thereby boosting overall QPS performance.
  3. Graceful Fallback:
    • If all eligible ports in the static table are occupied or the quadruple is not configured in the static lookup table, the system automatically falls back to the original ff_rss_check() dynamic calculation process, ensuring functional correctness.

4. Configuration Instructions This feature needs to be manually enabled and defined via the configuration file (e.g., config.ini). The relevant configuration section is as follows:

[ff_rss_check_tbl]
; Enable or disable the static ff_rss_check table.
; If enabled, F-Stack will initialize this table at APP startup.
; Thereafter, when the APP acts as a client connecting to a server, it will first try to select a local port from this table.
enable=1

; Define port allocation rules.
; Each rule specifies, for a given target service (saddr + sport), which pre-allocated local address (daddr) to use,
; based on the NIC port ID used to obtain the NIC queue configuration.
rss_tbl=0,192.168.2.10,80,192.168.1.10;0,192.168.2.11,80,192.168.1.10;0,192.168.2.12,80,192.168.1.10

Configuration Parameter Details:

  • enable: Master switch. Must be set to 1 to enable this optimization feature.
  • rss_tbl: Defines port allocation rules. Each rule specifies the NIC port ID (for queue config), target server address (saddr), target server port (sport), and the local source address (daddr) to use.

5. Performance Improvement According to internal test data, this optimization brought significant performance improvements:

  • Test Scenario: A client (wrk) sends HTTP requests to F-Stack Nginx using persistent connections. F-Stack Nginx is set up as a reverse proxy, acting as a client to frequently initiate short TCP connections (without keep-alive) to multiple upstream servers. alt

  • Test Data: As shown in the table below.

alt


  • Typical QPS improvement is about 2-6%, and the improvement ratio tends to increase with the number of processes because more processes mean the original dynamic ff_rss_check() calculation requires more attempts on average.
  • In specific scenarios, the improvement can reach over 35%. This occurs mainly because, with certain process counts, the random port selection in the original dynamic method far exceeds the mathematical expectation (should be around total_processes + a few attempts), leading to a large number of extra ff_rss_check() calls consuming substantial CPU resources.
    • [Note] Different upstream server configurations (quantity, IP, ports, etc.) might cause similar issues with different process counts. In this test scenario, the problem appeared with 8 or 16 processes. The exact reason is not yet fully clarified, but checks related to the static table, port occupancy, and requests were normal.
    • The tables below show the statistics of ff_rss_check() call counts for 8 and 12 process Nginx. It's evident that the number of random selection attempts for 8 cores significantly exceeds the mathematical expectation (should be around 8-12 attempts). The same applies to 16 cores.
      • randomtime refers to the kernel parameter net.inet.ip.portrange.randomtime setting. Due to F-Stack's characteristics, completely random source port selection works better when acting as a client, and F-Stack doesn't require high randomness – an earlier optimization already replaced the random function with a much faster pseudo-random one.

alt

  • Perf top Screenshots:

alt

*   8 processes, intermittent randomness (mostly non-random): Very high number of `ff_rss_check()` calls, significant hotspot.

alt

*   8 processes, fully random: Number of `ff_rss_check()` calls and hotspot reduced.

alt

*   12 processes: `ff_rss_check()` hotspot decreases substantially.
  • QPS Performance Comparison: 8 processes, intermittent randomness vs. fully random port selection.

alt

6. Other Parameter Optimizations and Adjustments F-Stack has correspondingly adjusted some parameters. The specifics are as follows. Other services can adjust their configurations flexibly based on their specific characteristics.

7. Summary This optimization of ff_rss_check() is a typical example of F-Stack's pursuit of performance. It solves the core bottleneck in the following ways:

  1. Space for Time: Uses pre-computed static tables to trade off runtime dynamic calculation overhead.
  2. Reduces Call Counts: Significantly lowers the number of retry calls to ff_rss_check() and in_pcblookup_local() during port selection, achieving an overall system performance improvement of 2-6% in general, and over 35% in specific extreme scenarios.
  3. Ensures Compatibility: Guarantees functional correctness under any circumstances through the fallback mechanism.

Scenarios recommended for enabling this feature: All applications using F-Stack as a client that require high-frequency creation of new connections (especially short-lived connections). After enabling, simply predefine the commonly used target service addresses in the configuration file to gain significant performance benefits.


Note: The initial version of the Chinese document was generated using DeepSeek AI and then manually adjusted. Illustrations were generated and modified by Yuanbao based on human-provided prompts. This English translation aims to faithfully represent the content and structure of the provided Chinese document.

Translated by Yuanbao based on the F-Stack ff_rss_check()优化介绍.


Clone this wiki locally