[Celestica] Leh800bcls: Fix bcmcfg_parse_error in AgentTrafficPfc*Test tests case due to fragile hardcoded YAML indentation#1316
Conversation
…t tests case due to fragile hardcoded YAML indentation
|
Is this a problem with the config file that you are using? Why is this specific to Leh800bcls? Aren't other platforms using same format? |
@msomasundaran Great question! This is actually not a problem with the configuration file itself, but rather an issue with a fragile assumption in our config-overwriting utility class. This is not an issue specific to leh800bcls; ladakh800bcls also faces a similar problem. Furthermore, YAML syntax allows any consistent number of spaces for indentation. The base config file of leh800bcls is 100% valid, its embedded yamlConfig uses 12 spaces of indentation for properties under logical tables like TM_THD_CONFIG. |
I am still not clear why we don't see this problem on icecube, tahan, montblanc platforms |
@msomasundaran Whether the test passes depends on the YAML formatting. If the YAML alignment matches the original 6-space indentation used by FBOSS, like this, there won't be any issues. I currently don't have access to all the previous platform test data. However, looking back, icecube was also reporting errors. I am not completely certain if this is the final version, so I will reach out to the test owner tomorrow to clarify the details. |
|
@gang-tao Are you using your own config file? I do not see a Leh specific config file in fboss/oss/hw_test_configs |
|
@mikechoifb has imported this pull request. If you are a Meta employee, you can view this in D109721654. |
@ravi861 Yeah, the current configuration is my own. I'm not quite sure when the configurations can be placed here—I didn't see the other ones like santa or ladakh800blcs either. I looked around here and found that a few of the indents are indeed different. At the same level across different configurations, their indentation varies—some have 12 spaces, while others have 6. |
|
@msomasundaran I asked the Santa testers for their configuration, and their indent is set to 6, as shown below: |
Pre-submission checklist
pip install -r requirements-dev.txt && pre-commit installpre-commit runclang-format.............................................................Passed
shellcheck...........................................(no files to check)Skipped
shfmt................................................(no files to check)Skipped
trim trailing whitespace.................................................Passed
fix end of files.........................................................Passed
check yaml...........................................(no files to check)Skipped
check json...........................................(no files to check)Skipped
check for merge conflicts................................................Passed
ruff check...........................................(no files to check)Skipped
ruff format..........................................(no files to check)Skipped
Prevent sai_impl in fboss manifest.......................................Passed
Summary
When running AgentTrafficPfcTests and AgentNetworkAIQosTests (such as verifyPfcWithDefaultCfg) in cold boot mode, the hardware agents (fboss_hw_agent-sai_impl) crash during the SDK initialization stage (init soc). The SDK throws a fatal YAML block mapping parse error: :bcmcfg_parse_error: :1900:7: did not find expected key while parsing a block mapping. This halts the SDKLT Base Driver component initialization (Component BD (2)), causing the hardware agents to exit with SIGABRT and leading to test suite timeouts/failures.
Root Cause
The issue stems from a fragile, string-replacement-based YAML configuration override mechanism used to enable lossless traffic configurations dynamically during PFC tests.
In fboss/agent/test/utils/NetworkAITestUtils.cpp, the utility::applyBackendAsicConfig function rewrites the platform's yamlConfig by replacing "LOSSY" with a multiline string:
This hardcodes exactly 6 leading spaces of indentation for the newly inserted SKIP_BUFFER_RESERVATION: 1 attribute.
While this hardcoded indentation aligns properly on before platforms whose original YAML configs use 6-space indentation, it breaks catastrophically on current platforms like leh800bcls.
In the agent configuration file loaded on the device, the original TM_THD_CONFIG logic table uses 12 spaces of indentation:
When the string replacement executes on this platform, the generated YAML results in an inconsistent indentation structure:
The line " SKIP_BUFFER_RESERVATION: 1" is indented with only 6 spaces (sitting invalidly between the 4 spaces of 0: and the 8 spaces of TM_THD_CONFIG:).
According to YAML block mapping rules, this indentation mismatch is a syntax violation. Consequently, Broadcom SDKLT's YAML parser (bcmcfg) fails during the early config-parsing phase at column 7 of that line (exactly where the key character S in SKIP_BUFFER is located):
Since the config parsing fails, the Base Driver initialization aborts, causing the hardware agents to crash.
Solution
Refactored utility::applyBackendAsicConfig in fboss/agent/test/utils/NetworkAITestUtils.cpp to dynamically find the beginning of the line containing "LOSSY", compute its exact leading indentation string (whether spaces or tabs), and prepend this exact indentation sequence to the newly inserted "SKIP_BUFFER_RESERVATION: 1" line.
Test Plan
All 7 previously crashing tests now pass successfully on both npu0 and npu1 test: