Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi SPI configuration causes occasional corrupted data transfer (IDFGH-11187) #12354

Open
3 tasks done
lilalaunestift opened this issue Oct 4, 2023 · 42 comments
Open
3 tasks done
Assignees
Labels
Status: Opened Issue is new Type: Bug bugs in IDF

Comments

@lilalaunestift
Copy link

Answers checklist.

  • I have read the documentation ESP-IDF Programming Guide and the issue is not addressed there.
  • I have updated my IDF branch (master or release) to the latest version and checked that the issue is present there.
  • I have searched the issue tracker for a similar issue and not found a similar issue.

IDF version.

v5.1.1-1-gd3c99ed3b8

Espressif SoC revision.

ESP32-D0WD-V3 (revision v3.0)

Operating System used.

Linux

How did you build your project?

VS Code IDE

If you are using Windows, please specify command line type.

None

Development Kit.

Custom Board

Power Supply used.

External 3.3V

What is the expected behavior?

Two SPI buses are used:

  1. The HSPI is configured as Master. A dm9051 controller is the slave. The driver for the dm9051 from the esp-idf is used.
  2. The VSPI is configured as Slave and communicates with another controller.
    Reliable data transfer on both configured SPI busses is expected.

What is the actual behavior?

The communication of HSPI master is unstable. Approx. 10% of the messages are corrupted somehow.
This can be observed for both incoming and outgoing data:
For incoming data over the MISO line, it can be observed that data on the SPI bus sent by dm9051 is correct (via Logic Analyzer), but partly faulty data can be found in the receive buffer.
For outgoing data (MOSI), there is correct data in the send buffer, but partly faulty data can be observed on the SPI bus.

Steps to reproduce.

  1. Step Configuration of the HSPI Master:
#define IN_MODULE_nINT_PIN          GPIO_NUM_4
#define SPI_MODULE_MISO_PIN         GPIO_NUM_12
#define SPI_MODULE_MOSI_PIN         GPIO_NUM_13
#define SPI_MODULE_CLK_PIN          GPIO_NUM_14
#define SPI_MODULE_nCS_PIN          GPIO_NUM_15

#define ETHERNET_SPI_HOST           HSPI_HOST
#define ETHERNET_SPI_CLK            5000000
#define ETHERNET_DMA_CHAN           1

esp_err_t Ethernet_init(void)
{
    esp_err_t esp_err;
    esp_eth_mac_t *poEthMac = NULL;
    esp_eth_phy_t *poEthPhy = NULL;

    esp_err = Ethernet_initSpi();
    esp_err = Ethernet_initMacPhyController(&poEthMac, &poEthPhy)
}

esp_err_t Ethernet_initSpi(void)
{
    oEthernet.hSpiHandle = NULL;

    spi_bus_config_t oBusConfig = {
            .miso_io_num = SPI_MODULE_MISO_PIN,
            .mosi_io_num = SPI_MODULE_MOSI_PIN,
            .sclk_io_num = SPI_MODULE_CLK_PIN,
            .quadwp_io_num = -1,
            .quadhd_io_num = -1,
    };
    ESP_ERROR_CHECK(spi_bus_initialize(ETHERNET_SPI_HOST, &oBusConfig, ETHERNET_DMA_CHAN));
    return ESP_OK;
}

esp_err_t Ethernet_initMacPhyController(esp_eth_mac_t **poOutMac ,esp_eth_phy_t **poOutPhy)
{
    eth_mac_config_t oMacConfig = ETH_MAC_DEFAULT_CONFIG();
    eth_phy_config_t oPhyConfig = ETH_PHY_DEFAULT_CONFIG();
    oPhyConfig.autonego_timeout_ms = 0;
    oPhyConfig.phy_addr = 1;
    oPhyConfig.reset_gpio_num = -1;

    spi_device_interface_config_t oSpiDevConfig = {
            .command_bits = 1,
            .address_bits = 7,
            .mode = 0,
            .clock_speed_hz = ETHERNET_SPI_CLK,
            .spics_io_num = SPI_MODULE_nCS_PIN,
            .queue_size = 20
    };

    eth_dm9051_config_t oDm9051Config = ETH_DM9051_DEFAULT_CONFIG(ETHERNET_SPI_HOST, &oSpiDevConfig);
    oDm9051Config.int_gpio_num = IN_MODULE_nINT_PIN;

    *poOutMac = esp_eth_mac_new_dm9051(&oDm9051Config, &oMacConfig);
    *poOutPhy = esp_eth_phy_new_dm9051(&oPhyConfig);

    return ESP_OK;
}
  1. Step Configuration of the SPI Slave:
#define OUT_MSP_nINT_PIN            GPIO_NUM_16
#define OUT_MSP_nRDY_PIN            GPIO_NUM_32
#define SPI_MSP_MISO_PIN            GPIO_NUM_19
#define SPI_MSP_MOSI_PIN            GPIO_NUM_23
#define SPI_MSP_CLK_PIN             GPIO_NUM_18
#define SPI_MSP_nCS_PIN             GPIO_NUM_5

#define MSP_HOST                    VSPI_HOST
#define MSP_DMA_CHAN                2

esp_err_t Slave_init(uint32_t nMaxLen)
{
    esp_err_t esp_err;

    // Configuration for the SPI bus
    spi_bus_config_t buscfg =
    {
        .mosi_io_num        = SPI_MSP_MOSI_PIN,
        .miso_io_num        = SPI_MSP_MISO_PIN,
        .sclk_io_num        = SPI_MSP_CLK_PIN,
        .quadwp_io_num      = -1,
        .quadhd_io_num      = -1,
        .max_transfer_sz    = nMaxLen,
        .flags              = SPICOMMON_BUSFLAG_SLAVE,
        .intr_flags         = ESP_INTR_FLAG_LOWMED
    };

    // Configuration for the SPI slave interface
    spi_slave_interface_config_t slvcfg =
    {
        .spics_io_num   = SPI_MSP_nCS_PIN,
        .flags          = 0,
        .queue_size     = 2,    // at least 2
        .mode           = 1,
        .post_setup_cb  = &Msp_postSetupCb,
        .post_trans_cb  = &Msp_postTransCb
    };
    
    //Initialize SPI slave interface
    esp_err = spi_slave_initialize(MSP_HOST, &buscfg, &slvcfg, MSP_DMA_CHAN);

    return esp_err;
}

sdkconfig file:
sdkconfig.txt

Debug Logs.

Here is an example of the observed data corruption:


Data in the buffer as passed to the dm9051 in (emac_dm9051_transmit()):

28 6b 35 b2 71 f9 a8 03 2a ee c4 67 08 00 45 00
00 54 9e ba 40 00 ff 01 f7 66 c0 a8 b2 1b c0 a8
b2 1a 00 00 00 c7 00 01 2f 5c 59 65 15 65 00 00
00 00 9b 3e 07 00 00 00 00 00 10 11 12 13 14 15
16 17 18 19 1a 1b 1c 1d 1e 1f 20 21 22 23 24 25
26 27 28 29 2a 2b 2c 2d 2e 2f 30 31 32 33 34 35
36 37

Data observed on the SPI bus via logic analyzer:

28 6B 35 B2 71 F9 A8 03 2A EE C4 67 08 00 45 00 
00 54 9E BA 40 00 FF 01 F7 66 C0 A8 B2 1B C0 A8 
B2 1A 00 00 00 C7 00 01 2F 5C 59 65 15 65 00 00 
00 00 9B 00 00 9B 3E 07 00 00 00 00 00 10 11 12 
13 00 00 00 00 9B 3E 07 00 00 00 00 00 10 11 12 
13 00 00 00 00 9B 3E 07 00 00 00 00 00 10 11 12 
13 00

More Information.

  • The mentioned issue started showing after an update from idfv4.3 to idfv5.1. With idf4.3 the communication with the dm9051 over the HSPI master is working as expected.
  • I tried the given ethernet example in idfv5.1 for the dm9051 with our hardware setup and its working fine
  • I moved the shown initialization above to the ethernet example and it is still working fine
  • When I deactivate the VSPI Slave, the HSPI master starts working without any issues

Is it possible that there is an issue on balancing the DMA usage? It seems that somehow the data is corrupted between the SPI bus and dm9051 send/receive buffer. At the same time I would not suspect a SPI issue, since the data is correct in most parts and the faulty parts are not arbitrary data (see log above).

@lilalaunestift lilalaunestift added the Type: Bug bugs in IDF label Oct 4, 2023
@espressif-bot espressif-bot added the Status: Opened Issue is new label Oct 4, 2023
@github-actions github-actions bot changed the title Multi SPI configuration causes occasional corrupted data transfer Multi SPI configuration causes occasional corrupted data transfer (IDFGH-11187) Oct 4, 2023
@kostaond
Copy link
Collaborator

kostaond commented Oct 10, 2023

@lilalaunestift thanks for nice detailed report. Could you please try to lower SPI CLK for DM9051 to 20 MHz?

Edit: I see, it's actually 5 MHz... What is SPI CLK of slave?

@lilalaunestift
Copy link
Author

Hi, the clock of the slave SPI is running with 4.096MHz.

@kostaond
Copy link
Collaborator

One more question...

I moved the shown initialization above to the ethernet example and it is still working fine

Do you mean just solely Ethernet or Ethernet and the SPI slave?

@lilalaunestift
Copy link
Author

I moved only the Ethernet to the dm9051 example.

@lilalaunestift
Copy link
Author

Hey,
I did some more measurements with the logic analyzer. I captured both SPI buses this time.
The following two pictures show the two constellations I found, where the SPI Master is transmitting wrong data on the bus. The point in time where the corruption starts is shortly after the SPI Slave is done with receiving (the red and purple markers placed in the capture show where the corruption approximately starts). Again I can observe, that some part of the data is sent repeadetly.
BrokenFrame
BrokenFrame2

On the other hand, I found transmissions where also both buses are active, but there is no corruption on the SPI master:
OkFrame
Here it seems that the SPI slave is considerate of the SPI master.

Hope this helps somehow.
Greetings

@kostaond
Copy link
Collaborator

@lilalaunestift thank you for the report, we try to reproduce and let you know.

@lilalaunestift
Copy link
Author

Hey @kostaond,
are there any news on this issue so far? Are you able to reproduce this behavior?
Greetings

@kostaond
Copy link
Collaborator

Hi @lilalaunestift, my colleague tries to reproduce it. However, he hasn't be able to reproduce yet...

@kostaond
Copy link
Collaborator

kostaond commented Nov 1, 2023

@lilalaunestift how long are expected transactions at SPI slave? If they are less than or equal to 32B, could you please try to disable DMA at the slave interface?

@kostaond
Copy link
Collaborator

kostaond commented Nov 1, 2023

We haven't been able to reproduce. We've based our setup on SPI Slave example. We have two ESP32's - one as master and one slave with DM9051. Could you please provide more information about your SPI slave code?

@KaeLL
Copy link
Contributor

KaeLL commented Nov 1, 2023

@kostaond

If they are less than or equal to 32B, could you please try to disable DMA at the slave interface?

Care to elaborate on that?

@lilalaunestift
Copy link
Author

@lilalaunestift how long are expected transactions at SPI slave? If they are less than or equal to 32B, could you please try to disable DMA at the slave interface?

The transactions are longer than 32B. Disabling the DMA is unfortunately not an option here.

We haven't been able to reproduce. We've based our setup on SPI Slave example. We have two ESP32's - one as master and one slave with DM9051. Could you please provide more information about your SPI slave code?

Just for clarification: are you also using a setup where the ESP32 is master and slave at the same time (see picture)?
2023-11-03 07_49_17-Window

@kostaond
Copy link
Collaborator

kostaond commented Nov 3, 2023

Just for clarification: are you also using a setup where the ESP32 is master and slave at the same time (see picture)?

Yes, we added the DM9051 to the SPI Slave example.

Could you please provide more details about your setup? For example, slave code, what is the traffic (size, period), etc.

@lilalaunestift
Copy link
Author

lilalaunestift commented Nov 7, 2023

Hey, sorry for the delay.

Regarding our slave code:
99.9% of the data has a length of 201B and is transmitted periodically every 4ms from master to slave. Clock frequency of the slave is 4,096MHz.
Besides the four SPI lines, there are two additional lines for the communication. Both of them are controlled by the ESP:

  1. MSP_READY: this line signals SPI Master that the ESP is ready for a SPI transaction
  2. MSP_INT: this line signals the SPI Master that the ESP wants to transmit something

Sending and receiving are done sequentially. So while the ESP slave is sending, the master is only receiving and vice versa.
Here is the part where the slave code is interacting with the SPI driver.

#include "../inc/Clock.h"
#include "../inc/Crc16.h"

#include "driver/gpio.h"
#include "driver/spi_slave.h"
#include "esp_intr_alloc.h"



static void Msp_postSetupCb(spi_slave_transaction_t* pTrans);
static void Msp_postTransCb(spi_slave_transaction_t* pTrans);

uint8_t acSnd[268];

typedef struct SMsp
{
    spi_slave_transaction_t t0;
    uint32_t nCpuTick;
}
Msp_t;

Msp_t oMsp;

esp_err_t Msp_init(uint32_t nMaxLen)
{
    esp_err_t esp_err;
    
    memset(&acSnd, 0xff, sizeof(acSnd));
    memset(&oMsp, 0x00, sizeof(Msp_t));

    // Configuration for the SPI bus
    spi_bus_config_t buscfg =
    {
        .mosi_io_num        = SPI_MSP_MOSI_PIN,
        .miso_io_num        = SPI_MSP_MISO_PIN,
        .sclk_io_num        = SPI_MSP_CLK_PIN,
        .quadwp_io_num      = -1,
        .quadhd_io_num      = -1,
        .max_transfer_sz    = nMaxLen,
        .flags              = SPICOMMON_BUSFLAG_SLAVE,
        .intr_flags         = ESP_INTR_FLAG_LOWMED
    };

    // Configuration for the SPI slave interface
    spi_slave_interface_config_t slvcfg =
    {
        .spics_io_num   = SPI_MSP_nCS_PIN,
        .flags          = 0,
        .queue_size     = 2,    // at least 2
        .mode           = 1,
        .post_setup_cb  = &Msp_postSetupCb,
        .post_trans_cb  = &Msp_postTransCb
    };
    
    //Initialize SPI slave interface
    esp_err = spi_slave_initialize(MSP_HOST, &buscfg, &slvcfg, MSP_DMA_CHAN);
    assert(esp_err == ESP_OK);

     return esp_err;
}

void Msp_transReady(void)
{
    // ready to transmit
    gpio_set_level(OUT_MSP_nINT_PIN, 0);
}

void Msp_writeBlock(uint8_t* acSndData, uint8_t* acRcvData, uint32_t nMaxLen)
{
    esp_err_t esp_err;

    oMsp.t0.length      = nMaxLen << 3;
    oMsp.t0.rx_buffer   = acRcvData;
    oMsp.t0.trans_len   = 0;
    oMsp.t0.tx_buffer   = acSndData;
    oMsp.t0.user        = (void*)1;

    esp_err = spi_slave_queue_trans(MSP_HOST, &oMsp.t0, 0);
    assert(esp_err == ESP_OK);
}

void Msp_readBlock(uint8_t* acRcvData, uint32_t nMaxLen)
{
    esp_err_t esp_err;

    oMsp.t0.length      = nMaxLen << 3;
    oMsp.t0.rx_buffer   = acRcvData;
    oMsp.t0.trans_len   = 0;
    oMsp.t0.tx_buffer   = acSnd;
    oMsp.t0.user        = (void*)0;

    esp_err = spi_slave_queue_trans(MSP_HOST, &oMsp.t0, 0);
    assert(esp_err == ESP_OK);
}

void Msp_getTransResult(void)
{
    esp_err_t esp_err;
    spi_slave_transaction_t * pTrans = NULL;

    esp_err = spi_slave_get_trans_result(MSP_HOST, &pTrans, 0);
        esp_err = ESP_OK;
    if (esp_err != ESP_OK)
        return;

    if ((uint32_t)pTrans->user != 0)
    {
        // not ready to transmit
        gpio_set_level(OUT_MSP_nINT_PIN, 1);
    }
    if (pTrans->tx_buffer == acSnd)
        pTrans->tx_buffer = NULL;

    SioMsp_onEvTransComplete(   pTrans->tx_buffer,
                                pTrans->rx_buffer,
                                pTrans->trans_len   );
}

// called after a transaction is queued and ready for pickup by master.
static void IRAM_ATTR Msp_postSetupCb(spi_slave_transaction_t* pTrans)
{
    // wait 1�s if necessary! MSP must see this negative edge!
    while (Clock_getCpuTicks() - oMsp.nCpuTick < eTick_tm1us);

    gpio_set_level(OUT_MSP_nRDY_PIN, 0);
}

// called after transaction is sent/received.
static void IRAM_ATTR Msp_postTransCb(spi_slave_transaction_t* pTrans)
{
    BaseType_t xHigherPriorityTaskWoken;

    // not ready to receive or transmit
    gpio_set_level(OUT_MSP_nRDY_PIN, 1);

    oMsp.nCpuTick = Clock_getCpuTicks();

    // call spi_slave_get_trans_result ...
    xHigherPriorityTaskWoken = pdFALSE;

    SioMsp_onPostTransFromISR(&xHigherPriorityTaskWoken);

    if (xHigherPriorityTaskWoken == pdTRUE)
        portYIELD_FROM_ISR();
}

Thanks and Greetings

@lilalaunestift
Copy link
Author

Hey @kostaond ,
did you find any useful information in the shared code that could help?

Assuming the problem is related to the DMA, is there anything I can do track the issue down or provide additional debug information? The documentation does not give much information about the topic. So I don't really know where to start.
Greetings

@kostaond
Copy link
Collaborator

@lilalaunestift sorry for not replying, I was busy with other tasks. However, provided code still has room for uncertainty. We invested quite some time with the previous attempt using modified SPI Slave example. Therefore I would much appreciate, if you could provide fully functioning minimum project under which you are able to demonstrate the issue. We need to reproduce it at our side to move forward. I tried to discuss with team responsible for SPI and they indicated that the issue could be at HW design side (PCB)...

@lilalaunestift
Copy link
Author

Ok, I will try to create a minimal project. I guess this will take some days till I find the time. I will let you know.
Regarding the HW design:
The reason why we so far have not investigated an issue on the pcb side is, that everything was working fine with idf4.3 and earlier versions.
Is there any specific assumption what could cause the issue on the pcb? I could ask our HW guy to have a look at it then.
Thanks and greetings

@lilalaunestift
Copy link
Author

Hey,
it took some time but I managed to created a minimal example for the esp32 with which I am able to reproduce the issue.
SPI_Issue_min_example.zip
Some explanations:

  1. The example can receive messages from an external SPI master device and does nothing with them. In my setup the attached SPI master transmits 204B of data every 4ms with a bus frequency of 4.096kHz.
  2. I made the espressif basic ethernet example part of the project (with some smaller changes ) and use it to drive the dm9051.

The mentioned additional pins for the SPI bus are not used in this minimum project (they are set to a fixed state and do not participate in the communication).

If I now use 'ping' to send ICMP packages to the esp32, roughly 7-10% of the messages are lost or damaged.
When deactivating the Slave_task, 100% are received.

I'm still using the same setup regarding IDF and HW as mentioned in the beginning.

Greetings.

@lilalaunestift
Copy link
Author

Hey,
have you already found the time to take a look into the example?
Greetings

@kostaond
Copy link
Collaborator

Hi @lilalaunestift, yes, we've give it a try but we have some troubles. I'll get back to you once there is something to share. Please be patient.

@lilalaunestift
Copy link
Author

Ok, great. Thank you very much for the update.

@lilalaunestift
Copy link
Author

Hi @kostaond,
may i ask for a small update on this topic? Is it possible to reproduce the issue with the provided example?
Greetings

@kostaond
Copy link
Collaborator

kostaond commented Jan 24, 2024

Hi @lilalaunestift, we had issues with SPI master... At the end, I needed to implement it on bear-metal SAM3S MCU to achieve 4 ms period. Therefore it took a time to find appropriate hardware, prepare all the infrastructure and the test setup. Anyway, I was able to reproduce the issue with minimum code example you provided.

The good thing is that I probably found the root cause of the issue. Your Rx buffer is not 32-bit aligned:

typedef struct SData
{
    uint8_t acData[258];  // !!!
}
Data_t;

The memory alignment is required by DMA engine otherwise the DMA may write incorrectly or not in a boundary aligned manner.. When I changed the Rx buffer size to 256B and transmitted the SPI message with the same size, there were no lost ping packets (I tried with ping 10.10.10.104 -i 0.5).

The problem is the driver didn't report error as it should have when incorrect aliment was used. I've already reported this issue to SPI colleagues.

@KaeLL
Copy link
Contributor

KaeLL commented Jan 24, 2024

@kostaond Do the restrictions described on the linked page also apply to spi_master?

@kostaond
Copy link
Collaborator

@kostaond Do the restrictions described on the linked page also apply to spi_master?

Very good question, they apply. I'm not sure if check is correctly implemented in code though. I asked SPI team to double check.

@lilalaunestift
Copy link
Author

Hi @kostaond,
thats good news! Thank you very much the effort!
I will look into this topic tomorrow and then provide some feedback.

@lilalaunestift
Copy link
Author

Hi @kostaond,
I did some tests with your suggested change but it seems that the issue still persists.

The not word aligned buffer you mentioned is something I introduced while creating the minimal example. Sorry for that. In our actual code the Data struct is only part of the bigger struct Frame_t which acts as receive buffer. But for simplification I removed the other part and only Data_t was left. Actually there are asserts that make sure the buffer is word aligned and has the correct length:

#pragma pack(push, 1)

typedef struct SFrame
{
    union
    {
        Data_t      Data;
        Packet_t    Packet;
    };
    uint16_t    nLen;
    struct
    {
        ESioAddr_t  eSioDstPortAddr;
        ESioAddr_t  eSioSrcPortAddr;
    };
}
Frame_t;

#pragma pack(pop)

// make sure that some properties hold:
_Static_assert(sizeof(Frame_t) == 268, "wrong Frame_t Size");
_Static_assert(sizeof(Frame_t) %  4 == 0, "Frame_t Array must be word aligned");
_Static_assert(sizeof(Data_t) == 258, "wrong Data_t Size");
_Static_assert(sizeof(Packet_t) < 258, "wrong Packet_t Size");
_Static_assert(OFFSET(Frame_t, Data) % 4 == 0, "wrong Data Offset");
_Static_assert(OFFSET(Frame_t, Packet) % 4 == 0, "wrong Packet Offset");

The actual call to MSP_readblock looks like this:

...
static void SioMsp_rcvBuffer(Frame_t* pRcvFrame)
{
    Msp_readBlock(&pRcvFrame->Data.acData[0], sizeof(Frame_t));
}
...

where sizeof(Frame_t) is applied as length to the spi_slave_transaction_t struct.

Anyways, I tested your suggested changes with the provided minimal example and I got the following results:

receive buffer size lost ping packets
258B 7-10%
256B 7-10%
260B 7-10%
204B 1-2%
208B 7-10%

The data transmitted by our SPI master is (in 98% of the cases) 204B in length.
If I make the receive buffer fit this length, I get a much better result (but still too many packets are lost). If I just increase the buffer by 4B to 208B, I'm back to the huge packet loss of more than 7%...

Can you confirm this behavior with your setup?

By the way, I did the tests with:

ping <ip> -c 100 -i 0.5

Greetings

@kostaond
Copy link
Collaborator

kostaond commented Feb 2, 2024

@lilalaunestift if I set acData buffer to size greater than actual transmitted data from master, I am able to reproduce the issue. In other words:

  • If acData[256] and SPI frames transmitted from master are 204B, I observe ping loss.
  • If acData[204] and SPI frames transmitted from master are 204B, I do NOT observe ping loss.
  • If acData[256] and SPI frames transmitted from master are 256B, I do NOT observe ping loss.

This could be your workaround. However, something is probably wrong somewhere. I'll pass it to SPI team. My work is done here since it is beyond my specialization... I'm responsible for Ethernet...

@lilalaunestift
Copy link
Author

Hey @kostaond,

I assume that the cause for the 1-2% losses I still observe in the case where the buffer is configured to 204B in length is, that some of our messages transmitted by the SPI master are shorter than 204B. So there are still some occasions where the size of the buffer and the message don't match.

Anyways, thank you very much for your effort so far. I will then wait for some information from the SPI team.
Greetings

@wanckl
Copy link
Collaborator

wanckl commented Feb 6, 2024

IDF SPI team hide in corner and scare a lot 🤣

@wanckl
Copy link
Collaborator

wanckl commented Feb 6, 2024

@lilalaunestift I remember that, due to DMA HW architecture, for esp32 rx direction, no matter master or slave, the actually trans length need to be WORD aligned.

That means, you use esp32 as slave and use dma, you need config slave rx buffer address and length WORD aligned, meanwhile, master side should also write actually length align to WORD.

Though it can't explain : If acData[256] and SPI frames transmitted from master are 204B, I observe ping loss

However other chips after esp32 don't have this limitation, (S2, C3 ....), If you have, it should work without issue....

@lilalaunestift
Copy link
Author

Hey @wanckl,
yes, the buffers must be word aligned and they all are. After kostaond mentioned the issue he found in the minimal example I checked twice in our actual code base (See code snippet from last week).
Also I want to point out again, that everything was working fine for two years with older versions of the IDF where the WORD alignment restriction was the same. The issue started with the update to IDFv5.1.

Changing to another type of the esp32 is not an option since the product is already in the market for two years with the esp32.

@wanckl
Copy link
Collaborator

wanckl commented Feb 7, 2024

@lilalaunestift yes,

but you mentioned that some of our messages transmitted by the SPI master are shorter than 204B , So I notice master transfer length need also align to 4 byte, otherwise will also lead esp32 slave receive broken package.

By the way, you means even 5.0 is work fine ?

@lilalaunestift
Copy link
Author

Ok, sorry. I didn't get that you are talking about the length of the transmitted data.
But this is also always Word aligned. This line in the master code ensures that the transmitted length is always a multiple of 4:

oSioSpi.nSndCnt     = (pPacket->Header.cTotLen + 3) & 0xfffc;

IDF5.0 is not tested. The issue occured while updating the code from IDFv4.3 to IDFv5.1 (the latest version at that time).

@wanckl
Copy link
Collaborator

wanckl commented Feb 7, 2024

@lilalaunestift

So now issue is on slave side that slave can't receive correct data some time right ? slave send direction and master side is OK.
Then, could you know what time the slave transaction broken, and the detail of this broken transaction.

beside, I think IDF SPI team is also going to spring festival, may no update several days,,,

@kostaond
Copy link
Collaborator

kostaond commented Feb 7, 2024

So now issue is on slave side that slave can't receive correct data some time right ? slave send direction and master side is OK.
Then, could you know what time the slave transaction broken, and the detail of this broken transaction.

It's even worse. It seems the slave transactions are OK but they somehow affect transmit side of SPI master (SPI Ethernet DM9051) which is connected to the other SPI interface.

@lilalaunestift
Copy link
Author

@wanckl

Details of broken transaction

Regarding the data corruption (see also one of the first comments):

Here is an example of the observed data corruption:


Data in the buffer as passed to the dm9051 driver in (emac_dm9051_transmit()):

28 6b 35 b2 71 f9 a8 03 2a ee c4 67 08 00 45 00
00 54 9e ba 40 00 ff 01 f7 66 c0 a8 b2 1b c0 a8
b2 1a 00 00 00 c7 00 01 2f 5c 59 65 15 65 00 00
00 00 9b 3e 07 00 00 00 00 00 10 11 12 13 14 15
16 17 18 19 1a 1b 1c 1d 1e 1f 20 21 22 23 24 25
26 27 28 29 2a 2b 2c 2d 2e 2f 30 31 32 33 34 35
36 37

Data observed on the SPI bus via logic analyzer:

28 6B 35 B2 71 F9 A8 03 2A EE C4 67 08 00 45 00 
00 54 9E BA 40 00 FF 01 F7 66 C0 A8 B2 1B C0 A8 
B2 1A 00 00 00 C7 00 01 2F 5C 59 65 15 65 00 00 
00 00 9B 00 00 9B 3E 07 00 00 00 00 00 10 11 12 
13 00 00 00 00 9B 3E 07 00 00 00 00 00 10 11 12 
13 00 00 00 00 9B 3E 07 00 00 00 00 00 10 11 12 
13 00

In this example data should be transmitted to the ethernet controller.
Correct data is passed to the driver, but on the SPI bus I can observe that some part of the data suddenly gets repeated.

The same can be observed when data is received from the ethernet controller:
Data on the SPI bus looks fine, but data in the receive buffer is corrupted as shown above.

When is the transaction broken

As Kostaond says, the SPI slave communication is fine. It seems that the SPI Slave causes issues on the SPI master.
If I deactivate the SPI slave communication, the SPI master works fine and communicates without problems with the ethernet board. If I activate the SPI slave again, I get the described issues on the SPI master.

In the attached pictures here you can see that the issue occurs, when SPI master transmits/receives data and SPI slave receives data at the same time.
#12354 (comment)

The first part of the received data is correct. But when the transaction on the SPI slave is finished, the data corruption on the SPI master starts and I can observe the repeating pattern as shown above.

And it seems that the issue on the SPI master only occurs, when the received data on the SPI slave is shorter than the specified receive buffer size.

@lilalaunestift
Copy link
Author

Hi again,
are there any news on this issue you can share?
Greetings

@wanckl
Copy link
Collaborator

wanckl commented Mar 7, 2024

😢 A bit busy recently

@keLimbum
Copy link

Hey @wanckl,

are there any news on this topic? Updating the IDF now becomes important for us since we need some of the new features.

I updated to IDFv5.4 and I still can observe the described issue.

To summarize the issue:

We are using two SPI buses: The HSPI is configured as master and communicating with a DM9051 board, while the VSPI is configured as slave communication with a MSP430i controller.

With IDF4.3 everything was working fine. Since the update to to IDFv5.0 (and higher) I can observe corrupted data on the HSPI master bus. The corruption happens when data transmission on both SPI busses occurs simultaneously.

Following is an example of the data corruption (with IDF v5.4):

I updated the given minimum example earlier in this thread to the latest IDF Ethernet example:

Tools.iConTrace.Firmware.Esp.zip

I try to receive and answer ICMP packages.

Reading the dm9051 input buffer, I can verify that the package is received correctly:

Received data:

e2 e2 e6 1d 8d d0 28 6b  35 b2 71 f9 08 00 45 00
00 54 2a a5 00 00 80 01  2a 82 c0 a8 b2 1b c0 a8
b2 15 08 00 b8 49 00 0e  00 5e 56 3f d8 67 00 00
00 00 4a d0 07 00 00 00  00 00 10 11 12 13 14 15
16 17 18 19 1a 1b 1c 1d  1e 1f 20 21 22 23 24 25
26 27 28 29 2a 2b 2c 2d  2e 2f 30 31 32 33 34 35
36 37

In the transmit buffer of the dm9051 driver I can observe that the answer is passed to the driver correctly:

Data to transmit:

28 6b 35 b2 71 f9 e2 e2  e6 1d 8d d0 08 00 45 00
00 54 2a a5 00 00 40 01  6a 82 c0 a8 b2 15 c0 a8
b2 1b 00 00 c0 49 00 0e  00 5e 56 3f d8 67 00 00
00 00 4a d0 07 00 00 00  00 00 10 11 12 13 14 15
16 17 18 19 1a 1b 1c 1d  1e 1f 20 21 22 23 24 25
26 27 28 29 2a 2b 2c 2d  2e 2f 30 31 32 33 34 35
36 37

But when observing the data transmitted on the SPI bus (with a logic analyzer), I can see that the data is corrupted:

Image
Here you can see that a message is received is on the VSPI bus (upper) and at the same time a message is transmitted on the HSPI bus (lower).

The first part of the message is correct, but the second part is corrupted:

Image

The corrupted answer reads as follows (captured with WireShark):

28 6b 35 b2 71 f9 e2 e2  e6 1d 8d d0 08 00 45 00
00 54 2a a5 00 00 40 01  6a 82 c0 a8 b2 15 c0 a8
b2 1b 00 00 c0 49 00 0e  00 5e 56 3f d8 67 00 00
00 00 4a d0 07 00 00 4a  d0 07 00 00 00 00 00 10
11 12 13 00 00 00 00 4a  d0 07 00 00 00 00 00 10
11 12 13 00 00 00 00 4a  d0 07 00 00 00 00 00 10
11 12

Parts of the payload are suddenly repeated. This always starts, shortly after a message on the VSPI bus is received completely (See picture). My assumption is, that the DMA starts copying the received VSPI message at that point and interferes with the data for the HSPI bus.

Receive and send buffer both are word aligned. See also this post and the following for that topic.

Regards,
lilalaunestift (but with a new account)

@keLimbum
Copy link

One more question:
for the internal emac its possible to set the dma burst length to 4 to avoid frame corruption:

e82dce7

Is this issue related? And is there a similar option for use with an external mac?

@kostaond kostaond assigned wanckl and unassigned kostaond Mar 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Opened Issue is new Type: Bug bugs in IDF
Projects
None yet
Development

No branches or pull requests

6 participants