Skip to content

Handle Event Queue Overflow#1048

Open
Jamidd wants to merge 3 commits intoopenvehicles:masterfrom
TucarApp:tucar/feat/event-overflow-manager
Open

Handle Event Queue Overflow#1048
Jamidd wants to merge 3 commits intoopenvehicles:masterfrom
TucarApp:tucar/feat/event-overflow-manager

Conversation

@Jamidd
Copy link
Copy Markdown

@Jamidd Jamidd commented Aug 13, 2024

This PR aims to prevent overflow in the event queue. When the number of available task slots drops to less than a fifth of the original capacity, ticker events are discarded. If more than 30 ticker events are discarded, the OVMS will be restarted, as this indicates that too much time has passed since the last ticker event was processed.

@dexterbg
Copy link
Copy Markdown
Member

Jaime,

the PR again mixes two completely independent changes, besides the queue checking it includes an undocumented and potentially hazardous change to the event task stack size, raising it from 8K to 12K.

Any substantial new stack requirements need to be discussed first, best on the developer list. Stack is allocated from the internal 8 bit RAM segment, which is rather tight already. Depending on the vehicle, there will be additional tasks needing space, and less free 8 bit RAM can make the system less stable in a situation of multiple concurrent command tasks being executed (e.g. via API calls).

So there must be a good & valid reason to generally increase a task stack size. The event task runs all the event handling, thus already has a rather large stack of 8K. What caused a stack overflow for you, did you verify it was in the event task, and how can we reproduce this?

If the situation is setup/config specific (e.g. only occurring when using an MQTT server), instead of a general RAM hit, we should make the event stack size configurable.

On the actual PR topic:

Please try to avoid typos in system messaged ("droped"), and to be very precise about what is happening, as this is info for developers: "Timer service / ticker timer has died" isn't actually correct. The timer service & tickers are still working perfectly here, the issue is the event task needing too much time processing events.

And that leads to a potentially much more useful addition idea: how about also collecting statistics on which event handlers are the culprits? The event task could log suspiciously long event handling runs, and it could gather average and max handling times, so developers can take a look at which handler needs optimization / fixing.

I added some info on the currently being handled event for the crash aftermath (see m_current_started), but that will only tell about the last handler before the crash. General performance statistics could be very helpful in tracking down issues before they become fatal. Take @frogonwheels approach for the OBD performance stats as an example.

Regards,
Michael

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants