Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Another LUAI_ASSERT crash, possibly caused by use of strings and/or calls to external functions (e.g. table.concat) #562

Open
ewmailing opened this issue Mar 7, 2023 · 6 comments

Comments

@ewmailing
Copy link

I believe I have encountered another really hard to isolate/reproduce Pallene bug. This is going to be a terrible bug report because it has been really difficult to isolate and reproduce, and I so far have only encountered it as a piece of a much larger program.
But I want to report this bug so it is at least known and on the radar.

If I compile Lua with -DLUAI_ASSERT, then I (sometimes) get the following assertion failure:

lua: lapi.c:1015: lua_callk: Assertion `((nresults) == (-1) || (L->ci->top - L->top >= (nresults) - (nargs))) && "results from function overflow current stack size"' failed.
Aborted (core dumped)

I only get this sometimes, using the same data inputs. Also, I only have triggered this (thus far) when using LuaLanes. I've been scrutinizing both Lanes and other modules I use, but I don't think they are the problem. I think Lanes relationship is a red-herring. At best, Lanes might be stressing out the system in such a way that it forces the bug to materialize more easily.
But this bug can go for weeks without happening, and then all of a sudden, every single run crashes.

If I pallenec --emit-lua for my entire project, I never get any crashes/assertions, so that's why I believe this is a Pallene bug.

I've been trying to isolate this for a few months now, by moving groups of Pallene functions that I considered suspect, to Lua.
I think I have finally isolated things down just enough that it might be just small enough to start a discussion.

I will be attaching some files. But as a high level summary, the code when it was crashing seemed to be coming from the function:

function m.GenerateBackTestPlotForChartJS(backTestPlot:BackTestPlot, src_file_path:string, tgt_file_path:string, file_basename:string, wants_monolithic:boolean, use_remote_deps:boolean) : (string, string)

It used to be a big giant function, so I've been refactoring and breaking it down for the past few weeks.
There are now 6 helper functions calls:
Generate_JSArrayData_1_*
Generate_JSArrayData_2_*
Generate_JSArrayData_3_*
Generate_JSArrayData_4_*
Generate_JSArrayData_5_*
Generate_JSArrayData_6_*

My crashes stopped for about 2 weeks after this refactoring, so I thought maybe it was the size of the original function that was causing the problem, but the crashes returned.

So continuing on, I started moving the 6 helper functions back and forth between Pallene and Lua. I noticed that keeping 1-4 in Pallene didn't seem to affect crashes, but
5 and 6 seemed to crash often.

I also noticed there was another commonality between 5-6, and a difference between 1-4:
Functions 5 & 6 both call another local helper function, which calls out to externally imported Lua functions (table.concat in this case).

This helper function is:
local function GenerateCodeForChartJSSeries(series_plot:Series, var_name_prefix:string, index:integer) : (string, string, string)
and in the middle/end of the function, there in indirect call to table.concat().

        local ret_data_decl = g_luaFuncs_concat(js_data_array_decl, "", 1, #js_data_array_decl)

So my currently thinking is that this external call may be responsible for the bug. To try to test this theory:

I moved the following to Lua:
m.GenerateBackTestPlotForChartJS
Generate_JSArrayData_5_*
Generate_JSArrayData_6_*

But I modified 5 & 6 so their calls to GenerateCodeForChartJSSeries go to Pallene. This continued to crash.

Then I modified GenerateCodeForChartJSSeries so it no longer calls table.concat. Originally this helper would return the resulting string as a return parameter. I made a change to skip the table.concat() and instead return the array of strings.
Then back in 5 & 6, I call table.concat() in Lua on the now returned array of strings to get back to where I was.

Thus far, my crashes have disappeared. (But as I warned, this crash can be really random, so I can't prove this change is a coincidence or not.)

Also, I did worry maybe table.concat was secretly returning more values than I expected, so I did actually write a wrapper around table.concat to make sure it only ever returns 1 value. Despite my wrapper, I would still get crashes. So I don't think this is the problem.

But for the moment, my theory is that Pallene is incorrectly generating code relating to the calling of external functions (i.e. table.concat), or has something to do with string code in general.

I have lots of places in my codebase that do call external functions in Pallene, and they haven't crashed like this. But the big difference with this module for me, is that this is one of the few areas where I do a lot of string handling in Pallene. Most of my other modules are centered around floats and integers, and not strings.

That said, these functions discussed here are one minor hotspot in my program. While working around this by converting to Lua, my run times for a particular dataset went from about 25 seconds to 28 seconds (~12% slower). So originally I wasn't sure if Pallene would give me any benefit for this section of code, but it turned out to give a pretty good boost when it doesn't crash.

Attached are immediate before & afters of the code discussed here, with the table.concat workaround. This is just the Pallene code. I am not including all the rest of the program because I think it would be too much. (If you really need it, I can provide it.) So there is no test driver program to run/crash. However, you might be able to look/compile at the Pallene generated code, and maybe something might reveal itself to you.

chart_pln_suspectconcat_before.pln.txt
chart_pln_workaround_after.pln.txt

@hugomg
Copy link
Member

hugomg commented Mar 7, 2023

Recently I've been thinking about changing the internal Pallene calling convention to aid debugging. Currently, from the point of view of Lua, the first Pallene call creates a new Lua stack frame and all the immediately nested Pallene calls reuse the same stack frame. I was wondering if we should change this so that every Pallene call creates a new Lua stack frame. That way, if someone raises an exception then the full Pallene call stack would appear in the stack trace. It might also help us keep this class bugs under control. We might be better able to identify which function is is responsible for the bug, because there are separate stack frames for each function; and also it would bring the Pallene calling convention closer to the calling convention that the various LUAI_ASSERT macros expect.

The downside of doing this is it might increase the overhead of Pallene function calls. (I don't know how much, we'd have to try it out...) Anyway, what do you think about this?

@ewmailing
Copy link
Author

My initial reaction is that I like the idea, but only as part of a debug or non-optimized compile mode. Then in an optimized mode (e.g. -O2 or -O3), geared for performance/release, you use the current (presumably faster) technique. I suspect this should also allow you to just add more debugging stuff as you see convenient without agonizing too much over how much it might slow down production code.

Because Pallene is really the last option before needing to rewrite things in a different language, squeezing as much performance out of it as possible is always at the top of my mind.

@bjornbm
Copy link
Contributor

bjornbm commented Mar 8, 2023

@ewmailing: Out of curiousity, when you convert to Lua (as a work-around for the crashes), do you use Pallene's --emit-lua flag or covert “by hand”?

@ewmailing
Copy link
Author

@bjornbm I use --emit-lua. This was the fastest and most reliable way to determine if the crashes were caused by Pallene C generated code or not.

As a slight caveat, the last experiment I discussed to remove table.concat were hand edits on the --emit-lua output. I actually discovered I had a minor copy/paste bug on two of the lines I changed. It doesn't affect the crashing issue discussed here...just produces slightly wrong output for the results of my real program.

@ewmailing
Copy link
Author

@hugomg Your suggestion about the stack frame inspired a different idea in my head. This is an unrelated tangential/supplemental idea to be considered independently and in addition to.

I've never done this before, but how hard is it to inject information into the generated C code so that the native C debugger can be used to help figure out the whereabouts of where a problem might be?

Is the starting place as simple as adding the #line preprocessor directive that maps it back the line number of the .pln file?

https://stackoverflow.com/questions/20085545/how-to-provide-source-level-debugging-info-when-my-compiler-targets-c
https://stackoverflow.com/questions/63702046/how-to-debug-and-step-into-custom-language-sources-transpiled-to-c

Seems like gcc, clang, and Visual Studio all support the #line directive.

I'm thinking out loud here... so for debugging, if we build Lua with debug symbols, as well as the Pallene generated C code, then in a crash...would we be able to see the entire native C stack trace all the way into both Pallene generated code and Lua, plus also know what line number in the .pln file this mapped to? Would this be helpful information?

@hugomg
Copy link
Member

hugomg commented Mar 11, 2023

I'm not sure #line annotations would help much in our original bug. The big problem is not about locating where the crash is, but reproducing the crash and identifying the real culprit (which is far from the crash location).

That said, #line annotations might be useful for other reasons... I'll open a separate discussion for it :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants