-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Another LUAI_ASSERT crash, possibly caused by use of strings and/or calls to external functions (e.g. table.concat) #562
Comments
Recently I've been thinking about changing the internal Pallene calling convention to aid debugging. Currently, from the point of view of Lua, the first Pallene call creates a new Lua stack frame and all the immediately nested Pallene calls reuse the same stack frame. I was wondering if we should change this so that every Pallene call creates a new Lua stack frame. That way, if someone raises an exception then the full Pallene call stack would appear in the stack trace. It might also help us keep this class bugs under control. We might be better able to identify which function is is responsible for the bug, because there are separate stack frames for each function; and also it would bring the Pallene calling convention closer to the calling convention that the various LUAI_ASSERT macros expect. The downside of doing this is it might increase the overhead of Pallene function calls. (I don't know how much, we'd have to try it out...) Anyway, what do you think about this? |
My initial reaction is that I like the idea, but only as part of a debug or non-optimized compile mode. Then in an optimized mode (e.g. -O2 or -O3), geared for performance/release, you use the current (presumably faster) technique. I suspect this should also allow you to just add more debugging stuff as you see convenient without agonizing too much over how much it might slow down production code. Because Pallene is really the last option before needing to rewrite things in a different language, squeezing as much performance out of it as possible is always at the top of my mind. |
@ewmailing: Out of curiousity, when you convert to Lua (as a work-around for the crashes), do you use Pallene's |
@bjornbm I use --emit-lua. This was the fastest and most reliable way to determine if the crashes were caused by Pallene C generated code or not. As a slight caveat, the last experiment I discussed to remove table.concat were hand edits on the --emit-lua output. I actually discovered I had a minor copy/paste bug on two of the lines I changed. It doesn't affect the crashing issue discussed here...just produces slightly wrong output for the results of my real program. |
@hugomg Your suggestion about the stack frame inspired a different idea in my head. This is an unrelated tangential/supplemental idea to be considered independently and in addition to. I've never done this before, but how hard is it to inject information into the generated C code so that the native C debugger can be used to help figure out the whereabouts of where a problem might be? Is the starting place as simple as adding the #line preprocessor directive that maps it back the line number of the .pln file? https://stackoverflow.com/questions/20085545/how-to-provide-source-level-debugging-info-when-my-compiler-targets-c Seems like gcc, clang, and Visual Studio all support the #line directive. I'm thinking out loud here... so for debugging, if we build Lua with debug symbols, as well as the Pallene generated C code, then in a crash...would we be able to see the entire native C stack trace all the way into both Pallene generated code and Lua, plus also know what line number in the .pln file this mapped to? Would this be helpful information? |
I'm not sure #line annotations would help much in our original bug. The big problem is not about locating where the crash is, but reproducing the crash and identifying the real culprit (which is far from the crash location). That said, #line annotations might be useful for other reasons... I'll open a separate discussion for it :) |
I believe I have encountered another really hard to isolate/reproduce Pallene bug. This is going to be a terrible bug report because it has been really difficult to isolate and reproduce, and I so far have only encountered it as a piece of a much larger program.
But I want to report this bug so it is at least known and on the radar.
If I compile Lua with -DLUAI_ASSERT, then I (sometimes) get the following assertion failure:
I only get this sometimes, using the same data inputs. Also, I only have triggered this (thus far) when using LuaLanes. I've been scrutinizing both Lanes and other modules I use, but I don't think they are the problem. I think Lanes relationship is a red-herring. At best, Lanes might be stressing out the system in such a way that it forces the bug to materialize more easily.
But this bug can go for weeks without happening, and then all of a sudden, every single run crashes.
If I pallenec --emit-lua for my entire project, I never get any crashes/assertions, so that's why I believe this is a Pallene bug.
I've been trying to isolate this for a few months now, by moving groups of Pallene functions that I considered suspect, to Lua.
I think I have finally isolated things down just enough that it might be just small enough to start a discussion.
I will be attaching some files. But as a high level summary, the code when it was crashing seemed to be coming from the function:
function m.GenerateBackTestPlotForChartJS(backTestPlot:BackTestPlot, src_file_path:string, tgt_file_path:string, file_basename:string, wants_monolithic:boolean, use_remote_deps:boolean) : (string, string)
It used to be a big giant function, so I've been refactoring and breaking it down for the past few weeks.
There are now 6 helper functions calls:
Generate_JSArrayData_1_*
Generate_JSArrayData_2_*
Generate_JSArrayData_3_*
Generate_JSArrayData_4_*
Generate_JSArrayData_5_*
Generate_JSArrayData_6_*
My crashes stopped for about 2 weeks after this refactoring, so I thought maybe it was the size of the original function that was causing the problem, but the crashes returned.
So continuing on, I started moving the 6 helper functions back and forth between Pallene and Lua. I noticed that keeping 1-4 in Pallene didn't seem to affect crashes, but
5 and 6 seemed to crash often.
I also noticed there was another commonality between 5-6, and a difference between 1-4:
Functions 5 & 6 both call another local helper function, which calls out to externally imported Lua functions (table.concat in this case).
This helper function is:
local function GenerateCodeForChartJSSeries(series_plot:Series, var_name_prefix:string, index:integer) : (string, string, string)
and in the middle/end of the function, there in indirect call to table.concat().
So my currently thinking is that this external call may be responsible for the bug. To try to test this theory:
I moved the following to Lua:
m.GenerateBackTestPlotForChartJS
Generate_JSArrayData_5_*
Generate_JSArrayData_6_*
But I modified 5 & 6 so their calls to GenerateCodeForChartJSSeries go to Pallene. This continued to crash.
Then I modified GenerateCodeForChartJSSeries so it no longer calls table.concat. Originally this helper would return the resulting string as a return parameter. I made a change to skip the table.concat() and instead return the array of strings.
Then back in 5 & 6, I call table.concat() in Lua on the now returned array of strings to get back to where I was.
Thus far, my crashes have disappeared. (But as I warned, this crash can be really random, so I can't prove this change is a coincidence or not.)
Also, I did worry maybe table.concat was secretly returning more values than I expected, so I did actually write a wrapper around table.concat to make sure it only ever returns 1 value. Despite my wrapper, I would still get crashes. So I don't think this is the problem.
But for the moment, my theory is that Pallene is incorrectly generating code relating to the calling of external functions (i.e. table.concat), or has something to do with string code in general.
I have lots of places in my codebase that do call external functions in Pallene, and they haven't crashed like this. But the big difference with this module for me, is that this is one of the few areas where I do a lot of string handling in Pallene. Most of my other modules are centered around floats and integers, and not strings.
That said, these functions discussed here are one minor hotspot in my program. While working around this by converting to Lua, my run times for a particular dataset went from about 25 seconds to 28 seconds (~12% slower). So originally I wasn't sure if Pallene would give me any benefit for this section of code, but it turned out to give a pretty good boost when it doesn't crash.
Attached are immediate before & afters of the code discussed here, with the table.concat workaround. This is just the Pallene code. I am not including all the rest of the program because I think it would be too much. (If you really need it, I can provide it.) So there is no test driver program to run/crash. However, you might be able to look/compile at the Pallene generated code, and maybe something might reveal itself to you.
chart_pln_suspectconcat_before.pln.txt
chart_pln_workaround_after.pln.txt
The text was updated successfully, but these errors were encountered: