Skip to content

Tracking: Hidden Errors #371

@ProfLander

Description

@ProfLander

Recently, a spate of hidden errors have been revealed in xray-monolith (#363, #365, #366, #367, #369, various in ProfLander:xray-monolith)

This is down to the following developments:

  • Introduction of new luabind (bf92f5c)
  • Functional Clang builds via CMake (downstream)
  • Experimentation with the compiler's Address Sanitizer

This issue is for tracking discussion of the broader issue, and tractable solutions to address it while limiting user impact.

Address Sanitizer

#366 was found as the result of my experimenting with /fsanitize=address; a compiler feature which emits additional runtime memory sanity checking.

This is the least-impactful part of the picture, since it's free-standing; any address sanity problems that exist are - as far as we know right now - benign with regard to release builds, aside from as a possible source of rare 'random' crashes reported on occasion by the userbase. OXR's commit and PR lists are useful reference here, as they've been using ASan, UBsan, and LSan (Address, Undefined Behaviour, and Leak sanitizers) to verify against for some time.

This will be tricky to formally integrate into the existing VS2022 solution, as it levies a ~2x slowdown on top of existing overhead, and so would require another hard-to-maintain build preset alongside Verified etc.

CMake allows locally-overridable custom variables within build presets, so this - along with Clang-specific settings for UBSan and LSan - have been integrated in my current local WIP commit.

Clang

My CMake integration recently grew the ability to produce functioning builds via the Clang compiler, which is significantly stricter than MSVC on account of full C++ standards-compliance and more sophisticated (ergo, more brittle) codegen and optimization.

After opening it to testing via Anomaly Discord, three primary error sources have emerged:

Differences in exception handling

Where MSVC uses Structured Exception Handling to render hardware exceptions (ex. Memory Access Violation) catchable, Clang's strict adherence to the C++ standard treats them as hard errors.

This reveals issues like #368, where the removal of a method implementation at some point in Anomaly's history created a memory access violation within ALife brain code, which was caught and hidden by the SEH-empowered luabind exception handler. Or #367, where a design flaw in localization switching resulted in use-after-free due to the distinction between script objects and their underlying concrete C++ value.

As per the luabind manual, this behaviour can be reproduced on MSVC by defining a hard-throwing exception translator, but is likely inadvisable for mainline until the resulting crashes can be fixed.

Undefined Behaviour

4e77c3a..063437f is mainly comprised of commits targeting undefined behaviour, such as the invocation of methods on null references. MSVC permits these calls before failing at some point deep in an invalid call stack, while Clang fails as soon as a null pointer is turned into a reference.

This requires that the code to be corrected with proper pointer access and checking, but is ultimately cleaner and saner.

Local regressions

As part of the CMake work, I fixed all of X-Ray's MSVC-generated compiler warnings, and all cases where spec violations resulted in Clang compiler failures.

These changes have for the most part been inert, but some introduced some logic bugs (incorrect but syntactically valid cast, narrowing conversion) which introduced new crashes, or broke existing systems in subtle ways, such as a failure in XML BOM stripping causing ~50% of stalker definitions - and their related objects - to silently fail loading.

At time of writing, all known fork-level issues have been documented in the tracker, or were fixed prior to its formal introduction.

Lander-Modding#5

I also note that Clang exhibits some ~45k+ compiler warnings at time of writing, resulting in an extremely verbose (i.e. potentially unresponsive, depending on terminal implementation) compile process. I intend to fix these once the build is gameplay-stable, and suspect that many are duplicates arising from X-Ray's include structure, but the scope of the task remains considerable.

luabind

The new build of luabind exhibits different behaviour in exception handling, propagating certain previously-silent catches to lua_error().
This is a more correct behaviour, as it does not mask fatal failures, but exposes crashes to users where they were previously hidden by the implementation.

The X8 door crash is the first major example, but it is likely that more such cases have been revealed following this change.

Supplemental: Exception Handling

The Lua exception-handling story for Anomaly is twofold:

luabind

luabind is compiled with the following defines in non-debug builds; quotes from the manual:

LUABIND_NO_ERROR_CHECKING

If this macro is defined, all the Lua code is expected only to make legal calls. If illegal function calls are made (e.g. giving parameters that doesn’t match the function signature) they will not be detected by luabind and the application will probably crash. Error checking could be disabled when shipping a release build (given that no end-user has access to write custom Lua code). Note that function parameter matching will be done if a function is overloaded, since otherwise it’s impossible to know which one was called. Functions will still be able to throw exceptions when error checking is disabled.

If a function throws an exception it will be caught by luabind and propagated with lua_error().

Safe to say, Anomaly and its extensive suite of user mods do not qualify for given that no end-user has access to write custom Lua code.
As alluded to in #370, this is demonstrably troublesome in practice: Luabind's parameter checking (and the resulting visible errors) still occur, but only in cases where it is structurally critical, thus masking errors in common cases like vector and matrix constructor misuse, and potentially extending to endemic issues like busyhands.

Discussion suggests that this define is present because it causes errors when enabled, but has proven to be more of a band-aid than a workable solution given the number of issues being obscured by compiler and library behavior.

LUABIND_NO_EXCEPTIONS

This define will disable all usage of try, catch and throw in luabind. This will in many cases disable run-time errors, when performing invalid casts or calling Lua functions that fails or returns values that cannot be converted by the given policy. luabind requires that no function called directly or indirectly by luabind throws an exception (throwing exceptions through Lua has undefined behavior).

Where exceptions are the only way to get an error report from luabind, they will be replaced with calls to the callback functions set with set_error_callback() and set_cast_failed_callback().

Initially, this would suggest that xray-monolith's Lua exception formulation is fundamentally unsound: While where exceptions are the only way introduces some ambiguity, if setting this flag promises that no Lua-called C++ will throw exceptions at runtime, and we're seeing all these underlying errors, then surely the rule is being violated constantly in release builds.

However:

LuaJIT

A relevant excerpt from lj_err.c:

** LuaJIT can either use internal or external frame unwinding:
**
** - Internal frame unwinding (INT) is free-standing and doesn't require
**   any OS or library support.
**
** - External frame unwinding (EXT) uses the system-provided unwind handler.
**
** Pros and Cons:
**
** - EXT requires unwind tables for *all* functions on the C stack between
**   the pcall/catch and the error/throw. This is the default on x64,
**   but needs to be manually enabled on x86/PPC for non-C++ code.
**
** - INT is faster when actually throwing errors (but this happens rarely).
**   Setting up error handlers is zero-cost in any case.
**
** - EXT provides full interoperability with C++ exceptions. You can throw
**   Lua errors or C++ exceptions through a mix of Lua frames and C++ frames.
**   C++ destructors are called as needed. C++ exceptions caught by pcall
**   are converted to the string "C++ exception". Lua errors can be caught
**   with catch (...) in C++.
**
** - INT has only limited support for automatically catching C++ exceptions
**   on POSIX systems using DWARF2 stack unwinding. Other systems may use
**   the wrapper function feature. Lua errors thrown through C++ frames
**   cannot be caught by C++ code and C++ destructors are not run.
**
** EXT is the default on x64 systems, INT is the default on all other systems.

In short, the x64-default EXT-mode frame unwinding:

  • Renders Lua C++ exception-safe, allowing destructors to be called while unwinding through what would otherwise be unprotected C
    • Thus, the LUABIND_NO_EXCEPTIONS C insanity implied above is rendered inert, and xray-monolith's usage (and tight coupling with release builds) is a sound formulation.
  • Replaces all exception error information with the string "C++ Exception", which propagates to lua_error and into the popup

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions