-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hypothesis not finding failing examples #77
Comments
Hmm. Unfortunately I don't have any silver-bullet style advice; I think this is just a case where failing examples are a very small fraction of your input space. If you can work out how to summarise "goodness" of an example as a real number (or several), Otherwise... you can probably make examples somewhat more diverse by complicating your strategies. This usually isn't worth it, but does kinda work - for example using |
Thanks for responding. Actually, now that I think about it, the problem might not be the empty tuples. I think the issue is that I have a lot of different cases for this particular function that raise NotImplementedError. I currently just ignore NotImplementedError in the test, so that it can "automatically" work when it becomes implemented. I actually used to assume(False) for NotImplementedError, but I removed it when hypothesis told me there were too many filtered tests. But perhaps I should have heeded that warning. I'm pretty sure once I implement more cases, then the current error that I wasn't finding would be found much easier. Maybe I should make a separate test with a more restrictive set of strategies, which I know will be implemented. Is there a way to decorate the same test twice to generate two different tests? Another question, which I think will actually be a bigger problem going forward. Is there some way to tell hypothesis that two strategies should be "mostly" related to one another in some way? To be more concrete, if you have a boolean array index # The actual strategies used in ndindex are a little bit more complicated than this
shapes = tuples(integers(0, 10))
boolean_arrays = arrays(bool_, shapes)
@given(boolean_arrays, shapes)
def test_booleanarray(idx, shape):
... Maybe I can use |
See issue Quansight-Labs#77. This implementation of checking if the array shapes are subsequences of the test array shape doesn't seem to be fine grained enough for it to find a simple bug, so I think it will need to be improved.
I played with using I've had mixed results. I eventually did get it to find the intentional bug, but only after many examples. It isn't really clear to me if the targeting made it find the bug any faster (and I'm not sure how to test that, especially now that the failure is in the database). Based on the output of my print statements, it doesn't really seem to be smart enough to increase the target values. |
For related shapes, you probably want some combination of Yeah... targeting does tend to increase the score pretty well, but it only kicks in halfway through and arrays are a hard structure for it. Coming up with a metric that correlates well with bugs is often a lot harder than it sounds too... As to empirically checking the impact though, you can run a number of tests with |
Nice. |
I've been using this to test the effectiveness of the strategies (after introducing a bug that would only be found if a good example is generated). from hypothesis import seed, settings, Phase
from ndindex.tests.test_newshape import test_newshape_hypothesis
import warnings
warnings.filterwarnings('error')
func = settings(phases=[Phase.generate, Phase.shrink], max_examples=10000)(test_newshape_hypothesis)
N = 20
failures = 0
for i in range(N):
print(i)
try:
seed(i)(func)()
except KeyboardInterrupt:
break
except BaseException as e:
print(e)
failures += 1
print(failures, N, failures/N) The base run with no shared strategies gets 0 failures (I didn't attempt to see how many runs are needed until it finds one). With a shared shape I can get it up to 95%, which falls down to 75% if I also allow nonshared shapes using |
…shape This is achieved with the shared() strategy. Without this, boolean arrays almost always don't match the array shape where they are indexed, except in the scalar boolean case, meaning almost no boolean array behavior was properly tested outside of IndexError behavior. We do still include random shapes to test this behavior, as it doesn't reduce the number of good examples by much. Fixes part of issue Quansight-Labs#77. See the discussion there for more information.
I'm a little disappointed that hypothesis never found the failing example here #80, even after running millions of tests. I think the issue with the as_subindex tests is that there are so many cases that give NotImplementedError, that if a tuple index contains any index that isn't implemented, the whole test is more or less skipped. I used to filter these, but I was getting filtering warnings, so I changed it to just ignore NotImplementedError. But I guess I shouldn't have ignored them. Trying to filter these cases isn't going to be easy, especially in a way that doesn't make hypothesis give warnings about it. The best solution is to just implement more of the missing cases, so that this isn't a problem any more. |
Yep. The usual workaround in the meantime would be to restrict your slices strategy so that it only generates cases you know you can handle 😕 |
At #76 (particularly the most recent commit, 2006911), I implemented the logic for as_subindex wrong. But the hypothesis test for as_subindex just isn't finding a failing example. I even ran it with
--max-examples=1000000
, which took 2 hours to run, and nothing.I tried reducing the broad
ndindices
strategies to strategies that I knew would encompass the bug. I got nothing until I narrowed it down to just(compare that to the default
ndindices
andshapes
strategies https://github.com/asmeurer/ndindex/blob/2006911d291f5af0681b64bd9ba6b96eed5d5eec/ndindex/tests/helpers.py). And even then I still had to run with--max-examples=1000
.A shrunk example that fails is
@Zac-HD do you have any suggestions on how I can improve the situation here? I understand that the search space for my strategies is pretty big, but I would expect to at least get something useful out of it after running for a few hours. By the way, if you want to run the above test, you may have to install NumPy from git (sorry).
Is there maybe some way to tell hypothesis that certain things are more important than others? When I look at the verbose output, it seems to be trying a lot of things that aren't really going to make a difference for the most part, like very large integers. It also really likes producing empty tuples, and those are good to test, but they generally represent trivial cases and in a lot of instances I if an empty tuple is a problem I would expect a larger example to shrink to it. I'm OK with it testing these things, but watching the verbose output, it feels like it care about them a little bit too much. Or should I not focus too much on what hypothesis is doing internally?
I know I can add @examples, but that seems a lot like I would just be fooling myself into thinking hypothesis is useful, because the @examples don't actually influence the example generation from the strategies (they also don't shrink, which is kind of annoying). So an @example might cover a case I know to be wrong, but hypothesis wouldn't actually ever find cases that I don't know about.
The text was updated successfully, but these errors were encountered: