-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add parse_index function for parsing NDIndex from string #91
base: main
Are you sure you want to change the base?
Conversation
I've now also tested this code in cpy39 (after some teething issues, I managed to use |
I've now added unittests for There's still a couple of minor flaws in how I'm using hypothesis to generate string representations of ellipses (always Hypothesis is tricky! And interesting. I've never used it before |
ndindex/literal_eval_index.py
Outdated
@@ -0,0 +1,67 @@ | |||
import ast |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a module level docstring
The code looks reasonable to me. @asmeurer - will you take a look at this please, whenever you get a chance? Thanks! |
@example('slice(-12, -72, 14)') | ||
@example('3:15, -5, slice(12, -14), (1,2,3)') | ||
@example('..., -5, slice(12, -14), (1,2,3)') | ||
@example('3:15, -5, slice(12, -14), (1,2,3), Ellipsis') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are not valid indices, at least as far as ndindex is concerned. Nested tuples are not allowed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did see
Why did you decide to be more restrictive about this syntax than numpy
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason is that I wanted to avoid confusion from both Tuple(1, 2)
and Tuple((1, 2))
being valid and meaning different things. See #17.
I also had thought that using a tuple to represent an array was deprecated, but now I can't find that. Certainly there is ambiguity even at the NumPy level. If you expect a tuple to act like a fancy index, then you will be surprised that a[array([1, 2])]
will give a different thing than a[(1, 2)]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Particularly this comment #17 (comment). Perhaps this should be added to the docs. I do have https://quansight.github.io/ndindex/type-confusion.html#tuple, but that is more about best practices and doesn't really discuss why things are the way they are.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, so it will be a little difficult to implement disallowing nested tuples indexes, since there are a bunch of places where we do need to parse ast.Tuple
regardless. I think the way to do this will be to include some kind of guard in the ast.Tuple
handling clause of _convert
that will raise an error if it gets called twice. So far I've got:
class _Guard:
def __init__(self):
self.val = False
def __call__(self):
if self.val:
return True
else:
self.val = True
return False
and then
_nested_tuple_guard = _Guard()
def _convert(node):
if isinstance(node, ast.Tuple):
if _nested_tuple_guard():
raise ValueError(f'tuples inside of tuple indices are not supported: {node!r}')
return tuple(map(_convert, node.elts))
...
There's probably a cleaner way to implement the guard
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the simpler way is to not worry about nonsense combinations here, and let ndindex()
handle it at the end.
Thanks for doing the work on this. This is looking good. Some comments:
@given(ndindices)
def test_literal_eval_index(idx):
assert literal_eval_index(str(idx)) == idx This doesn't handle the cases of non-conforming indices, though. For that, we would need some strategies that generate ast. I think hypothesis doesn't have first-class support for this. We could try something like https://github.com/Zac-HD/hypothesmith, but I think it might not be easy. So for this, I think we can just have a bunch of normal tests that syntax features outside of the allowed list are not valid. And also test that things that look valid from the ast but aren't actually valid fail properly. |
Actually I think it needs to look like this @given(ndindices)
def test_literal_eval_index(idx):
index = ndindex(idx)
str_index = str(index)
repr_index = repr(index)
try:
assert literal_eval_index(str_index) == idx
except SyntaxError:
# str() gives invalid syntax for array indices
assume(False)
assert literal_eval_index(repr_index) == idx This will test both the str and repr forms, and properly skip the test for the str form of an array (which isn't valid Python). This still won't handle a lot of combinations, including slice literals, but we'd have to create custom strategies to do that. Although I've also considered that perhaps |
Regarding Python 3.9, it is out now, so we need to support it. See #92 |
@asmeurer I tried the approach to writing @asmeurer @scopatz Can you please give this PR another review and let me know if anything still needs to change? Since the version of |
@asmeurer I played around with
Basically, whenever I fixed one issue I ran into another, so all of the complete fixes I came up so far involve radical changes to |
I'll close/reopen this PR to trigger the new py39 test |
c08db5a
to
6174a67
Compare
The PR is now passing for py39 as well |
@@ -65,6 +66,91 @@ def __init__(self, f): | |||
def __get__(self, obj, owner): | |||
return self.f(owner) | |||
|
|||
class _Guard: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this guard stuff is needed. The Tuple constructor should catch this after calling ndindex() at the end (and will give a better error message than what this currently gives).
node_or_string = node_or_string.slice | ||
|
||
def _raise_malformed_node(node): | ||
raise ValueError(f'malformed node or string: {node!r}, {ast.dump(node)!r}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The node
part here isn't particularly helpful alongside the dump.
Also I don't understand why the literal_eval error message says "node or string". How can the string be malformed? If there is a syntax error, it raises a different exception. I like this better:
raise ValueError(f'malformed node or string: {node!r}, {ast.dump(node)!r}') | |
raise ValueError('unsupported node when parsing index: {ast.dump(node)}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think your language is clearer in any case. I'll pull this suggestion in
|
||
def slices(start=one_of(none(), ints()), stop=one_of(none(), ints()), | ||
step=one_of(none(), ints())): | ||
step=one_of(none(), ints_nonzero())): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This includes zero step because it is used to test the slice constructor. Although it would probably be better to use this for the tests that don't actually test that (see the discussion at #46). It may be better to work on this in a separate PR, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this since I kept getting ValueErrors
due to the related check in the Slice
constructor.
So what I don't get is how this wasn't causing errors previously for tests using the @given(ndindices)
strategy. Wouldn't that just mean that (due to one of the many weird quirks of hypothesis) @given(ndindices)
just happened to never attempt to generate ndindex(slice(x, y, 0))
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ndindices uses filter(_doesnt_raise)
which make it only generate valid indices. Actually I suspect this means several tests
ints(), | ||
slices(), | ||
Tuples, | ||
).map(lambda x: f'{x}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put strategies in helpers.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can include arrays here too, although you'll have to use the repr form in that case.
@example('..., 3:15, -5, slice(-12, -72, 14)') | ||
@given(ndindexStrs) | ||
def test_parse_index_hypothesis(ixStr): | ||
assert eval(f'_dummy[{ixStr}]') == parse_index(ixStr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can check here if eval gives an error than so should parse_index.
One idea to get better test coverage would be to >>> str(Slice(1, 2))
'1:2'
>>> str(Tuple(1, 2))
'(1, 2)' and then test both str and repr of ndindex(idx) (and ndindex(idx).raw) (str for arrays will still have to be skipped because it uses non-valid syntax). It may also be a good idea to support the ndindex classes as valid names, so |
|
||
def parse_index(node_or_string): | ||
""" | ||
"Safely" (needs validation) evaluate an expression node or a string containing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does "needs validation" here mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it depends on what we mean by "safely". parse_index('('*1000 + ')'*1000)
raises MemoryError (but so does ast.literal_eval('('*1000 + ')'*1000)
). It shouldn't be possible to execute arbitrary code, though, as far as I can tell.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't really mean anything, mostly just a reflection of my paranoia about building a) any parser, and b) something meant to take raw user input. My goal has been to make something that's at least as safe as the existing literal_eval
. At this point I've examined the code and the surrounding issues enough that I'm reasonably confident this has actually been archived, so I'll edit out the wishy-washy language
def parse_index(node_or_string): | ||
""" | ||
"Safely" (needs validation) evaluate an expression node or a string containing | ||
a (limited subset) of valid numpy index or slice expressions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring should probably clarify what is and isn't allowed. I guess we aren't allowing any kind of expressions that create numbers.
Actually, we may at one point want to support just anything that parses, and make the current behavior under a safe=True
flag, which I guess should be the default. But we can worry about that later.
Also this docstring needs examples. and should be added to Sphinx.
Fixes #86, see #86 for discussion.
Adds a
literal_eval_index
function, built off of a modified version of theast.literal_eval
function from the cpy std lib.I've tested the
literal_eval_index
function in both cpy37 and cpy38, seems to be working pretty well. Still needs some work to integrate this into thendindex
package (including adding tests).