StackOverflowException in Xeger #5

moodmosaic · 2012-10-21T16:50:57Z

The following regular expressions create a StackOverflowException:

^[a-zA-Z0-9\-\.]+\.(com|org|net|mil|edu|COM|ORG|NET|MIL|EDU)$
(http|ftp|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,@?^=%&amp;:/~\+#]*[\w\-\@?^=%&amp;/~\+#])?
^(http|https|ftp)\://([a-zA-Z0-9\.\-]+(\:[a-zA-Z0-9\.&amp;%\$\-]+)*@)?((25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[1-9])\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[1-9]|0)\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[1-9]|0)\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[0-9])|([a-zA-Z0-9\-]+\.)*[a-zA-Z0-9\-]+\.[a-zA-Z]{2,4})(\:[0-9]+)?(/[^/][a-zA-Z0-9\.\,\?\'\\/\+&amp;%\$#\=~_\-@]*)*$
(?:(?:\+?1\s*(?:[.-]\s*)?)?(?:\(\s*([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9])\s*\)|([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9]))\s*(?:[.-]\s*)?)?([2-9]1[02-9]|[2-9][02-9]1|[2-9][02-9]{2})\s*(?:[.-]\s*)?([0-9]{4})(?:\s*(?:#|x\.?|ext\.?|extension)\s*(\d+))?$

The Xeger class keep traversing the Automaton's States resulting to a StackOverflowException. This happens in the IList<Transition> GetSortedTransitions() method of the State class.

The issue appeared after this commit which addresses this issue.

The text was updated successfully, but these errors were encountered:

moodmosaic · 2013-11-05T11:51:24Z

As @ymotton points out the following regular expression also throws:

@"(.*)\r\n\r\n"

edwardskrod · 2017-10-30T22:49:07Z

@moodmosaic Did you guys ever figure this out?

moodmosaic · 2017-10-31T10:26:29Z

@edwardskrod, no we haven't. See also #13 (comment).

trylock · 2018-11-13T00:18:21Z

There is a non-zero probability that pretty much any non-trivial regular expression will throw a StackOverflowException. This is because generally there will be cycles in its automaton. ~~However, the expected number of steps of the Xeger algorithm is bounded (see http://web.cs.elte.hu/~lovasz/erdos.pdf)~~ Simply rewriting the Generate method to use a for loop or something like that instead of recursion should do the trick. I can send a pull request when I have time, unless there is another problem with this I'm not aware of.

Also, fun fact about the linked issue: If we pick one of 2^16 characters at random, the mean time to get an English letter is 2^16/26 (about 2521) trials.

moodmosaic · 2018-11-13T09:09:42Z

I can send a pull request when I have time

Please do. There's no need for us to strictly follow the Java version (dk.brics.automaton and Xeger). 🚀

trylock · 2018-11-18T18:30:13Z

I finally had a little time to think about this properly, First of all, the linked document is obviously not relevant to this discussion as it deals with undirected graphs. Under current conditions, there are some regular expressions where the expected number of steps is exponential (for example: .*some long suffix). Another disadvantage of the old algorithm is that the generated words depend heavily on the graph structure. Consider this regexp: a{1,4}, for the Xeger algorithm, P(a) = 1/2, P(aa) = 1/4, P(aaa) = 1/8, P(aaaa) = 1/8

I thought about possible alternatives and there is a lot of them. It all depends on our requirements.

uniform distribution of generated words
uniform/custom distribution of generated word lengths
uniform/custom distribution of cycle lengths (for example in [a-z]+@[a-zA-Z0-9_-]+ we might want the part before @ to have the same expected length as the part after @)

Of course, I assume we restrict generated words to some finite subset (like words of length at most n).

These requirements go against each other. For example, in general you can't have both 1) and 2) (.* - there is a lot more longer words). It would be nice to have multiple generators.

As for the Xeger algorithm, I'd rather get rid of it completely. However, I understand that a lot of code may depend on its probability distribution. We can maybe create a modified version which will stop after a random number of steps and generate the rest of the word without using cycles. This shouldn't break the distribution too much and it will reduce the expected number of steps greatly. What do you think?

moodmosaic · 2018-11-20T04:22:46Z

I would definitely go with different generators, with an API that also supports LINQ, like this one. This one is from Hedgehog but the idea about how the API can look like is similar.

We can have a Gen type that takes care of generating the data we want. – On distribution, I'd rather have uniform distribution by default, and have some generator(s) that allow you to use a weighted distribution.

(All examples I use come from Hedgehog, the property testing tool I'm working on, but the concept of data generation applies here as well.)

At this point we might wonder if it's worth keeping the Xeger algorithm around; if I could go back in time, I wouldn't port Xeger and instead I would only port dk.brics.automaton instead of both...

For now, I would keep Xeger around, perhaps refactor it to use the (new) Gen class but I wouldn't change its public methods if possible, to avoid breaking other peoples code until the next major version bump.

ghost assigned moodmosaic Oct 21, 2012

moodmosaic mentioned this issue Oct 22, 2013

Added testcase that throws a StackOverflowException #10

Closed

moodmosaic added bug help wanted labels Jul 1, 2015

moodmosaic removed their assignment Jul 1, 2015

gregsdennis mentioned this issue Jan 20, 2022

running xeger.Generate() causes stack overflow exception #57

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

StackOverflowException in Xeger #5

StackOverflowException in Xeger #5

moodmosaic commented Oct 21, 2012

moodmosaic commented Nov 5, 2013

edwardskrod commented Oct 30, 2017

moodmosaic commented Oct 31, 2017

trylock commented Nov 13, 2018 •

edited

Loading

moodmosaic commented Nov 13, 2018

trylock commented Nov 18, 2018

moodmosaic commented Nov 20, 2018

StackOverflowException in Xeger #5

StackOverflowException in Xeger #5

Comments

moodmosaic commented Oct 21, 2012

moodmosaic commented Nov 5, 2013

edwardskrod commented Oct 30, 2017

moodmosaic commented Oct 31, 2017

trylock commented Nov 13, 2018 • edited Loading

moodmosaic commented Nov 13, 2018

trylock commented Nov 18, 2018

moodmosaic commented Nov 20, 2018

trylock commented Nov 13, 2018 •

edited

Loading