Regexp in java problem
I found some problems while testing my NLP system. I have java regex "(.*\\.\\s*)*Dendryt.*"
and for a string "v Table of Contents List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . "
it just doesn't stop calculating.
It is clear that this complexity of regular expressions is very high, I will try to refactor it. Do you have some suggestions for me for future regex development ???
Thanks.
a source to share
You are facing catastrophic backtracking while repeating a group containing repeated quantifiers. Then the subsequent combinatorial explosion (if there is enough input) will result in a stack overflow (tada!).
Simplified, your regex is trying
(.*\.\s*)
matches any sequence of characters , including periods and spaces , then a period followed by zero or more spaces, then
(...)*
repeat this number of times.
Dendryt
Only then does he try to match "Dendrite".
Since this fails, the engine retreats, trying to rearrange the other. The possibilities are almost endless ...
To illustrate here a screenshot of the RegexBuddy regex debugger on a simplified version of your data:
RegexBuddy Screenshot http://img714.imageshack.us/img714/3275/screen017.png
Engine fails after 1 million shifts.
Your regex will be slightly better (remember to avoid backslashes when converting it to a Java string):
(.*)(\.\s*)*+Dendryt
In this case *+
, the so-called possessive quantifier will refuse to return after it matches. So the regex engine can crash much faster, but it's still bad because it (.*)
matches anything, even dots.
([^.]*)(\.\s*)*+Dendryt
is safe unless your data can contain dots before the "dotted line" bit. In general, please state your requirements a little more clearly, then a better regex can be created.
a source to share
Try the following:
"[^.]*+(?>\\.\\s*)*+Dendryt.*"
[^.]*+
consumes everything up to the first point, but +
does *
possessive , so the regex will never give up from that point.
(?>\\.\\s*)
is an atomic group : it matches a period and any subsequent spaces as if it were a separate unit. If the regex engine should go back to that point, it will then slide right over it, where the group will start matching.
But it won't go back to this point because I did the group quantifier as well. I would like to illustrate the use of atomic groups, but I could use a possessor instead \\s*
- or both.
Potential quantifiers and atomic groups completely disable the countdown, but they may not always be usable. When you have to allow retreat, keep it to a minimum; don't let quantifiers consume more than they need to. And especially, as Tim said, avoid nested quantifiers and quantified subexpressions that can match the same thing.
In fact, it is a good exercise to avoid using .*
and .+
; it makes you think about the mechanism behind it. If you don't want something specific, you have to think about what you don't want, like when I used [^.]*
instead of the first .*
in your regex.
a source to share