Google Answers: How to rewrite a Java regexp that requires exponential time

View Question

Q: How to rewrite a Java regexp that requires exponential time ( No Answer, 0 Comments )

Question

Subject: How to rewrite a Java regexp that requires exponential time
Category: Computers > Programming
Asked by: willisboyce-ga
List Price: $25.00

Posted: 06 Jul 2003 16:35 PDT
Expires: 07 Jul 2003 00:25 PDT
Question ID: 225830

I have a method that removes HTML tags from an input string.  The
method allows the caller to specify some HTML tags which are
permitted, so the
method must be able to identify each tag.  The method performs the
same
validation on tag attributes.  My general strategy is to extract the
HTML tags from
the input string using one regular expression, then use a second
regular
expression to extract the attributes from the tags.  This all works
very well, except
that some inputs cause this method, specifically the call to
Matcher.find method
corresponding to the first regexp, to run for a very, very long time. 
According to
this page:

http://developer.java.sun.com/developer/bugParade/bugs/4771934.html

the time required to evaluate a match for some regular expressions can
increase
exponentially with the length of the input.  Apparently that is what
is happening
in this case.

Here is the regular expression in question:

/**
 * Pattern to pull HTML tags out of text.  A tag is considered to be a
"<", optionally followed by a "/",
 * followed by a group of word letters, optionally followed by a group
of attributes, with or without quotes,
 * followed by a ">".
 */
private static final Pattern TAG_PATTERN =
Pattern.compile("<(/?)(\\w+)\\s*(\\w+\\s*(=\\s*([^\"'>][^\\s>]*|\"[^\"]*\"|'[^']*'))?\\s*)*[^/>]*(/[^/>]+)*(/?)>",Pattern.CASE_INSENSITIVE);

My questions:  How do I know if a particular regular expression will
require
exponential time (in other words, what are the constructs that trigger
this behavior),
and, if I have a regexp that requires exponential time, what is the
strategy for
rewriting it so that it runs efficiently?

Please note:  Obviously I need to rewrite this regular expression, so
I will not
accept any answers that do not allow me to achieve that goal.  For
example, telling
me to "eliminate backtracking in the regexp" doesn't help because I
don't know
what constructs cause backtracking.

Please email me at wboyce@panix.com if you would like a test program
and some
input data that exposes the problem.

Clarification of Question by willisboyce-ga on 07 Jul 2003 00:24 PDT

I found my answer.  Sorry if you were working on this.  I purchased
"Master Regular Expressions" by Jeffrey E. F. Friedl and was able to
figure out what is going on.  "Backtracking" can occur whenever there
are two or more possible matching paths at some point in the regular
expression (for example, "a?" can match either a or nothing).  At such
points, the matcher tries one of the paths (the "match a" path in this
case, since "?" is a greedy qualifier and prefers to match rather than
not match) and must backtrack to the decision point if the chosen path
ultimately fails to match.  A complex regular expression such as mine
can have many decision points and may have to do a lot of
backtracking, especially if the expression includes a lot of greedy
qualifiers that must be forced to give back a lot of what they
initially matched.  Lazy qualifiers and lookahead and lookbehind
qualifiers can be used to reduced the amount of backtracking
necessary.

Answer

There is no answer at this time.

Comments

There are no comments at this time.

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.

Search Google Answers for

Google Home - Answers FAQ - Terms of Service - Privacy Policy