I have a method that removes HTML tags from an input string. The
method allows the caller to specify some HTML tags which are
permitted, so the
method must be able to identify each tag. The method performs the
same
validation on tag attributes. My general strategy is to extract the
HTML tags from
the input string using one regular expression, then use a second
regular
expression to extract the attributes from the tags. This all works
very well, except
that some inputs cause this method, specifically the call to
Matcher.find method
corresponding to the first regexp, to run for a very, very long time.
According to
this page:
http://developer.java.sun.com/developer/bugParade/bugs/4771934.html
the time required to evaluate a match for some regular expressions can
increase
exponentially with the length of the input. Apparently that is what
is happening
in this case.
Here is the regular expression in question:
/**
* Pattern to pull HTML tags out of text. A tag is considered to be a
"<", optionally followed by a "/",
* followed by a group of word letters, optionally followed by a group
of attributes, with or without quotes,
* followed by a ">".
*/
private static final Pattern TAG_PATTERN =
Pattern.compile("<(/?)(\\w+)\\s*(\\w+\\s*(=\\s*([^\"'>][^\\s>]*|\"[^\"]*\"|'[^']*'))?\\s*)*[^/>]*(/[^/>]+)*(/?)>",Pattern.CASE_INSENSITIVE);
My questions: How do I know if a particular regular expression will
require
exponential time (in other words, what are the constructs that trigger
this behavior),
and, if I have a regexp that requires exponential time, what is the
strategy for
rewriting it so that it runs efficiently?
Please note: Obviously I need to rewrite this regular expression, so
I will not
accept any answers that do not allow me to achieve that goal. For
example, telling
me to "eliminate backtracking in the regexp" doesn't help because I
don't know
what constructs cause backtracking.
Please email me at wboyce@panix.com if you would like a test program
and some
input data that exposes the problem. |