|
dev
newsgroups
|
|||||||||||||||||||||||
|
|||||||||||||||||||||||
regex high cpu utilizationtake the following 2 c# lines: 1) str = Regex.Replace(str, ".*AAA", ""); 2) str = Regex.Replace(str, "^.*AAA", ""); notice that the only difference is that the pattern in line 2 has a starter marker (^). if str is large and does not contain the pattern, line 1 takes much much longer to complete than line 2 and cpu goes nearly full throttle. i have reproduced this behavior in .net 1.1 and 2.0. but jscript does not exhibit this. is this expected behavior? both lines do the exact same operation, expect for line 2 we're explicitly instruction regex to start at the beginning. i'm baffled. thanks, rh
Show quote
"rh" <rhashem***@hotmail.com> wrote: This is one possible expected behaviour under the current> take the following 2 c# lines: > 1) str = Regex.Replace(str, ".*AAA", ""); > 2) str = Regex.Replace(str, "^.*AAA", ""); > > notice that the only difference is that the pattern in line 2 has a > starter > marker (^). if str is large and does not contain the pattern, line 1 > takes much much longer to complete than line 2 and cpu goes nearly > full throttle. > > i have reproduced this behavior in .net 1.1 and 2.0. but jscript does > not exhibit this. > > is this expected behavior? both lines do the exact same operation, > expect for line 2 we're explicitly instruction regex to start at the > beginning. implementation, which does some kinds of optimizations (such as compiling to a dynamic assembly in memory if you wish) but not others (as you have seen). Here's the way each one works in a string of length n, assuming no optimization of any kind: 1) Start at first character, match next character * n times end of string found, so back-track 1 try to match A, then end of string found | backtrack to .* backtrack 1, try to match AA, then end of string | backtrack to .* backtrack 1, try to match AAA, if found then success not found, so backtrack the .* matching Start at second character, match next character * n-1 times end of string found, so back-track 1 try to match A, then end of string found backtrack 1, try to match AA, then end of string backtrack 1, try to match AA, then end of string not found, so backtrack the .* matching Start at third character, ... As you can see, this will take time proportional to n*n, or quadratic time. 2) Start at first character, match next character * n times end of string found, so back-track 1 try to match A, then end of string found backtrack 1, try to match AA, then end of string backtrack 1, try to match AA, then end of string not found, so *NO MATCH* This one can exit early because the '^' forces the expression to only match the start of the string. It takes time proportional to n, or linear time. Now, semantically, one can see that '.*AAA' is simply the start of the string up to AAA, and the whole string could be found quickly by just searching for AAA. Since the '*' operator in .NET Regex is greedy, the '^' is implied. I don't know if the JScript test you ran has the same semantics, but it appears it certainly has a different implementation with some more optimizations. To make it quicker, you can use '^' yourself. (I think this is a .net framework issue, not a general issue or a C# issue, so I suppose it should have been posted to that newsgroup only.) -- Barry |
|||||||||||||||||||||||