|
dev
newsgroups
|
|||||||||||||||||||||||
|
|||||||||||||||||||||||
regex syntaxthe point I thought I know how to use regex to extract date string. But I ran into problems. what is the best regex expression to look for month names or date string for that matter? from my testing, I could use "((JAN)|(FEB)|(MAR)|(APR)|(MAY)|(JUN)|(JUL)|(AUG)|(SEP)|(OCT)|(NOV)|(DEC))" not '([ADFJMNOS][ACEOPU][BCGLNPRTVY])" In other word I got syntax problem with the month pattern I am working towards dealing with various date format I deal with My object is to get the entire date string and parse into yyyy-mm-dd or whatever the dotnet conversion routine will take. I will have to deal with many long strings of 64K to 200K . This is the reason I am locking for a good regex expression to minimize delays from processing I know I have to deal with yyyy-mm-dd ( and variants thereof with dot or slash as separator instead of dash, single digit month or day) yyyy-MMM-dd ( or just space instead of -) MMM d, yy ( or yyyy) and the tougher ones like d MMM yyyy d MMM yy have a look at regexlib.com for customized expressions
-- Show quoteHide quoteRegards, Alvin Bruney [Shameless Author Plug] The Microsoft Office Web Components Black Book with .NET available at www.lulu.com/owc, Amazon, B&H etc Forth-coming VSTO.NET ------------------------------------------------------------------------------- "jg" <j***@mail.pls> wrote in message news:%23aS6MZIoFHA.2540@TK2MSFTNGP15.phx.gbl... >I am new to using both dotnet and regex. I have done the basic reading to >the point I thought I know how to use regex to extract date string. But I >ran into problems. > > > what is the best regex expression to look for month names or date string > for that matter? > > from my testing, I could use > > "((JAN)|(FEB)|(MAR)|(APR)|(MAY)|(JUN)|(JUL)|(AUG)|(SEP)|(OCT)|(NOV)|(DEC))" > not > '([ADFJMNOS][ACEOPU][BCGLNPRTVY])" > In other word I got syntax problem with the month pattern > > I am working towards dealing with various date format I deal with > My object is to get the entire date string and parse into yyyy-mm-dd or > whatever the dotnet conversion routine will take. > I will have to deal with many long strings of 64K to 200K . This is the > reason I am locking for a good regex expression to minimize delays from > processing > > I know I have to deal with > yyyy-mm-dd ( and variants thereof with dot or slash as separator > instead of dash, single digit month or day) > yyyy-MMM-dd ( or just space instead of -) > MMM d, yy ( or yyyy) > and the tougher ones like > d MMM yyyy > d MMM yy > thank you
However, I have no luck accessing that content. all I got was the Green Logos. did not see anything. Show quoteHide quote "Alvin Bruney [MVP - ASP.NET]" <www.lulu.com/owc> wrote in message news:Or%23h4fQoFHA.764@TK2MSFTNGP14.phx.gbl... > have a look at regexlib.com for customized expressions > > -- > Regards, > Alvin Bruney > [Shameless Author Plug] > The Microsoft Office Web Components Black Book with .NET > available at www.lulu.com/owc, Amazon, B&H etc > > > Forth-coming VSTO.NET > ------------------------------------------------------------------------------- > "jg" <j***@mail.pls> wrote in message > news:%23aS6MZIoFHA.2540@TK2MSFTNGP15.phx.gbl... >>I am new to using both dotnet and regex. I have done the basic reading to >>the point I thought I know how to use regex to extract date string. But I >>ran into problems. >> >> >> what is the best regex expression to look for month names or date string >> for that matter? >> >> from my testing, I could use >> >> "((JAN)|(FEB)|(MAR)|(APR)|(MAY)|(JUN)|(JUL)|(AUG)|(SEP)|(OCT)|(NOV)|(DEC))" >> not >> '([ADFJMNOS][ACEOPU][BCGLNPRTVY])" >> In other word I got syntax problem with the month pattern >> >> I am working towards dealing with various date format I deal with >> My object is to get the entire date string and parse into yyyy-mm-dd or >> whatever the dotnet conversion routine will take. >> I will have to deal with many long strings of 64K to 200K . This is the >> reason I am locking for a good regex expression to minimize delays from >> processing >> >> I know I have to deal with >> yyyy-mm-dd ( and variants thereof with dot or slash as separator >> instead of dash, single digit month or day) >> yyyy-MMM-dd ( or just space instead of -) >> MMM d, yy ( or yyyy) >> and the tougher ones like >> d MMM yyyy >> d MMM yy >> > > jg wrote:
> I know I have to deal with I have created a regex for you that works with all those samples. Here > yyyy-mm-dd ( and variants thereof with dot or slash as separator instead > of dash, single digit month or day) > yyyy-MMM-dd ( or just space instead of -) > MMM d, yy ( or yyyy) > and the tougher ones like > d MMM yyyy > d MMM yy it is: (?<year>\d{4})[-\./\s](?<month>\d{1,2})[-\./\s](?<day>\d{1,2})$ | (?<year>\d{4})[-\s](?<month>JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC)[-\s](?<day>\d{1,2})$ | (?<month>JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC)\s(?<day>\d{1,2}),\s*?(?<year>\d{4}|\d{2})$ | (?<day>\d{1,2})\s(?<month>JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC)\s(?<year>\d{4}|\d{2})$I tried this with the following samples, constructed from the templates you gave: 2005-03-08 2005.03.08 2005/03/08 2005 03 08 2005 3 08 2005 3 8 2005 03 8 2005-MAR-08 2005 MAR 08 2005 MAR 8 MAR 8, 2005 MAR 08, 2005 MAR 8, 05 MAR 08, 05 8 MAR 2005 8 MAR 05 08 MAR 2005 08 MAR 05 As you can see, the expression is comprised of four different parts. Each of these has a $ sign at the end, which you'll want to get rid of before using the expression with your own long string. This is only needed to test the expression in Regulator with multiple samples. I tried this with the IgnoreWhitespace and the IgnoreCase options switched on. Hope this helps! (If you have any trouble with the regex, I could send you the saved Regulator file. Just in case things get mangled in the message or something.) Oliver Sturm -- omnibus ex nihilo ducendis sufficit unum Spaces inserted to prevent google email destruction: MSN oliver @ sturmnet.org Jabber sturm @ amessage.de ICQ 27142619 http://www.sturmnet.org/blog that is absolutely wonderful and helpful. Thank you very much. Your efforts
are well appreciated. Thank you very much again for testing and explaining. I will try that out.. Show quoteHide quote "Oliver Sturm" <oli***@sturmnet.org> wrote in message news:%23HZlBO9oFHA.2472@TK2MSFTNGP15.phx.gbl... > jg wrote: > >> I know I have to deal with >> yyyy-mm-dd ( and variants thereof with dot or slash as separator >> instead of dash, single digit month or day) >> yyyy-MMM-dd ( or just space instead of -) >> MMM d, yy ( or yyyy) >> and the tougher ones like >> d MMM yyyy >> d MMM yy > > I have created a regex for you that works with all those samples. Here it > is: > > (?<year>\d{4})[-\./\s](?<month>\d{1,2})[-\./\s](?<day>\d{1,2})$ | > (?<year>\d{4})[-\s](?<month>JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC)[-\s](?<day>\d{1,2})$ > | > (?<month>JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC)\s(?<day>\d{1,2}),\s*?(?<year>\d{4}|\d{2})$ > | > (?<day>\d{1,2})\s(?<month>JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC)\s(?<year>\d{4}|\d{2})$ > > I tried this with the following samples, constructed from the templates > you gave: > > 2005-03-08 > 2005.03.08 > 2005/03/08 > 2005 03 08 > 2005 3 08 > 2005 3 8 > 2005 03 8 > 2005-MAR-08 > 2005 MAR 08 > 2005 MAR 8 > MAR 8, 2005 > MAR 08, 2005 > MAR 8, 05 > MAR 08, 05 > 8 MAR 2005 > 8 MAR 05 > 08 MAR 2005 > 08 MAR 05 > > As you can see, the expression is comprised of four different parts. Each > of these has a $ sign at the end, which you'll want to get rid of before > using the expression with your own long string. This is only needed to > test the expression in Regulator with multiple samples. > > I tried this with the IgnoreWhitespace and the IgnoreCase options switched > on. > > Hope this helps! > > (If you have any trouble with the regex, I could send you the saved > Regulator file. Just in case things get mangled in the message or > something.) > > > Oliver Sturm > -- > omnibus ex nihilo ducendis sufficit unum > Spaces inserted to prevent google email destruction: > MSN oliver @ sturmnet.org Jabber sturm @ amessage.de > ICQ 27142619 http://www.sturmnet.org/blog Great, it works even after taking out the $ and the space around the |.. I
did add \b before the entire expression to make sure the first part of the date is on the word boundary. This way I can avoid some supposedly low probability errors like some strange catalogue dot or dash notations Now all I have to do is to make it work with January, February,... ( fully spelled month names). I guess I can always add another 12 | parts to the month expressions Show quoteHide quote "jg" <j***@mail.pls> wrote in message news:%23VC0r1BpFHA.3380@TK2MSFTNGP12.phx.gbl... > that is absolutely wonderful and helpful. Thank you very much. Your > efforts are well appreciated. > Thank you very much again for testing and explaining. > > I will try that out.. > > "Oliver Sturm" <oli***@sturmnet.org> wrote in message > news:%23HZlBO9oFHA.2472@TK2MSFTNGP15.phx.gbl... >> jg wrote: >> >>> I know I have to deal with >>> yyyy-mm-dd ( and variants thereof with dot or slash as separator >>> instead of dash, single digit month or day) >>> yyyy-MMM-dd ( or just space instead of -) >>> MMM d, yy ( or yyyy) >>> and the tougher ones like >>> d MMM yyyy >>> d MMM yy >> >> I have created a regex for you that works with all those samples. Here it >> is: >> >> (?<year>\d{4})[-\./\s](?<month>\d{1,2})[-\./\s](?<day>\d{1,2})$ | >> (?<year>\d{4})[-\s](?<month>JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC)[-\s](?<day>\d{1,2})$ >> | >> (?<month>JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC)\s(?<day>\d{1,2}),\s*?(?<year>\d{4}|\d{2})$ >> | >> (?<day>\d{1,2})\s(?<month>JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC)\s(?<year>\d{4}|\d{2})$ >> >> I tried this with the following samples, constructed from the templates >> you gave: >> >> 2005-03-08 >> 2005.03.08 >> 2005/03/08 >> 2005 03 08 >> 2005 3 08 >> 2005 3 8 >> 2005 03 8 >> 2005-MAR-08 >> 2005 MAR 08 >> 2005 MAR 8 >> MAR 8, 2005 >> MAR 08, 2005 >> MAR 8, 05 >> MAR 08, 05 >> 8 MAR 2005 >> 8 MAR 05 >> 08 MAR 2005 >> 08 MAR 05 >> >> As you can see, the expression is comprised of four different parts. Each >> of these has a $ sign at the end, which you'll want to get rid of before >> using the expression with your own long string. This is only needed to >> test the expression in Regulator with multiple samples. >> >> I tried this with the IgnoreWhitespace and the IgnoreCase options >> switched on. >> >> Hope this helps! >> >> (If you have any trouble with the regex, I could send you the saved >> Regulator file. Just in case things get mangled in the message or >> something.) >> >> >> Oliver Sturm >> -- >> omnibus ex nihilo ducendis sufficit unum >> Spaces inserted to prevent google email destruction: >> MSN oliver @ sturmnet.org Jabber sturm @ amessage.de >> ICQ 27142619 http://www.sturmnet.org/blog > > jg wrote:
> Great, it works even after taking out the $ and the space around the |.. I Sure, I didn't know your exact circumstances, so you'd have to make > did add \b before the entire expression to make sure the first part of the > date is on the word boundary. This way I can avoid some supposedly low > probability errors like some strange catalogue dot or dash notations modifications to my sample to make it work for you completely. > Now all I have to do is to make it work with January, February,... ( fully Sure you can. If you find the whole thing growing too much, maybe you > spelled month names). I guess I can always add another 12 | parts to the > month expressions could define the various parts you need (the month expression, the day expression, the two digit year, the four digit year) as string constants in your code and use a String.Format to put them together to form the complete regular expression before you use it. That way it might be a bit more maintainable - otherwise you'll have to make every change to one of the parts in many places, increasing the probability of an error. Oliver Sturm -- omnibus ex nihilo ducendis sufficit unum Spaces inserted to prevent google email destruction: MSN oliver @ sturmnet.org Jabber sturm @ amessage.de ICQ 27142619 http://www.sturmnet.org/blog thank you again. you are wonderfully helpful.
I did find the pattern string getting too huge. So I started to split date pattern into 3 components before using them to compose the final pattern, although I did not use the string format method. Show quoteHide quote "Oliver Sturm" <oli***@sturmnet.org> wrote in message news:eQV5%23lJpFHA.620@TK2MSFTNGP15.phx.gbl... > jg wrote: > >> Great, it works even after taking out the $ and the space around the |.. >> I did add \b before the entire expression to make sure the first part of >> the date is on the word boundary. This way I can avoid some supposedly >> low probability errors like some strange catalogue dot or dash notations > > Sure, I didn't know your exact circumstances, so you'd have to make > modifications to my sample to make it work for you completely. > >> Now all I have to do is to make it work with January, February,... ( >> fully spelled month names). I guess I can always add another 12 | parts >> to the month expressions > > Sure you can. If you find the whole thing growing too much, maybe you > could define the various parts you need (the month expression, the day > expression, the two digit year, the four digit year) as string constants > in your code and use a String.Format to put them together to form the > complete regular expression before you use it. That way it might be a bit > more maintainable - otherwise you'll have to make every change to one of > the parts in many places, increasing the probability of an error. > > > > Oliver Sturm > -- > omnibus ex nihilo ducendis sufficit unum > Spaces inserted to prevent google email destruction: > MSN oliver @ sturmnet.org Jabber sturm @ amessage.de > ICQ 27142619 http://www.sturmnet.org/blog jg wrote:
> I did find the pattern string getting too huge. So I started to split date Well, if you ask me, you should always use String.Format when putting > pattern into 3 components before using them to compose the final pattern, > although I did not use the string format method. together strings from more than two parts. A String.Format call can create an arbitrarily complicated string in one operation, while a concatenation a + b + c takes two operations at least. Strings are immutable in .NET, so a + b + c will end up allocating several new strings before the final result is ready. The argument against this is that the compiler might get rid of some of the overhead for you, at least when a, b and c are static strings. But I don't like to depend on that, especially when the String.Format call is usually so much better readable: "At " + time.ToString() + ", the user " + user + "had a problem accessing the " + resource + "resource." String.Format("At {0}, the user {1} had a problem accessing the {2} resource.", time, user, resource); Oliver Sturm -- omnibus ex nihilo ducendis sufficit unum Spaces inserted to prevent google email destruction: MSN oliver @ sturmnet.org Jabber sturm @ amessage.de ICQ 27142619 http://www.sturmnet.org/blog Now I see. pardon my ignorance
Thank you again. much appreciated. Show quoteHide quote "Oliver Sturm" <oli***@sturmnet.org> wrote in message news:eu9XMM8pFHA.3544@TK2MSFTNGP15.phx.gbl... > jg wrote: > >> I did find the pattern string getting too huge. So I started to split >> date pattern into 3 components before using them to compose the final >> pattern, although I did not use the string format method. > > Well, if you ask me, you should always use String.Format when putting > together strings from more than two parts. A String.Format call can create > an arbitrarily complicated string in one operation, while a concatenation > a + b + c takes two operations at least. Strings are immutable in .NET, so > a + b + c will end up allocating several new strings before the final > result is ready. > > The argument against this is that the compiler might get rid of some of > the overhead for you, at least when a, b and c are static strings. But I > don't like to depend on that, especially when the String.Format call is > usually so much better readable: > > "At " + time.ToString() + ", the user " + user + "had a problem accessing > the " + resource + "resource." > > String.Format("At {0}, the user {1} had a problem accessing the {2} > resource.", time, user, resource); > > > > Oliver Sturm > -- > omnibus ex nihilo ducendis sufficit unum > Spaces inserted to prevent google email destruction: > MSN oliver @ sturmnet.org Jabber sturm @ amessage.de > ICQ 27142619 http://www.sturmnet.org/blog Oliver Sturm <oli***@sturmnet.org> wrote:
> Well, if you ask me, you should always use String.Format when putting I disagree.> together strings from more than two parts. > A String.Format call can What do you count as an operation? Bear in mind that String.Format has > create an arbitrarily complicated string in one operation, while a > concatenation a + b + c takes two operations at least. to do a lot more work in terms of parsing etc - I very much doubt that there are many cases where it's more efficient. > Strings are That's not true if a, b and c are already strings. a+b+c will simply > immutable in .NET, so a + b + c will end up allocating several new > strings before the final result is ready. result in a call to String.Concat(a, b, c) which creates one string without creating any intermediate ones. It's not like a+b+c is compiled into (a+b)+c, evaluating a+b first. string a = "a"; string b = "b"; string c = "c"; string x = a+b+c; is compiled into: IL_0000: ldstr "a" IL_0005: stloc.0 IL_0006: ldstr "b" IL_000b: stloc.1 IL_000c: ldstr "c" IL_0011: stloc.2 IL_0012: ldloc.0 IL_0013: ldloc.1 IL_0014: ldloc.2 IL_0015: call string [mscorlib]System.String::Concat(string, string, string) IL_001a: stloc.3 > The argument against this is that the compiler might get rid of some of You can depend on it in C# at least - it's in the specification, IIRC.> the overhead for you, at least when a, b and c are static strings. But I > don't like to depend on that > especially when the String.Format call is Sometimes String.Format is more readable; sometimes it's less readable. > usually so much better readable: > > "At " + time.ToString() + ", the user " + user + "had a problem > accessing the " + resource + "resource." > > String.Format("At {0}, the user {1} had a problem accessing the {2} > resource.", time, user, resource); In almost all cases, readability should be the key to determining which to use. -- Jon Skeet - <sk***@pobox.com> http://www.pobox.com/~skeet If replying to the group, please do not mail me too Jon Skeet [C# MVP] wrote:
>>Well, if you ask me, you should always use String.Format when putting I guess I should have qualified my statement better. I might have added >>together strings from more than two parts. > > I disagree. conditions like "and at least one of the parts is not a string in itself". >>The argument against this is that the compiler might get rid of some of I would readily assume it even without reading the specs. I would make a >>the overhead for you, at least when a, b and c are static strings. But I >>don't like to depend on that > > You can depend on it in C# at least - it's in the specification, IIRC. test if it were in any way important to me. Until then, I wouldn't depend on it. >>especially when the String.Format call is Right, that was my most important point as well. But apart from >>usually so much better readable: >> >> "At " + time.ToString() + ", the user " + user + "had a problem >>accessing the " + resource + "resource." >> >> String.Format("At {0}, the user {1} had a problem accessing the {2} >>resource.", time, user, resource); > > > Sometimes String.Format is more readable; sometimes it's less readable. > In almost all cases, readability should be the key to determining which > to use. concatenations of literal strings or variables/constants holding strings, I can't imagine cases where the + concatenation would be more readable (see above, IMO). Even in these cases I might tend to use String.Format because during the course of development I find it much easier to extend and change. I can always change it if the profiler says it's a problem. Oliver Sturm -- omnibus ex nihilo ducendis sufficit unum Spaces inserted to prevent google email destruction: MSN oliver @ sturmnet.org Jabber sturm @ amessage.de ICQ 27142619 http://www.sturmnet.org/blog Oliver Sturm <oli***@sturmnet.org> wrote:
> >>Well, if you ask me, you should always use String.Format when putting Do you have evidence that String.Format doesn't itself convert the > >>together strings from more than two parts. > > > > I disagree. > > I guess I should have qualified my statement better. I might have added > conditions like "and at least one of the parts is not a string in itself". arguments to intermediate strings? If it does, I can't see that using it is saving any operations. > > You can depend on it in C# at least - it's in the specification, IIRC. Well, take it from me - you *can* depend on it. (That's assuming that > > I would readily assume it even without reading the specs. I would make a > test if it were in any way important to me. Until then, I wouldn't > depend on it. by "static" you mean "constant".) > > Sometimes String.Format is more readable; sometimes it's less readable. In cases with a single parameter you want at the end of the string, I > > In almost all cases, readability should be the key to determining which > > to use. > > Right, that was my most important point as well. But apart from > concatenations of literal strings or variables/constants holding > strings, I can't imagine cases where the + concatenation would be more > readable (see above, IMO). Even in these cases I might tend to use > String.Format because during the course of development I find it much > easier to extend and change. I can always change it if the profiler says > it's a problem. think it's more readable to have: string x = "Age: "+age; than: string x = string.Format("Age: {0}", age); It's very easy to change the former to the latter if you ever *do* want to do anything more complicated. -- Jon Skeet - <sk***@pobox.com> http://www.pobox.com/~skeet If replying to the group, please do not mail me too
Other interesting topics
finally is not always being executed...
Programmatically Refreshing the Page Misbehaving COM object? Namespace organization ADO.NET and SQL Server Data Paging Drag and Drop Question Socket data out of order SQL + Dreamweaver Determining the path to the .NET framework Large images in comparison with VB6 |
|||||||||||||||||||||||