|
dev
newsgroups
|
|||||||||||||||||||||||
|
|||||||||||||||||||||||
regex expressionwhat is a good general regex expression for html <img ....> tag?
I tried "<img [/:.a-z =0-9\""_;&]*\->", RegexOptions.IgnoreCase) but it is not quite working thank you for your time Hi,
You have to use the lazy modifier "?" so that the "*" quantifier doesn't match the trailing ">". In your example the "*" won't match the trailing ">", so I think it's the "\-" that is causing you problems. Try the following expression: Regex re = new Regex( @"<img .*?(/>|</img>)", // this pattern accounts for the XHTML as well as HTML standards RegexOptions.IgnoreCase | RegexOptions.ExplicitCapture); -- Show quoteDave Sexton "GS" <gsmsnews.microsoft.co***@msnews.Nomail.com> wrote in message news:Ov$SVD94GHA.3960@TK2MSFTNGP02.phx.gbl... > what is a good general regex expression for html <img ....> tag? > I tried > "<img [/:.a-z =0-9\""_;&]*\->", RegexOptions.IgnoreCase) > but it is not quite working > > thank you for your time > > > thank you.
however I found the new expression failed to find the <img tag as in <P><IMG height=168 src="test.bmp" width=235 border=0> Brought to you by <FONT size=4>Test ABC Inc.</FONT></P> So I remove the slash before the > thus myregex = New Regex("<img .*?(>|</img>)", RegexOptions.IgnoreCase Or RegexOptions.ExplicitCapture) ' this is in vb What problem I may encounter with the modified expression? please bear with my lack of knowledge on xml. BTW what about <object ... type="image/png"> any chance of that being mixed with non image such as scripts or applets? At the moment the <object ..> tags for image seem to be a nest of hornets. "Dave Sexton" <dave@jwa[remove.this]online.com> wrote in message match the trailing ">". In your example the "*" won'tnews:%23eTXiD%234GHA.3404@TK2MSFTNGP04.phx.gbl... > Hi, > > You have to use the lazy modifier "?" so that the "*" quantifier doesn't > match the trailing ">", so I think it's the "\-" that is causing you well as HTML standardsproblems. > > Try the following expression: > > Regex re = new Regex( > @"<img .*?(/>|</img>)", // this pattern accounts for the XHTML as Show quote > RegexOptions.IgnoreCase | RegexOptions.ExplicitCapture); > > -- > Dave Sexton > > "GS" <gsmsnews.microsoft.co***@msnews.Nomail.com> wrote in message news:Ov$SVD94GHA.3960@TK2MSFTNGP02.phx.gbl... > > what is a good general regex expression for html <img ....> tag? > > I tried > > "<img [/:.a-z =0-9\""_;&]*\->", RegexOptions.IgnoreCase) > > but it is not quite working > > > > thank you for your time > > > > > > > > Hi,
You are correct that it doesn't properly handle img tags commonly found in HTML documents. Sorry about that. If you don't have to account for a closing </img> tag then the following should work for HTML and most standard XHTML documents: "<img .*?>" In your other recent post your expression will work essentially the same as the one above. > BTW what about <object ... type="image/png"> any chance of that being mixed If you need to match that as well then you'll have to use a more complex expression:> with non image such as scripts or applets? At the moment the <object ..> > tags for image seem to be a nest of hornets. "<object .*?type=\"image/.*?\".*?(/>|</object>)" HTH -- Show quoteDave Sexton "GS" <gsmsnews.microsoft.co***@msnews.Nomail.com> wrote in message news:%23O6VV0%234GHA.3360@TK2MSFTNGP04.phx.gbl... > thank you. > > however I found the new expression failed to find the <img tag as in > <P><IMG height=168 src="test.bmp" width=235 > border=0> Brought to you by <FONT size=4>Test ABC > Inc.</FONT></P> > > So I remove the slash before the > > thus > myregex = New Regex("<img .*?(>|</img>)", RegexOptions.IgnoreCase Or > RegexOptions.ExplicitCapture) ' this is in vb > > What problem I may encounter with the modified expression? please bear with > my lack of knowledge on xml. > > > BTW what about <object ... type="image/png"> any chance of that being mixed > with non image such as scripts or applets? At the moment the <object ..> > tags for image seem to be a nest of hornets. > > "Dave Sexton" <dave@jwa[remove.this]online.com> wrote in message > news:%23eTXiD%234GHA.3404@TK2MSFTNGP04.phx.gbl... >> Hi, >> >> You have to use the lazy modifier "?" so that the "*" quantifier doesn't > match the trailing ">". In your example the "*" won't >> match the trailing ">", so I think it's the "\-" that is causing you > problems. >> >> Try the following expression: >> >> Regex re = new Regex( >> @"<img .*?(/>|</img>)", // this pattern accounts for the XHTML as > well as HTML standards >> RegexOptions.IgnoreCase | RegexOptions.ExplicitCapture); >> >> -- >> Dave Sexton >> >> "GS" <gsmsnews.microsoft.co***@msnews.Nomail.com> wrote in message > news:Ov$SVD94GHA.3960@TK2MSFTNGP02.phx.gbl... >> > what is a good general regex expression for html <img ....> tag? >> > I tried >> > "<img [/:.a-z =0-9\""_;&]*\->", RegexOptions.IgnoreCase) >> > but it is not quite working >> > >> > thank you for your time >> > >> > >> > >> >> > > wonderful explanation and help. thank you
BTW I initially had tried <img .*> which had a greedy propensity for gobbling up everything to the last > in the line of html so is the ?> making it non greedy expression? Show quote "Dave Sexton" <dave@jwa[remove.this]online.com> wrote in message news:uAbQDq$4GHA.2208@TK2MSFTNGP04.phx.gbl... > Hi, > > You are correct that it doesn't properly handle img tags commonly found in > HTML documents. Sorry about that. > > If you don't have to account for a closing </img> tag then the following > should work for HTML and most standard XHTML documents: > > "<img .*?>" > > In your other recent post your expression will work essentially the same > as the one above. > >> BTW what about <object ... type="image/png"> any chance of that being >> mixed >> with non image such as scripts or applets? At the moment the <object ..> >> tags for image seem to be a nest of hornets. > > If you need to match that as well then you'll have to use a more complex > expression: > > "<object .*?type=\"image/.*?\".*?(/>|</object>)" > > HTH > > -- > Dave Sexton > > "GS" <gsmsnews.microsoft.co***@msnews.Nomail.com> wrote in message > news:%23O6VV0%234GHA.3360@TK2MSFTNGP04.phx.gbl... >> thank you. >> >> however I found the new expression failed to find the <img tag as in >> <P><IMG height=168 src="test.bmp" width=235 >> border=0> Brought to you by <FONT size=4>Test ABC >> Inc.</FONT></P> >> >> So I remove the slash before the > >> thus >> myregex = New Regex("<img .*?(>|</img>)", RegexOptions.IgnoreCase >> Or >> RegexOptions.ExplicitCapture) ' this is in vb >> >> What problem I may encounter with the modified expression? please bear >> with >> my lack of knowledge on xml. >> >> >> BTW what about <object ... type="image/png"> any chance of that being >> mixed >> with non image such as scripts or applets? At the moment the <object ..> >> tags for image seem to be a nest of hornets. >> >> "Dave Sexton" <dave@jwa[remove.this]online.com> wrote in message >> news:%23eTXiD%234GHA.3404@TK2MSFTNGP04.phx.gbl... >>> Hi, >>> >>> You have to use the lazy modifier "?" so that the "*" quantifier doesn't >> match the trailing ">". In your example the "*" won't >>> match the trailing ">", so I think it's the "\-" that is causing you >> problems. >>> >>> Try the following expression: >>> >>> Regex re = new Regex( >>> @"<img .*?(/>|</img>)", // this pattern accounts for the XHTML as >> well as HTML standards >>> RegexOptions.IgnoreCase | RegexOptions.ExplicitCapture); >>> >>> -- >>> Dave Sexton >>> >>> "GS" <gsmsnews.microsoft.co***@msnews.Nomail.com> wrote in message >> news:Ov$SVD94GHA.3960@TK2MSFTNGP02.phx.gbl... >>> > what is a good general regex expression for html <img ....> tag? >>> > I tried >>> > "<img [/:.a-z =0-9\""_;&]*\->", RegexOptions.IgnoreCase) >>> > but it is not quite working >>> > >>> > thank you for your time >>> > >>> > >>> > >>> >>> >> >> > > Hi,
Glad I could help. ? is a lazy modifier to the * quantifier, which alone matches as match as it can up to the first occurrence of the remainder of the expression. The RegEx in this case matches everything up to the first occurrence of the remainder of the expression and then matches the remainder once (the trailing > in this case), but no more. The lazy modifier is, in that respect, like a positive look-ahead assertion asserting on the remainder of the expression, except that it will match up to the end of the input string if necessary. HTH -- Show quoteDave Sexton "gs" <gs@dontMail.telus> wrote in message news:OQ2a1mB5GHA.2536@TK2MSFTNGP06.phx.gbl... > wonderful explanation and help. thank you > > BTW I initially had tried <img .*> which had a greedy propensity for gobbling up everything to the last > in the line of html > > so is the ?> making it non greedy expression? > > "Dave Sexton" <dave@jwa[remove.this]online.com> wrote in message news:uAbQDq$4GHA.2208@TK2MSFTNGP04.phx.gbl... >> Hi, >> >> You are correct that it doesn't properly handle img tags commonly found in HTML documents. Sorry about that. >> >> If you don't have to account for a closing </img> tag then the following should work for HTML and most standard XHTML documents: >> >> "<img .*?>" >> >> In your other recent post your expression will work essentially the same as the one above. >> >>> BTW what about <object ... type="image/png"> any chance of that being mixed >>> with non image such as scripts or applets? At the moment the <object ..> >>> tags for image seem to be a nest of hornets. >> >> If you need to match that as well then you'll have to use a more complex expression: >> >> "<object .*?type=\"image/.*?\".*?(/>|</object>)" >> >> HTH >> >> -- >> Dave Sexton >> >> "GS" <gsmsnews.microsoft.co***@msnews.Nomail.com> wrote in message news:%23O6VV0%234GHA.3360@TK2MSFTNGP04.phx.gbl... >>> thank you. >>> >>> however I found the new expression failed to find the <img tag as in >>> <P><IMG height=168 src="test.bmp" width=235 >>> border=0> Brought to you by <FONT size=4>Test ABC >>> Inc.</FONT></P> >>> >>> So I remove the slash before the > >>> thus >>> myregex = New Regex("<img .*?(>|</img>)", RegexOptions.IgnoreCase Or >>> RegexOptions.ExplicitCapture) ' this is in vb >>> >>> What problem I may encounter with the modified expression? please bear with >>> my lack of knowledge on xml. >>> >>> >>> BTW what about <object ... type="image/png"> any chance of that being mixed >>> with non image such as scripts or applets? At the moment the <object ..> >>> tags for image seem to be a nest of hornets. >>> >>> "Dave Sexton" <dave@jwa[remove.this]online.com> wrote in message >>> news:%23eTXiD%234GHA.3404@TK2MSFTNGP04.phx.gbl... >>>> Hi, >>>> >>>> You have to use the lazy modifier "?" so that the "*" quantifier doesn't >>> match the trailing ">". In your example the "*" won't >>>> match the trailing ">", so I think it's the "\-" that is causing you >>> problems. >>>> >>>> Try the following expression: >>>> >>>> Regex re = new Regex( >>>> @"<img .*?(/>|</img>)", // this pattern accounts for the XHTML as >>> well as HTML standards >>>> RegexOptions.IgnoreCase | RegexOptions.ExplicitCapture); >>>> >>>> -- >>>> Dave Sexton >>>> >>>> "GS" <gsmsnews.microsoft.co***@msnews.Nomail.com> wrote in message >>> news:Ov$SVD94GHA.3960@TK2MSFTNGP02.phx.gbl... >>>> > what is a good general regex expression for html <img ....> tag? >>>> > I tried >>>> > "<img [/:.a-z =0-9\""_;&]*\->", RegexOptions.IgnoreCase) >>>> > but it is not quite working >>>> > >>>> > thank you for your time >>>> > >>>> > >>>> > >>>> >>>> >>> >>> >> >> > > I actually now change the grouping
myregex = New Regex("<img .*?(>|(/>|(</img>))", RegexOptions.IgnoreCase Or RegexOptions.ExplicitCapture) ' this is in vb I hope it does catch the XML <img .../> tags have not got around deal with <image .../> -- I don't expect to run into them in may application fro next year or more "Dave Sexton" <dave@jwa[remove.this]online.com> wrote in message match the trailing ">". In your example the "*" won'tnews:%23eTXiD%234GHA.3404@TK2MSFTNGP04.phx.gbl... > Hi, > > You have to use the lazy modifier "?" so that the "*" quantifier doesn't > match the trailing ">", so I think it's the "\-" that is causing you well as HTML standardsproblems. > > Try the following expression: > > Regex re = new Regex( > @"<img .*?(/>|</img>)", // this pattern accounts for the XHTML as Show quote > RegexOptions.IgnoreCase | RegexOptions.ExplicitCapture); > > -- > Dave Sexton > > "GS" <gsmsnews.microsoft.co***@msnews.Nomail.com> wrote in message news:Ov$SVD94GHA.3960@TK2MSFTNGP02.phx.gbl... > > what is a good general regex expression for html <img ....> tag? > > I tried > > "<img [/:.a-z =0-9\""_;&]*\->", RegexOptions.IgnoreCase) > > but it is not quite working > > > > thank you for your time > > > > > > > > GS wrote:
> what is a good general regex expression for html <img ....> tag? It looks like you've already had a working answer, but I still want to > I tried > "<img [/:.a-z =0-9\""_;&]*\->", RegexOptions.IgnoreCase) > but it is not quite working > > thank you for your time comment on a few issues. By default, the . does not match newlines, so image tags like these <img src = "http://.../img.gif /> won't be matched. if you're expression is <img.*>. Adding or removing the ? to make it <img.*?> doesn't change things. There is an option to allow . to match newlines, but that option is potentionally very resource intensive (if you're input is 2MB, it will match 2MB and start backtracking from there). A safer expression would be the following: <img[^>]*> this matches everything between <img and > that is not a > itself. This will work in most cases. There's one problem though, > is allowed within quotes if you follow the standards. This can also be caught in regex: <img("[^"]*"|'[^']*'|[^>])*> If you'd want to catch the corresponding </img> tag as well things get harder, though this is still possible to a certain degree. First we match everything up to the end of the tag <img("[^"]*"|'[^']*'|[^>])* and then we match either /> or >......</img> (/>|>.*?</img>) As you can see I added the lazy modifier again, but this will suffer the same issues as before, so is there a better solution you might ask... And of course there is :). By using a negative look-ahead we can match everything that is not the start of </img as follows: ((?!</img).)* Combine this with what we already had and you get this: <img("[^"]*"|'[^']*'|[^>])*(/>|>((?!</img).)*</img>) Only one issue left to tackle. The </img> tag does not necessarily have the closing > directly after the tagname. Whitespace is allowed in the closing tag. This can easily be added: <img("[^"]*"|'[^']*'|[^>])*(/>|>((?!</img).)*</img\s+>) Kind regards, Jesse Houwing Great answer with learning details. Thank you. keep up the good work
Show quote "Jesse Houwing" <jesse.houwing@nospam.sogeti.nl> wrote in message news:ewLFbsk5GHA.3452@TK2MSFTNGP05.phx.gbl... > GS wrote: > > what is a good general regex expression for html <img ....> tag? > > I tried > > "<img [/:.a-z =0-9\""_;&]*\->", RegexOptions.IgnoreCase) > > but it is not quite working > > > > thank you for your time > > It looks like you've already had a working answer, but I still want to > comment on a few issues. > > By default, the . does not match newlines, so image tags like these > > <img > src = "http://.../img.gif > /> > > won't be matched. if you're expression is <img.*>. Adding or removing > the ? to make it <img.*?> doesn't change things. There is an option to > allow . to match newlines, but that option is potentionally very > resource intensive (if you're input is 2MB, it will match 2MB and start > backtracking from there). > > A safer expression would be the following: <img[^>]*> this matches > everything between <img and > that is not a > itself. This will work in > most cases. There's one problem though, > is allowed within quotes if > you follow the standards. This can also be caught in regex: > > <img("[^"]*"|'[^']*'|[^>])*> > > If you'd want to catch the corresponding </img> tag as well things get > harder, though this is still possible to a certain degree. > > First we match everything up to the end of the tag > <img("[^"]*"|'[^']*'|[^>])* > > and then we match either /> or >......</img> > > (/>|>.*?</img>) > > As you can see I added the lazy modifier again, but this will suffer the > same issues as before, so is there a better solution you might ask... > And of course there is :). > > By using a negative look-ahead we can match everything that is not the > start of </img as follows: > > ((?!</img).)* > > Combine this with what we already had and you get this: > > <img("[^"]*"|'[^']*'|[^>])*(/>|>((?!</img).)*</img>) > > Only one issue left to tackle. The </img> tag does not necessarily have > the closing > directly after the tagname. Whitespace is allowed in the > closing tag. This can easily be added: > > <img("[^"]*"|'[^']*'|[^>])*(/>|>((?!</img).)*</img\s+>) > > Kind regards, > > Jesse Houwing Hi Jesse,
Thanks for brining up those points, but I wouldn't worry about performance or memory consumption issues related to the Multiline flag when matching patterns in an html document. Pattern matching is slow by nature, and in this case it might not be executed in a batch process where performance would really be a concern. Also, any expression will probably perform well when executed against any standard-sized html document. I think my solution with the addition of the Multiline option should be fine. If the user experiences performance issues due to the expression, only then would I recommend that a more complex expression be used. A more complex expression is much harder to write and debug, but it may perform better. Therefore, the user must make a trade-off decision, but I wouldn't recommend sacrificing ease of writing and debugging, (and therefore, understanding), to address performance concerns that aren't real. When it's known whether the expression is not going to perform well then the trade-off can be made. Anyway, I followed your post and your points seemed to make perfect sense, but your expression didn't work when I tested it on the following document: string html = @"<html> <head></head> <body> <img src=""test.jpg""></img> </body> </html> "; 0 matches. And didn't work on this document either: string html = @"<html> <head></head> <body> <img src=""test.jpg""></img> <img src=""test.jpg"" /> <img src=""test.jpg""></img> </body> </html> "; 1 match, but it's invalid: {<img src="next.jpg"></img> <img src="next.jpg" />} Here's the code I used to test your expression: System.Text.RegularExpressions.Regex re = new System.Text.RegularExpressions.Regex( @"<img(""[^""]*""|'[^']*'|[^>])*(/>|>((?!</img).)*</img\s+>)"); foreach (System.Text.RegularExpressions.Match match in re.Matches(html)) { match.GetType(); // break point in debugger } I didn't even attempt to do any debugging of my own :) -- Show quoteDave Sexton "Jesse Houwing" <jesse.houwing@nospam.sogeti.nl> wrote in message news:ewLFbsk5GHA.3452@TK2MSFTNGP05.phx.gbl... > GS wrote: >> what is a good general regex expression for html <img ....> tag? >> I tried >> "<img [/:.a-z =0-9\""_;&]*\->", RegexOptions.IgnoreCase) >> but it is not quite working >> >> thank you for your time > > It looks like you've already had a working answer, but I still want to comment on a few issues. > > By default, the . does not match newlines, so image tags like these > > <img > src = "http://.../img.gif > /> > > won't be matched. if you're expression is <img.*>. Adding or removing the ? to make it <img.*?> doesn't change things. There is an > option to allow . to match newlines, but that option is potentionally very resource intensive (if you're input is 2MB, it will > match 2MB and start backtracking from there). > > A safer expression would be the following: <img[^>]*> this matches everything between <img and > that is not a > itself. This will > work in most cases. There's one problem though, > is allowed within quotes if you follow the standards. This can also be caught > in regex: > > <img("[^"]*"|'[^']*'|[^>])*> > > If you'd want to catch the corresponding </img> tag as well things get harder, though this is still possible to a certain degree. > > First we match everything up to the end of the tag > <img("[^"]*"|'[^']*'|[^>])* > > and then we match either /> or >......</img> > > (/>|>.*?</img>) > > As you can see I added the lazy modifier again, but this will suffer the same issues as before, so is there a better solution you > might ask... And of course there is :). > > By using a negative look-ahead we can match everything that is not the start of </img as follows: > > ((?!</img).)* > > Combine this with what we already had and you get this: > > <img("[^"]*"|'[^']*'|[^>])*(/>|>((?!</img).)*</img>) > > Only one issue left to tackle. The </img> tag does not necessarily have the closing > directly after the tagname. Whitespace is > allowed in the closing tag. This can easily be added: > > <img("[^"]*"|'[^']*'|[^>])*(/>|>((?!</img).)*</img\s+>) > > Kind regards, > > Jesse Houwing |
|||||||||||||||||||||||