|
dev
newsgroups
|
|||||||||||||||||||||||
|
|||||||||||||||||||||||
RegEx substringI have a string that constantly changes. This string is pulled from an Atom feed from a blog. I need to strip the HTML formatting from this string and just grab the inner text. If this is my string: "<div>Hello my name is bryan and I am learning regex!</div>" I need to be able to just grab what is in between <div> and </div> I thought this would work but it still grabs the div's code... Regex: <div>.*?</div> How can i modify this expression to eliminate the div's ? Thanks Bryan (?![^<]*)<[^>]*?>
This matches all HTML markup. It is the opposite of what you want. If you remove all text matched by this regular expression, what's left over is what you want. -- Show quoteHTH, Kevin Spencer Microsoft MVP Chicken Salad Surgery It takes a tough man to make a tender chicken salad. <bryan***@gmail.com> wrote in message news:1157581632.764042.66750@h48g2000cwc.googlegroups.com... > Ok I am new to RegEx and what I am trying to do is find a substring. > > I have a string that constantly changes. This string is pulled from an > Atom feed from a blog. I need to strip the HTML formatting from this > string and just grab the inner text. > > If this is my string: "<div>Hello my name is bryan and I am learning > regex!</div>" > > I need to be able to just grab what is in between <div> and </div> > > I thought this would work but it still grabs the div's code... > > Regex: <div>.*?</div> > > How can i modify this expression to eliminate the div's ? > > > Thanks > Bryan > This may not help me, becuase the text I am parsing is code from a blog
and likely to include formatting tags. I would want to keep all the formatting markup, whether it be style,s fonts, line breaks, etc. I just need to eliminate the first div and last div. Kevin Spencer wrote: Show quote > (?![^<]*)<[^>]*?> > > This matches all HTML markup. It is the opposite of what you want. If you > remove all text matched by this regular expression, what's left over is what > you want. > > -- > HTH, > > Kevin Spencer > Microsoft MVP > Chicken Salad Surgery > > It takes a tough man to make a tender chicken salad. > > > <bryan***@gmail.com> wrote in message > news:1157581632.764042.66750@h48g2000cwc.googlegroups.com... > > Ok I am new to RegEx and what I am trying to do is find a substring. > > > > I have a string that constantly changes. This string is pulled from an > > Atom feed from a blog. I need to strip the HTML formatting from this > > string and just grab the inner text. > > > > If this is my string: "<div>Hello my name is bryan and I am learning > > regex!</div>" > > > > I need to be able to just grab what is in between <div> and </div> > > > > I thought this would work but it still grabs the div's code... > > > > Regex: <div>.*?</div> > > > > How can i modify this expression to eliminate the div's ? > > > > > > Thanks > > Bryan > > Not a problem.
(?<=<div[^>]*>).*?(?=</div>) I'll explain: This uses a positive LookBehind and a positive LookAhead. The LookBehind and LookAhead are non-capturing expressions, which indicate that the Match must be preceded by or followed by a certain pattern. The Matches in the LookBehind and LookAhead are not captured. So, only the text between them is. In addition, a div may have attributes, so I added an expression to the LookBehind, indicating that the opening div tag can have any characters in it other than the '>' character, prior to the closing '>' character. -- Show quoteHTH, Kevin Spencer Microsoft MVP Chicken Salad Surgery What You Seek Is What You Get. <bryan***@gmail.com> wrote in message news:1157638926.924397.19660@e3g2000cwe.googlegroups.com... > This may not help me, becuase the text I am parsing is code from a blog > and likely to include formatting tags. I would want to keep all the > formatting markup, whether it be style,s fonts, line breaks, etc. I > just need to eliminate the first div and last div. > > > Kevin Spencer wrote: >> (?![^<]*)<[^>]*?> >> >> This matches all HTML markup. It is the opposite of what you want. If you >> remove all text matched by this regular expression, what's left over is >> what >> you want. >> >> -- >> HTH, >> >> Kevin Spencer >> Microsoft MVP >> Chicken Salad Surgery >> >> It takes a tough man to make a tender chicken salad. >> >> >> <bryan***@gmail.com> wrote in message >> news:1157581632.764042.66750@h48g2000cwc.googlegroups.com... >> > Ok I am new to RegEx and what I am trying to do is find a substring. >> > >> > I have a string that constantly changes. This string is pulled from an >> > Atom feed from a blog. I need to strip the HTML formatting from this >> > string and just grab the inner text. >> > >> > If this is my string: "<div>Hello my name is bryan and I am learning >> > regex!</div>" >> > >> > I need to be able to just grab what is in between <div> and </div> >> > >> > I thought this would work but it still grabs the div's code... >> > >> > Regex: <div>.*?</div> >> > >> > How can i modify this expression to eliminate the div's ? >> > >> > >> > Thanks >> > Bryan >> > > Thanks a million, Kevin
That line of code was golden! I appreciate your time and effort very much. Thanks again, Bryan http://www.staga.net --------------------------- Kevin Spencer wrote: Show quote > Not a problem. > > (?<=<div[^>]*>).*?(?=</div>) > > I'll explain: > > This uses a positive LookBehind and a positive LookAhead. The LookBehind and > LookAhead are non-capturing expressions, which indicate that the Match must > be preceded by or followed by a certain pattern. The Matches in the > LookBehind and LookAhead are not captured. So, only the text between them > is. > > In addition, a div may have attributes, so I added an expression to the > LookBehind, indicating that the opening div tag can have any characters in > it other than the '>' character, prior to the closing '>' character. > > -- > HTH, > > Kevin Spencer > Microsoft MVP > Chicken Salad Surgery > > What You Seek Is What You Get. > > <bryan***@gmail.com> wrote in message > news:1157638926.924397.19660@e3g2000cwe.googlegroups.com... > > This may not help me, becuase the text I am parsing is code from a blog > > and likely to include formatting tags. I would want to keep all the > > formatting markup, whether it be style,s fonts, line breaks, etc. I > > just need to eliminate the first div and last div. > > > > > > Kevin Spencer wrote: > >> (?![^<]*)<[^>]*?> > >> > >> This matches all HTML markup. It is the opposite of what you want. If you > >> remove all text matched by this regular expression, what's left over is > >> what > >> you want. > >> > >> -- > >> HTH, > >> > >> Kevin Spencer > >> Microsoft MVP > >> Chicken Salad Surgery > >> > >> It takes a tough man to make a tender chicken salad. > >> > >> > >> <bryan***@gmail.com> wrote in message > >> news:1157581632.764042.66750@h48g2000cwc.googlegroups.com... > >> > Ok I am new to RegEx and what I am trying to do is find a substring. > >> > > >> > I have a string that constantly changes. This string is pulled from an > >> > Atom feed from a blog. I need to strip the HTML formatting from this > >> > string and just grab the inner text. > >> > > >> > If this is my string: "<div>Hello my name is bryan and I am learning > >> > regex!</div>" > >> > > >> > I need to be able to just grab what is in between <div> and </div> > >> > > >> > I thought this would work but it still grabs the div's code... > >> > > >> > Regex: <div>.*?</div> > >> > > >> > How can i modify this expression to eliminate the div's ? > >> > > >> > > >> > Thanks > >> > Bryan > >> > > > |
|||||||||||||||||||||||