Home All Groups Group Topic Archive Search About
Author
6 Sep 2006 10:27 PM
bryanmig
Ok I am new to RegEx and what I am trying to do is find a substring.

I have a string that constantly changes.  This string is pulled from an
Atom feed from a blog.  I need to strip the HTML formatting from this
string and just grab the inner text.

If this is my string:  "<div>Hello my name is bryan and I am learning
regex!</div>"

I need to be able to just grab what is in between <div> and </div>

I thought this would work but it still grabs the div's code...

Regex:   <div>.*?</div>

How can i modify this expression to eliminate the div's ?


Thanks
Bryan

Author
7 Sep 2006 11:16 AM
Kevin Spencer
(?![^<]*)<[^>]*?>

This matches all HTML markup. It is the opposite of what you want. If you
remove all text matched by this regular expression, what's left over is what
you want.

--
HTH,

Kevin Spencer
Microsoft MVP
Chicken Salad Surgery

It takes a tough man to make a tender chicken salad.


<bryan***@gmail.com> wrote in message
Show quote
news:1157581632.764042.66750@h48g2000cwc.googlegroups.com...
> Ok I am new to RegEx and what I am trying to do is find a substring.
>
> I have a string that constantly changes.  This string is pulled from an
> Atom feed from a blog.  I need to strip the HTML formatting from this
> string and just grab the inner text.
>
> If this is my string:  "<div>Hello my name is bryan and I am learning
> regex!</div>"
>
> I need to be able to just grab what is in between <div> and </div>
>
> I thought this would work but it still grabs the div's code...
>
> Regex:   <div>.*?</div>
>
> How can i modify this expression to eliminate the div's ?
>
>
> Thanks
> Bryan
>
Author
7 Sep 2006 2:22 PM
bryanmig@gmail.com
This may not help me, becuase the text I am parsing is code from a blog
and likely to include formatting tags.  I would want to keep all the
formatting markup, whether it be style,s fonts, line breaks, etc.  I
just need to eliminate the first div and last div.


Kevin Spencer wrote:
Show quote
> (?![^<]*)<[^>]*?>
>
> This matches all HTML markup. It is the opposite of what you want. If you
> remove all text matched by this regular expression, what's left over is what
> you want.
>
> --
> HTH,
>
> Kevin Spencer
> Microsoft MVP
> Chicken Salad Surgery
>
> It takes a tough man to make a tender chicken salad.
>
>
> <bryan***@gmail.com> wrote in message
> news:1157581632.764042.66750@h48g2000cwc.googlegroups.com...
> > Ok I am new to RegEx and what I am trying to do is find a substring.
> >
> > I have a string that constantly changes.  This string is pulled from an
> > Atom feed from a blog.  I need to strip the HTML formatting from this
> > string and just grab the inner text.
> >
> > If this is my string:  "<div>Hello my name is bryan and I am learning
> > regex!</div>"
> >
> > I need to be able to just grab what is in between <div> and </div>
> >
> > I thought this would work but it still grabs the div's code...
> >
> > Regex:   <div>.*?</div>
> >
> > How can i modify this expression to eliminate the div's ?
> >
> >
> > Thanks
> > Bryan
> >
Author
7 Sep 2006 2:55 PM
Kevin Spencer
Not a problem.

(?<=<div[^>]*>).*?(?=</div>)

I'll explain:

This uses a positive LookBehind and a positive LookAhead. The LookBehind and
LookAhead are non-capturing expressions, which indicate that the Match must
be preceded by or followed by a certain pattern. The Matches in the
LookBehind and LookAhead are not captured. So, only the text between them
is.

In addition, a div may have attributes, so I added an expression to the
LookBehind, indicating that the opening div tag can have any characters in
it other than the '>' character, prior to the closing '>' character.

--
HTH,

Kevin Spencer
Microsoft MVP
Chicken Salad Surgery

What You Seek Is What You Get.

<bryan***@gmail.com> wrote in message
Show quote
news:1157638926.924397.19660@e3g2000cwe.googlegroups.com...
> This may not help me, becuase the text I am parsing is code from a blog
> and likely to include formatting tags.  I would want to keep all the
> formatting markup, whether it be style,s fonts, line breaks, etc.  I
> just need to eliminate the first div and last div.
>
>
> Kevin Spencer wrote:
>> (?![^<]*)<[^>]*?>
>>
>> This matches all HTML markup. It is the opposite of what you want. If you
>> remove all text matched by this regular expression, what's left over is
>> what
>> you want.
>>
>> --
>> HTH,
>>
>> Kevin Spencer
>> Microsoft MVP
>> Chicken Salad Surgery
>>
>> It takes a tough man to make a tender chicken salad.
>>
>>
>> <bryan***@gmail.com> wrote in message
>> news:1157581632.764042.66750@h48g2000cwc.googlegroups.com...
>> > Ok I am new to RegEx and what I am trying to do is find a substring.
>> >
>> > I have a string that constantly changes.  This string is pulled from an
>> > Atom feed from a blog.  I need to strip the HTML formatting from this
>> > string and just grab the inner text.
>> >
>> > If this is my string:  "<div>Hello my name is bryan and I am learning
>> > regex!</div>"
>> >
>> > I need to be able to just grab what is in between <div> and </div>
>> >
>> > I thought this would work but it still grabs the div's code...
>> >
>> > Regex:   <div>.*?</div>
>> >
>> > How can i modify this expression to eliminate the div's ?
>> >
>> >
>> > Thanks
>> > Bryan
>> >
>
Author
18 Sep 2006 7:35 PM
bryanmig@gmail.com
Thanks a million, Kevin

That line of code was golden!
I appreciate your time and effort very much.

Thanks again,
Bryan
http://www.staga.net

---------------------------
Kevin Spencer wrote:
Show quote
> Not a problem.
>
> (?<=<div[^>]*>).*?(?=</div>)
>
> I'll explain:
>
> This uses a positive LookBehind and a positive LookAhead. The LookBehind and
> LookAhead are non-capturing expressions, which indicate that the Match must
> be preceded by or followed by a certain pattern. The Matches in the
> LookBehind and LookAhead are not captured. So, only the text between them
> is.
>
> In addition, a div may have attributes, so I added an expression to the
> LookBehind, indicating that the opening div tag can have any characters in
> it other than the '>' character, prior to the closing '>' character.
>
> --
> HTH,
>
> Kevin Spencer
> Microsoft MVP
> Chicken Salad Surgery
>
> What You Seek Is What You Get.
>
> <bryan***@gmail.com> wrote in message
> news:1157638926.924397.19660@e3g2000cwe.googlegroups.com...
> > This may not help me, becuase the text I am parsing is code from a blog
> > and likely to include formatting tags.  I would want to keep all the
> > formatting markup, whether it be style,s fonts, line breaks, etc.  I
> > just need to eliminate the first div and last div.
> >
> >
> > Kevin Spencer wrote:
> >> (?![^<]*)<[^>]*?>
> >>
> >> This matches all HTML markup. It is the opposite of what you want. If you
> >> remove all text matched by this regular expression, what's left over is
> >> what
> >> you want.
> >>
> >> --
> >> HTH,
> >>
> >> Kevin Spencer
> >> Microsoft MVP
> >> Chicken Salad Surgery
> >>
> >> It takes a tough man to make a tender chicken salad.
> >>
> >>
> >> <bryan***@gmail.com> wrote in message
> >> news:1157581632.764042.66750@h48g2000cwc.googlegroups.com...
> >> > Ok I am new to RegEx and what I am trying to do is find a substring.
> >> >
> >> > I have a string that constantly changes.  This string is pulled from an
> >> > Atom feed from a blog.  I need to strip the HTML formatting from this
> >> > string and just grab the inner text.
> >> >
> >> > If this is my string:  "<div>Hello my name is bryan and I am learning
> >> > regex!</div>"
> >> >
> >> > I need to be able to just grab what is in between <div> and </div>
> >> >
> >> > I thought this would work but it still grabs the div's code...
> >> >
> >> > Regex:   <div>.*?</div>
> >> >
> >> > How can i modify this expression to eliminate the div's ?
> >> >
> >> >
> >> > Thanks
> >> > Bryan
> >> >
> >

AddThis Social Bookmark Button