Home All Groups Group Topic Archive Search About

HTTPWebRequest with non-English Characters...

Author
16 Feb 2006 2:48 PM
Paul W
I'm using VB.NET 2K5 and SQL Server 2K5...

I'm trying to use a HTTPWebRequest to pull HTML from a given web page
and a HTTPWebResponse to move it into a streamreader and then screen
scrape it into my database.  No problem, it works great, except....

One of the URLs I'm working with has an ö.  When I put the string with
this URL into the HTTPWebRequest, I don't get the correct page back
from the website.  It seems the HTTPWebRequest translated the ö to a
?? and I get the wrong page.

How do I correct this?  Is this a setting in .NET or on the OS level by
changing the language?

Thanks for any help you can give...

Paul W

Author
16 Feb 2006 3:44 PM
Vadym Stetsyak
Hello, Paul!

PW> I'm trying to use a HTTPWebRequest to pull HTML from a given web page
PW> and a HTTPWebResponse to move it into a streamreader and then screen
PW> scrape it into my database.  No problem, it works great, except....

PW> One of the URLs I'm working with has an ö.  When I put the string with
PW> this URL into the HTTPWebRequest, I don't get the correct page back
PW> from the website.  It seems the HTTPWebRequest translated the ö to a
PW> ?? and I get the wrong page.

How do you 'decode' the content retrieved?
Take a look at ContentEncoding property of HttpWebResponse. you can use the encoding specified there to 'decode' the message...

Also it is improtant that you secify webserver in what encoding you want to get the content.
Accept-Charset web header can be used for this. Before doing request with HttpWebRequest you can specify, for instance, Accept-Charset : UTF-8. And then you can expect to get the content in the UTF-8 encoding.

To decode content you can use System.Text.Encoding.UTF8 prop of the Encoding class...

--
Regards, Vadym Stetsyak
www: http://vadmyst.blogspot.com
Author
16 Feb 2006 4:24 PM
Paul W
I'll check it out when I get home.  I guess you are saying that however
I 'decode' the content retrieved is how the URL is sent out in the
first place?

I would understand if I got the correct page and just the characters
were wrong in the stream, but I'm getting the wrong page back in the
first place, not the correct page with 'bad' data in the stream.

I think I found another solution.  The word in the URL is Flöhli, but
looking again at the web page.  The URL link on the page tranlates the
ö to %F6.  So the Flöhli becomes Fl%F6hli.  I'll have to try that
too.  Any idea why this happens?

PW
Author
16 Feb 2006 4:38 PM
Vadym Stetsyak
Hello, Paul!

PW> I would understand if I got the correct page and just the characters
PW> were wrong in the stream, but I'm getting the wrong page back in the
PW> first place, not the correct page with 'bad' data in the stream.

What do you mean by wrong page? page from wrong url?

PW> I think I found another solution.  The word in the URL is Flöhli, but
PW> looking again at the web page.  The URL link on the page tranlates the
PW> ö to %F6.  So the Flöhli becomes Fl%F6hli.  I'll have to try that
PW> too.  Any idea why this happens?

it is unicode representation of the non-ASCII symbols in the url.
Take an url with space = %20
http://my url will become http://my%20url

That is normal behavior, and is known as url encoding.
--
Regards, Vadym Stetsyak
www: http://vadmyst.blogspot.com
Author
16 Feb 2006 6:11 PM
Paul W
> What do you mean by wrong page? page from wrong url?

Example (not real): I was shooting for the page http://Flöhli and I
pass that to the HTTPWebRequest.  However, it reads it as
http://Fl??hli

I didn't realise that the ö would need to be translated to %F6 in the
URL.  Now that you reminded me about translating spaces to %20 in a URL
it all clicks.  Don't know why it didn't before.

Thanks!
Author
17 Feb 2006 6:20 PM
Joerg Jooss
Thus wrote Vadym,

> Hello, Paul!
>
PW>> I would understand if I got the correct page and just the
PW>> characters were wrong in the stream, but I'm getting the wrong page
PW>> back in the first place, not the correct page with 'bad' data in
PW>> the stream.
PW>>
> What do you mean by wrong page? page from wrong url?
>
PW>> I think I found another solution.  The word in the URL is Flöhli,
PW>> but
PW>> looking again at the web page.  The URL link on the page tranlates
PW>> the
PW>> ö to %F6.  So the Flöhli becomes Fl%F6hli.  I'll have to try that
PW>> too.  Any idea why this happens?
> it is unicode representation of the non-ASCII symbols in the url.
> Take an url with space = %20
> http://my url will become http://my%20url
> That is normal behavior, and is known as url encoding.

But not in the hostname, which is still subject to RFC 1034, unless IDN support
is available (.NET 2.0).

Cheers,
--
Joerg Jooss
news-re***@joergjooss.de

AddThis Social Bookmark Button