|
dev
newsgroups
|
|||||||||||||||||||||||
|
|||||||||||||||||||||||
HTTPWebRequest with non-English Characters...I'm using VB.NET 2K5 and SQL Server 2K5...
I'm trying to use a HTTPWebRequest to pull HTML from a given web page and a HTTPWebResponse to move it into a streamreader and then screen scrape it into my database. No problem, it works great, except.... One of the URLs I'm working with has an ö. When I put the string with this URL into the HTTPWebRequest, I don't get the correct page back from the website. It seems the HTTPWebRequest translated the ö to a ?? and I get the wrong page. How do I correct this? Is this a setting in .NET or on the OS level by changing the language? Thanks for any help you can give... Paul W Hello, Paul!
PW> I'm trying to use a HTTPWebRequest to pull HTML from a given web page PW> and a HTTPWebResponse to move it into a streamreader and then screen PW> scrape it into my database. No problem, it works great, except.... PW> One of the URLs I'm working with has an ö. When I put the string with PW> this URL into the HTTPWebRequest, I don't get the correct page back PW> from the website. It seems the HTTPWebRequest translated the ö to a PW> ?? and I get the wrong page. How do you 'decode' the content retrieved? Take a look at ContentEncoding property of HttpWebResponse. you can use the encoding specified there to 'decode' the message... Also it is improtant that you secify webserver in what encoding you want to get the content. Accept-Charset web header can be used for this. Before doing request with HttpWebRequest you can specify, for instance, Accept-Charset : UTF-8. And then you can expect to get the content in the UTF-8 encoding. To decode content you can use System.Text.Encoding.UTF8 prop of the Encoding class... I'll check it out when I get home. I guess you are saying that however
I 'decode' the content retrieved is how the URL is sent out in the first place? I would understand if I got the correct page and just the characters were wrong in the stream, but I'm getting the wrong page back in the first place, not the correct page with 'bad' data in the stream. I think I found another solution. The word in the URL is Flöhli, but looking again at the web page. The URL link on the page tranlates the ö to %F6. So the Flöhli becomes Fl%F6hli. I'll have to try that too. Any idea why this happens? PW Hello, Paul!
PW> I would understand if I got the correct page and just the characters PW> were wrong in the stream, but I'm getting the wrong page back in the PW> first place, not the correct page with 'bad' data in the stream. What do you mean by wrong page? page from wrong url? PW> I think I found another solution. The word in the URL is Flöhli, but PW> looking again at the web page. The URL link on the page tranlates the PW> ö to %F6. So the Flöhli becomes Fl%F6hli. I'll have to try that PW> too. Any idea why this happens? it is unicode representation of the non-ASCII symbols in the url. Take an url with space = %20 http://my url will become http://my%20url That is normal behavior, and is known as url encoding. > What do you mean by wrong page? page from wrong url? Example (not real): I was shooting for the page http://Flöhli and Ipass that to the HTTPWebRequest. However, it reads it as http://Fl??hli I didn't realise that the ö would need to be translated to %F6 in the URL. Now that you reminded me about translating spaces to %20 in a URL it all clicks. Don't know why it didn't before. Thanks! Thus wrote Vadym,
> Hello, Paul! PW>> I would understand if I got the correct page and just the> PW>> characters were wrong in the stream, but I'm getting the wrong page PW>> back in the first place, not the correct page with 'bad' data in PW>> the stream. PW>> > What do you mean by wrong page? page from wrong url? PW>> I think I found another solution. The word in the URL is Flöhli,> PW>> but PW>> looking again at the web page. The URL link on the page tranlates PW>> the PW>> ö to %F6. So the Flöhli becomes Fl%F6hli. I'll have to try that PW>> too. Any idea why this happens? > it is unicode representation of the non-ASCII symbols in the url. But not in the hostname, which is still subject to RFC 1034, unless IDN support > Take an url with space = %20 > http://my url will become http://my%20url > That is normal behavior, and is known as url encoding. is available (.NET 2.0). Cheers, -- Joerg Jooss news-re***@joergjooss.de |
|||||||||||||||||||||||