Home All Groups Group Topic Archive Search About

DataFormats.HTML with foreign character questions

Author
16 Oct 2006 1:11 PM
Leon
Hi,

I recently have to process HTML clipboard format (both retrieval and
posting) in C# and I have struck some problem when the HTML fragment contains
foreign characters like Chinese or real UTF-8 sequence.

When I examine the HTML clipboard data from Clipboard.GetData(
DataFormats.Html ), the return type is System.String. For most part it is
correct except in data between the <!--StartFragment--> and
<!--EndFragment--> which should contain UTF-8 sequence. For some reason the
return byte sequences are wrong at certain places.

I then used the unmanaged code via C++/CLI to retrieve the same data and
compared byte-for-byte with that returned from Clipboard.GetData() to see the
real difference. They are indeed different. The one that I retrieve using
Win32 API GetClipboardData() is the correct one while that from
Clipboard.GetData() is corrupted.

Does anyone know why Microsoft has chosen to use System.String for this
rather than Byte[], which should be more appropriate? In the end, I wrote my
own retrieval function in C++/CLI that returns a Byte[] to allow me to use
Encoding.UTF8.GetString() to convert the fragment correctly.

On posting using Clipboard.SetData(), I have also observed changes in
certain byte sequences. So I am writing my own.

Is there any logical explanation or has this been dealt with before. I am
using .Net 2.0

Thanks.

Leon

AddThis Social Bookmark Button