Home All Groups Group Topic Archive Search About

Possible bug in UnicodeEncoding

Author
12 Sep 2006 4:31 PM
KrippZ
Hello!

We here at the office have discovered something odd. Can somebody
please verify this potential bug for us?

This code generates a byte buffer fills it with 256 bytes ranging from
0 to 255, and the bug appers when the Unicode Encoder gets the bytes
from another Unicode Encoder that gives it a string from a bytebuffer.

The bytebuffers should not differ but in Net 2.0 they do.
We have run the testcode  in VS 2003 and VS 2005 and the results of
2003 don´t differ.

bytes 216,217 and 222, 223 seem to go missing?!?

       static void Main(string[] args)
        {
            byte[] bytearrBuffer = new byte[256];
            for (int i = 0; i < 256; i++)
            {
                bytearrBuffer[i] = (byte)i;
            }
            WriteBuffer(bytearrBuffer, "Buffer.txt");
            WriteBuffer(new System.Text.UnicodeEncoding().GetBytes(new
System.Text.UnicodeEncoding().GetString(bytearrBuffer)),
"Buffer2.txt");
        }


        public static void WriteBuffer(byte[] arrbyteBuffer, string
filename)
        {
            try
            {
                string sLogFileName = Path.Combine("c:\\", filename);

                FileStream fs = new
FileStream(sLogFileName,FileMode.Create,FileAccess.Write,FileShare.Write);
                BinaryWriter bw = new BinaryWriter(fs);

                for (int i = 0; i < arrbyteBuffer.Length; i++)
                {
                    bw.Write(arrbyteBuffer[i].ToString());
                }

                bw.Flush();
                bw.Close();
            }
            catch
            {
            }
        }

Cheers
//KrippZ

Author
13 Sep 2006 8:37 PM
Mattias Sjögren
>We here at the office have discovered something odd. Can somebody
>please verify this potential bug for us?

I wouldn't call it a bug. There's no guarantee that a random byte
array will come back the same after a
Encoding.GetString/Encoding.GetBytes roundtrip. Some byte values may
have spacial meaning or may be invalid according to that encoding. So
you can't take an arbitrary blob and decode it to a string like that.


Mattias

--
Mattias Sjögren [C# MVP]  mattias @ mvps.org
http://www.msjogren.net/dotnet/ | http://www.dotnetinterop.com
Please reply only to the newsgroup.
Author
13 Sep 2006 9:12 PM
Jon Skeet [C# MVP]
<Kri***@gmail.com> wrote:
> We here at the office have discovered something odd. Can somebody
> please verify this potential bug for us?

Not a bug, or at least not the bug you think it is.

> This code generates a byte buffer fills it with 256 bytes ranging from
> 0 to 255, and the bug appers when the Unicode Encoder gets the bytes
> from another Unicode Encoder that gives it a string from a bytebuffer.
>
> The bytebuffers should not differ but in Net 2.0 they do.
> We have run the testcode  in VS 2003 and VS 2005 and the results of
> 2003 don´t differ.
>
> bytes 216,217 and 222, 223 seem to go missing?!?

Your byte array isn't a valid Unicode-encoded string. 0xd800 to 0xdffff
are reserved for surrogate pairs - you need to have a value in
[0xd800-0xdbff] followed by [0xdc00-0xdfff]. So, 216/217 isn't valid,
and neither is 222/223.

In fact, 218/219 and 220/221 shouldn't be valid either, which just goes
to show: garbage in, garbage out.

The moral of the story is that you shouldn't treat arbitrary binary
data as text.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet   Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too
Author
13 Sep 2006 10:10 PM
Jon Skeet [C# MVP]
Jon Skeet [C# MVP] <sk***@pobox.com> wrote:
> > The bytebuffers should not differ but in Net 2.0 they do.
> > We have run the testcode  in VS 2003 and VS 2005 and the results of
> > 2003 don´t differ.
> >
> > bytes 216,217 and 222, 223 seem to go missing?!?
>
> Your byte array isn't a valid Unicode-encoded string. 0xd800 to 0xdffff
> are reserved for surrogate pairs - you need to have a value in
> [0xd800-0xdbff] followed by [0xdc00-0xdfff]. So, 216/217 isn't valid,
> and neither is 222/223.
>
> In fact, 218/219 and 220/221 shouldn't be valid either, which just goes
> to show: garbage in, garbage out.

Sorry, I've realised what I'd done wrong in the above analysis. My
general principle was right (as was the conclusion that the byte array
didn't represent a valid Unicode string) but the logic was off.

This bit is right:
> Your byte array isn't a valid Unicode-encoded string. 0xd800 to 0xdffff
> are reserved for surrogate pairs - you need to have a value in
> [0xd800-0xdbff] followed by [0xdc00-0xdfff].

and the bytes 216-225 end up being 16-bit values of:

0xd9d8 0xdbda 0xdddb 0xdfde 0xe1e0

Now, the Encoding looks at the first of those (0xd9d8) and expects a
high surrogate character to follow. It doesn't, so it presumably
ignores the character. It moves on to 0xdbda, which is "correctly"
followed by 0xdddb, so those end up forming a surrogate pair. The
0xdfde should have been preceded by a low surrogate, so it ignores it
and moves on to the rest - which are valid in themselves.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet   Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too

AddThis Social Bookmark Button