|
dev
newsgroups
|
|||||||||||||||||||||||
|
|||||||||||||||||||||||
Possible bug in UnicodeEncodingWe here at the office have discovered something odd. Can somebody please verify this potential bug for us? This code generates a byte buffer fills it with 256 bytes ranging from 0 to 255, and the bug appers when the Unicode Encoder gets the bytes from another Unicode Encoder that gives it a string from a bytebuffer. The bytebuffers should not differ but in Net 2.0 they do. We have run the testcode in VS 2003 and VS 2005 and the results of 2003 don´t differ. bytes 216,217 and 222, 223 seem to go missing?!? static void Main(string[] args) { byte[] bytearrBuffer = new byte[256]; for (int i = 0; i < 256; i++) { bytearrBuffer[i] = (byte)i; } WriteBuffer(bytearrBuffer, "Buffer.txt"); WriteBuffer(new System.Text.UnicodeEncoding().GetBytes(new System.Text.UnicodeEncoding().GetString(bytearrBuffer)), "Buffer2.txt"); } public static void WriteBuffer(byte[] arrbyteBuffer, string filename) { try { string sLogFileName = Path.Combine("c:\\", filename); FileStream fs = new FileStream(sLogFileName,FileMode.Create,FileAccess.Write,FileShare.Write); BinaryWriter bw = new BinaryWriter(fs); for (int i = 0; i < arrbyteBuffer.Length; i++) { bw.Write(arrbyteBuffer[i].ToString()); } bw.Flush(); bw.Close(); } catch { } } Cheers //KrippZ >We here at the office have discovered something odd. Can somebody I wouldn't call it a bug. There's no guarantee that a random byte>please verify this potential bug for us? array will come back the same after a Encoding.GetString/Encoding.GetBytes roundtrip. Some byte values may have spacial meaning or may be invalid according to that encoding. So you can't take an arbitrary blob and decode it to a string like that. Mattias -- Mattias Sjögren [C# MVP] mattias @ mvps.org http://www.msjogren.net/dotnet/ | http://www.dotnetinterop.com Please reply only to the newsgroup. <Kri***@gmail.com> wrote:
> We here at the office have discovered something odd. Can somebody Not a bug, or at least not the bug you think it is.> please verify this potential bug for us? > This code generates a byte buffer fills it with 256 bytes ranging from Your byte array isn't a valid Unicode-encoded string. 0xd800 to 0xdffff > 0 to 255, and the bug appers when the Unicode Encoder gets the bytes > from another Unicode Encoder that gives it a string from a bytebuffer. > > The bytebuffers should not differ but in Net 2.0 they do. > We have run the testcode in VS 2003 and VS 2005 and the results of > 2003 don´t differ. > > bytes 216,217 and 222, 223 seem to go missing?!? are reserved for surrogate pairs - you need to have a value in [0xd800-0xdbff] followed by [0xdc00-0xdfff]. So, 216/217 isn't valid, and neither is 222/223. In fact, 218/219 and 220/221 shouldn't be valid either, which just goes to show: garbage in, garbage out. The moral of the story is that you shouldn't treat arbitrary binary data as text. -- Jon Skeet - <sk***@pobox.com> http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet If replying to the group, please do not mail me too Jon Skeet [C# MVP] <sk***@pobox.com> wrote:
> > The bytebuffers should not differ but in Net 2.0 they do. Sorry, I've realised what I'd done wrong in the above analysis. My > > We have run the testcode in VS 2003 and VS 2005 and the results of > > 2003 don´t differ. > > > > bytes 216,217 and 222, 223 seem to go missing?!? > > Your byte array isn't a valid Unicode-encoded string. 0xd800 to 0xdffff > are reserved for surrogate pairs - you need to have a value in > [0xd800-0xdbff] followed by [0xdc00-0xdfff]. So, 216/217 isn't valid, > and neither is 222/223. > > In fact, 218/219 and 220/221 shouldn't be valid either, which just goes > to show: garbage in, garbage out. general principle was right (as was the conclusion that the byte array didn't represent a valid Unicode string) but the logic was off. This bit is right: > Your byte array isn't a valid Unicode-encoded string. 0xd800 to 0xdffff and the bytes 216-225 end up being 16-bit values of:> are reserved for surrogate pairs - you need to have a value in > [0xd800-0xdbff] followed by [0xdc00-0xdfff]. 0xd9d8 0xdbda 0xdddb 0xdfde 0xe1e0 Now, the Encoding looks at the first of those (0xd9d8) and expects a high surrogate character to follow. It doesn't, so it presumably ignores the character. It moves on to 0xdbda, which is "correctly" followed by 0xdddb, so those end up forming a surrogate pair. The 0xdfde should have been preceded by a low surrogate, so it ignores it and moves on to the rest - which are valid in themselves. -- Jon Skeet - <sk***@pobox.com> http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet If replying to the group, please do not mail me too |
|||||||||||||||||||||||