|
dev
newsgroups
|
|||||||||||||||||||||||
|
|||||||||||||||||||||||
StreamReader.StreamReader(String, bool) bug - no BOM detectionDuring my app testing I discovered the following bug in .NET v2.0 (have not tested 1.1 yet). Constructors of StreamReader supposed to detect byte order mark fail to do so. Simple test case is below just feed it with files with different BOM and one can see that StreamReader encoding is always default UTF8Encoding disregard for BOM of file. In case somone needs BOM detection use code below instead of StringReader constructors. StreamReader reader = null; System.IO.FileStream file = null; Encoding enc = null; try { file = new System.IO.FileStream(path, FileMode.Open, FileAccess.Read, FileShare.ReadWrite); if (file.CanSeek) { byte[] bom = new byte[4]; // Get the byte-order mark, if there is one file.Read(bom, 0, 4); if (bom[0] == 0xef && bom[1] == 0xbb && bom[2] == 0xbf){ enc = Encoding.UTF8; } // utf-8 else if (bom[0] == 0xff && bom[1] == 0xfe){ enc = Encoding.Unicode; } // ucs-2le, ucs-4le, and ucs-16le else if (bom[0] == 0xfe && bom[1] == 0xff) { enc = Encoding.Unicode; } // utf-16 and ucs-2 else if (bom[0] == 0 && bom[1] == 0 && bom[2] == 0xfe && bom[3] == 0xff) // ucs-4 { enc = System.Text.Encoding.UTF32; } else { enc = System.Text.Encoding.ASCII; } file.Close(); } reader = new StreamReader(path, true); Trace.WriteLine("StreamReader encoding: " + reader.CurrentEncoding); Trace.WriteLine("BOM detected encoding: " + enc.ToString()); } catch (Exception ex) { Trace.WriteLine(ex.ToString()); } finally { if (reader != null) reader.Close(); if (file != null) file.Close(); } Cheers, http://sourceforge.net/projects/ngmp Polanski24 <infod***@aster.pl> wrote:
> During my app testing I discovered the following bug in .NET v2.0 (have No, they don't. You're trying to use CurrentEncoding prior to reading > not tested 1.1 yet). > > Constructors of StreamReader supposed to detect byte order mark fail to > do so. any data. From the docs: <quote> Property Value The current character encoding used by the current reader. The value can be different after the first call to any Read method of StreamReader, since encoding autodetection is not done until the first call to a Read method. </quote> I don't believe there's a bug at all. If you run the code below, you'll see it doing the right thing. In the last case, the same data as a previous test case is used, claiming to be UTF-8 but then using little endian UTF-16 data. That's the only case in which things go "wrong" (understandably) - it copes with all the rest. using System; using System.IO; class Test { static void Main (string[] args) { byte[] littleEndian = new byte[] {0xff, 0xfe, 0x41, 0x00, 0x42, 0x00}; byte[] bigEndian = new byte[] {0xfe, 0xff, 0x00, 0x41, 0x00, 0x42}; byte[] utf8 = new byte[] {0xef, 0xbb, 0xbf, 0x41, 0x42}; byte[] utf8DuffData = new byte[] {0xef, 0xbb, 0x41, 0x00, 0x42, 0x00}; ShowEncoding ("Big endian", bigEndian); ShowEncoding ("Little endian", littleEndian); ShowEncoding ("UTF-8", utf8); ShowEncoding ("UTF-8 with little endian UTF-16 data", utf8DuffData); } static void ShowEncoding (string correct, byte[] data) { using (MemoryStream ms = new MemoryStream(data)) { using (StreamReader reader = new StreamReader(ms, true)) { Console.WriteLine (correct); Console.WriteLine (reader.CurrentEncoding); Console.WriteLine (reader.ReadLine()); } } } } -- Jon Skeet - <sk***@pobox.com> http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet If replying to the group, please do not mail me too Hello!
Thanks for reply. I do with some hesitation agree that it's not a bug but it's rather desing flaw (or bug). The reason is very simple - there is no way to freely seek in the stream using StringReader - it works only in one direction - but during code execution usually it is necessary to detect encoding before any reads or processing is done. Since StramReader uses internally Stream to go through data it should in the constructor code do the check internally than rewind Stream position to 0 and wait for first read operation instead of providing incorrect value in CurrentEncoding property. This is at least the way I would design that feature. Cheers http://sourceforge.net/projects/ngmp Polanski24 wrote:
> Hello! .... which would then make it useless on non-seekable streams, like the > > Thanks for reply. I do with some hesitation agree that it's not a bug > but it's rather desing flaw (or bug). The reason is very simple - > there is no way to freely seek in the stream using StringReader - it > works only in one direction - but during code execution usually it is > necessary to detect encoding before any reads or processing is done. > > Since StramReader uses internally Stream to go through data it should > in the constructor code do the check internally than rewind Stream > position to 0 and wait for first read operation instead of providing > incorrect value in CurrentEncoding property. This is at least the way > I would design that feature. stream from a decompressor or a network socket. Poor choice, IMO. -cd Polanski24 <infod***@aster.pl> wrote:
> Thanks for reply. I do with some hesitation agree that it's not a bug I very rarely find that necessary, actually. If I don't know what the > but it's rather desing flaw (or bug). The reason is very simple - there > is no way to freely seek in the stream using StringReader - it works > only in one direction - but during code execution usually it is > necessary to detect encoding before any reads or processing is done. encoding is before I start, I rarely care about it at all, so long as I'm getting the right text data. > Since StramReader uses internally Stream to go through data it should You're making some assumptions there:> in the constructor code do the check internally than rewind Stream > position to 0 and wait for first read operation instead of providing > incorrect value in CurrentEncoding property. This is at least the way I > would design that feature. 1) Reading the stream when you haven't been asked to won't have any nasty side effects 2) The stream can be rewound Now, assuming you meant to write StreamReader rather than StringReader in the first paragraph, all you need to use is use the BaseStream property, set the position on *that*, and then call StreamReader.DiscardBufferedData. So, you can get the behaviour you want by: 1) Find the position of the stream 2) Create the StreamReader 3) Call StreamReader.Read() 4) Call StreamReader.BaseStream.Position = <whatever is was before> 5) Call StreamReader.DiscardBufferedData This will still cause problems if the stream isn't seekable, of course. In that case, you'd have to read the first character and remember it for when you first wanted actual data. -- Jon Skeet - <sk***@pobox.com> http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet If replying to the group, please do not mail me too |
|||||||||||||||||||||||