|
dev
newsgroups
|
|||||||||||||||||||||||
|
|||||||||||||||||||||||
Large text file - in memory ( > 60mb)My program needs to search a large textfile (>60MB).
At this time I'm using a streamreader to read the file into a string-variable (objString = sr.ReadToEnd). Before reading the file the proces running my programm uses about 10mb, after reading the text-file into the string, it uses over 200mb. I would expect the program to use between 70 and 100mb. Is there a more efficient way of storing this data in-memory and still be able to search through it ... ? TIA, Jurjen. "Jurjen de Groot" <Jurjen.de.Gr***@xs4all.nl> wrote in message Keep in mind that .NET strings are UTF-16, so reading an ANSI text file will news:OCVR7iI7GHA.4552@TK2MSFTNGP05.phx.gbl... > My program needs to search a large textfile (>60MB). > > At this time I'm using a streamreader to read the file into a > string-variable (objString = sr.ReadToEnd). Before reading the file the > proces running my programm uses about 10mb, after reading the text-file > into the string, it uses over 200mb. I would expect the program to use > between 70 and 100mb. > > Is there a more efficient way of storing this data in-memory and still be > able to search through it ... ? typically double the size in bytes. If you read the file into a byte array, you'll use less memory, but you won't be able to use the .NET string searching facilities (e.g. System.String member functions, regular expressions, etc). Depending on your searching requirements, you might be able to use a simpler search facility, such as the one in this article: http://www.codeproject.com/cs/algorithms/BoyerMooreSearch.asp (but you'd have to modify the code to search a byte array instead of a char array). If your file is MBCS or UTF-8, you're likely best off just sticking with the .NET string classes. -cd As CD has said this is expected as strings are UTF-16 .. my question would
be how you are searching this file. Are you just doing keyword searches? Depending on the type of search you might be much better off doing something like building an index of the file and loading the index into memory. Cheers, Greg Show quote "Jurjen de Groot" <Jurjen.de.Gr***@xs4all.nl> wrote in message news:OCVR7iI7GHA.4552@TK2MSFTNGP05.phx.gbl... > My program needs to search a large textfile (>60MB). > > At this time I'm using a streamreader to read the file into a > string-variable (objString = sr.ReadToEnd). Before reading the file the > proces running my programm uses about 10mb, after reading the text-file > into the string, it uses over 200mb. I would expect the program to use > between 70 and 100mb. > > Is there a more efficient way of storing this data in-memory and still be > able to search through it ... ? > > > TIA, > Jurjen. > Greg,
The text-file consists of records of 80 characters seperated by a NewLine. These records all have a record type 1 thru 9, a set of records start with record 1 and end with record 9 at wich point the next set will start with record type 1. I search the contents of the file for the search criteria as entered by the user f.i. 2742281, when I find this sequence I have to make sure it's found in exactly the right position within the record to make sure I have compared it to the right field. Then I have to show this record found (wich should be record type 1) and show all records until I find recordtype 9 (of EOF). I could create an index but that would complicate the app, I also thought of maybe creating a datatable to ease the search but I'm pretty sure memory consumption would be even worse... I was just wondering why the current app is consuming so much memory wich is now clear to me. I guess my client will have to make the decision, cheap app wich will use much memory, little more expensive app using less memory. Regards, Jurjen. Show quote "Greg Young" <druckdruckREMOVEgo***@hotmail.com> wrote in message news:egRx8uJ7GHA.4604@TK2MSFTNGP03.phx.gbl... > As CD has said this is expected as strings are UTF-16 .. my question would > be how you are searching this file. > > Are you just doing keyword searches? Depending on the type of search you > might be much better off doing something like building an index of the > file and loading the index into memory. > > Cheers, > > Greg > > "Jurjen de Groot" <Jurjen.de.Gr***@xs4all.nl> wrote in message > news:OCVR7iI7GHA.4552@TK2MSFTNGP05.phx.gbl... >> My program needs to search a large textfile (>60MB). >> >> At this time I'm using a streamreader to read the file into a >> string-variable (objString = sr.ReadToEnd). Before reading the file the >> proces running my programm uses about 10mb, after reading the text-file >> into the string, it uses over 200mb. I would expect the program to use >> between 70 and 100mb. >> >> Is there a more efficient way of storing this data in-memory and still be >> able to search through it ... ? >> >> >> TIA, >> Jurjen. >> > > Hi Jurjen,
Sounds to me like you could just use a ReadLine() and do a search per = record. You should use the encoding used in the file. If you don't = specify an encoding, UTF-8 is used. You would need some logic added to = = keep track of an entire recordset, which can be a string[] of length 9 StreamReader sr =3D new StreamReader("", Encoding.Default);= string s =3D null; string[] recordset =3D new string[9]; int index =3D 0; while ((s =3D sr.ReadLine()) !=3D null) { int i =3D GetRecordNumber(s); if (i >=3D (index + 1)) ;// missing record recordset[index] =3D s; index++; if (i =3D=3D 9) // complete record { if (SearchRecordSet(recordset)) return true; Array.Clear(recordset, 0, 9); index =3D 0; } } PS! Your system clock is a bit too fast On Wed, 11 Oct 2006 09:01:00 +0200, Jurjen de Groot <i***@gits-online.nl= > = wrote:Show quote > Greg, -- => > The text-file consists of records of 80 characters seperated by a = > NewLine. > These records all have a record type 1 thru 9, a set of records start= = > with > record 1 and end with record 9 at wich point the next set will start w= ith > record type 1. > > I search the contents of the file for the search criteria as entered b= y = > the > user f.i. 2742281, when I find this sequence I have to make sure it's = = > found > in exactly the right position within the record to make sure I have = > compared > it to the right field. Then I have to show this record found (wich = > should be > record type 1) and show all records until I find recordtype 9 (of EOF)= .. > > I could create an index but that would complicate the app, I also = > thought of > maybe creating a datatable to ease the search but I'm pretty sure memo= ry > consumption would be even worse... > > I was just wondering why the current app is consuming so much memory = > wich is > now clear to me. I guess my client will have to make the decision, che= ap = > app > wich will use much memory, little more expensive app using less memory= .. > > > Regards, > Jurjen. > > > "Greg Young" <druckdruckREMOVEgo***@hotmail.com> wrote in message > news:egRx8uJ7GHA.4604@TK2MSFTNGP03.phx.gbl... >> As CD has said this is expected as strings are UTF-16 .. my question = = >> would >> be how you are searching this file. >> >> Are you just doing keyword searches? Depending on the type of search = you >> might be much better off doing something like building an index of th= e >> file and loading the index into memory. >> >> Cheers, >> >> Greg >> >> "Jurjen de Groot" <Jurjen.de.Gr***@xs4all.nl> wrote in message >> news:OCVR7iI7GHA.4552@TK2MSFTNGP05.phx.gbl... >>> My program needs to search a large textfile (>60MB). >>> >>> At this time I'm using a streamreader to read the file into a >>> string-variable (objString =3D sr.ReadToEnd). Before reading the fil= e the >>> proces running my programm uses about 10mb, after reading the text-f= ile >>> into the string, it uses over 200mb. I would expect the program to u= se >>> between 70 and 100mb. >>> >>> Is there a more efficient way of storing this data in-memory and sti= ll = >>> be >>> able to search through it ... ? >>> >>> >>> TIA, >>> Jurjen. >>> >> >> > > Happy Coding! Morten Wennevik [C# MVP] Morten,
Thanks for your reply, I understand what you're doing in the code, but isn't reading line by line slow ? The file is over 64mb in size, reading it line by line to do a search seems like a lot of overhead, especially when the user does many searches while running the app, it would mean reading/searching the >64mb file many times, that's why I opted to keep the file in memory which might not be the best idea. I'm currently trying to get some more time from my client to try and optimize by creating an index of the file (which doesn't change that often) and searching through that and retrieving part of the text-file corresponding to the index... Jurjen. "Morten Wennevik" <MortenWenne***@hotmail.com> wrote in message Sounds to me like you could just use a ReadLine() and do a search pernews:op.tg8x7ygjklbvpo@tr024.bouvet.no... Hi Jurjen, record. You should use the encoding used in the file. If you don't specify an encoding, UTF-8 is used. You would need some logic added to keep track of an entire recordset, which can be a string[] of length 9 StreamReader sr = new StreamReader("", Encoding.Default); string s = null; string[] recordset = new string[9]; int index = 0; while ((s = sr.ReadLine()) != null) { int i = GetRecordNumber(s); if (i >= (index + 1)) ;// missing record recordset[index] = s; index++; if (i == 9) // complete record { if (SearchRecordSet(recordset)) return true; Array.Clear(recordset, 0, 9); index = 0; } } PS! Your system clock is a bit too fast On Wed, 11 Oct 2006 09:01:00 +0200, Jurjen de Groot <i***@gits-online.nl> wrote: Show quote > Greg, > > The text-file consists of records of 80 characters seperated by a > NewLine. > These records all have a record type 1 thru 9, a set of records start > with > record 1 and end with record 9 at wich point the next set will start with > record type 1. > > I search the contents of the file for the search criteria as entered by > the > user f.i. 2742281, when I find this sequence I have to make sure it's > found > in exactly the right position within the record to make sure I have > compared > it to the right field. Then I have to show this record found (wich should > be > record type 1) and show all records until I find recordtype 9 (of EOF). > > I could create an index but that would complicate the app, I also thought > of > maybe creating a datatable to ease the search but I'm pretty sure memory > consumption would be even worse... > > I was just wondering why the current app is consuming so much memory wich > is > now clear to me. I guess my client will have to make the decision, cheap > app > wich will use much memory, little more expensive app using less memory. > > > Regards, > Jurjen. > > > "Greg Young" <druckdruckREMOVEgo***@hotmail.com> wrote in message > news:egRx8uJ7GHA.4604@TK2MSFTNGP03.phx.gbl... >> As CD has said this is expected as strings are UTF-16 .. my question >> would >> be how you are searching this file. >> >> Are you just doing keyword searches? Depending on the type of search you >> might be much better off doing something like building an index of the >> file and loading the index into memory. >> >> Cheers, >> >> Greg >> >> "Jurjen de Groot" <Jurjen.de.Gr***@xs4all.nl> wrote in message >> news:OCVR7iI7GHA.4552@TK2MSFTNGP05.phx.gbl... >>> My program needs to search a large textfile (>60MB). >>> >>> At this time I'm using a streamreader to read the file into a >>> string-variable (objString = sr.ReadToEnd). Before reading the file the >>> proces running my programm uses about 10mb, after reading the text-file >>> into the string, it uses over 200mb. I would expect the program to use >>> between 70 and 100mb. >>> >>> Is there a more efficient way of storing this data in-memory and still >>> be >>> able to search through it ... ? >>> >>> >>> TIA, >>> Jurjen. >>> >> >> > > -- Happy Coding! Morten Wennevik [C# MVP] I haven't done speed tests, but you may well find the speed taken to
locate a recordset by reading the file line by line is not much considering the processing power of todays computers. -- Happy Coding! Morten Wennevik [C# MVP] I created a ~80mb text file consisting of 80 characters per line and a
known word on the second last line Opening and searching line by line took ~2.4 seconds each time. Opening and reading the entire file before a search took ~6.4 seconds, with ~0.4 seconds for each search. Added complexity to the processing code will have less impact on the first method in percentages, considering you then already have a complete recordset where in the second method you would need additional searchs. -- Happy Coding! Morten Wennevik [C# MVP] Given time, the fastest option is probably using a FileStream and a search
algorithm like Boyer-Moore, but the complexity of the code would also increase accordingly. -- Happy Coding! Morten Wennevik [C# MVP] "Morten Wennevik" <MortenWenne***@hotmail.com> wrote in message See my earlier reply to the original post. That's exactly what I do in a news:op.tg9emnp8klbvpo@tr024.bouvet.no... > Given time, the fastest option is probably using a FileStream and a search > algorithm like Boyer-Moore, but the complexity of the code would also > increase accordingly. program that reads similar fixed-format text files: Read the entire file into a byte array via a single call to FileStream.Read and then use a Boyer-Moore search on that. It's about 5x faster than reading the entire file into a string and using string.IndexOf. There's a link to the Boyer-Moore implementation in my earlier post - the version in the article works on strings, but it's straightforward to convert it to work on byte arrays. I found the byte array search using BM to be about 2x faster than the "Find" function in Visual Studio (which is quite fast), but about 2x slower than the "Find" function in Notepad - which is also Boyer-Moore (well, QuickSearch, actually), but is written in C instead of C#. The array bounds checking on access to the supplemental arrays used by BM really hurt the performance - but it's still quite speedy. -cd Ah, sorry, didn't catch the last part of your first post.
-- Happy Coding! Morten Wennevik [C# MVP] Hi Jurjen,
my napkin calculations show that you have a 750,000 record table that doesn't change often and that you need to search many times by each user. Dude... is SQL Server really so bad of an option? c'mon! Why write this capability into your app when it is available to you for free? -- Show quote--- Nick Malik [Microsoft] MCSD, CFPS, Certified Scrummaster http://blogs.msdn.com/nickmalik Disclaimer: Opinions expressed in this forum are my own, and not representative of my employer. I do not answer questions on behalf of my employer. I'm just a programmer helping programmers. -- "Jurjen de Groot" <i***@gits-online.nl> wrote in message news:OvpwWCT7GHA.1256@TK2MSFTNGP04.phx.gbl... > Morten, > > Thanks for your reply, I understand what you're doing in the code, but > isn't reading line by line slow ? > The file is over 64mb in size, reading it line by line to do a search > seems like a lot of overhead, especially when the user does many searches > while running the app, it would mean reading/searching the >64mb file many > times, that's why I opted to keep the file in memory which might not be > the best idea. > I'm currently trying to get some more time from my client to try and > optimize by creating an index of the file (which doesn't change that > often) and searching through that and retrieving part of the text-file > corresponding to the index... > > Jurjen. > > > > "Morten Wennevik" <MortenWenne***@hotmail.com> wrote in message > news:op.tg8x7ygjklbvpo@tr024.bouvet.no... > Hi Jurjen, > > Sounds to me like you could just use a ReadLine() and do a search per > record. You should use the encoding used in the file. If you don't > specify an encoding, UTF-8 is used. You would need some logic added to > keep track of an entire recordset, which can be a string[] of length 9 > > StreamReader sr = new StreamReader("", Encoding.Default); > string s = null; > string[] recordset = new string[9]; > int index = 0; > > while ((s = sr.ReadLine()) != null) > { > int i = GetRecordNumber(s); > > if (i >= (index + 1)) > ;// missing record > > recordset[index] = s; > > index++; > > if (i == 9) // complete record > { > if (SearchRecordSet(recordset)) > return true; > > Array.Clear(recordset, 0, 9); > index = 0; > } > } > > > PS! Your system clock is a bit too fast > > > On Wed, 11 Oct 2006 09:01:00 +0200, Jurjen de Groot <i***@gits-online.nl> > wrote: > >> Greg, >> >> The text-file consists of records of 80 characters seperated by a >> NewLine. >> These records all have a record type 1 thru 9, a set of records start >> with >> record 1 and end with record 9 at wich point the next set will start with >> record type 1. >> >> I search the contents of the file for the search criteria as entered by >> the >> user f.i. 2742281, when I find this sequence I have to make sure it's >> found >> in exactly the right position within the record to make sure I have >> compared >> it to the right field. Then I have to show this record found (wich >> should be >> record type 1) and show all records until I find recordtype 9 (of EOF). >> >> I could create an index but that would complicate the app, I also >> thought of >> maybe creating a datatable to ease the search but I'm pretty sure memory >> consumption would be even worse... >> >> I was just wondering why the current app is consuming so much memory >> wich is >> now clear to me. I guess my client will have to make the decision, cheap >> app >> wich will use much memory, little more expensive app using less memory. >> >> >> Regards, >> Jurjen. >> >> >> "Greg Young" <druckdruckREMOVEgo***@hotmail.com> wrote in message >> news:egRx8uJ7GHA.4604@TK2MSFTNGP03.phx.gbl... >>> As CD has said this is expected as strings are UTF-16 .. my question >>> would >>> be how you are searching this file. >>> >>> Are you just doing keyword searches? Depending on the type of search you >>> might be much better off doing something like building an index of the >>> file and loading the index into memory. >>> >>> Cheers, >>> >>> Greg >>> >>> "Jurjen de Groot" <Jurjen.de.Gr***@xs4all.nl> wrote in message >>> news:OCVR7iI7GHA.4552@TK2MSFTNGP05.phx.gbl... >>>> My program needs to search a large textfile (>60MB). >>>> >>>> At this time I'm using a streamreader to read the file into a >>>> string-variable (objString = sr.ReadToEnd). Before reading the file the >>>> proces running my programm uses about 10mb, after reading the text-file >>>> into the string, it uses over 200mb. I would expect the program to use >>>> between 70 and 100mb. >>>> >>>> Is there a more efficient way of storing this data in-memory and still >>>> be >>>> able to search through it ... ? >>>> >>>> >>>> TIA, >>>> Jurjen. >>>> >>> >>> >> >> > > > > -- > Happy Coding! > Morten Wennevik [C# MVP] > Nick,
SQL isn't an option as this is supposed to be a very simple (+/- 2 hour) solution to a problem. The programm will most probably only be used for a week or so. Doing SQL would be overkill in this specific situation. I would like to thank everyone for their replies. Jurjen. Show quote "Nick Malik [Microsoft]" <nickmalik@hotmail.nospam.com> wrote in message news:QN6dnRgaB7ousrDYnZ2dnUVZ_qOdnZ2d@comcast.com... > Hi Jurjen, > > my napkin calculations show that you have a 750,000 record table that > doesn't change often and that you need to search many times by each user. > > Dude... is SQL Server really so bad of an option? c'mon! Why write this > capability into your app when it is available to you for free? > > > -- > --- Nick Malik [Microsoft] > MCSD, CFPS, Certified Scrummaster > http://blogs.msdn.com/nickmalik > > Disclaimer: Opinions expressed in this forum are my own, and not > representative of my employer. > I do not answer questions on behalf of my employer. I'm just a > programmer helping programmers. > -- > "Jurjen de Groot" <i***@gits-online.nl> wrote in message > news:OvpwWCT7GHA.1256@TK2MSFTNGP04.phx.gbl... >> Morten, >> >> Thanks for your reply, I understand what you're doing in the code, but >> isn't reading line by line slow ? >> The file is over 64mb in size, reading it line by line to do a search >> seems like a lot of overhead, especially when the user does many searches >> while running the app, it would mean reading/searching the >64mb file >> many times, that's why I opted to keep the file in memory which might not >> be the best idea. >> I'm currently trying to get some more time from my client to try and >> optimize by creating an index of the file (which doesn't change that >> often) and searching through that and retrieving part of the text-file >> corresponding to the index... >> >> Jurjen. >> >> >> >> "Morten Wennevik" <MortenWenne***@hotmail.com> wrote in message >> news:op.tg8x7ygjklbvpo@tr024.bouvet.no... >> Hi Jurjen, >> >> Sounds to me like you could just use a ReadLine() and do a search per >> record. You should use the encoding used in the file. If you don't >> specify an encoding, UTF-8 is used. You would need some logic added to >> keep track of an entire recordset, which can be a string[] of length 9 >> >> StreamReader sr = new StreamReader("", Encoding.Default); >> string s = null; >> string[] recordset = new string[9]; >> int index = 0; >> >> while ((s = sr.ReadLine()) != null) >> { >> int i = GetRecordNumber(s); >> >> if (i >= (index + 1)) >> ;// missing record >> >> recordset[index] = s; >> >> index++; >> >> if (i == 9) // complete record >> { >> if (SearchRecordSet(recordset)) >> return true; >> >> Array.Clear(recordset, 0, 9); >> index = 0; >> } >> } >> >> >> PS! Your system clock is a bit too fast >> >> >> On Wed, 11 Oct 2006 09:01:00 +0200, Jurjen de Groot <i***@gits-online.nl> >> wrote: >> >>> Greg, >>> >>> The text-file consists of records of 80 characters seperated by a >>> NewLine. >>> These records all have a record type 1 thru 9, a set of records start >>> with >>> record 1 and end with record 9 at wich point the next set will start >>> with >>> record type 1. >>> >>> I search the contents of the file for the search criteria as entered by >>> the >>> user f.i. 2742281, when I find this sequence I have to make sure it's >>> found >>> in exactly the right position within the record to make sure I have >>> compared >>> it to the right field. Then I have to show this record found (wich >>> should be >>> record type 1) and show all records until I find recordtype 9 (of EOF). >>> >>> I could create an index but that would complicate the app, I also >>> thought of >>> maybe creating a datatable to ease the search but I'm pretty sure memory >>> consumption would be even worse... >>> >>> I was just wondering why the current app is consuming so much memory >>> wich is >>> now clear to me. I guess my client will have to make the decision, cheap >>> app >>> wich will use much memory, little more expensive app using less memory. >>> >>> >>> Regards, >>> Jurjen. >>> >>> >>> "Greg Young" <druckdruckREMOVEgo***@hotmail.com> wrote in message >>> news:egRx8uJ7GHA.4604@TK2MSFTNGP03.phx.gbl... >>>> As CD has said this is expected as strings are UTF-16 .. my question >>>> would >>>> be how you are searching this file. >>>> >>>> Are you just doing keyword searches? Depending on the type of search >>>> you >>>> might be much better off doing something like building an index of the >>>> file and loading the index into memory. >>>> >>>> Cheers, >>>> >>>> Greg >>>> >>>> "Jurjen de Groot" <Jurjen.de.Gr***@xs4all.nl> wrote in message >>>> news:OCVR7iI7GHA.4552@TK2MSFTNGP05.phx.gbl... >>>>> My program needs to search a large textfile (>60MB). >>>>> >>>>> At this time I'm using a streamreader to read the file into a >>>>> string-variable (objString = sr.ReadToEnd). Before reading the file >>>>> the >>>>> proces running my programm uses about 10mb, after reading the >>>>> text-file >>>>> into the string, it uses over 200mb. I would expect the program to use >>>>> between 70 and 100mb. >>>>> >>>>> Is there a more efficient way of storing this data in-memory and still >>>>> be >>>>> able to search through it ... ? >>>>> >>>>> >>>>> TIA, >>>>> Jurjen. >>>>> >>>> >>>> >>> >>> >> >> >> >> -- >> Happy Coding! >> Morten Wennevik [C# MVP] >> > > |
|||||||||||||||||||||||