10-19-2006 11:44 AM
10-19-2006 01:38 PM
10-19-2006 01:49 PM
10-19-2006 05:29 PM
10-23-2006 08:30 AM
04-29-2011 11:34 AM
I'm resurrecting this thread, as I can't imagine someone hasn't come up with an efficient solution to this problem. Reading arbitrary lines from a large data file seems like it would be a fairly common technique!
Jarrod's VI does allow you to update the refnum, but -as you'd expect- the time required to loop through each line grows with file size. As an example, I'm trying to read in a file that will almost always have at least 2M lines, but the data in consecutive lines is very well defined (e.g. 0, 1, 2, 3, etc). With this knowledge of starting value and increment, has anyone come up with a way to efficiently go to an arbitrary point (say, line 1,495,290)?
I know there are some real pros out there and I bet this would benefit a lot of folks.
04-29-2011 01:38 PM
@joshmont wrote:
Jarrod's VI does allow you to update the refnum, but -as you'd expect- the time required to loop through each line grows with file size. As an example, I'm trying to read in a file that will almost always have at least 2M lines, but the data in consecutive lines is very well defined (e.g. 0, 1, 2, 3, etc). With this knowledge of starting value and increment, has anyone come up with a way to efficiently go to an arbitrary point (say, line 1,495,290)?
I know there are some real pros out there and I bet this would benefit a lot of folks.
Can you be a bit more specific about how your lines are formatted? If you know that every line has exactly the same number of characters, you can use "Set File Position" (File I/O -> Advanced File Operations) to jump to a specific location in the file. However, this only works if each line has the same number of characters. It's not enough for them to have similar formatting (for example, you might have a file in which each line has 5 numbers separated by tabs, but if some of those numbers have 3 digits and others have 4, then you have different numbers of characters). This is one advantage of binary files - any given datatype will always contain the same number of bytes regardless of the actual value it contains.
04-29-2011 01:51 PM
@nathand wrote:
Can you be a bit more specific about how your lines are formatted? If you know that every line has exactly the same number of characters, you can use "Set File Position" (File I/O -> Advanced File Operations) to jump to a specific location in the file. However, this only works if each line has the same number of characters. It's not enough for them to have similar formatting (for example, you might have a file in which each line has 5 numbers separated by tabs, but if some of those numbers have 3 digits and others have 4, then you have different numbers of characters). This is one advantage of binary files - any given datatype will always contain the same number of bytes regardless of the actual value it contains.
Right, I forgot to mention that the reason it's been a problem is that the number of characters per line is not consistent. I've included a sample from a data file; imagine that it continues to at least 2x10^6. I'm not familiar with binary files. Perhaps this is something I should read more about?
04-29-2011 01:57 PM
joshmont wrote:Right, I forgot to mention that the reason it's been a problem is that the number of characters per line is not consistent. I've included a sample from a data file; imagine that it continues to at least 2x10^6. I'm not familiar with binary files. Perhaps this is something I should read more about?
Depends on what you're doing with the files, but if you don't need them to be human-readable, binary is probably the way to go. You don't run the risk of losing any precision due to formatting (the bytes are written to the file exactly as the computer stores them) and you don't have to do any conversion from strings to numbers, saving time and memory. For a small file that's never an issue, but once you're talking about millions of lines it can make a difference. I have no experience with TDMS but you might look into that as well (it's also a binary format, but with some additional information to help with formatting and organizing the data).
06-16-2011 02:36 PM - edited 06-16-2011 02:38 PM
You can get "close" to a line jump.
Create a sampling distribution, (say 10K at 100 points). Count the lines in each sample. Either
a) Fit those points to a model. Then use the model and interpolate the location you want to jump to. or
b) create a simple ratio (lines per byte), and use that.
Similar to the algorithim I used when building "alc" approximate line counting tool ... which is a linear model only.
Or you can index the file...storing line numbers and byte offsets in a separate file