Page 1 of 1

Best way to scan a file?

Posted: Mon Dec 03, 2007 6:21 pm
by GameZelda
Well, I'm relatively new here, I don't know if this is the apropiate forum :oops:

I'm coding a file ripper for a game (in C), but my problem is that it's just TOOO slow :(

My method is just: Get a DWORD, compare it with the magic. If it coincides, extract the file. If not, just go back 3 characters and return to scanning.

But If I just do this on Hex Workshop, it gives me the result much faster. What's the secret? Checking it character by character?

The question is just... what's the best way to scan a big file?

Posted: Mon Dec 03, 2007 7:21 pm
by Rheini
Well this is a string matching problem. You can decrease scanning time by preprocessing the search string. An example is the Knuth-Morris-Pratt algorithm that
searches for occurrences of a "word" W within a main "text string" S by employing the observation that when a mismatch occurs, the word itself embodies sufficient information to determine where the next match could begin, thus bypassing re-examination of previously matched characters
This algorithm already gives a good improvement on speed.
But a (general) file ripper needs to compare many signatures so you might want to have a look at other algos just like Aho-Corasick.
I personally used this one as a test in my own small scanning engine :)

Posted: Tue Dec 04, 2007 3:56 pm
by Darkstar
I, too, would use Aho-Corasick. You don't have to worry about storing the generated graph somehow: just use the canonical algorithm to build the scanning-graph on runtime. Normally, you have only a few hundred signatures at most, so the time it takes to build the graph is quite small compared to the search time itself.

Advantages:
Aho-Corasick trivially supports multiple matching patterns (i.e. if you have a file magic ABCD and another one BCD it will find both in the string XXABCDYY)
It's blazingly fast, especially when combined with a smart buffering algorithm (so that you don't have to read each byte from the file by itself). The stdio buffers work well enough in practice.
The time it takes to scan does not depend on the number of patterns that you're searching for (however, more patterns take longer time to build the graph for them). You only have to take one step in the graph for each byte read, and that takes constant time.

BTW, does anyone know if there are other similar parallel-matching agorithms available? Aho-Corasick was the only one I could find.

Posted: Tue Dec 04, 2007 4:58 pm
by Rheini
How do you implement the graph? The simple way is to use bidimensional arrays, linear lists are also possible and I found an interesting approach here: http://linux.thai.net/~thep/datrie/datrie.html

I don't know any other algorithm like AC. In the end it simply combines tries with the "failure function" idea from KMP.

I thought a lot about how to efficiently implement matching with wildcards to support signatures like 5A 78 ?? FF ?? ??. I found some ideas in those slides: http://www.cs.uku.fi/~kilpelai/BSA05/le ... ides04.pdf
The drawback is you need an array as big as the text to store occurence counts. I wonder if this is possible without much memory usage :?:

Posted: Thu Dec 06, 2007 1:06 pm
by Rheini
Rheini wrote:I thought a lot about how to efficiently implement matching with wildcards to support signatures like 5A 78 ?? FF ?? ??. I found some ideas in those slides: http://www.cs.uku.fi/~kilpelai/BSA05/le ... ides04.pdf
The drawback is you need an array as big as the text to store occurence counts. I wonder if this is possible without much memory usage :?:
Another problem is that there approach is only able to search for a single string with wildcards. And you have to go through the whole array and check the counts after the scan. So it also doubles the needed time.

I got an idea that theoretically solves most problems. Why not using a kind of sliding window (implemented as a circular buffer) to store occurence counts?
Every time something is written into the buffer the field that is overwritten is checked for a match.
Since we want to use more than 1 search string we have to store an identifier for the keyword besides the occurence count.
So what if there are 2 strings with the same beginning or something similar? This could be solved by using an array of linked lists.

Hope this is somewhat understandable. If you find any flaws in my idea please tell me.

Posted: Mon Dec 10, 2007 11:05 am
by Darkstar
Rheini wrote:How do you implement the graph? The simple way is to use bidimensional arrays, linear lists are also possible and I found an interesting approach here: http://linux.thai.net/~thep/datrie/datrie.html
I'm currently using real heap objects for each node (liked to each other by pointers) but I'm moving it to a simple linear array (each node is an entry into the array). The Trie-method sounds interesting. But I guess until I implement lots and lots of patterns, I won't notice any relevant speed gain (my ripper currently contains about 50 magics for some 10 different file formats)
Rheini wrote:I thought a lot about how to efficiently implement matching with wildcards to support signatures like 5A 78 ?? FF ?? ??.
In my implementation I avoid the problem by shortening the pattern to (in your example) 5A 78, and then let the extractor-function handle the final validation. It creates a few false positives (few meaning a few hundreds to a few thousands in a normal file), but since I have to check the format for validity anyways (some file formats have very generic magics like "M.K." or hex 00 01 or something) it's a simple "hack" to add the "false positive checking" stage to them.

Posted: Mon Dec 10, 2007 1:52 pm
by Rheini
Darkstar wrote:I'm currently using real heap objects for each node (liked to each other by pointers) but I'm moving it to a simple linear array (each node is an entry into the array).
So you mean an array of linked lists?
AC suggests using an array for more often used states like the root state and using linked lists for the other ones btw.
In my implementation I avoid the problem by shortening the pattern to (in your example) 5A 78, and then let the extractor-function handle the final validation.
Ok I guess this is the best way for file rippers cause, as you said, the format has to be validated anyway. But wildcard support is necessary for other applications I think.

Posted: Sat Dec 15, 2007 7:45 pm
by Rahly
You guys missed the obvious.

You should read from the disk in bulk reads, not by DWORD. Disk IO is attrocious in terms of speed. By creating a larger buffer you can remove that speed back to the user. Although buffer size is largely opinionist. The optimal reading should be the number of sectors for that track. Reading one track at a time is generally the fastest method, but IMO you should at least read by blocks of sector sizes. Disks use cache-ahead analysis, so by reading back 3 bytes, causes a cache miss hit and a reread.

So, if your sector size is 4k, then say a buffer size of 10*4096, or 10 sectors, would seriously improve your speed. For algorythmic searching, you might wanna do a small calculation so you don't have to read back 3 characters. There are many "string" searching algos like the one they posted, that fly through chunks of data.

P.S. Shouldn't this be in the coding forum?

Posted: Sat Dec 15, 2007 9:30 pm
by Rheini
File Mapping is a considerable possibility. http://msdn2.microsoft.com/en-us/library/aa366556.aspx

Posted: Sun Dec 16, 2007 8:17 pm
by Darkstar
Disk buffering is indeed helpful. The stdlib does it's own little buffering scheme with a few bytes/kbytes, so that's a first step ahead. When in doubt, you can also open the file with a special "sequential" flag which optimizes disk I/O for this special case.

File mapping is indeed a nice option because it handles files in 4k blocks at the lowest level possible (on a page basis), and because you can access the file with standard stdlib routines (strstr, strchr, ...).
However, in a 32bit OS, it's not possible to mmap() more than 1-2 gbytes at once, so for bigger files it's a bit more complicated to handle mmap()