friendsofwatto wrote:
I usually try to find some documentation on the internet about the format, or even some open-source code that is easy to read.
Always good advice. I can second that. But only after I have had a look at the archive first and I could not figure it out in a quarter of an hour.
friendsofwatto wrote:If I can't find anything, or the game is new, it means I have to discover it myself. I use a little program I wrote myself to hack apart the file formats, but it basically functions similarly to a hex editor.
The way forward is definitely using a good Hex Editor. I use Hex Workshop from Breakpoint software. To me, that's the best tool available for pure hex editing. It incorporates a Hex Calculator (to easily calculate back and forth from hex->dec), but more inportantly, it enables you to bookmark positions in the file you have open, colormap pieces etc. and it shows you the values at your current cursor position (e.g. viewed as byte, int. long, string etc.) all in one window.
I have written a tool in the past to help me figure out the formats. It was called MexScan and I could run it to scan a new putative archive. It ran on scripts that I could easily feed it. It would search for file identity tags, it would search for pieces of text to identify directory entries and it would pick up simple repeating patterns (e.g. set lengths of those entries) and point out the location of the start of a possible header or tail.
I never used it.
Why? Because I had already become completely adept at spotting an archive format miles away, just by opening it in Hex Workshop. Using MexScan would just be a waste of time.
friendsofwatto wrote:
Most archives are of 2 different formats - either directory-based or chunk-based....Finally, some archives actually store the directory in a separate file with the same name - for example, archive.arc and archive.idx - where the idx file contains the directory, and the arc file is the file data.
Well, I have come across some very weird formats in my time, and I wish things were always that easy. Unfortunately, although what you say is true, you will find a lot of different formats nonetheless. As an example, there's this PAK format from Painkiller. Although they used the tail-approach (stuffing the file information at the back), they had invented some kind of encoding for the filename strings. This really was quite a puzzle for me, hence the satisfaction when I cracked their code. Besides, they used Zlib compression for their resources, which I can spot from 10 miles away.
Another example is the TRE structure from Star Wars Galaxies, that uses compressed tails, so spying eyes can't easily identify a record structure. Too bad for them that they still pointed to the beginning of the compressed tail from somewhere at the start of the file, and I cold easily spot the compression technique used. Thus, I could uncompress the tail and examine the actual information in there to write the TRE Archiver.
Sometimes though, you will come across archive formats that are fully integrated in some game code, and only the game code will know what to expect from the archive (i.e. the archive has no "universal" structure that will always apply when new archives are created). If this is the case, the best bet is to try and identify identity tags, such as the mentioned "RIFF" etc and extract them with a resource ripper.
friendsofwatto wrote:
At the start of the directory, the offset to the files will be small, so you will look for 4-byte numbers where the last few bytes are null or zero. You are mainly looking for patterns in the directory here, so move through the directory until you find another 4-bytes with a few null bytes at the end. If the second 4-byte number is bigger than the first, it usually indicates the offset to the file.
True, yet not always the case. In some instances the offsets may vary (i.e. the resources in the archive are not listed in order of saving in the archive). Also, in many cases the archive will only list offsets, and not sizes. Especially if the resources are not compressed, and are saved in order of appearance in the header or tail, this saves the programmers some space. To calculate the sizes one only needs to substract the offset of the file of interest from the offset of the next file in the list. What happens when you reach the last file depends on where the resource information was saved (header or tail). You calculate the size of the last entry by either substracting the offset from the end of the archive (total size of the archive), or by the start of the tail (if it comes directly after the last file).
There are many variations possible to this theme, however, and it will require logic to find ways to universally be able to open these variations.
One variable you could also look for, is the Number of Resources. Usually somewhere in an archive you may find this variable. This is handy, both to identify the purpose of such a number in the file (because if you figured out the directory structure you can calculate the number of files and if that matches this unknown number you have succesfully identified it AND have good support for the putative structure), and it may also help to positively identify the number of entries to expect.
However, there are also a lot of cases where people have not used such a variable. In this case, you must use other logical techniques to calculate the number of entries in an archive.
Furthermore, regarding strings that comprise the names of the resouces in an archive, there are a number of possibilities of which I will list some.
First, a string may be null-terminated. That is, you will see a string of characters (e.g. "C:\sounds\arg.wav" that is immediately followed by a 0-byte. This way, entries in a header of an archive may vary in length, because of varying string entry lengths. No matter, just read null-terminated strings and continue processing the information that comes thereafter like a normal entry.
Second, somewhere before the string you may find a number that actually is the length of the string that follows. So, you read that number prior to reading the string, so you know how long it is, and continue processing the stuff that comes after the string right after the last character of the string.
Thirdly, strings have a fixed size. In Doom .wad files all strings are 8 bytes in length for instance, 8 bytes that all are characters of the filename. In SadCom's .SAD files strings have 256 bytes reserved for them, BUT are not always 256 in length, they are still null-terminated, but have space up to 256. Just read the 256 bytes and process normally thereafter.
friendsofwatto wrote:
Make sure, however, to choose an archive which is likely to contain files you will recognise. For example, I usually try to choose an audio archive, or sound archive, because it is easy to confirm whether your field guesses are correct.
Yes, if your new to this, it's a good way to start off with. As you examine more files, you will remember more ID tags, so you can open just any file. Also, as you examine more files, the more easy you will pick up patterns and structures. I have examined hundreds at least.
friendsofwatto wrote:
The difficult parts come when you are faced with archives that contain compression or encryption. The best thing to do, if you expect compression, is to extract a file from the archive (once you know the offset and size of the file from the directory) and run it through some standard decompressors (such as ZLIB and GZIP) to see if they work. If not, then you are pretty much stuck.
You can spot a ZLib archive by just looking for a "x" at the start of the compressed resource. If there's an 'x' as the first byte, then you have a good chance that it was ZLib compressed.
The early Painkiller files just used an actual PKZip archive and altered the identity tags that the normal .ZIP format saves. (I.e. the "PK" at the beginning of the file and throughout the file was removed). Quake 3, Thief 2 andmany more just use standard .ZIP files, but with another file extension (e.g. .PK3 in the case of Quake 3). If you open a file and the first two characters are "PK" you bet you have a normal ZIP archive, just try renaming it to *.ZIP and open it with WinRAR or any other .ZIP archive handler.
friendsofwatto wrote:
PS - when you get better at it, it will usually only take a few minutes to discover the format of a game.
Practice makes easy. As I have often said, there are only so many formats to make up a game resource archive. Although I am quite often still surprised at the new things these programmers come up with, the majority of the archives out there will become easy for you, once you have had the necessary practise. Look at it as a puzzle, a riddle. There are many easy riddles, but some are tough, even tough beyond unravelling. That's just bad luck then. But remember to check the internet often to see if someone has done the impossible. You could also try to reverse engineer impossible formats, by using disassemblers on the executable. You can also look at all the binaries that come with the game to see if there's some trace of the compression method the game uses (e.g. find company markers, look inside dll's to see "zlib" or something etc.).
Good stuff from WATTO, and I hope my addition to this swift tutorial will explain things further to you.