Hello!
I made some progress on the file extraction from .dat files. So far my tool works on both vox.dat, movie.dat and demo.dat. Other dat files use a different data structure as far as I know. My aim is to replace the audio files with the ones in another language provided by the PSX version. I don't really know if it will be feasible or not, but I think it's worth trying.
Before releasing the tool I want to add some basic features, like being able to only analyze the dat file without extracting every file, or extracting audio files only.
Anyway, I want to document here how the data structure works. I have understood like 90% of the entire structure. Hopefully, someone can help me understanding what the few "unknown" values actually represent.
Let's start. These are the first bytes of vox.dat from Disc 1:
Data is stored as a
"dictionary", and is split into multiple
"pages". Each page has the following structure:
Blue rows represent dictionary
"definitions", which tells how many and what kind of data is stored in every dictionary.
Green rows represent data in the dictionary.
Red rows represent zeros used for padding.
Let's see the first dictionary definition in detail. Every definition is composed of 16 bytes organized as follows:
00 00 00 10 = 0x10 (in big endian) is an identifier for "dictionary definition"
00 00 00 10 = 0x10 = 16 bytes is the length of the definition
00 00 00 00 are all zeros in the definition
00 01 00 04 is the defined data type
Data types, in particular, are split into two word (2-bytes numbers). The second word (in big endian) is the type of datum stored. So far I encountered the following four types:
0x01 = audio
0x04 = subtitle
0x02 = unknown
0x06 = unknown
0x07 = unknown
The first word (again, in big endian) represents the language:
0x00 = default
0x01 = english
0x02 = french
0x03 = german
0x04 = italian
0x05 = spanish
0x07 = japanese
So the first definition actually defines a subtitle (0x04) in english (0x01). The second definition, instead, a subtitle (0x04) in japanese (0x07).
Let's now focus on the green part that represent the data, here called dictionary entry. Each entry is identified with a 16-bytes header. The first entry has
00 01 00 04 = 0x04 (subtitle) and 0x01 (english) identifies the data
00 00 00 40 = 0x40 = 64 bytes is the length of data (with header)
00 00 00 00 (unknown1) are all zeros here
00 00 00 00 (unknown2) are all zeros here
The second entry is very similar. The only difference is in the subtitle language (jap instead of eng).
Finally, a padding section is found in the red area:
00 00 00 F0 = 0xF0 identifies the padding
00 00 00 10 = 0x10 = 16 bytes is the length of this header
00 00 01 2C (unknown1) I don't know what this represents. It is
not the number of 00 used for padding.
00 00 00 00 (unknown2) are all zeros here
The zeros used for padding end at the file offset 0x800 and mark the end of a dictionary page. Every page in the file is always aligned to 0x800 (so for example, it can end at 0x800, 0x1000, 0x1800 and so on).
Since I'm interested in audio files, I want to show how they are stored. In vox.dat, the first audio file appears at the offset 0x26b090:
Audio files are organized as chunks of 8192 bytes. However, before the first chunk one can find the following header:
00 00 00 01 = 0x01 (audio) and 0x00 (default language)
00 00 00 20 = 0x20 = 32 bytes is the length of data (with header)
00 00 00 00 (unknown1) are all zeros here
00 00 00 08 (unknown2) = 0x08
which contains the following data, that I'm not able to interpret:
00 00 00 07 = 0x07 (data type unknown)?
00 00 20 00 = 0x2000 = 8192 bytes?
66 66 66 66 (unknown1)
66 66 66 66 (unknown2)
But then, after this first chunk, one can find:
00 00 00 01 = 0x01 (audio) and 0x00 (default language)
00 00 20 10 = 0x2010 = 8208 bytes is the length of data (with header) -> 8192 + 16
00 00 00 00 (unknown1) are all zeros here
00 00 20 00 = 0x2000 = 8192 bytes. For audio files, this represents the encapsulated data length.
Finally, the first 8192-byte chunk of audio data occurs.
The analysis on this page reveals the following data:
Code: Select all
I: Chunk Type audio (0x0001) Lang def (0x0000) Length 32 Data size 16 Unk 0 Data (no pad) 8 at file offset 0x90
I: Chunk Type audio (0x0001) Lang def (0x0000) Length 8208 Data size 8192 Unk 0 Data (no pad) 8192 at file offset 0xb0
I: Chunk Type unk (0x0006) Lang eng (0x0001) Length 256 Data size 240 Unk 0 Data (no pad) 0 at file offset 0x20c0
I: Chunk Type unk (0x0006) Lang ger (0x0003) Length 256 Data size 240 Unk 0 Data (no pad) 0 at file offset 0x21c0
I: Chunk Type unk (0x0006) Lang fra (0x0002) Length 256 Data size 240 Unk 0 Data (no pad) 0 at file offset 0x22c0
I: Chunk Type unk (0x0006) Lang ita (0x0004) Length 256 Data size 240 Unk 0 Data (no pad) 0 at file offset 0x23c0
I: Chunk Type unk (0x0006) Lang esp (0x0005) Length 256 Data size 240 Unk 0 Data (no pad) 0 at file offset 0x24c0
I: Chunk Type unk (0x0006) Lang jap (0x0007) Length 256 Data size 240 Unk 0 Data (no pad) 0 at file offset 0x25c0
I: Chunk Type unk (0x0007) Lang def (0x0000) Length 624 Data size 608 Unk 1 Data (no pad) 0 at file offset 0x26c0
I: Chunk Type audio (0x0001) Lang def (0x0000) Length 8208 Data size 8192 Unk 1 Data (no pad) 8192 at file offset 0x2930
I: Chunk Type audio (0x0001) Lang def (0x0000) Length 8208 Data size 8192 Unk 2 Data (no pad) 8192 at file offset 0x4940
I: Chunk Type audio (0x0001) Lang def (0x0000) Length 8208 Data size 8192 Unk 3 Data (no pad) 8192 at file offset 0x6950
I: Chunk Type audio (0x0001) Lang def (0x0000) Length 8208 Data size 8192 Unk 4 Data (no pad) 8192 at file offset 0x8960
I: Chunk Type audio (0x0001) Lang def (0x0000) Length 8208 Data size 8192 Unk 5 Data (no pad) 8192 at file offset 0xa970
I: Chunk Type audio (0x0001) Lang def (0x0000) Length 8208 Data size 8192 Unk 6 Data (no pad) 8192 at file offset 0xc980
I: Chunk Type audio (0x0001) Lang def (0x0000) Length 8208 Data size 8192 Unk 576 Data (no pad) 8192 at file offset 0xe990
I: Chunk Type audio (0x0001) Lang def (0x0000) Length 8208 Data size 8192 Unk 1209 Data (no pad) 8192 at file offset 0x109a0
I: Chunk Type audio (0x0001) Lang def (0x0000) Length 8208 Data size 8192 Unk 1588 Data (no pad) 8192 at file offset 0x129b0
I: Chunk Type audio (0x0001) Lang def (0x0000) Length 8208 Data size 8192 Unk 2081 Data (no pad) 8192 at file offset 0x149c0
I: Chunk Type audio (0x0001) Lang def (0x0000) Length 8208 Data size 8192 Unk 2518 Data (no pad) 8192 at file offset 0x169d0
I: Chunk Type audio (0x0001) Lang def (0x0000) Length 8208 Data size 8192 Unk 3022 Data (no pad) 8192 at file offset 0x189e0
I: Chunk Type audio (0x0001) Lang def (0x0000) Length 8208 Data size 8192 Unk 3650 Data (no pad) 8192 at file offset 0x1a9f0
I: Chunk Type audio (0x0001) Lang def (0x0000) Length 8208 Data size 8192 Unk 4028 Data (no pad) 8192 at file offset 0x1ca00
I: Chunk Type audio (0x0001) Lang def (0x0000) Length 8208 Data size 8192 Unk 4611 Data (no pad) 8192 at file offset 0x1ea10
I: Chunk Type audio (0x0001) Lang def (0x0000) Length 8208 Data size 8192 Unk 5328 Data (no pad) 8192 at file offset 0x20a20
I: Chunk Type audio (0x0001) Lang def (0x0000) Length 8208 Data size 8192 Unk 5749 Data (no pad) 8192 at file offset 0x22a30
I: Chunk Type audio (0x0001) Lang def (0x0000) Length 8208 Data size 8192 Unk 6229 Data (no pad) 8192 at file offset 0x24a40
I: Chunk Type audio (0x0001) Lang def (0x0000) Length 8208 Data size 8192 Unk 6699 Data (no pad) 8192 at file offset 0x26a50
I: Chunk Type audio (0x0001) Lang def (0x0000) Length 8208 Data size 8192 Unk 7270 Data (no pad) 8192 at file offset 0x28a60
I: Chunk Type audio (0x0001) Lang def (0x0000) Length 8208 Data size 8192 Unk 7765 Data (no pad) 8192 at file offset 0x2aa70
I: Chunk Type audio (0x0001) Lang def (0x0000) Length 8208 Data size 8192 Unk 7995 Data (no pad) 8192 at file offset 0x2ca80
I: Chunk Type audio (0x0001) Lang def (0x0000) Length 8208 Data size 8192 Unk 8369 Data (no pad) 8192 at file offset 0x2ea90
I: Chunk Type audio (0x0001) Lang def (0x0000) Length 8208 Data size 8192 Unk 8925 Data (no pad) 8192 at file offset 0x30aa0
I: Chunk Type audio (0x0001) Lang def (0x0000) Length 8208 Data size 8192 Unk 9441 Data (no pad) 8192 at file offset 0x32ab0
I: Chunk Type audio (0x0001) Lang def (0x0000) Length 8208 Data size 8192 Unk 10032 Data (no pad) 8192 at file offset 0x34ac0
I: Chunk Type audio (0x0001) Lang def (0x0000) Length 8208 Data size 8192 Unk 10554 Data (no pad) 8192 at file offset 0x36ad0
I: Chunk Type audio (0x0001) Lang def (0x0000) Length 8208 Data size 8192 Unk 10977 Data (no pad) 8192 at file offset 0x38ae0
I: Chunk Type audio (0x0001) Lang def (0x0000) Length 2448 Data size 2432 Unk 11392 Data (no pad) 2419 at file offset 0x3aaf0
where "Unk" is actually "unknown1". As you can see, "unknown1" increases by 1 up to 6, then it increases by (random?) amounts, which in principle do not have any relationship with the data size. The same behaviour can be observed with audio files in the other pages. So hopefully someone can help me understanding, in particular, the meaning of this value, which may be important to replace the audio files inside the .dat.
Thank you for the attention (if you read up to this point
)
Big thanks to the the authors of "demux_dat", which was very helpful at the beginning of my research.