File Formats

What to do with a file you don't understand

Understanding common file formats

If you want to understand how a file format works, you should look at documentation online about it. Often these formats are not ASCII readable so you'll want to use a hex editor, such as xxd, hexedit or hexylarrow-up-right.

A big collection of file formats made by Ange Albertini is the following (just scroll through until you find your format):

A big collection of drawings of file formats to understand them quickly

Another tool which can automatically decode and give you raw information is fqarrow-up-right:

Example fq output show chunks of PNG

CRCs: Cyclic Redundancy Checks

File formats often use a Cyclic Redundancy Check (CRC)arrow-up-right to validate if the bytes have been tampered with or corrupted slightly. See these as a checksum that combines all bytes into a small extra set of bytes that is different if you change even a single bit. These are not as strong as real hashing algorithms, but only output a few bytes. Preventing collisions is not their purpose, purely detecting accidental changes.

Because there are many different types of CRCs, a site like the following makes it easy to compare your data and output to reverse engineer exactly what algorithm was used. Then you can use this knowledge to create a correct checksum for any arbitrary data:

Quickly view lots of well-known CRC of different sizes based on your input

CRC revengarrow-up-right is another tool built for calculating the CRC parameters from enough samples, so it does not have to be a well-known algorithm.

chevron-rightCompilationhashtag

Download and extract the source code, then run make. If you run into the following error, do as it says and change the BMP_BIT and BMP_SUB values inside config.h:

Then you can install the tool using sudo ln -s "$(pwd)"/reveng /usr/bin/reveng.

chevron-rightUsagehashtag

Given a word length (often 8, 16 or 32), this tool can find the parameters of a CRC algorithmically. You need to provide hex strings that are followed by the CRC. Often these can be recognized by templated data (eg. lots of nulls or similar data) followed by 1, 2 or 4 random bytes which are the CRC. Take the following example:

It looks like the last 4 bytes of each row are pretty random. To crack the exact algorithm used, we simply provide them to reveng as hex strings and the 32-bit length we guessed52 45 43 00 02 00 00 00 04 00 00 00 05 00 00 00 D9 D1 49 38 REC...............I8

It found all parameters, and the preset name "CRC-32/ISO-HDLC". This is a well-known variant. Next, we can predict the CRC for any sequence of data by specifying a preset:

This correctly computes the hash for the first string! If the tool did not find a named preset, you can still give it the raw parameters to achieve the same result:

Binwalk

Sometimes when data tries to be hidden inside another file, it is just pasted right into the host file. Meaning that the bytes of the secret file are just somewhere in the other file. Using binwalkarrow-up-right you can check for known file signatures in a file to see if it embeds something. Using the following command you can also recursively extract all of these into a .extracted folder:

circle-exclamation

You can also use binwalk to understand an unknown file better, by looking at the entropy for example. Entropy is how random a certain sequence of bytes is. Simple ASCII text is pretty predictable and stays within about the same range, so the entropy would be low. But for completely random/encrypted bytes the entropy should be really high, close to 1. You can get a graph of the entropy of the file using the -E flag:

An example of a firmware image showing various amounts of entropy (sourcearrow-up-right)

This can give a good idea about what parts of a file you could look at.

PNG

Image files like PNG can have a lot of hidden info. It's a relatively complex file format with a lot of room for secrets.

A quick check you can do to see if it is a completely valid PNG file is using pngcheckarrow-up-right:

PNG files consist of chunks of bytes that tell something about the image. The most common one is IDAT which contains the pixel data of the image. An image always ends with IEND and 4 checksum bytes (every chunk has the checksum).

You might see custom chunks being used to embed data, or data appended to the end, after IEND.

Embed Raw Data (Polyglots)

You might find some applications where you are allowed to upload files and find that you can either give them a .php extension to create a web shell or make the Content-Type: text/html to render tags inside the raw bytes for Cross-Site Scripting (XSS). In either case, this application might validate or even transform your image in a way that does not preserve all the original bytes, breaking your payload.

While you might be able to include metadata with tools like exiftool, these might be stripped by the server upon saving your file:

Another trick is simply appending data to the end of the file. This would not pass as a valid PNG anymore, but could survive on the server:

Lastly, there is a technique more resistant to transformation by using the IDAT chunks. These normally include compressed DEFLATE data representing the pixels themselves, but this process can be reversed to obtain a string of pixels that compress into a payload like:

If the payload above is executed, you can provide a function you want to call like system() as the query parameter 0, and an argument you want to give the function in a 1 body parameter.

The process of creating these and a few example payloads are described in the following post, which also shows an XSS payload with the same idea:

Archives (ZIP, TAR, 7z, etc.)

Archiveschevron-right

Last updated