Encodings
Last updated
Last updated
Normally you see text like this, in plain ASCII. But sometimes you want to represent some special characters that have special functions or can't be seen. This is where you could use various encodings to represent the bytes in a different way.
ASCII is a simple set of 128 bytes that represent a lot of common characters we recognize, like letters, numbers, and some special characters. In the 0_
and 1_
column you can also find some non-printable characters. That means these characters cannot be seen normally, but have some special meaning. Take 0a
, for example, this is represented in the table as LF
which stands for Line Feed. This character is actually the newline character for when you press enter while writing text.
You might notice that the "most significant nibble" only goes up to 7. This is because ASCII only has 128 characters, instead of the 256 possible bytes. This means there are 128 more bytes that are not in ASCII but can still exist.
Many systems nowadays understand Unicode, an extension of ASCII, and quote a big one at that. There are over 100.000 different symbols defined in the standard, with new ones coming. In all these characters there are some that have special properties when changing case or normalizing.
The site below has a searchable table of all known unicode transformations for Uppercase, Lowercase, Normalize NFC, and Normalize NFKC. These can be useful for bypassing filters:
As these symbols take up more than 1 byte, they can also be useful for overflowing data. A length check on a string often returns the number of characters instead of the number of bytes in high-level programming languages. By inserting emoji, for example, it is possible to have a length be very small, but the number of bytes much larger:
As you can see in the table above, sometimes numbers are represented including the A-F
characters. This is known as hexadecimal or just "hex" because it allows for 16 values per digit (0-9
and A-F
). A common way to say a number is in this hex format is by adding 0x
in front of it, like 0x2a
.
Every digit is extended by 6 more characters, meaning it can store more information in fewer digits. The nice thing about hex is the fact that 2 digits can store 16x16=256
values, exactly the amount of possible bytes (2^8=256
). This makes it really useful to represent bytes with, and that is what it is often used for.
As we saw with ASCII, all characters can be assigned a number. We can convert this number to hex to get the hexadecimal representation of the character and keep doing this for all the characters. Eventually, we end up with a big string of hex characters that represent the original string:
You can use Python or CyberChef to easily convert to hex.
Base64 is another very common way to represent data just like hex. Hex is not very efficient as it always takes up 2 digits per byte. Base64 is better at this by having 64 possible characters to represent any bytes. It is a very useful encoding for representing non-printable characters as printable characters, and works as follows:
First, start by converting your desired bytes to binary (1's and 0's):
Then we take this big stream of binary and split it into chunks of 6 bits:
You may notice at the very end there are two 01
characters left, not a full 6 bits. We'll just fill these up with 0's.
Then finally we use the Base64 alphabet to convert these 6-bit values back to printable characters, and as a last step you should add =
characters until the length of the string is a multiple of 3:
This resulting string is our Base64 string. It will always only contain characters from the Base64 alphabet which makes it easy for systems because they won't have to deal with unexpected characters.
Decoding the string works in the same way but in reverse. First, you would convert the Base64 string back to the binary stream using the base64 alphabet, and then take chunks of 8 from it to get back the original bytes.
You can use Python or CyberChef to easily convert to Base64.
Note: Sometimes flags or other secrets are 'hidden' in Base64. You can search through files for a specific string using Grep, but you can also search in Base64 by first encoding your search string in Base64, and then taking off the last character (because it can change). This will allow you to search for a string in Base64, and you may find encoded flags (see this writeup)
Base64 is by far the most common format, but there are a few more similar base encodings. One example is Base32, which works the same as Base64 but uses chunks of 5 bits to per output character, so it takes up more space but has a more limited charset. This is useful for systems that don't allow capitalization like DNS domain names sometimes.
Another variant is Base58 which is a little smaller than Base64, removing often misread characters like I
, l
, O
and 0
. It is mostly used in cryptocurrency addresses like Bitcoin but can look very similar to Base64.
These other bases are also found in Python and CyberChef recipes, with often some option to specify a custom alphabet yourself because they are not always standardized.
Strings are always stored as bytes on a computer. Let's take the "Hello" string for example:
Integers are also just stored as bytes on a computer:
But what if we read a string as bytes from memory like it was an integer? We would get the Big Endian integer representation.
This type of encoding is pretty common when working with mathematical cryptosystems like RSA, because they work with numbers instead of strings. That way the math works the same and you can just convert it back to a string at the end.
You can use hex functions or the PyCryptodome library in Python to easily convert to this integer notation.
Python has lots of functions to easily convert to and between these different representations. One thing you need to keep in mind when working with Python is the difference between strings and bytestrings. They will come up very often and it's important to understand what they both allow you to do.
Normal strings are defined just by using "
or '
quotes. It is a series of printable characters used to store normal text data. But everything in a computer is stored in bytes, so these characters need to be encoded to bytes first before they are stored.
This is where bytestrings come in. They are the encoded variant of a string and are the exact bytes the string is made of. This means you have much easier control over the bytes. These types of strings are defined just like a normal string with quotes, but with a b
prepended to the quotes.
In the example above you can see the unprintable characters that make up the emoji are shown as \x[hex]
. This representation just means one hexadecimal byte when you see it.
To convert between strings and bytestrings you can use the .encode()
and .decode()
functions on the objects. And since bytestrings are basically just a list of integers from 0-255 under the hood, you can even use the bytes()
function to convert a list of integers directly to a bytestring.
Some useful functions not mentioned above for various things:
There are a lot of encodings out there, which are very useful for machines, but not often very readable for humans. There are a few tricks though to quickly recognize certain encodings that give away what they are.
The first thing to know is that the English letters go from 65 to 122 in ASCII. The lowercase letters start at 97 and are the most common, so if you see a list of decimal numbers around 97-122 you can be pretty sure that it is just the ASCII integer representation, and you can decode it from decimal: CyberChef
Next, the hexadecimal encoding is very similar. It's just the ASCII values but converted to hex, which goes from 0x41 to 0x7a. The lowercase letters start from 0x61 again, so a list of values from 0x61 to 0x7a is likely to be hex encoded. Hex characters are often represented without any spacing because they always take up 2 bytes of space, so recognizing a lot of 6's and 7's should be what you're looking for. Then you can of course decode again from hex: CyberChef
Finally, Base64 has a few indicators. The first and most obvious is the =
signs at the end, being the padding that Base64 often needs. Almost no other encoding does this so it's a clear sign of some base encoding. Then the with Base64 character set it often looks like random characters. But because Base64 is basically converting character by character, we can recognize the start of a string like {"
for JSON, which will look like eyJ
or eyI
. Then you know there is a JSON value when you decode it: CyberChef