MIME Base64 Encoding/Decoding
Email messages, their headers and their attachments are frequently subjected to a form of encoding that converts all data into a form that can be represented on screen/paper using a sequence of printable characters. This is known as MIME encoding. A detailed discussion of such issues can be found here. A commonly used for technique accomplishing this is known as Base64 encoding. In Base64 all bytes in the original data stream are mapped onto alphanumeric characters, i.e.
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789
and the characters +/. By taking six bits at a time from the original data one gets a number in the range 0..63. This number is treated as the index of a character in the reduced character set table shown below
| Index | Character | Index | Character | Index | Character | Index | Character |
|---|---|---|---|---|---|---|---|
| 0 | A | 16 | Q | 32 | g | 48 | w |
| 1 | B | 17 | R | 33 | h | 49 | x |
| 2 | C | 18 | S | 34 | i | 50 | y |
| 3 | D | 19 | T | 35 | j | 51 | z |
| 4 | E | 20 | U | 36 | k | 52 | 0 |
| 5 | F | 21 | V | 37 | l | 53 | 1 |
| 6 | G | 22 | W | 38 | m | 54 | 2 |
| 7 | H | 23 | X | 39 | n | 55 | 3 |
| 8 | I | 24 | Y | 40 | o | 56 | 4 |
| 9 | J | 25 | Z | 41 | p | 57 | 5 |
| 10 | K | 26 | a | 42 | q | 58 | 6 |
| 11 | L | 27 | b | 43 | r | 59 | 7 |
| 12 | M | 28 | c | 44 | s | 60 | 8 |
| 13 | N | 29 | d | 45 | t | 61 | 9 |
| 14 | O | 30 | e | 46 | u | 62 | + |
| 15 | P | 31 | f | 47 | v | 63 | / |
The operation of Base64 encoding is best understood by studying a simple example. Consider the encoding of the word MIME illustrated below
| Character | M | I | M | E | #0 | #0 | ||||||||||||||||||||||||||||||||||||||||||
| Hex Code | 4D | 49 | 4D | 45 | 0 | 0 | ||||||||||||||||||||||||||||||||||||||||||
| Binary Code | 01001101 | 01001001 | 01001101 | 01000101 | 00000000 | 00000000 | ||||||||||||||||||||||||||||||||||||||||||
| MIME Code | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| MIME Value | 19 | 20 | 37 | 13 | 17 | 16 | 0 | 0 | ||||||||||||||||||||||||||||||||||||||||
| MIME Char | T | U | l | N | R | Q | A | A | ||||||||||||||||||||||||||||||||||||||||
While the technique is largely self-explanatory there are several points that require emphasis
- In order for the technique to work we require a string length that is a multiple of 3. We accomplish this by padding the string with two null characters.
- Next we proceed to take six bits at a time from the binary representation of the string, starting with the six most significant bits of the first character. This results in a slight complication when working with Delphi - the raw data to be encoded are in the little endian format used by Intel processors. Consequently, some extra work has to be done to get at the "right" bits.
- This leaves two unused bits - the least significant ones - which we "bank" for use with the subsequent byte. When they are used they become the most significant bits in the new MIME code that is generated.
- In practice it is convenient to deal with three bytes of original data at a time. The 24 constituent bits of these bytes result in 4 MIME code characters each 6 bits long.
- The padding bytes added at the start of the process will result in extra A characters being appended to the MIME code string. We cannot simply leave them in the encoded string since it would be impossible to distinguish between As resulting from genuine zero bytes in the original data and As generated by padding bytes. We get round this problem by replacing them with an equivalent number of = characters. Thus, in the example the final MIME Base64 encoded string would be TUlNRQ==.
- It may not be immediately obvious why the number of false As is the same as the number of padding bytes added. Think of it this way
- Data with a length of 3n + 1 bytes requires two padding bytes.
- The last triplet of bytes in the padded data stream consist of one real byte and two padding bytes
- Encoding the real byte generates two characters - one for its six most significant bits and the second for the remaining 2 bits combined with the 4 most significant bits "borrowed" from the first null padding byte that follows.
- The remaining bits are all from the padding bytes and all 0s. They generate two As.
The reader will be able to perform a similar analysis for data with a length of 3n + 2. Decoding is a relatively simple affair. However, some precautions are required to ensure that the string provided for decoding does not contain characters that do not belong to the Base64 subset. Once again, byte order constraints may impose the need for some extra gymnastics. These issues are discussed at length in comments to the sourcecode available for download from the link below.
Download