|
| 1 | +# Actually Portable Executable Specification v0.1 |
| 2 | + |
| 3 | +Actually Portable Executable (APE) is an executable file format that |
| 4 | +polyglots the Windows Portable Executable (PE) format with a UNIX Sixth |
| 5 | +Edition style shell script that doesn't have a shebang. This makes it |
| 6 | +possible to produce a single file binary that executes on the stock |
| 7 | +installations of the many OSes and architectures. |
| 8 | + |
| 9 | +## Supported OSes and Architectures |
| 10 | + |
| 11 | +- AMD64 |
| 12 | + - Linux |
| 13 | + - MacOS |
| 14 | + - Windows |
| 15 | + - FreeBSD |
| 16 | + - OpenBSD |
| 17 | + - NetBSD |
| 18 | + - BIOS |
| 19 | + |
| 20 | +- ARM64 |
| 21 | + - Linux |
| 22 | + - MacOS |
| 23 | + - FreeBSD |
| 24 | + - Windows (non-native) |
| 25 | + |
| 26 | +## File Header |
| 27 | + |
| 28 | +APE defines three separate file magics, all of which are 8 characters |
| 29 | +long. Any file that starts with one of these magic values can be |
| 30 | +considered an APE program. |
| 31 | + |
| 32 | +### (1) APE MZ Magic |
| 33 | + |
| 34 | +- ASCII: `MZqFpD='` |
| 35 | +- Hex: 4d 5a 71 46 70 44 3d 27 |
| 36 | + |
| 37 | +This is the canonical magic used by almost all APE programs. It enables |
| 38 | +maximum portability between OSes. When interpreted as a shell script, it |
| 39 | +is assiging a single quoted string to an unused variable. The shell will |
| 40 | +then ignore subsequent binary content that's placed inside the string. |
| 41 | + |
| 42 | +It is strongly recommended that this magic value be immediately followed |
| 43 | +by a newline (\n or hex 0a) character. Some shells, e.g. FreeBSD SH and |
| 44 | +Zsh impose a binary safety check before handing off files that don't |
| 45 | +have a shebang to `/bin/sh`. That check applies to the first line, which |
| 46 | +can't contain NUL characters. |
| 47 | + |
| 48 | +The letters were carefully chosen so as to be valid x86 instructions in |
| 49 | +all operating modes. This makes it possible to store a BIOS bootloader |
| 50 | +disk image inside an APE binary. For example, simple CLI programs built |
| 51 | +with Cosmopolitan Libc will boot from BIOS into long mode if they're |
| 52 | +treated as a floppy disk image. |
| 53 | + |
| 54 | +The letters also allow for the possibility of being treated on x86-64 as |
| 55 | +a flat executable, where the PE / ELF / Mach-O executable structures are |
| 56 | +ignored, and execution simply begins at the beginning of the file, |
| 57 | +similar to how MS-DOS .COM binaries work. |
| 58 | + |
| 59 | +The 0x4a relative offset of the magic causes execution to jump into the |
| 60 | +MS-DOS stub defined by Portable Executable. APE binaries built by Cosmo |
| 61 | +Libc use tricks in the MS-DOS stub to check the operating mode and then |
| 62 | +jump to the appropriate entrypoint, e.g. `_start()`. |
| 63 | + |
| 64 | +#### Decoded as i8086 |
| 65 | + |
| 66 | +```asm |
| 67 | + dec %bp |
| 68 | + pop %dx |
| 69 | + jno 0x4a |
| 70 | + jo 0x4a |
| 71 | +``` |
| 72 | + |
| 73 | +#### Decoded as i386 |
| 74 | + |
| 75 | +```asm |
| 76 | + push %ebp |
| 77 | + pop %edx |
| 78 | + jno 0x4a |
| 79 | + jo 0x4a |
| 80 | +``` |
| 81 | + |
| 82 | +#### Decoded as x86-64 |
| 83 | + |
| 84 | +```asm |
| 85 | + rex.WRB |
| 86 | + pop %r10 |
| 87 | + jno 0x4a |
| 88 | + jo 0x4a |
| 89 | +``` |
| 90 | + |
| 91 | +### (2) APE UNIX-Only Magic |
| 92 | + |
| 93 | +- ASCII: `jartsr='` |
| 94 | +- Hex: 6a 61 72 74 73 72 3d 27 |
| 95 | + |
| 96 | +Being a novel executable format that was first published in 2020, the |
| 97 | +APE file format is less understood by industry tools compared to the PE, |
| 98 | +ELF, and Mach-O executable file formats, which have been around for |
| 99 | +decades. For this reason, APE programs that use the MZ magic above can |
| 100 | +attract attention from Windows AV software, which may be unwanted by |
| 101 | +developers who aren't interested in targeting the Windows platform. |
| 102 | +Therefore the `jartsr='` magic is defined which enables the creation of |
| 103 | +APE binaries that can safely target all non-Windows platforms. Even |
| 104 | +though this magic is less common, APE interpreters and binfmt-misc |
| 105 | +installations MUST support this. |
| 106 | + |
| 107 | +It is strongly recommended that this magic value be immediately followed |
| 108 | +by a newline (\n or hex 0a) character. Some shells, e.g. FreeBSD SH and |
| 109 | +Zsh impose a binary safety check before handing off files that don't |
| 110 | +have a shebang to `/bin/sh`. That check applies to the first line, which |
| 111 | +can't contain NUL characters. |
| 112 | + |
| 113 | +The letters were carefully chosen so as to be valid x86 instructions in |
| 114 | +all operating modes. This makes it possible to store a BIOS bootloader |
| 115 | +disk image inside an APE binary. For example, simple CLI programs built |
| 116 | +with Cosmopolitan Libc will boot from BIOS into long mode if they're |
| 117 | +treated as a floppy disk image. |
| 118 | + |
| 119 | +The letters also allow for the possibility of being treated on x86-64 as |
| 120 | +a flat executable, where the PE / ELF / Mach-O executable structures are |
| 121 | +ignored, and execution simply begins at the beginning of the file, |
| 122 | +similar to how MS-DOS .COM binaries work. |
| 123 | + |
| 124 | +The 0x78 relative offset of the magic causes execution to jump into the |
| 125 | +MS-DOS stub defined by Portable Executable. APE binaries built by Cosmo |
| 126 | +Libc use tricks in the MS-DOS stub to check the operating mode and then |
| 127 | +jump to the appropriate entrypoint, e.g. `_start()`. |
| 128 | + |
| 129 | +#### Decoded as i8086 / i386 / x86-64 |
| 130 | + |
| 131 | +```asm |
| 132 | + push $0x61 |
| 133 | + jb 0x78 |
| 134 | + jae 0x78 |
| 135 | +``` |
| 136 | + |
| 137 | +### (3) APE Debug Magic |
| 138 | + |
| 139 | +- ASCII: `APEDBG='` |
| 140 | +- Hex: 41 50 45 44 42 47 3d 27 |
| 141 | + |
| 142 | +While APE files must be valid shell scripts, in practice, UNIX systems |
| 143 | +will oftentimes be configured to provide a faster safer alternative to |
| 144 | +loading an APE binary through `/bin/sh`. The Linux Kernel can be patched |
| 145 | +to have execve() recognize the APE format and directly load its embedded |
| 146 | +ELF header. Linux systems can also use binfmt-misc to recognize APE's MZ |
| 147 | +and jartsr magic, and pass them to a userspace program named `ape` that |
| 148 | +acts as an interpreter. In such environments, the need sometimes arises |
| 149 | +to be able to test that the `/bin/sh` is working correctly, in which |
| 150 | +case the `APEDBG='` magic is RECOMMENDED. |
| 151 | + |
| 152 | +APE interpreters, execve() implementations, and binfmt-misc installs |
| 153 | +MUST ignore this magic. If necessary, steps can be taken to help files |
| 154 | +with this magic be passed to `/bin/sh` like a normal shebang-less shell |
| 155 | +script for execution. |
| 156 | + |
| 157 | +## Embedded ELF Header |
| 158 | + |
| 159 | +APE binaries MAY embed an ELF header inside them. Unlike conventional |
| 160 | +executable file formats, this header is not stored at a fixed offset. |
| 161 | +It's instead encoded as octal escape codes in a shell script `printf` |
| 162 | +statement. For example: |
| 163 | + |
| 164 | +``` |
| 165 | +printf '\177ELF\2\1\1\011\0\0\0\0\0\0\0\0\2\0\076\0\1\0\0\0\166\105\100\000\000\000\000\000\060\013\000\000\000\000\000\000\000\000\000\000\000\000\000\000\165\312\1\1\100\0\070\0\005\000\0\0\000\000\000\000' |
| 166 | +``` |
| 167 | + |
| 168 | +This `printf` statement MUST appear in the first 8192 bytes of the APE |
| 169 | +executable, so as to limit how much of the initial portion of a file an |
| 170 | +intepreter must load. |
| 171 | + |
| 172 | +Multiple such `printf` statements MAY appear in hte first 8192 bytes, in |
| 173 | +order to specify multiple architectures. For example, fat binaries built |
| 174 | +by the `apelink` program (provided by Cosmo Libc) will have two encoded |
| 175 | +ELF headers, for amd64 and arm64, each of which point into the proper |
| 176 | +file offsets for their respective native code. Therefore, kernels and |
| 177 | +interpreters which load the APE format directly MUST check the |
| 178 | +`e_machine` field of the `Elf64_Ehdr` that's decoded from the octal |
| 179 | +codes, before accepting a `printf` shell statement as valid. |
| 180 | + |
| 181 | +These printf statements MUST always use only unescaped ASCII characters |
| 182 | +or octal escape codes. These printf statements MUST NOT use space saving |
| 183 | +escape codes such as `\n`. For example, rather than saying `\n` it would |
| 184 | +be valid to say `\012` instead. It's also valid to say `\12` but only if |
| 185 | +the encoded characters that follow aren't an octal digit. |
| 186 | + |
| 187 | +For example, the following algorithm may be used for parsing octal: |
| 188 | + |
| 189 | +```c |
| 190 | +static int ape_parse_octal(const unsigned char page[8192], int i, int *pc) |
| 191 | +{ |
| 192 | + int c; |
| 193 | + if ('0' <= page[i] && page[i] <= '7') { |
| 194 | + c = page[i++] - '0'; |
| 195 | + if ('0' <= page[i] && page[i] <= '7') { |
| 196 | + c *= 8; |
| 197 | + c += page[i++] - '0'; |
| 198 | + if ('0' <= page[i] && page[i] <= '7') { |
| 199 | + c *= 8; |
| 200 | + c += page[i++] - '0'; |
| 201 | + } |
| 202 | + } |
| 203 | + *pc = c; |
| 204 | + } |
| 205 | + return i; |
| 206 | +} |
| 207 | +``` |
| 208 | +
|
| 209 | +APE aware interpreters SHOULD only take `e_machine` into consideration. |
| 210 | +It is the responsibility of the `_start()` function to detect the OS. |
| 211 | +Therefore, multiple `printf` statements are only embedded in the shell |
| 212 | +script for different CPU architectures. |
| 213 | +
|
| 214 | +The OS ABI field of an APE embedded `Elf64_Ehdr` SHOULD be set to |
| 215 | +`ELFOSABI_FREEBSD`, since it's the only UNIX OS APE supports that |
| 216 | +actually checks the field. However different values MAY be chosen for |
| 217 | +binaries that don't intend to have FreeBSD in their support vector. |
| 218 | +
|
| 219 | +Counter-intuitively, the ARM64 ELF header is used on the MacOS ARM64 |
| 220 | +platform when loading from fat binaries. |
| 221 | +
|
| 222 | +## Embedded Mach-O Header (x86-64 only) |
| 223 | +
|
| 224 | +APE shell scripts that support MacOS on AMD64 must use the `dd` command |
| 225 | +in a very specific way to specify how the embedded binary Macho-O header |
| 226 | +is copied backward to the start of the file. For example: |
| 227 | +
|
| 228 | +``` |
| 229 | +dd if="$o" of="$o" bs=8 skip=433 count=66 conv=notrunc |
| 230 | +``` |
| 231 | +
|
| 232 | +These `dd` statements have traditionally been generated by the GNU as |
| 233 | +and ld.bfd programs by encoding ASCII into 64-bit linker relocations, |
| 234 | +which necessitated a fixed width for integer values. It took several |
| 235 | +iterations over APE's history before we eventually got it right: |
| 236 | +
|
| 237 | +- `arg=" 9293"` is how we originally had ape do it |
| 238 | +- `arg=$(( 9293))` b/c busybox sh disliked quoted space |
| 239 | +- `arg=9293 ` is generated by modern apelink program |
| 240 | +
|
| 241 | +Software that parses the APE file format, which needs to extract to be |
| 242 | +able extract the Macho-O x86-64 header SHOULD support the old binaries |
| 243 | +that use the previous encodings. To make backwards compatibility simple |
| 244 | +the following regular expression may be used, which generalizes to all |
| 245 | +defined formats: |
| 246 | +
|
| 247 | +```c |
| 248 | +regcomp(&rx, |
| 249 | + "bs=" // dd block size arg |
| 250 | + "(['\"] *)?" // #1 optional quote w/ space |
| 251 | + "(\\$\\(\\( *)?" // #2 optional math w/ space |
| 252 | + "([[:digit:]]+)" // #3 |
| 253 | + "( *\\)\\))?" // #4 optional math w/ space |
| 254 | + "( *['\"])?" // #5 optional quote w/ space |
| 255 | + " +" // |
| 256 | + "skip=" // dd skip arg |
| 257 | + "(['\"] *)?" // #6 optional quote w/ space |
| 258 | + "(\\$\\(\\( *)?" // #7 optional math w/ space |
| 259 | + "([[:digit:]]+)" // #8 |
| 260 | + "( *\\)\\))?" // #9 optional math w/ space |
| 261 | + "( *['\"])?" // #10 optional quote w/ space |
| 262 | + " +" // |
| 263 | + "count=" // dd count arg |
| 264 | + "(['\"] *)?" // #11 optional quote w/ space |
| 265 | + "(\\$\\(\\( *)?" // #12 optional math w/ space |
| 266 | + "([[:digit:]]+)", // #13 |
| 267 | + REG_EXTENDED); |
| 268 | +``` |
| 269 | + |
| 270 | +For further details, see the canonical implementation in |
| 271 | +`cosmopolitan/tool/build/assimilate.c`. |
0 commit comments