AArch64 Bootstrapping 1 - Seed Groundwork
I got a new laptop[1] recently, and in the spirit of how open and accessible and tinkerable the whole system is, I thought it would be fun to try and do a little bootstrapping challenge. Starting from a tiny hand-made ELF executable “seed,” I’m going to try to build a complete programming environment from scratch.
Ideally I’d like to have started off by cat
ting raw hexadecimal into a file
and building up from there,[2] or something to that effect, but my new
laptop uses AArch64, also known as Arm64.[3] That’s the instruction set
used in most smartphones and newer Macs, and as a RISC-ish ISA it’s quite a bit
cleaner than the convoluted mess x86/64 has grown into, but it’s also
significantly more difficult to write by hand. I gave it a good shot, but on
top of the instruction fields being weirdly aligned and my near-complete
inexperience with directly using the instruction set, I finally relented when I
realized that ELF executables have to be little-endian,[4] meaning I’d
have to write the raw, unfamiliar instructions backwards.
Fortunately, the endianness thing is a pretty straightforward problem to solve, so I thought I’d try using that as a warmup. What I’d like is a program that takes in a space-delimited stream of varying sizes of numbers in hexadecimal, and spits them out the other side “compiled” to binary in little-endian byte order. Giving it a first go in C to make sure I had my implementation concept down right, I came up with this:
// hex.c
#include <stdio.h>
int main(void) {
char c = 0; // Input character
int count = 0; // Hex digits parsed for this number
long long v = 0; // Current value
while ((c = getchar()) != EOF) {
// Skip comments
if (c == '\\') { while ((c = getchar()) != EOF && c != '\n') {} }
// Process incoming digits and add to running value
else if (('0' <= c) && (c <= '9')) { v = (v << 4) + (c - '0'); count++; }
else if (('a' <= c) && (c <= 'f')) { v = (v << 4) + (c - 'a' + 10); count++; }
// Send the finished number out the other end
else if (c == ' ') {
// Turn the number of hex digits into a number of bytes
if (count & 1) count++;
count >>= 1;
// Emit each byte, littlest first
for (; count>0; count--) {
c = v & 0xFF;
putchar(c);
v >>= 8;
}
}
}
}
That worked like a charm, but C is still pretty high-level for where we’re
going, so I’m not going to spend too long going over it here. Instead, I
rewrote it in assembly. Given that I went for a mostly-direct translation, odds
are there’s plenty more efficiency that could be squeezed out of it in terms of
speed or instruction count, but I’m really only planning on using this on
itself and the very next stage, so I figure if it works correctly, that’s good
enough. I used the GNU as
assembler because I was already familiar with it,
and I find the ephemeral numbered labels especially nifty.
Starting off, we set up a couple “global” variables, then begin the main loop by pulling in a new character and checking if we’ve reached the end of the input, and exiting if so.
// hex.s
.section .data
buf:
.byte 0 // input/output character buffer
.section .text
.global _start
_start:
mov x19, #0 // value
mov x20, #0 // count
loop:
mov x8, #63 // read
mov x0, #0 // stdin
adr x1, buf
mov x2, #1
svc 0
cmp x0, #0 // count of bytes read; 0 means EOF
b.ne 1f
exit:
mov x8, #93 // exit
mov x0, #0 // error code (no error)
svc 0
System calls are made in AArch64 Linux by loading the system call
number[5] into the register x8
, plus the syscall arguments into some
of the other registers, and then running the “supervisor call” instruction svc 0
which handles the transition over to the kernel’s control.
For the read
syscall (63), the arguments are the source file descriptor, a
pointer to the target character buffer(!), and the requested length. Note that
the call specifically requires a destination buffer; as far as I can tell,
there’s no way to request a single char read directly into a register. So, we
need to load a register with a pointer to our one-byte buffer, then after the
syscall returns, we load the character from there.
AArch64 has a special instruction, adr
, for loading a register with an
absolute memory address given a program-counter-relative offset. My
understanding is that this is the idiomatic way to do this, even though it
might prove to be a bit of a pain when we’re manually calculating addresses and
offsets later.
Once the syscall has placed the byte into the buffer, we pull it out and check to see if it’s a backslash, meaning we’re at the start of a comment. If so, we enter an inner loop where we read characters repeatedly until we find the end of the line, then we start the main loop again from the top.
1:
adr x1, buf
ldurb w9, [x1]
cmp w9, #92 // backslash; start of comment
b.ne 2f
comment:
mov x8, #63 // read
mov x0, #0 // stdin
adr x1, buf
mov x2, #1
svc 0
cmp x0, #0
b.eq exit
adr x1, buf
ldurb w9, [x1]
cmp w9, '\n'
b.ne comment
b loop
If it’s not a comment, then it might be a hex digit we need to collect into the
current value. I broke this part up into three sections: one that finds the
value of decimal digits, one that finds the value of hex digits a
through f
(case sensitive!), and one that shifts the digit into the current large value
we’re building. Once that’s done, we head back up to the top of the main loop
to process the next character.
2:
cmp w9, '0'
b.lt 3f
cmp w9, '9'
b.gt 3f
sub w9, w9, '0'
b 4f
3:
cmp w9, 'a'
b.lt 5f
cmp w9, 'f'
b.gt 5f
sub w9, w9, ('a' - 10)
b 4f
4:
lsl x19, x19, #4
add x19, x19, w9, uxtb
add x20, x20, #1
b loop
Note the use of the… flag? uxtb
in the add
instruction on line 69. This
signals the processor to treat the value in w9
as an unsigned (i.e.
zero-extended) byte for the purposes of the addition. Technically we only need
a nibble (4 bits, not 8), but since we know the range of values in w9
is only
0–15, byte-sized extension works fine. To be honest, you could probably just
use the full-width x9
and avoid needing to specify an extension scheme at
all, but I felt like it would be nice to keep using w9
for consistency. It
also showed me something about the ISA structure that I wouldn’t otherwise have
thought to look into.
If it’s not a comment and it’s not a hex digit, then it’s either a character we’re ignoring (like a line break) or it’s a space, which we use as the cue to flush the current value to the output.
5:
cmp w9, ' '
b.ne loop
To flush the output, first we normalize the count of nibbles into a count of
bytes: we round up (i.e. increment if it’s odd), then we shift right one bit
(dividing the count by two). The tst
instruction used here differs from the
cmp
instruction we’ve been using everywhere else in that it does the
comparison using a bitwise “and” operation, rather than subtraction.
tst x20, #1
cinc x20, x20, ne
lsr x20, x20, #1
Finally, we enter a loop for each byte. We store the unsigned low-order byte
from our big value in x19
(accessed through the 32-bit sized w19
), send it
out using the write
syscall, shift our big value over by 8 to put the next
byte in the low-order spot, then decrement the byte count until we’re done
handling this value and we jump back up to the top of the main loop to start on
the next value.
6:
cmp x20, #0
b.eq loop
mov x8, #64 // write
mov x0, #1 // stdout
adr x1, buf
sturb w19, [x1]
mov x2, #1
svc 0
lsr x19, x19, #8
sub x20, x20, #1
b 6b
That wasn’t too bad! In the next post, we’ll test this new “language” out with a few small executables, then rewrite this program in hex itself, then maybe start on the next stage, a tiny little Forth implementation.
Until then!