AArch64 Bootstrapping 1 - Seed Groundwork

2024-08-17

I got a new laptop^[1] recently, and in the spirit of how open and accessible and tinkerable the whole system is, I thought it would be fun to try and do a little bootstrapping challenge. Starting from a tiny hand-made ELF executable “seed,” I’m going to try to build a complete programming environment from scratch.

Ideally I’d like to have started off by catting raw hexadecimal into a file and building up from there,^[2] or something to that effect, but my new laptop uses AArch64, also known as Arm64.^[3] That’s the instruction set used in most smartphones and newer Macs, and as a RISC-ish ISA it’s quite a bit cleaner than the convoluted mess x86/64 has grown into, but it’s also significantly more difficult to write by hand. I gave it a good shot, but on top of the instruction fields being weirdly aligned and my near-complete inexperience with directly using the instruction set, I finally relented when I realized that ELF executables have to be little-endian,^[4] meaning I’d have to write the raw, unfamiliar instructions backwards.

Fortunately, the endianness thing is a pretty straightforward problem to solve, so I thought I’d try using that as a warmup. What I’d like is a program that takes in a space-delimited stream of varying sizes of numbers in hexadecimal, and spits them out the other side “compiled” to binary in little-endian byte order. Giving it a first go in C to make sure I had my implementation concept down right, I came up with this:

// hex.c
#include <stdio.h>

int main(void) {
  char c = 0;      // Input character
  int count = 0;   // Hex digits parsed for this number
  long long v = 0; // Current value

  while ((c = getchar()) != EOF) {

    // Skip comments
    if (c == '\\') { while ((c = getchar()) != EOF && c != '\n') {} }

    // Process incoming digits and add to running value
    else if (('0' <= c) && (c <= '9')) { v = (v << 4) + (c - '0'); count++; }
    else if (('a' <= c) && (c <= 'f')) { v = (v << 4) + (c - 'a' + 10); count++; }

    // Send the finished number out the other end
    else if (c == ' ') {

      // Turn the number of hex digits into a number of bytes
      if (count & 1) count++;
      count >>= 1;

      // Emit each byte, littlest first
      for (; count>0; count--) {
        c = v & 0xFF;
        putchar(c);
        v >>= 8;
      }
    }
  }
}

That worked like a charm, but C is still pretty high-level for where we’re going, so I’m not going to spend too long going over it here. Instead, I rewrote it in assembly. Given that I went for a mostly-direct translation, odds are there’s plenty more efficiency that could be squeezed out of it in terms of speed or instruction count, but I’m really only planning on using this on itself and the very next stage, so I figure if it works correctly, that’s good enough. I used the GNU as assembler because I was already familiar with it, and I find the ephemeral numbered labels especially nifty.

Starting off, we set up a couple “global” variables, then begin the main loop by pulling in a new character and checking if we’ve reached the end of the input, and exiting if so.

// hex.s
.section .data
buf:
  .byte 0       // input/output character buffer

.section .text
.global _start
_start:
  mov x19, #0   // value
  mov x20, #0   // count

loop:
  mov x8, #63   // read
  mov x0, #0    // stdin
  adr x1, buf
  mov x2, #1
  svc 0

  cmp x0, #0    // count of bytes read; 0 means EOF
  b.ne 1f

exit:
  mov x8, #93   // exit
  mov x0, #0    // error code (no error)
  svc 0

System calls are made in AArch64 Linux by loading the system call number^[5] into the register x8, plus the syscall arguments into some of the other registers, and then running the “supervisor call” instruction svc 0 which handles the transition over to the kernel’s control.

For the read syscall (63), the arguments are the source file descriptor, a pointer to the target character buffer(!), and the requested length. Note that the call specifically requires a destination buffer; as far as I can tell, there’s no way to request a single char read directly into a register. So, we need to load a register with a pointer to our one-byte buffer, then after the syscall returns, we load the character from there.

AArch64 has a special instruction, adr, for loading a register with an absolute memory address given a program-counter-relative offset. My understanding is that this is the idiomatic way to do this, even though it might prove to be a bit of a pain when we’re manually calculating addresses and offsets later.

Once the syscall has placed the byte into the buffer, we pull it out and check to see if it’s a backslash, meaning we’re at the start of a comment. If so, we enter an inner loop where we read characters repeatedly until we find the end of the line, then we start the main loop again from the top.

1:
  adr x1, buf
  ldurb w9, [x1]

  cmp w9, #92   // backslash; start of comment
  b.ne 2f

comment:
  mov x8, #63   // read
  mov x0, #0    // stdin
  adr x1, buf
  mov x2, #1
  svc 0

  cmp x0, #0
  b.eq exit

  adr x1, buf
  ldurb w9, [x1]
  cmp w9, '\n'
  b.ne comment
  b loop

If it’s not a comment, then it might be a hex digit we need to collect into the current value. I broke this part up into three sections: one that finds the value of decimal digits, one that finds the value of hex digits a through f (case sensitive!), and one that shifts the digit into the current large value we’re building. Once that’s done, we head back up to the top of the main loop to process the next character.

2:
  cmp w9, '0'
  b.lt 3f
  cmp w9, '9'
  b.gt 3f

  sub w9, w9, '0'
  b 4f

3:
  cmp w9, 'a'
  b.lt 5f
  cmp w9, 'f'
  b.gt 5f

  sub w9, w9, ('a' - 10)
  b 4f

4:
  lsl x19, x19, #4
  add x19, x19, w9, uxtb
  add x20, x20, #1
  b loop

Note the use of the… flag? uxtb in the add instruction on line 69. This signals the processor to treat the value in w9 as an unsigned (i.e. zero-extended) byte for the purposes of the addition. Technically we only need a nibble (4 bits, not 8), but since we know the range of values in w9 is only 0–15, byte-sized extension works fine. To be honest, you could probably just use the full-width x9 and avoid needing to specify an extension scheme at all, but I felt like it would be nice to keep using w9 for consistency. It also showed me something about the ISA structure that I wouldn’t otherwise have thought to look into.

If it’s not a comment and it’s not a hex digit, then it’s either a character we’re ignoring (like a line break) or it’s a space, which we use as the cue to flush the current value to the output.

5:
  cmp w9, ' '
  b.ne loop

To flush the output, first we normalize the count of nibbles into a count of bytes: we round up (i.e. increment if it’s odd), then we shift right one bit (dividing the count by two). The tst instruction used here differs from the cmp instruction we’ve been using everywhere else in that it does the comparison using a bitwise “and” operation, rather than subtraction.

  tst x20, #1
  cinc x20, x20, ne
  lsr x20, x20, #1

Finally, we enter a loop for each byte. We store the unsigned low-order byte from our big value in x19 (accessed through the 32-bit sized w19), send it out using the write syscall, shift our big value over by 8 to put the next byte in the low-order spot, then decrement the byte count until we’re done handling this value and we jump back up to the top of the main loop to start on the next value.

6:
  cmp x20, #0
  b.eq loop

  mov x8, #64   // write
  mov x0, #1    // stdout
  adr x1, buf
  sturb w19, [x1]
  mov x2, #1
  svc 0

  lsr x19, x19, #8

  sub x20, x20, #1
  b 6b

That wasn’t too bad! In the next post, we’ll test this new “language” out with a few small executables, then rewrite this program in hex itself, then maybe start on the next stage, a tiny little Forth implementation.

Until then!

Files: hex.c hex.s

#bootstrapping #aarch64 #assembly