Tenerife Skunkworks

Boldly going where few have gone before

Creating Mac binaries on any platform, by hand and without using a linker

I’m in love with Forth but there are no commercial Forth environments for Mac OSX. GForth is a free, fast and portable implementation of ANS Forth but it requires GCC and does not allow for binary distribution of code that uses foreign functions.

There are two excellent commercial implementations of ANS Forth and both run on Linux. I asked one of the companies if I could port their Forth to the Mac and promptly ended up with a tarball on my lap. There were no C or assembler files, it was all Forth source code.

The proper bootstrapping approach turned out to generate a Mac kernel on Linux, copy it over to the Mac and use it to compile the rest of the Forth environment. It’s called cross-compiling!

This required me to investigate how Mac binaries are laid out and how I could generate them without using gcc or a linker.

I would like to explain how I did it. Let’s start with a simple C program and feel free to browse the full source code.

1 #include 
2 #include 
3 
4 int main(int argc, char **argv)
5 {
6   printf("Hello world!\n");
7   exit(0);
8 }

It can’t get any simpler!

1 gcc hello.c -o hello
2 ./hello
3 Hello world!

What does it look like in assembler, though?

 1 .cstring
 2 LC0:
 3     .ascii "Hello world!
 1 .cstring
 2 LC0:
 3     .ascii "Hello world!
 1 .cstring
 2 LC0:
 3     .ascii "Hello world!
 1 .cstring
 2 LC0:
 3     .ascii "Hello world!
 1 .cstring
 2 LC0:
 3     .ascii "Hello world!
 1 .cstring
 2 LC0:
 3     .ascii "Hello world!
 1 .cstring
 2 LC0:
 3     .ascii "Hello world!
 1 .cstring
 2 LC0:
 3     .ascii "Hello world!
 1 .cstring
 2 LC0:
 3     .ascii "Hello world!
 1 .cstring
 2 LC0:
 3     .ascii "Hello world![[posterous_whitelist_block_2]]"
 4     .text
 5 .globl _main
 6 _main:
 7     pushl   %ebp
 8     movl    %esp, %ebp
 9     pushl   %ebx
10     subl    $20, %esp
11     call    L3
12 "L00000000001$pb":
13 L3:
14     popl    %ebx
15     leal    LC0-"L00000000001$pb"(%ebx), %eax
16     movl    %eax, (%esp)
17     call    L_puts$stub
18     movl    $0, (%esp)
19     call    L_exit$stub
20     .section __IMPORT,__jump_table,symbol_stubs,self_modifying_code+pure_instructions,5
21 L_exit$stub:
22     .indirect_symbol _exit
23     hlt ; hlt ; hlt ; hlt ; hlt
24 L_puts$stub:
25     .indirect_symbol _puts
26     hlt ; hlt ; hlt ; hlt ; hlt
27     .subsections_via_symbols
" 4 .text 5 .globl _main 6 _main: 7 pushl %ebp 8 movl %esp, %ebp 9 pushl %ebx 10 subl $20, %esp 11 call L3 12 "L00000000001$pb": 13 L3: 14 popl %ebx 15 leal LC0-"L00000000001$pb"(%ebx), %eax 16 movl %eax, (%esp) 17 call L_puts$stub 18 movl $0, (%esp) 19 call L_exit$stub 20 .section __IMPORT,__jump_table,symbol_stubs,self_modifying_code+pure_instructions,5 21 L_exit$stub: 22 .indirect_symbol _exit 23 hlt ; hlt ; hlt ; hlt ; hlt 24 L_puts$stub: 25 .indirect_symbol _puts 26 hlt ; hlt ; hlt ; hlt ; hlt 27 .subsections_via_symbols
" 4 .text 5 .globl _main 6 _main: 7 pushl %ebp 8 movl %esp, %ebp 9 pushl %ebx 10 subl $20, %esp 11 call L3 12 "L00000000001$pb": 13 L3: 14 popl %ebx 15 leal LC0-"L00000000001$pb"(%ebx), %eax 16 movl %eax, (%esp) 17 call L_puts$stub 18 movl $0, (%esp) 19 call L_exit$stub 20 .section __IMPORT,__jump_table,symbol_stubs,self_modifying_code+pure_instructions,5 21 L_exit$stub: 22 .indirect_symbol _exit 23 hlt ; hlt ; hlt ; hlt ; hlt 24 L_puts$stub: 25 .indirect_symbol _puts 26 hlt ; hlt ; hlt ; hlt ; hlt 27 .subsections_via_symbols
" 4 .text 5 .globl _main 6 _main: 7 pushl %ebp 8 movl %esp, %ebp 9 pushl %ebx 10 subl $20, %esp 11 call L3 12 "L00000000001$pb": 13 L3: 14 popl %ebx 15 leal LC0-"L00000000001$pb"(%ebx), %eax 16 movl %eax, (%esp) 17 call L_puts$stub 18 movl $0, (%esp) 19 call L_exit$stub 20 .section __IMPORT,__jump_table,symbol_stubs,self_modifying_code+pure_instructions,5 21 L_exit$stub: 22 .indirect_symbol _exit 23 hlt ; hlt ; hlt ; hlt ; hlt 24 L_puts$stub: 25 .indirect_symbol _puts 26 hlt ; hlt ; hlt ; hlt ; hlt 27 .subsections_via_symbols
" 4 .text 5 .globl _main 6 _main: 7 pushl %ebp 8 movl %esp, %ebp 9 pushl %ebx 10 subl $20, %esp 11 call L3 12 "L00000000001$pb": 13 L3: 14 popl %ebx 15 leal LC0-"L00000000001$pb"(%ebx), %eax 16 movl %eax, (%esp) 17 call L_puts$stub 18 movl $0, (%esp) 19 call L_exit$stub 20 .section __IMPORT,__jump_table,symbol_stubs,self_modifying_code+pure_instructions,5 21 L_exit$stub: 22 .indirect_symbol _exit 23 hlt ; hlt ; hlt ; hlt ; hlt 24 L_puts$stub: 25 .indirect_symbol _puts 26 hlt ; hlt ; hlt ; hlt ; hlt 27 .subsections_via_symbols
" 4 .text 5 .globl _main 6 _main: 7 pushl %ebp 8 movl %esp, %ebp 9 pushl %ebx 10 subl $20, %esp 11 call L3 12 "L00000000001$pb": 13 L3: 14 popl %ebx 15 leal LC0-"L00000000001$pb"(%ebx), %eax 16 movl %eax, (%esp) 17 call L_puts$stub 18 movl $0, (%esp) 19 call L_exit$stub 20 .section __IMPORT,__jump_table,symbol_stubs,self_modifying_code+pure_instructions,5 21 L_exit$stub: 22 .indirect_symbol _exit 23 hlt ; hlt ; hlt ; hlt ; hlt 24 L_puts$stub: 25 .indirect_symbol _puts 26 hlt ; hlt ; hlt ; hlt ; hlt 27 .subsections_via_symbols
" 4 .text 5 .globl _main 6 _main: 7 pushl %ebp 8 movl %esp, %ebp 9 pushl %ebx 10 subl $20, %esp 11 call L3 12 "L00000000001$pb": 13 L3: 14 popl %ebx 15 leal LC0-"L00000000001$pb"(%ebx), %eax 16 movl %eax, (%esp) 17 call L_puts$stub 18 movl $0, (%esp) 19 call L_exit$stub 20 .section __IMPORT,__jump_table,symbol_stubs,self_modifying_code+pure_instructions,5 21 L_exit$stub: 22 .indirect_symbol _exit 23 hlt ; hlt ; hlt ; hlt ; hlt 24 L_puts$stub: 25 .indirect_symbol _puts 26 hlt ; hlt ; hlt ; hlt ; hlt 27 .subsections_via_symbols
" 4 .text 5 .globl _main 6 _main: 7 pushl %ebp 8 movl %esp, %ebp 9 pushl %ebx 10 subl $20, %esp 11 call L3 12 "L00000000001$pb": 13 L3: 14 popl %ebx 15 leal LC0-"L00000000001$pb"(%ebx), %eax 16 movl %eax, (%esp) 17 call L_puts$stub 18 movl $0, (%esp) 19 call L_exit$stub 20 .section __IMPORT,__jump_table,symbol_stubs,self_modifying_code+pure_instructions,5 21 L_exit$stub: 22 .indirect_symbol _exit 23 hlt ; hlt ; hlt ; hlt ; hlt 24 L_puts$stub: 25 .indirect_symbol _puts 26 hlt ; hlt ; hlt ; hlt ; hlt 27 .subsections_via_symbols
" 4 .text 5 .globl _main 6 _main: 7 pushl %ebp 8 movl %esp, %ebp 9 pushl %ebx 10 subl $20, %esp 11 call L3 12 "L00000000001$pb": 13 L3: 14 popl %ebx 15 leal LC0-"L00000000001$pb"(%ebx), %eax 16 movl %eax, (%esp) 17 call L_puts$stub 18 movl $0, (%esp) 19 call L_exit$stub 20 .section __IMPORT,__jump_table,symbol_stubs,self_modifying_code+pure_instructions,5 21 L_exit$stub: 22 .indirect_symbol _exit 23 hlt ; hlt ; hlt ; hlt ; hlt 24 L_puts$stub: 25 .indirect_symbol _puts 26 hlt ; hlt ; hlt ; hlt ; hlt 27 .subsections_via_symbols
" 4 .text 5 .globl _main 6 _main: 7 pushl %ebp 8 movl %esp, %ebp 9 pushl %ebx 10 subl $20, %esp 11 call L3 12 "L00000000001$pb": 13 L3: 14 popl %ebx 15 leal LC0-"L00000000001$pb"(%ebx), %eax 16 movl %eax, (%esp) 17 call L_puts$stub 18 movl $0, (%esp) 19 call L_exit$stub 20 .section __IMPORT,__jump_table,symbol_stubs,self_modifying_code+pure_instructions,5 21 L_exit$stub: 22 .indirect_symbol _exit 23 hlt ; hlt ; hlt ; hlt ; hlt 24 L_puts$stub: 25 .indirect_symbol _puts 26 hlt ; hlt ; hlt ; hlt ; hlt 27 .subsections_via_symbols

The IMPORT section is where gcc allocates stubs for external functions. The dynamic linker will replace these with a jump to the real printf once libc is loaded.

What the code above does not include is proper alignment of the stack before the calls to printf and exit. This is required according to the Mac OSX ABI IA-32 Function Calling Conventions. It’s a slight of hand on the part of gcc which inserts a prolog before invoking our main function.

This prolog sets up the stack and gets hold of our program arguments, i.e. argc, argv and envp.

 1 Breakpoint 1, 0x00001f6c in start ()
 2 (gdb) disas
 3 Dump of assembler code for function start:
 4 0x00001f68 :    push   $0x0
 5 0x00001f6a :    mov    %esp,%ebp
 6 0x00001f6c :    and    $0xfffffff0,%esp ; <-- stack alignment
 7 0x00001f6f :    sub    $0x10,%esp  ; <-- and here too!
 8 0x00001f72 :    mov    0x4(%ebp),%ebx
 9 0x00001f75 :    mov    %ebx,0x0(%esp)
10 0x00001f79 :    lea    0x8(%ebp),%ecx
11 0x00001f7c :    mov    %ecx,0x4(%esp)
12 0x00001f80 :    add    $0x1,%ebx
13 0x00001f83 :    shl    $0x2,%ebx
14 0x00001f86 :    add    %ecx,%ebx
15 0x00001f88 :    mov    %ebx,0x8(%esp)
16 0x00001f8c :    mov    (%ebx),%eax
17 0x00001f8e :    add    $0x4,%ebx
18 0x00001f91 :    test   %eax,%eax
19 0x00001f93 :    jne    0x1f8c 
20 0x00001f95 :    mov    %ebx,0xc(%esp)
21 0x00001f99 :    call   0x1fca 
22 0x00001f9e :    mov    %eax,0x0(%esp)
23 0x00001fa2 :    call   0x3000 
24 0x00001fa7 :    hlt    
25 End of assembler dump.

Let’s tidy things up into a single NASM file. It’s less verbose than GAS and I much prefer it.

 1 bits  32
 2 
 3 section .text
 4 
 5 GLOBAL start
 6 extern _printf, _exit
 7 
 8 start:
 9   and esp, 0xFFFFFFF0
10   sub esp, 0x10
11   mov dword [esp], hello.msg
12   call _printf
13   add esp, 0x10
14   mov eax, 0          ; set return code
15   call _exit
16   hlt
17 
18 section .data
19 
20 hello.msg db 'Hello, World!', 0x0a, 0x00

The stubs are taken care of by nasm in Mach-O mode (-f macho below) and the code still works.

1 nasm -f macho hello.asm -o hello.o
2 ld hello.o -o hello -lc
3 
4 ./hello
5 Hello, World!

otool is indispensable for any sort of involved Mac forensics and the Mach-O file format is very well explained by Apple.

 1 otool -l hello
 2 hello:
 3 Load command 0
 4       cmd LC_SEGMENT
 5   cmdsize 56
 6   segname __PAGEZERO
 7    vmaddr 0x00000000
 8    vmsize 0x00001000
 9   fileoff 0
10  filesize 0
11   maxprot 0x00000000
12  initprot 0x00000000
13    nsects 0
14     flags 0x0
15 ...
16 Load command 8
17      cmd LC_UUID
18  cmdsize 24
19    uuid 0xce 0x2c 0xd0 0xae 0xbb 0x29 0xb4 0xc5
20         0xba 0x70 0x39 0x06 0x18 0x30 0x42 0x7b
21 Load command 9
22         cmd LC_UNIXTHREAD
23     cmdsize 80
24      flavor i386_THREAD_STATE
25       count i386_THREAD_STATE_COUNT
26         eax 0x00000000 ebx    0x00000000 ecx 0x00000000 edx 0x00000000
27         edi 0x00000000 esi    0x00000000 ebp 0x00000000 esp 0x00000000
28         ss  0x00000000 eflags 0x00000000 eip 0x00001fd0 cs  0x00000000
29         ds  0x00000000 es     0x00000000 fs  0x00000000 gs  0x00000000
30 Load command 10
31           cmd LC_LOAD_DYLIB
32       cmdsize 52
33          name /usr/lib/libSystem.B.dylib (offset 24)
34    time stamp 2 Thu Jan  1 01:00:02 1970
35       current version 111.1.3
36 compatibility version 1.0.0

The Mach-O header is normally generated by the compiler and the linker (GCC & LD) but I’m using neither so I have to generate the header by hand. It’s doable, as long as NASM is instructed to simply dump a binary image to disk (-f bin) and it actually works!

1 nasm -f bin hello1.asm -o hello1
2 chmod +x hello1
3 ./hello1
4 Hello, World!

Note that this can be done on any platform NASM runs on. I did it on Linux but assume it will work just as well on Windows.

Now, let’s take a good look at the code…

We need to tell NASM we are in 32-bit mode and that program code starts on the second VM page (0x1000 or 4096). The first page (PAGEZERO) is there to catch null pointer references.

1 ;;; File: hello1.asm
2 ;;; Build: nasm -f bin -o hello1 hello1.asm && chmod +x hello1
3 
4 bits  32
5 org   0x1000

The header specifies that this is an x86-32 binary and a full-fledged executable file and that there are 10 load commands in the header.

1 mhdr:
2    dd 0xFEEDFACE  ; magic
3    dd 7           ; cputype
4    dd 3           ; cpusubtype
5    dd 2           ; filetype
6    dd 10          ; ncmds
7    dd sizeofcmds  ; sizeofcmds
8    dd 0x85        ; flags

PAGEZERO is where you end up when dereferencing a 0 pointer. This page is protected from reading and writing so any access to it causes a page fault and a memory access violation. This segment does not take any space in the file so its filesize is set to 0.

 1 ;;; Load command #0
 2 
 3 pagezero: 
 4    dd 1              ; LC_SEGMENT
 5    dd _pagezero      ; size
 6    db '__PAGEZERO'   ; segname
 7    times 6 db 0      ; padding to 16 chars
 8    dd 0              ; vmaddr
 9    dd 0x1000         ; vmsize
10    dd 0              ; fileoff
11    dd 0              ; filesize
12    dd 0              ; maxprot
13    dd 0              ; initprot
14    dd 0              ; nsects
15    dd 0              ; flags
16 _pagezero equ $-pagezero

The text segment is where our code lives. It’s readable and executable (initprot). The load commands that form part of the Mach-O header itself need to be loaded somewhere. Here, they are part of the text segment which is why the segment starts at the beginning of the file (fileoff 0).

 1 ;;; Load command #1
 2 
 3 code: 
 4    dd 1              ; LC_SEGMENT
 5    dd _code          ; size
 6    db '__TEXT'       ; segname 
 7    times 10 db 0     ; padding to 16 chars
 8    dd 0x1000         ; vmaddr
 9    dd 0x1000         ; vmsize
10    dd 0              ; fileoff
11    dd 0x1000         ; filesize
12    dd 7              ; maxprot
13    dd 5              ; initprot
14    dd 1              ; nsects
15    dd 0              ; flags
16 
17 sect1:               ; section 0
18    db '__text'       ; sectname
19    times 10 db 0     ; padding to 16 chars
20    db '__TEXT'       ; segname 
21    times 10 db 0     ; padding to 16 chars
22    dd start          ; addr
23    dd codesize       ; size
24    dd start-$$       ; offset
25    dd 0              ; align on 2^0
26    dd 0              ; reloff
27    dd 0              ; nreloc
28    dd 0x80000400     ; flags
29    dd 0              ; reserved1
30    dd 0              ; reserved2
31 _code equ $-code

The data segment holds our “Hello world!” string.

 1 ;;; Load command #2
 2 
 3 data: 
 4    dd 1              ; LC_SEGMENT
 5    dd _data          ; size
 6    db '__DATA'       ; segname 
 7    times 10 db 0     ; padding to 16 chars
 8    dd 0x2000         ; vmaddr
 9    dd 0x1000         ; vmsize
10    dd 0x1000         ; fileoff
11    dd 0x1000         ; filesize
12    dd 7              ; maxprot
13    dd 3              ; initprot
14    dd 1              ; nsects
15    dd 0              ; flags
16 
17 sect2:               ; section 0
18    db '__const'      ; sectname
19    times 9 db 0      ; padding to 16 chars
20    db '__DATA'       ; segname 
21    times 10 db 0     ; padding to 16 chars
22    dd 0x2000         ; addr
23    dd 15             ; size, our string 
24    dd 4096           ; offset
25    dd 0              ; align on 2^0
26    dd 0              ; reloff
27    dd 0              ; nreloc
28    dd 0              ; flags
29    dd 0              ; reserved1
30    dd 0              ; reserved2
31 _data equ $-data

The IMPORT segment holds our jump table, the stubs for printf and exit. The dynamic linker will fill in the stubs for us with a jump to printf and exit in libc. This segment needs to be readable, writable and executable (initprot).

 1 ;;; Load command #3
 2 
 3 stubs: 
 4    dd 1              ; LC_SEGMENT
 5    dd _stubs         ; size
 6    db '__IMPORT'     ; segname 
 7    times 8 db 0      ; padding to 16 chars
 8    dd 0x3000         ; vmaddr
 9    dd 0x1000         ; vmsize
10    dd 0x2000         ; fileoff
11    dd 0x1000         ; filesize
12    dd 7              ; maxprot
13    dd 7              ; initprot
14    dd 1              ; nsects
15    dd 0              ; flags
16 
17 sect3:               ; section 0
18    db '__jump_table' ; sectname
19    times 4 db 0      ; padding to 16 chars
20    db '__IMPORT'     ; segname 
21    times 8 db 0      ; padding to 16 chars
22    dd 0x3000         ; addr
23    dd 10             ; size, two stubs
24    dd 0x2000         ; offset
25    dd 6              ; align on 2^6
26    dd 0              ; reloff
27    dd 0              ; nreloc
28    dd 0x04000008     ; flags
29    dd 0              ; reserved1
30    dd 5              ; reserved2, stub size
31 _stubs equ $-stubs

The LINKEDIT segment holds the symbol table.

 1 ;;; Load command #4
 2 
 3 linkage: 
 4    dd 1              ; LC_SEGMENT
 5    dd _linkage       ; size
 6    db '__LINKEDIT'   ; link table 
 7    times 6 db 0      ; padding 
 8    dd 0x4000         ; vmaddr
 9    dd 0x1000         ; vmsize
10    dd symbols-$$     ; fileoff
11    dd _symbols       ; filesize
12    dd 7              ; maxprot
13    dd 1              ; initprot
14    dd 0              ; nsects
15    dd 0              ; flags
16 _linkage equ $-linkage

This segment describes our symbol table, including where the symbols and the strings naming them are located. I believe it’s mostly for the benefit of the debugger.

 1 ;;; Load command #5
 2 
 3 symtab: 
 4    dd 2              ; LC_SYMTAB
 5    dd _symtab        ; size
 6    dd symbols-$$     ; symoff
 7    dd 4              ; nsyms
 8    dd strings-$$     ; stroff
 9    dd _strings       ; strsize
10 _symtab equ $-symtab

This load command describes the dynamic symbol table. This is how the dynamic linker knows to plug the stubs (indirect).

 1 ;;; Load command #6
 2 
 3 dysymtab: 
 4    dd 0x0b           ; LC_DYSYMTAB
 5    dd _dysymtab      ; size
 6    dd 0              ; ilocalsym
 7    dd 1              ; nlocalsym
 8    dd 1              ; iextdefsym
 9    dd 2              ; nextdefsym
10    dd 2              ; iundefsym
11    dd 2              ; nundefsym
12    dd 0              ; tocoff
13    dd 0              ; ntoc
14    dd 0              ; modtaboff
15    dd 0              ; nmodtab
16    dd 0              ; extrefsymoff
17    dd 0              ; nextrefsyms
18    dd indirect-$$    ; indirectsymoff
19    dd 2              ; nindirectsyms
20    dd 0              ; extreloff
21    dd 0              ; nextrel
22    dd 0              ; locreloff
23    dd 0              ; nlocrel
24 _dysymtab equ $-dysymtab

My guess is as good as yours here. I’m not ready to use a dynamic linker of my own but this is a distinct possibility! This load command clearly provides for it.

 1 ;;; Load command #7
 2 
 3 dylinker: 
 4    dd 0x0e           ; LC_LOAD_DYLINKER
 5    dd _dylinker      ; size
 6    dd 12             ; nameoff
 7    db '/usr/lib/dyld', 0
 8    align 4
 9 _dylinker equ $-dylinker

This load command specifies the contents of the registers at startup. I haven’t seen anything other than EIP populated, though. The program will not run unless this load command is present!

 1 ;;; Load command #8
 2 
 3 thrstate:
 4    dd 0x5            ; LC_UNIXTHREAD
 5    dd _thrstate      ; size
 6    dd 0x01           ; i386_THREAD_STATE
 7    dd 0x10           ; i386_THREAD_STATE_COUNT
 8    times 10 dd 0x00  ; cpu thread state
 9    dd start          ; eip
10    times 05 dd 0x00  ; 
11 _thrstate equ $-thrstate

We can have as many dylib segments as dynamic libraries we would like to use. I’m only using libc since that’s where printf and exit live. I could have created stubs for dlopen, dlclose, dlsym and dlerror and used them to load libc and pull out printf and exit. Why bother, though, when the dynamic linker can do it for us?

 1 ;;; Load command #9
 2 
 3 dylib: 
 4    dd 0x0c           ; LC_LOAD_DYLIB
 5    dd _dylib         ; size
 6    dd 0x18           ; nameoff
 7    dd 0x02           ; timestamp
 8    dd 0x006F0103     ; currentver
 9    dd 0x00010000     ; compatver
10    db '/usr/lib/libSystem.B.dylib', 0
11    align 4
12 _dylib equ $-dylib

It was a long road through the Mach-O header but we can finally relax and get some work done. There isn’t much to do apart from printing hello world and exiting but note the alignment of the stack on a 16-byte boundary, before each function call.

I’m taking the easy way out and aligning the stack one extra time, at the beginning of the program. This makes the rest of the alignment work much easier!

All values in the stack are 32-bit values. We are pushing a single argument which requires us to pad the stack with 12 more bytes (sub esp, 0x10). We pop arguments and padding right after the call to printf.

 1 GLOBAL start
 2 
 3 start:
 4 
 5   and esp, 0xFFFFFFF0
 6   sub esp, 0x10
 7   mov dword [esp], hello.msg
 8   call _printf
 9   add esp, 0x10
10   mov eax, 0          ; set return code
11   call _exit
12   hlt
13 
14 codesize equ $-start

Data and stubs are easy. Note the alignment to a page boundary. A jump to a 32-bit address takes 5 bytes, thus 5 halt instructions are used for each stub.

 1 ;;; Data
 2 
 3 align 4096
 4 
 5 hello.msg db 'Hello, World!', 0x0a, 0x00
 6 
 7 ;;; Stubs
 8 
 9 align 4096
10 
11 _printf:
12   times 5 hlt
13 
14 _exit:
15   times 5 hlt

The symbol table has a well-defined format and each symbol needs to be described in excruciating detail!

 1 ;;; Linkage
 2 
 3 align 4096
 4 
 5 symbols:           ; symbol table
 6 
 7 ; hello.msg
 8 
 9 dd str01off    ; nstrx
10 db 0x0e        ; type
11 db 0x02        ; sect
12 dw 0x00        ; desc
13 dd hello.msg   ; value
14 
15 ; start
16 
17 dd str02off    ; nstrx
18 db 0x0f        ; type
19 db 0x01        ; sect
20 dw 0x00        ; desc
21 dd start       ; value
22 
23 ; _printf
24 
25 dd str03off    ; nstrx
26 db 0x01        ; type N_EXT
27 db 0x00        ; sect
28 dw 0x0101      ; desc
29 dd _printf     ; value
30 
31 ; _exit
32 
33 dd str04off    ; nstrx
34 db 0x01        ; type N_EXT
35 db 0           ; sect
36 dw 0x0101      ; desc
37 dd _exit       ; value

The indirect symbol table tells the dynamic linker that elements 2 and 3 of the symbol table need to be looked up and their stubs plugged.

1 indirect:         ; indirect symbol table
2 
3    dd 0x02        ; _printf        
4    dd 0x03        ; _exit

The string table names the symbols above.

 1 strings:          ; string table
 2 
 3       db 0x20, 0x00
 4 
 5 str01 db 'hello.msg', 0x00
 6 str02 db 'start', 0x00
 7 str03 db '_printf', 0x00
 8 str04 db '_exit', 0x00
 9 
10 str01off equ str01 - strings
11 str02off equ str02 - strings
12 str03off equ str03 - strings
13 str04off equ str04 - strings
14 
15 _strings equ $-strings   
16 _symbols equ $-symbols

I don’t expect you to generate Mac binaries by hand on Linux or Windows but I hope this tutorial will be of help if you ever decide to try!

Filed under  //   asm   hacks   mac   mach-o  

Mnesia Unlimited

Mnesia is the Erlang embbedded distributed DBMS, that supports high scalability and fault tolerance through replication. Mnesia has been used to great success in all kinds of applications but it's not without limitations. These limitations mainly stem from the underlying ETS and DETS mechanisms used to implement Mnesia tables. 

There have been attempts to hack around Mnesia limitations before, e.g I wrote a Mnesia backend for Amazon S3 just last summer. All these attempts to add external table functionality to Mnesia suffered from being hard to extend, or were closed source implementations. 

Thanks to the Dukes of Erl, I finally got a chance to do a proper Mnesia extension, the one to rule them all! The result is a series of about 50 careful git patches that support all the features of regular Mnesia tables such as indexing, replication, distribution, fragmentation and backup, as well as set or bag semantics. Best of all, this functionality is available to anyone using OTP release R11B5. 

The external table API is supplied as an Erlang behavior so all you need to do is supply a callback module that conforms to it. 

You can now use Mnesia as a front-end to disk-based hash tables like SleepyCat/BerkeleyDB or Tokyo Cabinet, a memory-mapped table or MySQL.

You are no longer constrained by the scalability or size limitations of ETS and DETS so go wild and let me know what you come up with. Oh, and please petition the Erlang/OTP team to include this work in the official Erlang distribution!

Mnesia 4.3.5 with the extension mechanism is available from Google Code and so is the Tokyo Cabinet external table.

Filed under  //   erlang   hacks   mnesia  

Hacking the Mac OSX Unified Buffer Cache

Files read and written get cached in the Unified Buffer Cache (UBC) on Mac OSX.

The UBC was hindering me because I was processing a huge file in chunks but throwing out each chunk, never to be reused again, after writing out the processed chunk. I would see gigabytes of memory get eaten away by the UBC until the system started swapping and became unresponsive.

UBC cannot be limited to a given maximum amount of memory.

UBC cannot be inspected programmatically.

UBC can be cleared by running ‘purge’ which allocates a lot of memory to force the cache to clear. The following bit of code can be used turn caching off for a particular file:

1 fcntl(fd, F_GLOBAL_NOCACHE, 1);

This can be done in any process and the file can be closed after. The setting persists through out the lifetime of the file. If the file is removed and re-created then the setting is lost.

How can you tell if the setting took hold and the file is indeed NOT being cached?

1 dd if=/db1/cdr.csv bs=1m count=1024 of=/dev/null 
2 1073741824 bytes transferred in 12.030688 secs (89250242 bytes/sec) 
3 
4 dd if=/db1/cdr.csv bs=1m count=1024 of=/dev/null 
5 1073741824 bytes transferred in 11.867947 secs (90474101 bytes/sec) 
6 
7 dd if=/db1/cdr.csv bs=1m count=1024 of=/dev/null 
8 1073741824 bytes transferred in 12.037562 secs (89199278 bytes/sec)

Tried reading the file three times. Speed is about the same.

What about a regular file that’s cached by default?

1 dd if=/db1/cdr1.csv bs=1m count=1024 of=/dev/null 
2 1073741824 bytes transferred in 11.505857 secs (93321325 bytes/sec) 
3 
4 dd if=/db1/cdr1.csv bs=1m count=1024 of=/dev/null 
5 1073741824 bytes transferred in 0.500468 secs (2145475416 bytes/sec)

Notice that reading from the cache is much faster the second time around.

Kudos to Dominic Giampaolo from Apple for explaining all this to me!

Filed under  //   hacks   mac   performance  

Writing a Mac OSX USB device driver that implements SCSI pass-through

I’ve been on a coding tear since the beginning of this year, when I decided to dump Erlang and focus on all things low-level. I’ve been much happier since, although not much richer. Do you need a Mac OSX device driver written? Talk to me!

In this post I will explain how I wrote a Mac OSX USB device driver for the IntellaSys 24-core CPU on a thumbstick, also known as FORTHdrive. I will skip the parts that are reasonably clear from Apple documentation and focus on the bits I had trouble with. I will also leave two-machine driver and kernel debugging over FireWire for another post.

It will be helpful for you to first read about IOKit fundamentals, as well as Mass storage device driver programming and the SCSI device architecture model device interface. SCSI in particular is how I started down this slippery slope.

Introduction

The USB flash drive format is popular with hardware vendors. It’s possible to buy security tokens on a thumb stick and even 24-core Forth processors. The stick will most likely have a small disk partition that will house the vendor development kit or tools. It will look like a regular flash drive to the operating system (OS) and the OS will use SCSI over USB to access the data.

The manufacturer will implement vendor-specific SCSI commands to give you access the core functionality of the device such as the encryption API of a security token or storing and fetching data from a custom CPU. The OS will let you send custom SCSI commands to a SCSI device, this is called SCSI pass-through. You can use SGIO for SCSI pass-through on Linux. This boils down to a series of ioctl calls from your application and all is well… except on Mac OSX.

In its infinite wisdom, Apple decided to disable SCSI pass-through lest you send a format command to an attached device or do something equally evil. Apple [really really wants you to go through official and established channels] to talk to devices under Mac OSX, particularly SCSI devices. Apple did not and cannot establish channels for every custom device out there, which means that the hard work to implement SCSI pass-through on Mac OSX falls squarely on your shoulders.

Writing a Mac OSX device driver is not particularly hard. It took me all of about a week to get my driver ready and working. There’s definitely a dearth of information on writing Mac OSX device drivers and existing examples are too simple to be of much use.

I hunted far and wide (and way back in time!) through various Apple driver development lists to collect the information I needed and I’m summarizing it for you here, as well as providing full working source code to my driver.

Did I mention that Mac OSX drivers are written in C++? Not C, not Objective-C but C++! The original IOKit used to be called DriverKit and was written in Objective-C. Apple, apparently, felt C++ would be easier on third-party driver writers. Say what you want but C++ does simplify reuse. You don’t need to re-implement the full driver, you can subclass and change or add tiny bits and pieces.

Fundamentals

Your application lives in user land whereas the driver lives in kernel land. The two cannot talk to one another, except through a Mach port. Normally, your application would first locate the driver in the I/O registry.The SimpleUserClient and VendorSpecificType00 examples that Apple provides for developers show you how this is done.

Once you get a handle to your driver (service), you can open a connection to it like this

1 io_connect_t connect;
2 kern_return_t kernResult = IOServiceOpen(service, mach_task_self(), 0, &connect);

This gets you a handle that you can use to access your driver in kernel land.

User client

Once you get through the Mach port, you land in something called the user client. The user client mechanism is designed to allow calls from a user process to be dispatched to any IOService-based object in the kernel. Your driver would normally be a subclass of IOService but you would not access it directly. You would create a series of “adapter” functions that verify and perhaps massage the data and then pass it to your driver.

You can invoke user client functions that are set up via the external method dispatch table. This is a series of structures that describe each method of your user client, including the function pointer, number of integer arguments that the method takes in, number of integer values it returns and the same for structures. The table will look like this

 1 const IOExternalMethodDispatch UserClientClassName::Methods[kNumberOfMethods] =
 2 {
 3   { // kS24ClientOpen
 4     (IOExternalMethodAction) &UserClientClassName::sOpenUserClient,
 5     0,                            
 6     0,                            
 7     0,                            
 8     0                            
 9   }, 
10   ...
11 }

The SimpleUserClient example shows you how to set up and use various external method configurations.

Your user land method invocation will end up in externalMethod below. This is where you will look up your method using the selector to index your method table.

 1 IOReturn UserClientClassName::externalMethod(uint32_t selector, IOExternalMethodArguments* arguments,
 2     IOExternalMethodDispatch* dispatch, OSObject* target, void* reference)
 3 
 4 {
 5     IOLog("%s[%p]::%s(%d, %p, %p, %p, %p)\n", getName(), this, __FUNCTION__,
 6         selector, arguments, dispatch, target, reference);
 7 
 8     if (selector < (uint32_t) kNumberOfMethods)
 9     {
10         dispatch = (IOExternalMethodDispatch *) &Methods[selector];
11 
12         if (!target)
13      target = this;
14   }

A lot of methods in the user client are boilerplate but you do not want to miss initWithTask! This is the method where you should take owningTask and save it. This is the Mach task of your user land application and you will need it to map memory buffers from user space to kernel space. owningTask here will correspond to mach_task_self() in the call to IOServiceOpen above.

 1 bool UserClientClassName::initWithTask(task_t owningTask, void* securityToken, UInt32 type)
 2 {
 3     bool success = super::initWithTask(owningTask, securityToken, type);  
 4 
 5   // This IOLog must follow super::initWithTask because getName relies on the superclass initialization.
 6   IOLog("%s[%p]::%s(%p, %p, %ld)\n", getName(), this, __FUNCTION__, owningTask, securityToken, type);
 7 
 8     fTask = owningTask;
 9     fProvider = NULL;
10 
11     return success;
12 }

Methods in your dispatch table will be static and you will need a way to map those to methods of your user client class. Fortunately, every method has a target argument just for this purpose.

1 IOReturn UserClientClassName::sInit(UserClientClassName* target, void* reference, IOExternalMethodArguments* arguments)
2 {
3     return target->S24IO(NULL, 0, 0, 0, kIODirectionNone);
4 }

The code above does not use any of the external arguments but this method does

 1 IOReturn UserClientClassName::sRead(UserClientClassName* target, void* reference, IOExternalMethodArguments* arguments)
 2 {
 3     return target->S24IO(
 4         arguments->scalarInput[0],
 5         arguments->scalarInput[1],
 6         arguments->scalarInput[2],
 7         0,
 8         kIODirectionIn
 9         );
10 }

Simply pull your values from external method arguments and pass them to a method in your user client class, e.g. S24IO. fprovider is our driver handle that we set up in the start method, invoked as a result of us calling IOService open in our user land application.

Talking to the user client

To talk to the driver’s user client from your application you will invoke methods like IOConnectCallScalarMethod and friends. The SimpleUserClient example shows how this is done.

Passing buffers into the kernel

Apple has guidelines for how to allocate and share memory with user space from an I/O kit driver but what do you do if you need to pass a buffer from user space into the kernel? Simple! The kernel works with I/O memory descriptors and we need to create one for our user space buffer like so

1 IOMemoryDescriptor *iomd = IOMemoryDescriptor::withAddress(
2       (vm_address_t)buffer,
3       size,
4       direction,
5       fTask
6     );

See IOMemoryDescriptor documentation for more details.

Note fTask and direction above. You must tell the kernel which task this memory pointer belongs to so that the kernel can properly translate this address into physical memory. You also must tell the kernel whether you are going to be reading from this memory buffer or writing to it. This is what direction is for.

This is by no means conclusive documentation for _IOMemoryDescriptor__. Please read about IOBufferMemoryDescriptor and feel free to poker around further.

We are still in driver adapter and glue code here but we are getting close to the driver itself.

Driver

The salient points here are the InitializeDeviceSupport method and the way to send SCSI commands to the device.

Use InitializeDeviceSupport if you need to send SCSI commands to your device during driver initialization. Do not use the probe method for this since the command gate (don’t ask!) will not be allocated yet and you will panic the kernel.

Here I’m initializing my device by sending it the vendor-specific initialization command in S24Init().

 1 bool com_wagerlabs_driver_SEAforth24::InitializeDeviceSupport(void)
 2 {
 3     bool result = false;
 4 
 5     result = super::InitializeDeviceSupport();
 6 
 7     if ( result == true )
 8         result = (S24Init() == kIOReturnSuccess);
 9 
10     return result;
11 }

The S24SyncIO method is the heart and soul of my driver. Your driver will look different but things are easy and downhill from this point on since you have everything you need to send any kind of SCSI command to your device. You just need to go through a few more steps before you are done.

1) You get hold of a SCSI task.

1 req = GetSCSITask();
2 
3 require(req != NULL, ErrorExit);

2) You populate the SCSI Command Descriptor Block (CDB) according to your vendor’s instructions.

 1 switch (kind)
 2 {
 3     case kS24Write:
 4         direction = kSCSIDataTransfer_FromInitiatorToTarget;
 5         b1 = 0xFB;
 6         b2 = 0x00;
 7         break;
 8     case kS24WriteLast:
 9         direction = kSCSIDataTransfer_FromInitiatorToTarget;
10         b1 = 0xFB;
11         b2 = 0x02;
12         break;
13     case kS24Read:
14         direction = kSCSIDataTransfer_FromTargetToInitiator;
15         b1 = 0xFB;
16         b2 = 0x01;
17         break;
18     default: // kS24Init
19         direction = kSCSIDataTransfer_NoDataTransfer;
20         b1 = 0xFA;
21         b2 = 0x00;
22 }
23 
24 SetCommandDescriptorBlock(req, 0x20, b1, b2, 0x00, 0x00, 0x00, 0x00, hi, lo, 0x00);

3) You set a timeout for completion of your SCSI request and the data transfer direction.

1 SetTimeoutDuration(req, 10000);
2 SetDataTransferDirection(req, direction);

4) Mac OSX uses virtual memory which means that at the time of your SCSI command your buffer may be paged out to disk and not in physical memory. It’s crucial that you tell Mac OSX to prepare your memory buffer by mapping it back into memory and do any necessary housekeeping for your driver to be able to access your memory.

Other than that, don’t forget to tell the SCSI task to use your buffer and tell it how many bytes you are looking to transfer. It’s not necessary to set the direction of the transfer (from driver to device or vise versa) if this has already been set in the I/O memory descriptor (which is what we did).

1 if (buffer != NULL)
2 {
3     buffer->prepare();
4     SetDataBuffer(req, buffer);
5     SetRequestedDataTransferCount(req, buffer->getLength());
6 }

5) Finally, send the command to the device and tell Mac OSX that your are done using your memory buffer for direct memory access (DMA) by invoking the complete method of the I/O memory descriptor. You will also want to check the status of your SCSI request, the number of bytes transferred and you may also want to use the “SCSI Request Sense command”:http://en.wikipedia.org/wiki/SCSI_Request_Sense_Command if your request was unsuccessful.

1 serviceResponse = SendCommand(req, 10000);
2 
3 if (buffer != NULL)
4 {
5     buffer->complete();
6 }

That’s it folks! Let me know if I have omitted something crucial and I’ll try to expand this post as time allows.

Filed under  //   c++   drivers   forth   hacks   mac   scsi