Tenerife Skunkworks

Boldly going where few have gone before

Logo contest for AlgoKit

I’m running a logo contest for AlgoKit, please let me know if any of the logos strike a chord with you. Feel free to comment here or in the Facebook discussion thread if we are friends.

The contest prize is just $500 (guaranteed and I may up that) and I started the contest of Friday. I received over 70 entries already and need your help to reduce them down to a manageable handful. You can read my feeble attempt at a design brief but here is the essence…

The market is retail traders and small trading shops. AlgoKit is my new name for the web-based translator between trading languages.

It’s just EasyLanguage to NinjaTrader C# for now and it’s not on the web yet. The translator itself is written in OCaml, the backend is written in OCaml using Ocsigen. Few example translations are available but I’m working on it.

Going forward, my plan is to target more C#-based platforms and translate from more trading languages.

The single outstanding feature then is making it possible for users to switch trading platforms at a single push of a button.

Once that goes well, I will let traders run all these strategies in my “cloud”. Then you won’t need a PC for your trading, will be able to co-locate strategies at the exchange datacenter, etc.

Loading mentions Retweet

Filed under  //   algokit   design  
Posted August 2, 2010
// 4 Comments

Grand Central Dispatch, C++ and inter-thread communication

Now that Grand Central Dispatch is part of iPhone OS 4.0, I would like to showcase its unique benefits for inter-thread communication.

Passing data from one thread to another is normally a royal pain on the Mac, something involving the use of Mach ports, something I’d rather not talk about. Ever! Assuming that your threads use Core Foundation or Cocoa, your threads have a run loop and you can shuffle bits of data back and forth with no trouble whatsoever. You can even do it in C++.

We will look at running a code block in another thread and waiting for it to finish (synchronous execution), as well as running a block in another thread and having it run a callback that we supply (asynchronous execution).

Let’s grab a few required header files and get started!

1 #include 
2 #include 
3 #include 
4 
5 #include 
6 #include

This is the chunk of data we will be throwing around. Feel free to expand it as you see fit.

1 class Data
2 {  
3 public:  
4   Data() {}
5 };

We want to be good memory managers and collect more than a few bits of data.

1 typedef std::tr1::shared_ptr DataRef;
2 typedef std::vector BitsOfData;

A block type is different from a function pointer in that it uses a ‘hat’ () instead of a ‘star’ (*). We want to know how our operation went so we return an integer error code.

1 // Custom block
2 typedef int (^OurBlock)(void);

We need a callback type for asynchronous execution. Once or block runs on some other thread, it will invoke our callback and pass the error code from the operation (block) that we just executed. Note that our callback is also a block.

1 // Generic callback     
2 typedef void (^OurCallback)(int status);

We are using C++ so let’s make another callback, one that takes both the status from the block we just executed and the bits of data we collected in the process.

1 // Data bits callback    
2 typedef void (^DataBitsCallback)(int status, BitsOfData& bits);

We will assume that there’s a single thread that we want to talk to, running somewhere, all lonely. There is nothing preventing you from passing the run loop as an argument, though. You do need to have a reference to the run loop, though, no way around it!

1 // Run loop for the thread we want to talk to    
2 extern CFRunLoopRef __TheOtherRunLoop;

Synchronous execution of a block uses a semaphore for synchronization. We want to tell the other thread to run the block for us and we want to sit quietly until we are told that the block has finished executing.

 1 // Run synchronously
 2 
 3 int RunSync(OurBlock block)
 4 {
 5   // use an error code to capture the result of the operation
 6 
 7   __block int status = 0;  
 8 
 9   // and a semaphore to wait until the operation has completed
10 
11   dispatch_semaphore_t sema = dispatch_semaphore_create(0);
12 
13   // enqueue the block on the target run loop
14 
15   CFRunLoopPerformBlock(__TheOtherRunLoop, kCFRunLoopDefaultMode, ^{
16 
17     // run the block we created previously
18 
19     status = block();
20 
21     // and signal that we are done
22 
23     dispatch_semaphore_signal(sema);
24 
25   });
26 
27   // wake up the target run loop 
28 
29   CFRunLoopWakeUp(__TheOtherRunLoop);
30 
31   // hold on until we are done above
32 
33   dispatch_semaphore_wait(sema, DISPATCH_TIME_FOREVER);
34 
35   // dispose of the semaphore 
36 
37   dispatch_release(sema);
38 
39   // and return the error code
40 
41   return status;
42 }

Note that we both wait forever on the semaphore and give the run loop of the target thread a gentle bump to remind it that there’s work to do.

Running a block asynchronously is conceptually simple: we tell the other thread to run the block, the other thread tells us to run the callback once it’s done. There are no semaphores and no waiting.

 1 // Run asynchronously
 2 
 3 void RunAsync(OurBlock block, OurCallback callback)
 4 {
 5   // callback will run in the caller thread 
 6   // and use the caller's run loop 
 7 
 8   CFRunLoopRef callbackRunLoop = CFRunLoopGetCurrent();
 9 
10   // enqueue a block on the target run loop 
11 
12   CFRunLoopPerformBlock(__TheOtherRunLoop, kCFRunLoopDefaultMode, ^{
13 
14     // do the work
15 
16     int status = block();
17 
18     // enqueue a block on the caller's run loop 
19 
20     CFRunLoopPerformBlock(callbackRunLoop, kCFRunLoopDefaultMode, ^{
21 
22       // and make the block invoke the callback and pass the status code
23 
24       callback(status);
25 
26     });    
27 
28     // wake up the caller's run loop 
29 
30     CFRunLoopWakeUp(callbackRunLoop);
31   });
32 
33   // wake up the target run loop
34 
35   CFRunLoopWakeUp(__TheOtherRunLoop);
36 }

This is what a sample operation (block) look like. Note that we need to allocate the block on the heap rather than the stack (default). This is where Block_copy comes in. We want nothing more than to collect a few bits of data into a vector, making sure no exceptions excape the block. We will not need to change the block code for synchronous and asynchronous execution. Hurray for code reuse!

 1 OurBlock MakeCollectBits(BitsOfData& bits)
 2 {
 3   // allocate block on the heap
 4 
 5   return Block_copy(^{
 6 
 7     int status = 0;
 8 
 9     cout << "We right here!" << endl;
10 
11     // no exceptions can excape the block!
12 
13     try
14     {
15       DataRef bit1(new Data);
16       bits.push_back(ref1);
17     } 
18     catch (...)
19     {
20       cout << "Unknown exception" << endl;
21       status = -1;
22     }
23 
24     return status;
25   });
26 }

Here is what synchronous collection of bits looks like. Note that we need to use deallocate a block and release its resources once we no longer need it. This is where Block_release comes in.

 1 // Synchronous gathering of data bits
 2 
 3 int CollectBits(BitsOfData& bits)
 4 {
 5   // create the block and "capture" bits 
 6 
 7   OurBlock block = MakeCollectBits(bits);  
 8 
 9   // run the block synchronously
10 
11   int status = RunSync(block);
12 
13   // release memory allocated to the block
14 
15   Block_release(block);
16 
17   // we are done
18 
19   return status;
20 }

Running asynchronously requires us to run a given callback after we are done. We also must release the block we were asked to run. We work around this by creating a callback on the fly that first releases the block and then runs the callback we were given.

 1 // Asynchronous gathering of data bits
 2 
 3 void CollectBitsAsync(BitsOfData& bits, DataBitsCallback callback)
 4 {
 5   // create a closure as before
 6 
 7   OurBlock block = MakeCollectBits(bits);                               
 8 
 9   // run asynchronously and trigger a callback created on the fly
10 
11   RunAsync(block, ^(int status) {    
12 
13     // we are inside "OurCallback" here and need to clean up 
14     // the bit collection block first and foremost
15 
16     Block_release(block);
17 
18     // invoke the data bits callback, 
19     // giving it what we have collected
20 
21     callback(status, bits);
22   });
23 }

Please feel free to ask any questions!

Loading mentions Retweet

Filed under  //   c++   gcd   libdispatch   mac  
Posted April 21, 2010
// 0 Comments

Tracking IO patterns in memory-mapped dynamic libraries

The Mac OSX dynamic linker uses mmap to load dynamic libraries into memory. The memory range occupied by the libraries is then backed by a set of virtual memory pages, chunks of 4096 bytes each.

Using mmap is efficient because pages are lazily loaded from disk as needed and the virtual memory pager is free to evict them behind the scenes when memory is needed for something else.

Each page is filled with code for functions that live in the dynamic library and pages are fetched from disk when a call is made to a function in the dynamic library.

Pages should be accessed sequentially for best performance but how do you find out if this is the case? The only way to find out is to plug into the virtual memory manager and track when pages backing a particular dynamic library are paged-in from disk.

This is doable but a tad complicated as it requires access to internal kernel structures. As always, DTrace comes to the rescue! Note that the following DTrace script will only work on Snow Leopard. Let’s take it apart and see how it works…

First we define a an alias for an internal kernel type. This is not necessary but saves on typing.

1 typedef struct nameidata* nameidata_t;

A dynamic library is usually opened with a call to dlopen. The first argument (arg0) is the library path.

1 pid$target::dlopen:entry
2 /arg0/
3 {
4  self->ppath = arg0;
5  self->dylib = 1;
6 }

The Mac OSX dynamic linker (dyld) does not use dlopen on dynamic libraries, though, so we need to do different.

1 pid$target::dyld??loadPhase5*:entry,
2 pid$target::dyld??loadPhase4*:entry
3 /!self->dylib && arg0/
4 {
5  self->path = copyinstr(arg0);
6  self->func = probefunc;
7  self->dylib = 1;
8 }

We want to trigger subsequent probes only if we entered via one of the two entry points above. We need to reset the flag once we return and this is what we do below.

 1 pid$target::dlopen:return
 2 {
 3  self->dylib = 0;
 4 }
 5 
 6 pid$target::dyld??loadPhase5*:return,
 7 pid$target::dyld??loadPhase4*:return
 8 /self->func != 0 && self->func == probefunc/
 9 {
10  self->dylib = 0;
11 }

It often happens that memory we want to access in DTrace is not available at the function entry point. This is what happens sometimes with the dlopen entry probe. The open call is triggered by dlopen so we can copy the file path into kernel space using the self->ppath pointer we saved earlier.

1 /* file name memory should be wired in by now */
2 
3 pid$target::open:entry
4 /self->dylib && self->ppath && self->path == 0/
5 {
6  self->path = copyinstr(self->ppath);
7 }

The Mac OSX virtual memory manager identifies a file by its virtual node (vnode) in the virtual filesystem (VFS) layer. We need to somehow match up the dynamic library name with its vnode and this is where the vn_open_auth entry probe comes in.

Note that unlike the DTrace pid$target provider we used before, we use the Function Boundary Tracing (FBT) provider since the virtual filesystem layer lives in the kernel.

1 fbt::vn_open_auth:entry
2 {
3  self->ndp = (nameidata_t)arg0;
4 }

This clause is not absolutely necessary but I’m populating self->curpath as a shortcut, to be used a few times in later probes.

1 /* wait to make sure ndp and vnode are fully populated */
2 
3 fbt::vn_open_auth:return
4 /self->path != 0/
5 {
6  self->curpath = stringof((self->ndp)->ni_pathbuf);
7 }

It’s not obvious how to best match up a vnode with a file name so I cheated and studied the XNU kernel source code which Apple makes available. The vnode is not populated until vn_open_auth returns so we have to wait until it happens fo fetch the path.

We are almost done with our task, just need to save the library name and library path to vnode mappings if our library names match.

 1 /* make sure we are opening the same file */
 2 
 3 fbt::vn_open_auth:return
 4 /self->curpath != 0 && self->path == self->curpath/
 5 {
 6  this->vp = (vnode_t)(self->ndp)->ni_vp;
 7  this->lib = stringof((this->vp)->v_name);
 8  self->lib[this->lib] = self->path;
 9  self->vnode[this->lib] = this->vp; 
10 }

We do need to clean up sometime and so we do.

 1 fbt::vn_open_auth:return
 2 {
 3  self->path = 0;
 4  self->ppath = 0;
 5  self->curpath = 0;
 6  self->ndp = 0;
 7  self->func = 0;
 8  self->dylib = 0;
 9 }

We use the same function boundary tracing (fbt) DTrace provider to track pageins. We print the file offset of the data we are paging in, as well as the size.

 1 fbt::vnode_pagein:entry
 2 {
 3  self->v_name = stringof(((vnode_t)arg0)->v_name);
 4 }
 5 
 6 /* vnode pointers should match but v_name seems more secure */
 7 
 8 fbt::vnode_pagein:entry
 9 /self->lib[self->v_name] != 0/
10 {
11  printf("vnode_pagein: %s, offset: %u, size: %u\n", 
12    self->v_name, arg3, arg4);
13 }

It does look from the output that we are loading multiple pages at the same time, e.g. size 1019904 corresponds to 249 pages. Page access looks quite random, though, which is killing us.

Now that we know what the access pattern is, we should try to first identify the symbols that are being accessed in each set of pages and then make the linker rearrange the code such that page access is more sequential and less random.

Note that you are not likely to see page-ins when running this DTrace script unless you have just restarted your Mac and are running Firefox for the first time. This is because the dynamic libraries will be stored in the Unified Buffer Cache (UBC) after first access and there won’t be any subsequent disk access until they are evicted from the cache.

I wrote about hacking the Unified Buffer Cache before but the same technique does not work with mmap-ed data since the virtual manager and the cache are the same thing. Evicting the libraries would involve allocating at at least twice as much virtual memory as there’s RAM and then touching each page to make sure it’s cached. This is unlikely to correspond to normal use of Firefox, though.

Loading mentions Retweet

Filed under  //   dtrace   firefox   mmap   optimization  
Posted November 2, 2009
// 1 Comment

Where does time go?

The dyld shared cache lives in /var/db/dyld/. The two files of interest are dyld_shared_cache_i386.map (for x86-32) and shared_region_roots/Applications.paths. Both are regular text files. The former shows the contents of the shared cache for the i386 architecture and the latter is what update_dyld_shared_cache inspects.

There’s no prebinding on newer versions of Mac OSX and the dyld shared cache is automatically updated as needed. Tracing Safari disk activity during startup reveals that basically all its dynamic libraries are pulled from the dyld shared cache.

It’s possible to add Firefox (…/Firefox.app/Contents/MacOS/firefox-bin) to Applications.paths and the change will persist across reboots. Unfortunately, only a handful of libraries that Firefox uses are pulled into the cache by update_dyld_shared_cache. I’m speculating that this may have something to do with @executable_path/XUL and friends (otool -L …/firefox-bin).

Safari uses absolute paths to frameworks in /System/Library/Frameworks so I speculate that relative paths are what is preventing XUL and others from going into the cache.

It’s possible to fix relative library paths in a given library, e.g. fix.sh XUL where fix.sh looks like this

 1 #!/bin/bash 
 2 
 3 function dylibs () {
 4   otool -L $1 |grep executable_path|awk '{print $1}'|cut -d"/" -f2
 5 }
 6 
 7 for i in `dylibs $1`
 8 do
 9         install_name_tool -change @executable_path/$i `pwd`/$i $1
10 done
11 
12 install_name_tool -id `pwd`/$1 $1

Firefox has to be recompiled with LDFLAGS=-header-pad_max_install_names__ in MOZCONFIG_ to make this happen since new library paths are greater than the space allocated in the Mach-O binary. See the man page for install_name_tool for details.

It’s possible to force dynamic libraries into the cache by putting dynamic library paths into shared_region_roots/Applications.paths instead of executables. I wasn’t successful in caching XUL, though, regardless of what I did. XUL is the Firefox dynamic library, it doesn’t even have a dylib extension.

In the end it doesn’t seem to matter since there’s a baffling lack of difference between Firefox and Safari cold stats, despite Safari pulling everything from the cache and Firefox using a large number of non-cached dylibs.

Here’s the cold startup stats for Safari

 1 sync && purge && DYLD_PRINT_STATISTICS=1 /Applications/Safari.app/Contents/MacOS/Safari
 2 total time: 696.9 milliseconds (100.0%)
 3 total images loaded:  116 (114 from dyld shared cache, 114 needed no fixups)
 4 total segments mapped: 5, into 30 pages with 10 pages pre-fetched
 5 total images loading time: 204.9 milliseconds (29.4%)
 6 total rebase fixups:  1,298
 7 total rebase fixups time: 0.1 milliseconds (0.0%)
 8 total binding fixups: 2,476
 9 total binding symbol lookups: 234, average images searched per symbol: 1.6
10 total binding fixups time: 80.5 milliseconds (11.5%)
11 total bindings lazily fixed up: 3 of 901
12 total init time time: 411.3 milliseconds (59.0%)
13 total images with weak exports:  1

and Firefox

 1 total time: 731.2 milliseconds (100.0%)
 2 total images loaded:  106 (93 from dyld shared cache, 56 needed no fixups)
 3 total segments mapped: 49, into 5903 pages with 684 pages pre-fetched
 4 total images loading time: 235.3 milliseconds (32.1%)
 5 total rebase fixups:  149,011
 6 total rebase fixups time: 3.7 milliseconds (0.5%)
 7 total binding fixups: 24,932
 8 total binding symbol lookups: 797, average images searched per symbol: 2.3
 9 total binding fixups time: 149.9 milliseconds (20.5%)
10 total bindings lazily fixed up: 45 of 19,109
11 total init time time: 342.2 milliseconds (46.8%)
12 total images with weak exports:  3

Notice a large and significant difference? Me neither.

The other thing that I cannot explain at the moment is where the rest of the startup time goes, e.g.

1 ./cold.sh startup.d
2 Total: 10001.723521ms

So it took less than 1 second to dynamically link Firefox but where did the other 9 seconds of startup go?

cold.sh is rather simple

1 #!/bin/bash
2 
3 cmd="./Minefield.app/Contents/MacOS/firefox-bin -no-remote -foreground -P 2"
4 
5 sync && purge && dtrace -x dynvarsize=64m -x evaltime=exec -c "$cmd" -wZs $1

and the startup.d script doesn’t do much either

 1 #pragma D option quiet
 2 
 3 BEGIN
 4 {
 5  start = timestamp;
 6 }
 7 
 8 /* stop tracing here */
 9 
10 mozilla$target:::main-entry
11 {
12  exit(0);
13 }
14 
15 END
16 {
17  this->total = timestamp - start;
18  printf("Total: %u.%06ums\n", this->total / 1000000, this->total % 1000000);
19 }

main-entry is my static probe, built into the Firefox source code. It fires once the main Firefox function, XRE_main, is entered.

I don’t have an explanation yet but 9 seconds is a very large difference but it just may be DTrace because of similar cold startup timings for Safari and Firefox.

Loading mentions Retweet

Filed under  //   dyld   firefox   mac   optimization   safari  
Posted August 28, 2009
// 0 Comments

Faster Mac Firefox

I started at Mozilla this Monday and my focus is on making Firefox run faster which includes startup and page rendering.

I thought I’d start by checking dynamic linking time of Firefox and Safari.

My testbed consists of

  • 3 years old 2.16Ghz CoreDuo MacBook Pro, 2Gb of 667Mhz DDR2 memory, 100Gb 7200 RPM hard drive
  • recent 2.93Ghz Core2Duo MacBook Pro, 4Gb of 1067Mhz DDR3 memory, 256Gb Apple SSD

Running on the latter machine, I have

 1 total time: 5.8 seconds (100.0%)
 2 total images loaded:  116 (114 from dyld shared cache, 114 needed no fixups)
 3 total segments mapped: 5, into 30 pages with 10 pages pre-fetched
 4 total images loading time: 2.4 seconds (42.5%)
 5 total rebase fixups:  1,298
 6 total rebase fixups time: 0.1 milliseconds (0.0%)
 7 total binding fixups: 2,444
 8 total binding symbol lookups: 236, average images searched per symbol: 1.6
 9 total binding fixups time: 895.5 milliseconds (15.3%)
10 total bindings lazily fixed up: 3 of 889
11 total init time time: 2.4 seconds (42.1%)
12 total images with weak exports:  1

for Safari, and

 1 total time: 5.9 seconds (100.0%)
 2 total images loaded:  105 (92 from dyld shared cache, 56 needed no fixups)
 3 total segments mapped: 49, into 5894 pages with 684 pages pre-fetched
 4 total images loading time: 4.4 seconds (74.4%)
 5 total rebase fixups:  148,829
 6 total rebase fixups time: 5.3 milliseconds (0.0%)
 7 total binding fixups: 24,885
 8 total binding symbol lookups: 794, average images searched per symbol: 2.3
 9 total binding fixups time: 350.1 milliseconds (5.9%)
10 total bindings lazily fixed up: 45 of 17,764
11 total init time time: 1.1 seconds (19.5%)
12 total images with weak exports:  3

for Firefox.

From the above I conclude that the dynamic linker is bloody fast, taking minimal time to fix up thousands of bindings (library function pointers, etc.).

I can also conclude that Safari takes twice as long to run its static initializers (total init time).

Dynamic linking is not the bottleneck then, so I must keep digging!

Here are the results for the faster machine and Safari

 1 total time: 623.0 milliseconds (100.0%)
 2 total images loaded:  116 (114 from dyld shared cache, 114 needed no fixups)
 3 total segments mapped: 5, into 30 pages with 10 pages pre-fetched
 4 total images loading time: 216.7 milliseconds (34.7%)
 5 total rebase fixups:  1,298
 6 total rebase fixups time: 0.1 milliseconds (0.0%)
 7 total binding fixups: 2,476
 8 total binding symbol lookups: 234, average images searched per symbol: 1.6
 9 total binding fixups time: 49.2 milliseconds (7.9%)
10 total bindings lazily fixed up: 3 of 901
11 total init time time: 356.9 milliseconds (57.2%)
12 total images with weak exports:  1

as well as Firefox

 1 total time: 792.2 milliseconds (100.0%)
 2 total images loaded:  105 (92 from dyld shared cache, 56 needed no fixups)
 3 total segments mapped: 49, into 5894 pages with 684 pages pre-fetched
 4 total images loading time: 281.0 milliseconds (35.4%)
 5 total rebase fixups:  148,829
 6 total rebase fixups time: 2.6 milliseconds (0.3%)
 7 total binding fixups: 24,885
 8 total binding symbol lookups: 794, average images searched per symbol: 2.3
 9 total binding fixups time: 66.7 milliseconds (8.4%)
10 total bindings lazily fixed up: 45 of 17,764
11 total init time time: 441.8 milliseconds (55.7%)
12 total images with weak exports:  3

Loading mentions Retweet

Filed under  //   firefox   mac   optimization   safari  
Posted August 21, 2009
// 0 Comments

How to set up an ejabberd cluster on Amazon EC2 in 6 easy steps

1) Edit /etc/init.d/ejabberd

You need node=`hostname -f` since `hostname -s` does not work here.

2) Edit /etc/init.d/ejabberd

Use -name ejabberd@$node instead of -sname ejabberd everywhere. This applies to -sname ejabberdctl as well.

3) Edit /etc/init.d/ejabberd add mnesia_extra_db_nodes

See the start() function, find the line that says -detached and add the following right above

-mnesia extra_db_nodes \"[' ... hostname -f of a running node ... ']\" 

4) Remove the Mnesia db tables

cd /var/lib/ejabberd/spool && rm -f *

5) Edit /etc/ejabberd/ejabberdctl.cfg

Make sure you have this at the very end

ERLANG_NODE=ejabberd@`hostname -f`

6) Make sure your .erlang.cookie files are the same on all nodes

This will work with MySQL. Enjoy!

Loading mentions Retweet

Filed under  //   ec2   ejabberd   erlang  
Posted February 7, 2009
// 0 Comments

Creating Mac binaries on any platform, by hand and without using a linker

I’m in love with Forth but there are no commercial Forth environments for Mac OSX. GForth is a free, fast and portable implementation of ANS Forth but it requires GCC and does not allow for binary distribution of code that uses foreign functions.

There are two excellent commercial implementations of ANS Forth and both run on Linux. I asked one of the companies if I could port their Forth to the Mac and promptly ended up with a tarball on my lap. There were no C or assembler files, it was all Forth source code.

The proper bootstrapping approach turned out to generate a Mac kernel on Linux, copy it over to the Mac and use it to compile the rest of the Forth environment. It’s called cross-compiling!

This required me to investigate how Mac binaries are laid out and how I could generate them without using gcc or a linker.

I would like to explain how I did it. Let’s start with a simple C program and feel free to browse the full source code.

1 #include 
2 #include 
3 
4 int main(int argc, char **argv)
5 {
6   printf("Hello world!\n");
7   exit(0);
8 }

It can’t get any simpler!

1 gcc hello.c -o hello
2 ./hello
3 Hello world!

What does it look like in assembler, though?

 1 .cstring
 2 LC0:
 3     .ascii "Hello world!
 1 .cstring
 2 LC0:
 3     .ascii "Hello world!
 1 .cstring
 2 LC0:
 3     .ascii "Hello world!
 1 .cstring
 2 LC0:
 3     .ascii "Hello world!
 1 .cstring
 2 LC0:
 3     .ascii "Hello world!
 1 .cstring
 2 LC0:
 3     .ascii "Hello world!
 1 .cstring
 2 LC0:
 3     .ascii "Hello world!
 1 .cstring
 2 LC0:
 3     .ascii "Hello world!
 1 .cstring
 2 LC0:
 3     .ascii "Hello world!
 1 .cstring
 2 LC0:
 3     .ascii "Hello world![[posterous_whitelist_block_2]]"
 4     .text
 5 .globl _main
 6 _main:
 7     pushl   %ebp
 8     movl    %esp, %ebp
 9     pushl   %ebx
10     subl    $20, %esp
11     call    L3
12 "L00000000001$pb":
13 L3:
14     popl    %ebx
15     leal    LC0-"L00000000001$pb"(%ebx), %eax
16     movl    %eax, (%esp)
17     call    L_puts$stub
18     movl    $0, (%esp)
19     call    L_exit$stub
20     .section __IMPORT,__jump_table,symbol_stubs,self_modifying_code+pure_instructions,5
21 L_exit$stub:
22     .indirect_symbol _exit
23     hlt ; hlt ; hlt ; hlt ; hlt
24 L_puts$stub:
25     .indirect_symbol _puts
26     hlt ; hlt ; hlt ; hlt ; hlt
27     .subsections_via_symbols
" 4 .text 5 .globl _main 6 _main: 7 pushl %ebp 8 movl %esp, %ebp 9 pushl %ebx 10 subl $20, %esp 11 call L3 12 "L00000000001$pb": 13 L3: 14 popl %ebx 15 leal LC0-"L00000000001$pb"(%ebx), %eax 16 movl %eax, (%esp) 17 call L_puts$stub 18 movl $0, (%esp) 19 call L_exit$stub 20 .section __IMPORT,__jump_table,symbol_stubs,self_modifying_code+pure_instructions,5 21 L_exit$stub: 22 .indirect_symbol _exit 23 hlt ; hlt ; hlt ; hlt ; hlt 24 L_puts$stub: 25 .indirect_symbol _puts 26 hlt ; hlt ; hlt ; hlt ; hlt 27 .subsections_via_symbols
" 4 .text 5 .globl _main 6 _main: 7 pushl %ebp 8 movl %esp, %ebp 9 pushl %ebx 10 subl $20, %esp 11 call L3 12 "L00000000001$pb": 13 L3: 14 popl %ebx 15 leal LC0-"L00000000001$pb"(%ebx), %eax 16 movl %eax, (%esp) 17 call L_puts$stub 18 movl $0, (%esp) 19 call L_exit$stub 20 .section __IMPORT,__jump_table,symbol_stubs,self_modifying_code+pure_instructions,5 21 L_exit$stub: 22 .indirect_symbol _exit 23 hlt ; hlt ; hlt ; hlt ; hlt 24 L_puts$stub: 25 .indirect_symbol _puts 26 hlt ; hlt ; hlt ; hlt ; hlt 27 .subsections_via_symbols
" 4 .text 5 .globl _main 6 _main: 7 pushl %ebp 8 movl %esp, %ebp 9 pushl %ebx 10 subl $20, %esp 11 call L3 12 "L00000000001$pb": 13 L3: 14 popl %ebx 15 leal LC0-"L00000000001$pb"(%ebx), %eax 16 movl %eax, (%esp) 17 call L_puts$stub 18 movl $0, (%esp) 19 call L_exit$stub 20 .section __IMPORT,__jump_table,symbol_stubs,self_modifying_code+pure_instructions,5 21 L_exit$stub: 22 .indirect_symbol _exit 23 hlt ; hlt ; hlt ; hlt ; hlt 24 L_puts$stub: 25 .indirect_symbol _puts 26 hlt ; hlt ; hlt ; hlt ; hlt 27 .subsections_via_symbols
" 4 .text 5 .globl _main 6 _main: 7 pushl %ebp 8 movl %esp, %ebp 9 pushl %ebx 10 subl $20, %esp 11 call L3 12 "L00000000001$pb": 13 L3: 14 popl %ebx 15 leal LC0-"L00000000001$pb"(%ebx), %eax 16 movl %eax, (%esp) 17 call L_puts$stub 18 movl $0, (%esp) 19 call L_exit$stub 20 .section __IMPORT,__jump_table,symbol_stubs,self_modifying_code+pure_instructions,5 21 L_exit$stub: 22 .indirect_symbol _exit 23 hlt ; hlt ; hlt ; hlt ; hlt 24 L_puts$stub: 25 .indirect_symbol _puts 26 hlt ; hlt ; hlt ; hlt ; hlt 27 .subsections_via_symbols
" 4 .text 5 .globl _main 6 _main: 7 pushl %ebp 8 movl %esp, %ebp 9 pushl %ebx 10 subl $20, %esp 11 call L3 12 "L00000000001$pb": 13 L3: 14 popl %ebx 15 leal LC0-"L00000000001$pb"(%ebx), %eax 16 movl %eax, (%esp) 17 call L_puts$stub 18 movl $0, (%esp) 19 call L_exit$stub 20 .section __IMPORT,__jump_table,symbol_stubs,self_modifying_code+pure_instructions,5 21 L_exit$stub: 22 .indirect_symbol _exit 23 hlt ; hlt ; hlt ; hlt ; hlt 24 L_puts$stub: 25 .indirect_symbol _puts 26 hlt ; hlt ; hlt ; hlt ; hlt 27 .subsections_via_symbols
" 4 .text 5 .globl _main 6 _main: 7 pushl %ebp 8 movl %esp, %ebp 9 pushl %ebx 10 subl $20, %esp 11 call L3 12 "L00000000001$pb": 13 L3: 14 popl %ebx 15 leal LC0-"L00000000001$pb"(%ebx), %eax 16 movl %eax, (%esp) 17 call L_puts$stub 18 movl $0, (%esp) 19 call L_exit$stub 20 .section __IMPORT,__jump_table,symbol_stubs,self_modifying_code+pure_instructions,5 21 L_exit$stub: 22 .indirect_symbol _exit 23 hlt ; hlt ; hlt ; hlt ; hlt 24 L_puts$stub: 25 .indirect_symbol _puts 26 hlt ; hlt ; hlt ; hlt ; hlt 27 .subsections_via_symbols
" 4 .text 5 .globl _main 6 _main: 7 pushl %ebp 8 movl %esp, %ebp 9 pushl %ebx 10 subl $20, %esp 11 call L3 12 "L00000000001$pb": 13 L3: 14 popl %ebx 15 leal LC0-"L00000000001$pb"(%ebx), %eax 16 movl %eax, (%esp) 17 call L_puts$stub 18 movl $0, (%esp) 19 call L_exit$stub 20 .section __IMPORT,__jump_table,symbol_stubs,self_modifying_code+pure_instructions,5 21 L_exit$stub: 22 .indirect_symbol _exit 23 hlt ; hlt ; hlt ; hlt ; hlt 24 L_puts$stub: 25 .indirect_symbol _puts 26 hlt ; hlt ; hlt ; hlt ; hlt 27 .subsections_via_symbols
" 4 .text 5 .globl _main 6 _main: 7 pushl %ebp 8 movl %esp, %ebp 9 pushl %ebx 10 subl $20, %esp 11 call L3 12 "L00000000001$pb": 13 L3: 14 popl %ebx 15 leal LC0-"L00000000001$pb"(%ebx), %eax 16 movl %eax, (%esp) 17 call L_puts$stub 18 movl $0, (%esp) 19 call L_exit$stub 20 .section __IMPORT,__jump_table,symbol_stubs,self_modifying_code+pure_instructions,5 21 L_exit$stub: 22 .indirect_symbol _exit 23 hlt ; hlt ; hlt ; hlt ; hlt 24 L_puts$stub: 25 .indirect_symbol _puts 26 hlt ; hlt ; hlt ; hlt ; hlt 27 .subsections_via_symbols
" 4 .text 5 .globl _main 6 _main: 7 pushl %ebp 8 movl %esp, %ebp 9 pushl %ebx 10 subl $20, %esp 11 call L3 12 "L00000000001$pb": 13 L3: 14 popl %ebx 15 leal LC0-"L00000000001$pb"(%ebx), %eax 16 movl %eax, (%esp) 17 call L_puts$stub 18 movl $0, (%esp) 19 call L_exit$stub 20 .section __IMPORT,__jump_table,symbol_stubs,self_modifying_code+pure_instructions,5 21 L_exit$stub: 22 .indirect_symbol _exit 23 hlt ; hlt ; hlt ; hlt ; hlt 24 L_puts$stub: 25 .indirect_symbol _puts 26 hlt ; hlt ; hlt ; hlt ; hlt 27 .subsections_via_symbols

The IMPORT section is where gcc allocates stubs for external functions. The dynamic linker will replace these with a jump to the real printf once libc is loaded.

What the code above does not include is proper alignment of the stack before the calls to printf and exit. This is required according to the Mac OSX ABI IA-32 Function Calling Conventions. It’s a slight of hand on the part of gcc which inserts a prolog before invoking our main function.

This prolog sets up the stack and gets hold of our program arguments, i.e. argc, argv and envp.

 1 Breakpoint 1, 0x00001f6c in start ()
 2 (gdb) disas
 3 Dump of assembler code for function start:
 4 0x00001f68 :    push   $0x0
 5 0x00001f6a :    mov    %esp,%ebp
 6 0x00001f6c :    and    $0xfffffff0,%esp ; <-- stack alignment
 7 0x00001f6f :    sub    $0x10,%esp  ; <-- and here too!
 8 0x00001f72 :    mov    0x4(%ebp),%ebx
 9 0x00001f75 :    mov    %ebx,0x0(%esp)
10 0x00001f79 :    lea    0x8(%ebp),%ecx
11 0x00001f7c :    mov    %ecx,0x4(%esp)
12 0x00001f80 :    add    $0x1,%ebx
13 0x00001f83 :    shl    $0x2,%ebx
14 0x00001f86 :    add    %ecx,%ebx
15 0x00001f88 :    mov    %ebx,0x8(%esp)
16 0x00001f8c :    mov    (%ebx),%eax
17 0x00001f8e :    add    $0x4,%ebx
18 0x00001f91 :    test   %eax,%eax
19 0x00001f93 :    jne    0x1f8c 
20 0x00001f95 :    mov    %ebx,0xc(%esp)
21 0x00001f99 :    call   0x1fca 
22 0x00001f9e :    mov    %eax,0x0(%esp)
23 0x00001fa2 :    call   0x3000 
24 0x00001fa7 :    hlt    
25 End of assembler dump.

Let’s tidy things up into a single NASM file. It’s less verbose than GAS and I much prefer it.

 1 bits  32
 2 
 3 section .text
 4 
 5 GLOBAL start
 6 extern _printf, _exit
 7 
 8 start:
 9   and esp, 0xFFFFFFF0
10   sub esp, 0x10
11   mov dword [esp], hello.msg
12   call _printf
13   add esp, 0x10
14   mov eax, 0          ; set return code
15   call _exit
16   hlt
17 
18 section .data
19 
20 hello.msg db 'Hello, World!', 0x0a, 0x00

The stubs are taken care of by nasm in Mach-O mode (-f macho below) and the code still works.

1 nasm -f macho hello.asm -o hello.o
2 ld hello.o -o hello -lc
3 
4 ./hello
5 Hello, World!

otool is indispensable for any sort of involved Mac forensics and the Mach-O file format is very well explained by Apple.

 1 otool -l hello
 2 hello:
 3 Load command 0
 4       cmd LC_SEGMENT
 5   cmdsize 56
 6   segname __PAGEZERO
 7    vmaddr 0x00000000
 8    vmsize 0x00001000
 9   fileoff 0
10  filesize 0
11   maxprot 0x00000000
12  initprot 0x00000000
13    nsects 0
14     flags 0x0
15 ...
16 Load command 8
17      cmd LC_UUID
18  cmdsize 24
19    uuid 0xce 0x2c 0xd0 0xae 0xbb 0x29 0xb4 0xc5
20         0xba 0x70 0x39 0x06 0x18 0x30 0x42 0x7b
21 Load command 9
22         cmd LC_UNIXTHREAD
23     cmdsize 80
24      flavor i386_THREAD_STATE
25       count i386_THREAD_STATE_COUNT
26         eax 0x00000000 ebx    0x00000000 ecx 0x00000000 edx 0x00000000
27         edi 0x00000000 esi    0x00000000 ebp 0x00000000 esp 0x00000000
28         ss  0x00000000 eflags 0x00000000 eip 0x00001fd0 cs  0x00000000
29         ds  0x00000000 es     0x00000000 fs  0x00000000 gs  0x00000000
30 Load command 10
31           cmd LC_LOAD_DYLIB
32       cmdsize 52
33          name /usr/lib/libSystem.B.dylib (offset 24)
34    time stamp 2 Thu Jan  1 01:00:02 1970
35       current version 111.1.3
36 compatibility version 1.0.0

The Mach-O header is normally generated by the compiler and the linker (GCC & LD) but I’m using neither so I have to generate the header by hand. It’s doable, as long as NASM is instructed to simply dump a binary image to disk (-f bin) and it actually works!

1 nasm -f bin hello1.asm -o hello1
2 chmod +x hello1
3 ./hello1
4 Hello, World!

Note that this can be done on any platform NASM runs on. I did it on Linux but assume it will work just as well on Windows.

Now, let’s take a good look at the code…

We need to tell NASM we are in 32-bit mode and that program code starts on the second VM page (0x1000 or 4096). The first page (PAGEZERO) is there to catch null pointer references.

1 ;;; File: hello1.asm
2 ;;; Build: nasm -f bin -o hello1 hello1.asm && chmod +x hello1
3 
4 bits  32
5 org   0x1000

The header specifies that this is an x86-32 binary and a full-fledged executable file and that there are 10 load commands in the header.

1 mhdr:
2    dd 0xFEEDFACE  ; magic
3    dd 7           ; cputype
4    dd 3           ; cpusubtype
5    dd 2           ; filetype
6    dd 10          ; ncmds
7    dd sizeofcmds  ; sizeofcmds
8    dd 0x85        ; flags

PAGEZERO is where you end up when dereferencing a 0 pointer. This page is protected from reading and writing so any access to it causes a page fault and a memory access violation. This segment does not take any space in the file so its filesize is set to 0.

 1 ;;; Load command #0
 2 
 3 pagezero: 
 4    dd 1              ; LC_SEGMENT
 5    dd _pagezero      ; size
 6    db '__PAGEZERO'   ; segname
 7    times 6 db 0      ; padding to 16 chars
 8    dd 0              ; vmaddr
 9    dd 0x1000         ; vmsize
10    dd 0              ; fileoff
11    dd 0              ; filesize
12    dd 0              ; maxprot
13    dd 0              ; initprot
14    dd 0              ; nsects
15    dd 0              ; flags
16 _pagezero equ $-pagezero

The text segment is where our code lives. It’s readable and executable (initprot). The load commands that form part of the Mach-O header itself need to be loaded somewhere. Here, they are part of the text segment which is why the segment starts at the beginning of the file (fileoff 0).

 1 ;;; Load command #1
 2 
 3 code: 
 4    dd 1              ; LC_SEGMENT
 5    dd _code          ; size
 6    db '__TEXT'       ; segname 
 7    times 10 db 0     ; padding to 16 chars
 8    dd 0x1000         ; vmaddr
 9    dd 0x1000         ; vmsize
10    dd 0              ; fileoff
11    dd 0x1000         ; filesize
12    dd 7              ; maxprot
13    dd 5              ; initprot
14    dd 1              ; nsects
15    dd 0              ; flags
16 
17 sect1:               ; section 0
18    db '__text'       ; sectname
19    times 10 db 0     ; padding to 16 chars
20    db '__TEXT'       ; segname 
21    times 10 db 0     ; padding to 16 chars
22    dd start          ; addr
23    dd codesize       ; size
24    dd start-$$       ; offset
25    dd 0              ; align on 2^0
26    dd 0              ; reloff
27    dd 0              ; nreloc
28    dd 0x80000400     ; flags
29    dd 0              ; reserved1
30    dd 0              ; reserved2
31 _code equ $-code

The data segment holds our “Hello world!” string.

 1 ;;; Load command #2
 2 
 3 data: 
 4    dd 1              ; LC_SEGMENT
 5    dd _data          ; size
 6    db '__DATA'       ; segname 
 7    times 10 db 0     ; padding to 16 chars
 8    dd 0x2000         ; vmaddr
 9    dd 0x1000         ; vmsize
10    dd 0x1000         ; fileoff
11    dd 0x1000         ; filesize
12    dd 7              ; maxprot
13    dd 3              ; initprot
14    dd 1              ; nsects
15    dd 0              ; flags
16 
17 sect2:               ; section 0
18    db '__const'      ; sectname
19    times 9 db 0      ; padding to 16 chars
20    db '__DATA'       ; segname 
21    times 10 db 0     ; padding to 16 chars
22    dd 0x2000         ; addr
23    dd 15             ; size, our string 
24    dd 4096           ; offset
25    dd 0              ; align on 2^0
26    dd 0              ; reloff
27    dd 0              ; nreloc
28    dd 0              ; flags
29    dd 0              ; reserved1
30    dd 0              ; reserved2
31 _data equ $-data

The IMPORT segment holds our jump table, the stubs for printf and exit. The dynamic linker will fill in the stubs for us with a jump to printf and exit in libc. This segment needs to be readable, writable and executable (initprot).

 1 ;;; Load command #3
 2 
 3 stubs: 
 4    dd 1              ; LC_SEGMENT
 5    dd _stubs         ; size
 6    db '__IMPORT'     ; segname 
 7    times 8 db 0      ; padding to 16 chars
 8    dd 0x3000         ; vmaddr
 9    dd 0x1000         ; vmsize
10    dd 0x2000         ; fileoff
11    dd 0x1000         ; filesize
12    dd 7              ; maxprot
13    dd 7              ; initprot
14    dd 1              ; nsects
15    dd 0              ; flags
16 
17 sect3:               ; section 0
18    db '__jump_table' ; sectname
19    times 4 db 0      ; padding to 16 chars
20    db '__IMPORT'     ; segname 
21    times 8 db 0      ; padding to 16 chars
22    dd 0x3000         ; addr
23    dd 10             ; size, two stubs
24    dd 0x2000         ; offset
25    dd 6              ; align on 2^6
26    dd 0              ; reloff
27    dd 0              ; nreloc
28    dd 0x04000008     ; flags
29    dd 0              ; reserved1
30    dd 5              ; reserved2, stub size
31 _stubs equ $-stubs

The LINKEDIT segment holds the symbol table.

 1 ;;; Load command #4
 2 
 3 linkage: 
 4    dd 1              ; LC_SEGMENT
 5    dd _linkage       ; size
 6    db '__LINKEDIT'   ; link table 
 7    times 6 db 0      ; padding 
 8    dd 0x4000         ; vmaddr
 9    dd 0x1000         ; vmsize
10    dd symbols-$$     ; fileoff
11    dd _symbols       ; filesize
12    dd 7              ; maxprot
13    dd 1              ; initprot
14    dd 0              ; nsects
15    dd 0              ; flags
16 _linkage equ $-linkage

This segment describes our symbol table, including where the symbols and the strings naming them are located. I believe it’s mostly for the benefit of the debugger.

 1 ;;; Load command #5
 2 
 3 symtab: 
 4    dd 2              ; LC_SYMTAB
 5    dd _symtab        ; size
 6    dd symbols-$$     ; symoff
 7    dd 4              ; nsyms
 8    dd strings-$$     ; stroff
 9    dd _strings       ; strsize
10 _symtab equ $-symtab

This load command describes the dynamic symbol table. This is how the dynamic linker knows to plug the stubs (indirect).

 1 ;;; Load command #6
 2 
 3 dysymtab: 
 4    dd 0x0b           ; LC_DYSYMTAB
 5    dd _dysymtab      ; size
 6    dd 0              ; ilocalsym
 7    dd 1              ; nlocalsym
 8    dd 1              ; iextdefsym
 9    dd 2              ; nextdefsym
10    dd 2              ; iundefsym
11    dd 2              ; nundefsym
12    dd 0              ; tocoff
13    dd 0              ; ntoc
14    dd 0              ; modtaboff
15    dd 0              ; nmodtab
16    dd 0              ; extrefsymoff
17    dd 0              ; nextrefsyms
18    dd indirect-$$    ; indirectsymoff
19    dd 2              ; nindirectsyms
20    dd 0              ; extreloff
21    dd 0              ; nextrel
22    dd 0              ; locreloff
23    dd 0              ; nlocrel
24 _dysymtab equ $-dysymtab

My guess is as good as yours here. I’m not ready to use a dynamic linker of my own but this is a distinct possibility! This load command clearly provides for it.

 1 ;;; Load command #7
 2 
 3 dylinker: 
 4    dd 0x0e           ; LC_LOAD_DYLINKER
 5    dd _dylinker      ; size
 6    dd 12             ; nameoff
 7    db '/usr/lib/dyld', 0
 8    align 4
 9 _dylinker equ $-dylinker

This load command specifies the contents of the registers at startup. I haven’t seen anything other than EIP populated, though. The program will not run unless this load command is present!

 1 ;;; Load command #8
 2 
 3 thrstate:
 4    dd 0x5            ; LC_UNIXTHREAD
 5    dd _thrstate      ; size
 6    dd 0x01           ; i386_THREAD_STATE
 7    dd 0x10           ; i386_THREAD_STATE_COUNT
 8    times 10 dd 0x00  ; cpu thread state
 9    dd start          ; eip
10    times 05 dd 0x00  ; 
11 _thrstate equ $-thrstate

We can have as many dylib segments as dynamic libraries we would like to use. I’m only using libc since that’s where printf and exit live. I could have created stubs for dlopen, dlclose, dlsym and dlerror and used them to load libc and pull out printf and exit. Why bother, though, when the dynamic linker can do it for us?

 1 ;;; Load command #9
 2 
 3 dylib: 
 4    dd 0x0c           ; LC_LOAD_DYLIB
 5    dd _dylib         ; size
 6    dd 0x18           ; nameoff
 7    dd 0x02           ; timestamp
 8    dd 0x006F0103     ; currentver
 9    dd 0x00010000     ; compatver
10    db '/usr/lib/libSystem.B.dylib', 0
11    align 4
12 _dylib equ $-dylib

It was a long road through the Mach-O header but we can finally relax and get some work done. There isn’t much to do apart from printing hello world and exiting but note the alignment of the stack on a 16-byte boundary, before each function call.

I’m taking the easy way out and aligning the stack one extra time, at the beginning of the program. This makes the rest of the alignment work much easier!

All values in the stack are 32-bit values. We are pushing a single argument which requires us to pad the stack with 12 more bytes (sub esp, 0x10). We pop arguments and padding right after the call to printf.

 1 GLOBAL start
 2 
 3 start:
 4 
 5   and esp, 0xFFFFFFF0
 6   sub esp, 0x10
 7   mov dword [esp], hello.msg
 8   call _printf
 9   add esp, 0x10
10   mov eax, 0          ; set return code
11   call _exit
12   hlt
13 
14 codesize equ $-start

Data and stubs are easy. Note the alignment to a page boundary. A jump to a 32-bit address takes 5 bytes, thus 5 halt instructions are used for each stub.

 1 ;;; Data
 2 
 3 align 4096
 4 
 5 hello.msg db 'Hello, World!', 0x0a, 0x00
 6 
 7 ;;; Stubs
 8 
 9 align 4096
10 
11 _printf:
12   times 5 hlt
13 
14 _exit:
15   times 5 hlt

The symbol table has a well-defined format and each symbol needs to be described in excruciating detail!

 1 ;;; Linkage
 2 
 3 align 4096
 4 
 5 symbols:           ; symbol table
 6 
 7 ; hello.msg
 8 
 9 dd str01off    ; nstrx
10 db 0x0e        ; type
11 db 0x02        ; sect
12 dw 0x00        ; desc
13 dd hello.msg   ; value
14 
15 ; start
16 
17 dd str02off    ; nstrx
18 db 0x0f        ; type
19 db 0x01        ; sect
20 dw 0x00        ; desc
21 dd start       ; value
22 
23 ; _printf
24 
25 dd str03off    ; nstrx
26 db 0x01        ; type N_EXT
27 db 0x00        ; sect
28 dw 0x0101      ; desc
29 dd _printf     ; value
30 
31 ; _exit
32 
33 dd str04off    ; nstrx
34 db 0x01        ; type N_EXT
35 db 0           ; sect
36 dw 0x0101      ; desc
37 dd _exit       ; value

The indirect symbol table tells the dynamic linker that elements 2 and 3 of the symbol table need to be looked up and their stubs plugged.

1 indirect:         ; indirect symbol table
2 
3    dd 0x02        ; _printf        
4    dd 0x03        ; _exit

The string table names the symbols above.

 1 strings:          ; string table
 2 
 3       db 0x20, 0x00
 4 
 5 str01 db 'hello.msg', 0x00
 6 str02 db 'start', 0x00
 7 str03 db '_printf', 0x00
 8 str04 db '_exit', 0x00
 9 
10 str01off equ str01 - strings
11 str02off equ str02 - strings
12 str03off equ str03 - strings
13 str04off equ str04 - strings
14 
15 _strings equ $-strings   
16 _symbols equ $-symbols

I don’t expect you to generate Mac binaries by hand on Linux or Windows but I hope this tutorial will be of help if you ever decide to try!

Loading mentions Retweet

Filed under  //   asm   hacks   mac   mach-o  
Posted January 26, 2009
// 0 Comments

Optimizing Erlang: Array vs Tuple

This is the first in my series of posts on optimizing Erlang. I plan to tackle optimizing Mnesia, profiling and scalability.

You need to use arrays of up to 10,000 elements. Erlang offers you tuples as well as fixed-size and extensible pseudo-arrays. What is the fastest option?

Let us start the death match by pitting arrays against tuples in a death match. Trees were an option before the array module became available, so lets throw in trees just for laughs.

I'm running Mac OSX Leopard 10.5.4 on a Mac Pro 2x2.8Ghz Quad-Core Intel Xeon with 14Gb 800Mhz DDR2 FB-DIMM.

Erlang (BEAM) emulator version 5.6.3 [source] [64-bit] [smp:8] [async-threads:0] [kernel-poll:false]

27> arr:test().
Fixed-size array: get:     2921us, set:     5902us
Extensible array: get:     3336us, set:     8144us
Tuple:            get:      632us, set:   107467us
Tree:             get:     4321us, set:    45256us
ok

30> arr:test(100000).
Fixed-size array: get:    35314us, set:    74653us
Extensible array: get:    35349us, set:    74059us
Tuple:            get:     6411us, set: 24304490us
Tree:             get:    53681us, set:   632795us
ok

Note that timer:tc returns time in microseconds. I ran each test 3 times and the results above are from the third iteration.

Trees in Erlang (gb_trees) are built on top of regular tuples and so is the array module. The array module is much more efficient about using tuples than a regular tree, though, and this is the reason why it's so much faster.

The tuple test pre-allocates a tuple of 10k or 100k elements. There's no destructive assignment in Erlang and so the same large tuple needs to be allocated and discarded on every set operation. It's very inefficient to allocate and discard a large tuple on every set operation, thus naive tuple set is very slow.

The array module uses an efficient tree-like internal representation:

%% A tree is either a leaf, with LEAFSIZE elements (the "base"), an
%% internal node with LEAFSIZE+1 elements, or an unexpanded tree,
%% represented by a single integer: the number of elements that may be
%% stored in the tree when it is expanded. The last element of an
%% internal node caches the number of elements that may be stored in
%% each of its subtrees.
%%
%% Note that to update an entry in a tree of height h = log[b] n, the
%% total number of written words is (b+1)+(h-1)*(b+2), since tuples use
%% a header word on the heap. 4 is the optimal base for minimizing the
%% number of words written, but causes higher trees, which takes time.
%% The best compromise between speed and memory usage seems to lie
%% around 8-10. Measurements indicate that the optimum base for speed is
%% 24 - above that, it gets slower again due to the high memory usage.
%% Base 10 is a good choice, giving 2/3 of the possible speedup from
%% base 4, but only using 1/3 more memory. (Base 24 uses 65% more memory
%% per write than base 10, but the speedup is only 21%.)

It's far more efficient to allocate small tuples on every set and this is why the array module wins hands down. 

Use the code below to replicate my results on your hardware.

    

Loading mentions Retweet

Filed under  //   erlang   performance  
Posted August 25, 2008
// 1 Comment

AlgoScript

I'm on my third iteration of a translator from EasyLanguage. The first two versions were written in Haskell and OCaml and I'm using Lisp now. My goal is to produce code for a trading engine that runs in a shared library or DLL and can be embedded in other products such as NinjaTrader or TradeStation.

The original translator produced C# code but found this approach untenable. Every trading platform I looked at has a different set of trading functions. Generating C# code would have required me to write a library of supporting functions for every target platform to plug in the holes. 

I would have to write the code and test my libraries over and over again. It would also have required me to become an expert in every trading platform I wanted to translate for and made expanding my market rather tedious. Last but not least, anyone could grok my logic by looking at a translation or two and then write the code themselves using the libraries that I have painstakingly produced.

It struck me that I could translate into an intermediate language and build an embeddable execution engine that could run in every trading product I would target with my translator. All the trading products I looked at support DLLs. So long as I supplied my engine as a DLL and exported a set of functions, I could take in price quotes and return buy or sell instructions. 

Targeting my own trading engine simplifies development and testing and lets me focus on adding value to my own products instead of the products of others. I can focus on producing the best embeddable trading engine ever. I will depend on the host platform for price quotes and sending orders to the exchange, at least initially, but will add market data and execution interfaces over time.

Most of the trading products that I'm aware of run on the Windows desktop and are either written using .NET or and are migrating to .NET as we speak. These trading products use C# as their trading language and the differences between them are becoming less and less pronounced. 

I have no intention of slugging it out in the extremely crowded desktop trading space. The embeddable cross-platform trading space, on the other hand, is a great niche. Think unattended execution of trading strategies, grid-based analysis of massive volumes of market data and other mouth-watering goodies. 

The main issue to consider is the choice of trading language for the embeddable engine. Just as with C# on .NET, it's a choice determined by the implementation language. A Haskell-like DSL would have been nice but I shudder at the thought of Haskell as a DLL. I'm sorry but I could not resist the poke!

The OCaml syntax is quite rigid, although the LexiFI folks have hacked it to suit their needs. I could use Camlp4 but I had a very unpleasant experience with it. I mean do you dig the <:expr<, $lid:tbl$, $lid:x$. I do not!

I would like to present a translation of the EKam Scalper in AlgoScript. A Lisp by any other name would smell as sweet?

Loading mentions Retweet

Filed under  //   compilers   lisp   trading  
Posted July 31, 2008
// 0 Comments

Mnesia Unlimited

Mnesia is the Erlang embbedded distributed DBMS, that supports high scalability and fault tolerance through replication. Mnesia has been used to great success in all kinds of applications but it's not without limitations. These limitations mainly stem from the underlying ETS and DETS mechanisms used to implement Mnesia tables. 

There have been attempts to hack around Mnesia limitations before, e.g I wrote a Mnesia backend for Amazon S3 just last summer. All these attempts to add external table functionality to Mnesia suffered from being hard to extend, or were closed source implementations. 

Thanks to the Dukes of Erl, I finally got a chance to do a proper Mnesia extension, the one to rule them all! The result is a series of about 50 careful git patches that support all the features of regular Mnesia tables such as indexing, replication, distribution, fragmentation and backup, as well as set or bag semantics. Best of all, this functionality is available to anyone using OTP release R11B5. 

The external table API is supplied as an Erlang behavior so all you need to do is supply a callback module that conforms to it. 

You can now use Mnesia as a front-end to disk-based hash tables like SleepyCat/BerkeleyDB or Tokyo Cabinet, a memory-mapped table or MySQL.

You are no longer constrained by the scalability or size limitations of ETS and DETS so go wild and let me know what you come up with. Oh, and please petition the Erlang/OTP team to include this work in the official Erlang distribution!

Mnesia 4.3.5 with the extension mechanism is available from Google Code and so is the Tokyo Cabinet external table.

Loading mentions Retweet

Filed under  //   erlang   hacks   mnesia  
Posted June 23, 2008
// 2 Comments