Tenerife Skunkworks

Trading and technology

Grand Central Dispatch

Now that Grand Central Dispatch is part of iPhone OS 4.0, I would like to showcase its unique benefits for inter-thread communication.

Passing data from one thread to another is normally a royal pain on the Mac, something involving the use of Mach ports, something I’d rather not talk about. Ever! Assuming that your threads use Core Foundation or Cocoa, your threads have a run loop and you can shuffle bits of data back and forth with no trouble whatsoever. You can even do it in C++.

We will look at running a code block in another thread and waiting for it to finish (synchronous execution), as well as running a block in another thread and having it run a callback that we supply (asynchronous execution).

Tracking IO Patterns in Memory-mapped Dynamic Libraries

The Mac OSX dynamic linker uses mmap to load dynamic libraries into memory. The memory range occupied by the libraries is then backed by a set of virtual memory pages, chunks of 4096 bytes each.

Using mmap is efficient because pages are lazily loaded from disk as needed and the virtual memory pager is free to evict them behind the scenes when memory is needed for something else.

Each page is filled with code for functions that live in the dynamic library and pages are fetched from disk when a call is made to a function in the dynamic library.

Firefox Startup: Where Does Time Go?

The dyld shared cache lives in /var/db/dyld/. The two files of interest are dyld_shared_cache_i386.map (for x86-32) and shared_region_roots/Applications.paths. Both are regular text files. The former shows the contents of the shared cache for the i386 architecture and the latter is what update_dyld_shared_cache inspects.

There’s no prebinding on newer versions of Mac OSX and the dyld shared cache is automatically updated as needed. Tracing Safari disk activity during startup reveals that basically all its dynamic libraries are pulled from the dyld shared cache.

How to Set Up an Ejabberd Cluster on Amazon EC2 in 6 Easy Steps

How to set up an ejabberd cluster on Amazon EC2 in 6 easy steps

17 Feb 2009 – Tenerife

1) Edit /etc/init.d/ejabberd

node=`hostname -f`

since

`hostname -s`

does not work here.

2) Edit /etc/init.d/ejabberd

Use

-name ejabberd@$node

instead of

-sname ejabberd

everywhere. This applies to

-sname ejabberdctl

too.

3) Edit /etc/init.d/ejabberd add mnesia_extra_db_nodes

See the start() function, find the line that says

-detached"

and add the following right above

-mnesia extra_db_nodes \"[' ... hostname -f of a running node ... ']\" \

4) Remove the Mnesia db tables

cd /var/lib/ejabberd/spool && rm -f *

5) Edit /etc/ejabberd/ejabberdctl.cfg

Make sure you have this at the very end

ERLANG_NODE=ejabberd@`hostname -f`

6) Make sure your .erlang.cookie files are the same on all nodes

This will work with MySQL. Enjoy!

Creating Mac Binaries on Any Platform

I’m in love with Forth but there are no commercial Forth environments for Mac OSX. GForth is a free, fast and portable implementation of ANS Forth but it requires GCC and does not allow for binary distribution of code that uses foreign functions.

There are two excellent commercial implementations of ANS Forth and both run on Linux. I asked one of the companies if I could port their Forth to the Mac and promptly ended up with a tarball on my lap. There were no C or assembler files, it was all Forth source code.

The proper bootstrapping approach turned out to generate a Mac kernel on Linux, copy it over to the Mac and use it to compile the rest of the Forth environment. It’s called cross-compiling!

Optimizing Erlang - a Death Match of Arrays and Tuples

Optimizing Erlang – A death match of arrays and tuples

25 August 2008 – Tenerife

This is the first in my series of posts on optimizing Erlang. I plan to tackle optimizing Mnesia, profiling and scalability.

You need to use arrays of up to 10,000 elements. Erlang offers you tuples as well as fixed-size and extensible pseudo-arrays. What is the fastest option?

Let us start the death match by pitting arrays against tuples in a death match. Trees were an option before the array module became available, so lets throw in trees just for laughs.

I’m running Mac OSX Leopard 10.5.4 on a Mac Pro 2×2.8Ghz Quad-Core Intel Xeon with 14Gb 800Mhz DDR2 FB-DIMM.


Erlang (BEAM) emulator version 5.6.3 [source] [64-bit] [smp:8] [async-threads:0] [kernel-poll:false]

27> arr:test().
Fixed-size array: get:     2921µs, set:     5902µs
Extensible array: get:     3336µs, set:     8144µs
Tuple:            get:      632µs, set:   107467µs
Tree:             get:     4321µs, set:    45256µs
ok

30> arr:test(100000).
Fixed-size array: get:    35314µs, set:    74653µs
Extensible array: get:    35349µs, set:    74059µs
Tuple:            get:     6411µs, set: 24304490µs
Tree:             get:    53681µs, set:   632795µs
ok

Note that timer:tc returns time in microseconds. I ran each test 3 times and the results above are from the third iteration.

Trees in Erlang (gb_trees) are built on top of regular tuples and so is the array module. The array module is much more efficient about using tuples than a regular tree, though, and this is the reason why it’s so much faster.

The tuple test pre-allocates a tuple of 10k or 100k elements. There’s no destructive assignment in Erlang and so the same large tuple needs to be allocated and discarded on every set operation. It’s very inefficient to allocate and discard a large tuple on every set operation, thus naive tuple set is very slow.

The array module uses an efficient tree-like internal representation:

%% A tree is either a leaf, with LEAFSIZE elements (the "base"), an

% internal node with LEAFSIZE+1 elements, or an unexpanded tree,
%
represented by a single integer: the number of elements that may be
% stored in the tree when it is expanded. The last element of an
%
internal node caches the number of elements that may be stored in
% each of its subtrees.
%

% Note that to update an entry in a tree of height h = log[b] n, the
%
total number of written words is (b+1)*(b2), since tuples use
% a header word on the heap. 4 is the optimal base for minimizing the
%
number of words written, but causes higher trees, which takes time.
% The best compromise between speed and memory usage seems to lie
%
around 8-10. Measurements indicate that the optimum base for speed is
% 24 – above that, it gets slower again due to the high memory usage.
%
Base 10 is a good choice, giving 2/3 of the possible speedup from
% base 4, but only using 1/3 more memory. (Base 24 uses 65 more memory
% per write than base 10, but the speedup is only 21.)

It’s far more efficient to allocate small tuples on every set and this is why the array module wins hands down.

Use this code to replicate my results on your hardware.

Hacking the Mac OSX Unified Buffer Cache

Files read and written get cached in the Unified Buffer Cache (UBC) on Mac OSX.

The UBC was hindering me because I was processing a huge file in chunks but throwing out each chunk, never to be reused again, after writing out the processed chunk. I would see gigabytes of memory get eaten away by the UBC until the system started swapping and became unresponsive.

Writing a Mac OSX USB Device Driver That Implements SCSI Pass-through

I’ve been on a coding tear since the beginning of this year, when I decided to dump Erlang and focus on all things low-level. I’ve been much happier since, although not much richer. Do you need a Mac OSX device driver written? Talk to me!

In this post I will explain how I wrote a Mac OSX USB device driver for the IntellaSys 24-core CPU on a thumbstick, also known as FORTHdrive. I will skip the parts that are reasonably clear from Apple documentation and focus on the bits I had trouble with. I will also leave two-machine driver and kernel debugging over FireWire for another post.

Parsing Text and Binary Files With Erlang

Erlang originated in the telecommunications industry where one of the major tasks is conversion of text and binary data from one format to another. This is a task that Erlang excels at!

Parsing text and binary data is something that you will be doing very often in the course of writing your super-scalable internet servers so lets take a look at some efficient approaches to parsing text and binary data.

Transparency and Masking in Lispworks CAPI

CAPI is a good cross-platform GUI toolkit and has been used to write apps like Prime Trader. Your apps will have a native look on each platform and you won’t have to do anything special to make it happen. Still, bits of platform speific code are still required and one example is alpha-blending.

There’s no support for alpha-blending in CAPI. You can do masking but specifying a transparent color in your images but this is way too basic to render a poker room so I have to resort to platform-specific code to make it happen.