Posterous
Joel is using Posterous to post everything online. Shouldn't you?
Dsc_5799_-_version_2__1__thumb
 

Tenerife Skunkworks

Boldly going where few have gone before

Optimizing Erlang: Array vs Tuple

This is the first in my series of posts on optimizing Erlang. I plan to tackle optimizing Mnesia, profiling and scalability.

You need to use arrays of up to 10,000 elements. Erlang offers you tuples as well as fixed-size and extensible pseudo-arrays. What is the fastest option?

Let us start the death match by pitting arrays against tuples in a death match. Trees were an option before the array module became available, so lets throw in trees just for laughs.

I'm running Mac OSX Leopard 10.5.4 on a Mac Pro 2x2.8Ghz Quad-Core Intel Xeon with 14Gb 800Mhz DDR2 FB-DIMM.

Erlang (BEAM) emulator version 5.6.3 [source] [64-bit] [smp:8] [async-threads:0] [kernel-poll:false]

27> arr:test().
Fixed-size array: get:     2921us, set:     5902us
Extensible array: get:     3336us, set:     8144us
Tuple:            get:      632us, set:   107467us
Tree:             get:     4321us, set:    45256us
ok

30> arr:test(100000).
Fixed-size array: get:    35314us, set:    74653us
Extensible array: get:    35349us, set:    74059us
Tuple:            get:     6411us, set: 24304490us
Tree:             get:    53681us, set:   632795us
ok

Note that timer:tc returns time in microseconds. I ran each test 3 times and the results above are from the third iteration.

Trees in Erlang (gb_trees) are built on top of regular tuples and so is the array module. The array module is much more efficient about using tuples than a regular tree, though, and this is the reason why it's so much faster.

The tuple test pre-allocates a tuple of 10k or 100k elements. There's no destructive assignment in Erlang and so the same large tuple needs to be allocated and discarded on every set operation. It's very inefficient to allocate and discard a large tuple on every set operation, thus naive tuple set is very slow.

The array module uses an efficient tree-like internal representation:

%% A tree is either a leaf, with LEAFSIZE elements (the "base"), an
%% internal node with LEAFSIZE+1 elements, or an unexpanded tree,
%% represented by a single integer: the number of elements that may be
%% stored in the tree when it is expanded. The last element of an
%% internal node caches the number of elements that may be stored in
%% each of its subtrees.
%%
%% Note that to update an entry in a tree of height h = log[b] n, the
%% total number of written words is (b+1)+(h-1)*(b+2), since tuples use
%% a header word on the heap. 4 is the optimal base for minimizing the
%% number of words written, but causes higher trees, which takes time.
%% The best compromise between speed and memory usage seems to lie
%% around 8-10. Measurements indicate that the optimum base for speed is
%% 24 - above that, it gets slower again due to the high memory usage.
%% Base 10 is a good choice, giving 2/3 of the possible speedup from
%% base 4, but only using 1/3 more memory. (Base 24 uses 65% more memory
%% per write than base 10, but the speedup is only 21%.)

It's far more efficient to allocate small tuples on every set and this is why the array module wins hands down. 

Use the code below to replicate my results on your hardware.

    

Filed under  //   erlang   performance  
Posted August 25, 2008
// 1 Comment

Hacking the Mac OSX Unified Buffer Cache

Files read and written get cached in the Unified Buffer Cache (UBC) on Mac OSX.
The UBC was hindering me because I was processing a huge file in chunks but throwing out each chunk, never to be reused again, after writing out the processed chunk. I would see gigabytes of memory get eaten away by the UBC until the system started swapping and became unresponsive.

UBC cannot be limited to a given maximum amount of memory.

UBC cannot be inspected programmatically.

UBC can be cleared by running 'purge' which allocates a lot of memory to force the cache to clear. The following bit of code can be used turn caching off for a particular file:

 
fcntl(fd, F_GLOBAL_NOCACHE, 1)

This can be done in any process and the file can be closed after. The setting persists through out the lifetime of the file. If the file is removed and re-created then the setting is lost.

How can you tell if the setting took hold and the file is indeed NOT being cached?

 
dd if=/db1/cdr.csv bs=1m count=1024 of=/dev/null 
1073741824 bytes transferred in 12.030688 secs (89250242 bytes/sec) 
 
dd if=/db1/cdr.csv bs=1m count=1024 of=/dev/null 
1073741824 bytes transferred in 11.867947 secs (90474101 bytes/sec) 
 
dd if=/db1/cdr.csv bs=1m count=1024 of=/dev/null 
1073741824 bytes transferred in 12.037562 secs (89199278 bytes/sec) 

Tried reading the file three times. Speed is about the same.

What about a regular file that's cached by default?

 
dd if=/db1/cdr1.csv bs=1m count=1024 of=/dev/null 
1073741824 bytes transferred in 11.505857 secs (93321325 bytes/sec) 
 
dd if=/db1/cdr1.csv bs=1m count=1024 of=/dev/null 
1073741824 bytes transferred in 0.500468 secs (2145475416 bytes/sec) 

Notice that reading from the cache is much faster the second time around.

Kudos to Dominic Giampaolo from Apple for explaining all this to me!

Filed under  //   hacks   mac   performance  
Posted March 4, 2008
// 1 Comment

memcpy in Lisp

The title is not correct since I'm trying to copy an array of pixels x 3 into an array of pixels x 4 while setting every 4th byte from another array of pixels. I tried different forms of optimizing things in LispWorks and always used these optimization settings:

(eval-when (compile)
  (hcl:toggle-source-debugging nil))

(eval-when (compile load eval)
  (defvar *optimize* '(optimize (safety 0)
                       (space 0)
                       (debug 0)
                       (float 0)
                       (hcl:fixnum-safety 0)
                       (speed 3))))

opengl:gl-vector-aref is actually a macro that expands to fli:dereference ... :index ...

Fastest code:

It positively beats the other two approaches and for some reason has 0 memory allocations while the two examples below have about 150 each. The code is specific to LispWorks.

Slow approach #1:

Slow approach #2 (mixes up the bytes someplace):

Filed under  //   lisp   lispworks   performance  
Posted March 5, 2006
// 0 Comments