Tenerife Skunkworks

Boldly going where few have gone before

Tracking IO patterns in memory-mapped dynamic libraries

The Mac OSX dynamic linker uses mmap to load dynamic libraries into memory. The memory range occupied by the libraries is then backed by a set of virtual memory pages, chunks of 4096 bytes each.

Using mmap is efficient because pages are lazily loaded from disk as needed and the virtual memory pager is free to evict them behind the scenes when memory is needed for something else.

Each page is filled with code for functions that live in the dynamic library and pages are fetched from disk when a call is made to a function in the dynamic library.

Pages should be accessed sequentially for best performance but how do you find out if this is the case? The only way to find out is to plug into the virtual memory manager and track when pages backing a particular dynamic library are paged-in from disk.

This is doable but a tad complicated as it requires access to internal kernel structures. As always, DTrace comes to the rescue! Note that the following DTrace script will only work on Snow Leopard. Let’s take it apart and see how it works…

First we define a an alias for an internal kernel type. This is not necessary but saves on typing.

1 typedef struct nameidata* nameidata_t;

A dynamic library is usually opened with a call to dlopen. The first argument (arg0) is the library path.

1 pid$target::dlopen:entry
2 /arg0/
3 {
4  self->ppath = arg0;
5  self->dylib = 1;
6 }

The Mac OSX dynamic linker (dyld) does not use dlopen on dynamic libraries, though, so we need to do different.

1 pid$target::dyld??loadPhase5*:entry,
2 pid$target::dyld??loadPhase4*:entry
3 /!self->dylib && arg0/
4 {
5  self->path = copyinstr(arg0);
6  self->func = probefunc;
7  self->dylib = 1;
8 }

We want to trigger subsequent probes only if we entered via one of the two entry points above. We need to reset the flag once we return and this is what we do below.

 1 pid$target::dlopen:return
 2 {
 3  self->dylib = 0;
 4 }
 5 
 6 pid$target::dyld??loadPhase5*:return,
 7 pid$target::dyld??loadPhase4*:return
 8 /self->func != 0 && self->func == probefunc/
 9 {
10  self->dylib = 0;
11 }

It often happens that memory we want to access in DTrace is not available at the function entry point. This is what happens sometimes with the dlopen entry probe. The open call is triggered by dlopen so we can copy the file path into kernel space using the self->ppath pointer we saved earlier.

1 /* file name memory should be wired in by now */
2 
3 pid$target::open:entry
4 /self->dylib && self->ppath && self->path == 0/
5 {
6  self->path = copyinstr(self->ppath);
7 }

The Mac OSX virtual memory manager identifies a file by its virtual node (vnode) in the virtual filesystem (VFS) layer. We need to somehow match up the dynamic library name with its vnode and this is where the vn_open_auth entry probe comes in.

Note that unlike the DTrace pid$target provider we used before, we use the Function Boundary Tracing (FBT) provider since the virtual filesystem layer lives in the kernel.

1 fbt::vn_open_auth:entry
2 {
3  self->ndp = (nameidata_t)arg0;
4 }

This clause is not absolutely necessary but I’m populating self->curpath as a shortcut, to be used a few times in later probes.

1 /* wait to make sure ndp and vnode are fully populated */
2 
3 fbt::vn_open_auth:return
4 /self->path != 0/
5 {
6  self->curpath = stringof((self->ndp)->ni_pathbuf);
7 }

It’s not obvious how to best match up a vnode with a file name so I cheated and studied the XNU kernel source code which Apple makes available. The vnode is not populated until vn_open_auth returns so we have to wait until it happens fo fetch the path.

We are almost done with our task, just need to save the library name and library path to vnode mappings if our library names match.

 1 /* make sure we are opening the same file */
 2 
 3 fbt::vn_open_auth:return
 4 /self->curpath != 0 && self->path == self->curpath/
 5 {
 6  this->vp = (vnode_t)(self->ndp)->ni_vp;
 7  this->lib = stringof((this->vp)->v_name);
 8  self->lib[this->lib] = self->path;
 9  self->vnode[this->lib] = this->vp; 
10 }

We do need to clean up sometime and so we do.

 1 fbt::vn_open_auth:return
 2 {
 3  self->path = 0;
 4  self->ppath = 0;
 5  self->curpath = 0;
 6  self->ndp = 0;
 7  self->func = 0;
 8  self->dylib = 0;
 9 }

We use the same function boundary tracing (fbt) DTrace provider to track pageins. We print the file offset of the data we are paging in, as well as the size.

 1 fbt::vnode_pagein:entry
 2 {
 3  self->v_name = stringof(((vnode_t)arg0)->v_name);
 4 }
 5 
 6 /* vnode pointers should match but v_name seems more secure */
 7 
 8 fbt::vnode_pagein:entry
 9 /self->lib[self->v_name] != 0/
10 {
11  printf("vnode_pagein: %s, offset: %u, size: %u\n", 
12    self->v_name, arg3, arg4);
13 }

It does look from the output that we are loading multiple pages at the same time, e.g. size 1019904 corresponds to 249 pages. Page access looks quite random, though, which is killing us.

Now that we know what the access pattern is, we should try to first identify the symbols that are being accessed in each set of pages and then make the linker rearrange the code such that page access is more sequential and less random.

Note that you are not likely to see page-ins when running this DTrace script unless you have just restarted your Mac and are running Firefox for the first time. This is because the dynamic libraries will be stored in the Unified Buffer Cache (UBC) after first access and there won’t be any subsequent disk access until they are evicted from the cache.

I wrote about hacking the Unified Buffer Cache before but the same technique does not work with mmap-ed data since the virtual manager and the cache are the same thing. Evicting the libraries would involve allocating at at least twice as much virtual memory as there’s RAM and then touching each page to make sure it’s cached. This is unlikely to correspond to normal use of Firefox, though.

Filed under  //   dtrace   firefox   mmap   optimization  

Where does time go?

The dyld shared cache lives in /var/db/dyld/. The two files of interest are dyld_shared_cache_i386.map (for x86-32) and shared_region_roots/Applications.paths. Both are regular text files. The former shows the contents of the shared cache for the i386 architecture and the latter is what update_dyld_shared_cache inspects.

There’s no prebinding on newer versions of Mac OSX and the dyld shared cache is automatically updated as needed. Tracing Safari disk activity during startup reveals that basically all its dynamic libraries are pulled from the dyld shared cache.

It’s possible to add Firefox (…/Firefox.app/Contents/MacOS/firefox-bin) to Applications.paths and the change will persist across reboots. Unfortunately, only a handful of libraries that Firefox uses are pulled into the cache by update_dyld_shared_cache. I’m speculating that this may have something to do with @executable_path/XUL and friends (otool -L …/firefox-bin).

Safari uses absolute paths to frameworks in /System/Library/Frameworks so I speculate that relative paths are what is preventing XUL and others from going into the cache.

It’s possible to fix relative library paths in a given library, e.g. fix.sh XUL where fix.sh looks like this

 1 #!/bin/bash 
 2 
 3 function dylibs () {
 4   otool -L $1 |grep executable_path|awk '{print $1}'|cut -d"/" -f2
 5 }
 6 
 7 for i in `dylibs $1`
 8 do
 9         install_name_tool -change @executable_path/$i `pwd`/$i $1
10 done
11 
12 install_name_tool -id `pwd`/$1 $1

Firefox has to be recompiled with LDFLAGS=-header-pad_max_install_names__ in MOZCONFIG_ to make this happen since new library paths are greater than the space allocated in the Mach-O binary. See the man page for install_name_tool for details.

It’s possible to force dynamic libraries into the cache by putting dynamic library paths into shared_region_roots/Applications.paths instead of executables. I wasn’t successful in caching XUL, though, regardless of what I did. XUL is the Firefox dynamic library, it doesn’t even have a dylib extension.

In the end it doesn’t seem to matter since there’s a baffling lack of difference between Firefox and Safari cold stats, despite Safari pulling everything from the cache and Firefox using a large number of non-cached dylibs.

Here’s the cold startup stats for Safari

 1 sync && purge && DYLD_PRINT_STATISTICS=1 /Applications/Safari.app/Contents/MacOS/Safari
 2 total time: 696.9 milliseconds (100.0%)
 3 total images loaded:  116 (114 from dyld shared cache, 114 needed no fixups)
 4 total segments mapped: 5, into 30 pages with 10 pages pre-fetched
 5 total images loading time: 204.9 milliseconds (29.4%)
 6 total rebase fixups:  1,298
 7 total rebase fixups time: 0.1 milliseconds (0.0%)
 8 total binding fixups: 2,476
 9 total binding symbol lookups: 234, average images searched per symbol: 1.6
10 total binding fixups time: 80.5 milliseconds (11.5%)
11 total bindings lazily fixed up: 3 of 901
12 total init time time: 411.3 milliseconds (59.0%)
13 total images with weak exports:  1

and Firefox

 1 total time: 731.2 milliseconds (100.0%)
 2 total images loaded:  106 (93 from dyld shared cache, 56 needed no fixups)
 3 total segments mapped: 49, into 5903 pages with 684 pages pre-fetched
 4 total images loading time: 235.3 milliseconds (32.1%)
 5 total rebase fixups:  149,011
 6 total rebase fixups time: 3.7 milliseconds (0.5%)
 7 total binding fixups: 24,932
 8 total binding symbol lookups: 797, average images searched per symbol: 2.3
 9 total binding fixups time: 149.9 milliseconds (20.5%)
10 total bindings lazily fixed up: 45 of 19,109
11 total init time time: 342.2 milliseconds (46.8%)
12 total images with weak exports:  3

Notice a large and significant difference? Me neither.

The other thing that I cannot explain at the moment is where the rest of the startup time goes, e.g.

1 ./cold.sh startup.d
2 Total: 10001.723521ms

So it took less than 1 second to dynamically link Firefox but where did the other 9 seconds of startup go?

cold.sh is rather simple

1 #!/bin/bash
2 
3 cmd="./Minefield.app/Contents/MacOS/firefox-bin -no-remote -foreground -P 2"
4 
5 sync && purge && dtrace -x dynvarsize=64m -x evaltime=exec -c "$cmd" -wZs $1

and the startup.d script doesn’t do much either

 1 #pragma D option quiet
 2 
 3 BEGIN
 4 {
 5  start = timestamp;
 6 }
 7 
 8 /* stop tracing here */
 9 
10 mozilla$target:::main-entry
11 {
12  exit(0);
13 }
14 
15 END
16 {
17  this->total = timestamp - start;
18  printf("Total: %u.%06ums\n", this->total / 1000000, this->total % 1000000);
19 }

main-entry is my static probe, built into the Firefox source code. It fires once the main Firefox function, XRE_main, is entered.

I don’t have an explanation yet but 9 seconds is a very large difference but it just may be DTrace because of similar cold startup timings for Safari and Firefox.

Filed under  //   dyld   firefox   mac   optimization   safari  

Faster Mac Firefox

I started at Mozilla this Monday and my focus is on making Firefox run faster which includes startup and page rendering.

I thought I’d start by checking dynamic linking time of Firefox and Safari.

My testbed consists of

  • 3 years old 2.16Ghz CoreDuo MacBook Pro, 2Gb of 667Mhz DDR2 memory, 100Gb 7200 RPM hard drive
  • recent 2.93Ghz Core2Duo MacBook Pro, 4Gb of 1067Mhz DDR3 memory, 256Gb Apple SSD

Running on the latter machine, I have

 1 total time: 5.8 seconds (100.0%)
 2 total images loaded:  116 (114 from dyld shared cache, 114 needed no fixups)
 3 total segments mapped: 5, into 30 pages with 10 pages pre-fetched
 4 total images loading time: 2.4 seconds (42.5%)
 5 total rebase fixups:  1,298
 6 total rebase fixups time: 0.1 milliseconds (0.0%)
 7 total binding fixups: 2,444
 8 total binding symbol lookups: 236, average images searched per symbol: 1.6
 9 total binding fixups time: 895.5 milliseconds (15.3%)
10 total bindings lazily fixed up: 3 of 889
11 total init time time: 2.4 seconds (42.1%)
12 total images with weak exports:  1

for Safari, and

 1 total time: 5.9 seconds (100.0%)
 2 total images loaded:  105 (92 from dyld shared cache, 56 needed no fixups)
 3 total segments mapped: 49, into 5894 pages with 684 pages pre-fetched
 4 total images loading time: 4.4 seconds (74.4%)
 5 total rebase fixups:  148,829
 6 total rebase fixups time: 5.3 milliseconds (0.0%)
 7 total binding fixups: 24,885
 8 total binding symbol lookups: 794, average images searched per symbol: 2.3
 9 total binding fixups time: 350.1 milliseconds (5.9%)
10 total bindings lazily fixed up: 45 of 17,764
11 total init time time: 1.1 seconds (19.5%)
12 total images with weak exports:  3

for Firefox.

From the above I conclude that the dynamic linker is bloody fast, taking minimal time to fix up thousands of bindings (library function pointers, etc.).

I can also conclude that Safari takes twice as long to run its static initializers (total init time).

Dynamic linking is not the bottleneck then, so I must keep digging!

Here are the results for the faster machine and Safari

 1 total time: 623.0 milliseconds (100.0%)
 2 total images loaded:  116 (114 from dyld shared cache, 114 needed no fixups)
 3 total segments mapped: 5, into 30 pages with 10 pages pre-fetched
 4 total images loading time: 216.7 milliseconds (34.7%)
 5 total rebase fixups:  1,298
 6 total rebase fixups time: 0.1 milliseconds (0.0%)
 7 total binding fixups: 2,476
 8 total binding symbol lookups: 234, average images searched per symbol: 1.6
 9 total binding fixups time: 49.2 milliseconds (7.9%)
10 total bindings lazily fixed up: 3 of 901
11 total init time time: 356.9 milliseconds (57.2%)
12 total images with weak exports:  1

as well as Firefox

 1 total time: 792.2 milliseconds (100.0%)
 2 total images loaded:  105 (92 from dyld shared cache, 56 needed no fixups)
 3 total segments mapped: 49, into 5894 pages with 684 pages pre-fetched
 4 total images loading time: 281.0 milliseconds (35.4%)
 5 total rebase fixups:  148,829
 6 total rebase fixups time: 2.6 milliseconds (0.3%)
 7 total binding fixups: 24,885
 8 total binding symbol lookups: 794, average images searched per symbol: 2.3
 9 total binding fixups time: 66.7 milliseconds (8.4%)
10 total bindings lazily fixed up: 45 of 17,764
11 total init time time: 441.8 milliseconds (55.7%)
12 total images with weak exports:  3

Filed under  //   firefox   mac   optimization   safari