Tenerife Skunkworks

Trading and technology

Upgrading Your Erlang Cluster on Amazon EC2

Upgrading your Erlang cluster on Amazon EC2

13 Oct 2007 – Tenerife

This article describes how to upgrade an Erlang cluster in one fell swoop once you have deployed it on Amazon EC2.

Why not the Erlang/OTP upgrade procedure

The standard and sanctioned way of deploying and upgrading Erlang applications is described in chapters 10-12 of the OTP Design Principles. Calling the upgrade procedure complex is an understatement.

Bowing to the OTP application packaging procedure, I wanted to have a way of upgrading applications with a “push of a button”. More precisely, I wanted to be able to type make:all() to rebuild my application and then type sync:all() to push updated modules to all nodes in my cluster. These nodes were previously set up as “diskless” Amazon EC2 nodes that fetch their code from the boot server since I didn’t want to reinvent the application packaging wheel.

The sync application

The principal application deployed in the cluster is the “sync” app. This is a gen_server set up according to chapter 2 of the OTP Design Principles. The gen_server handles requests to restart the Erlang node without shutting down and set environment variables, as well as requests upgrade the code by application or by process. Each sync gen_server joins the ‘SYNC’ distributed named process group and this is what enables upgrade of the whole cluster in one fell swoop.

The sync server will invoke init:restart/0 to restart the node without shutting down upon receiving the RESTART request. This is incredibly handy since the restart sequence takes the contents of the Erlang VM to the trash can and then repeats the same steps taken by the Erlang VM when it is started from the command line. Which is to say that the VM loads the boot file from the boot server, parses the boot file, downloads the applications and runs them. If we have upgraded the code on the boot server then the Erlang VM will run new code after a restart.

Upgrading by application or by process

The above procedure is quite intrusive since all apps running in the Erlang VM are killed. Any Erlang node will normally be running a number of apps and you may want to upgrade just one or two of them. This is where the “upgrade by application” procedure comes in.

application:get_application/1 will give you the name of the application that a module belongs to. I build a unique list of applications that my changed modules belong to and then stop each application with application:stop/1, re-load changed modules and start the application with application:start/1.

The “upgrade process by process” procedure first grabs a list of all processes running in the same node as the sync gen_server. It does this by calling processes(). I check whether each process is running the code in one of the modified modules using erlang:check_process_code/2. Next, I suspend affected processes with erlang:suspend_process/1, re-load changed modules with erlang:resume_process/1 and I’m done.

Reloading modules for fun and profit

I’m still not absolutely sure if I got reloading of changed modules right but it looks like this:


    load_modules([]) ->
        ok;

    load_modules([Mod|T]) ->
        code:purge(Mod),
        code:soft_purge(Mod),
        {module, Mod} =  code:load_file(Mod),
        load_modules(T).

The need to call code:soft_purge/1 after code:purge/1 was determined empirically.

Everything I have described thus far is small bits of code. The biggest chunk of code in the sync server figures out what modules were modified.

What to reload: Inspecting module versions

Remember my original intent to run make:all/0 followed by sync:all/0 to upgrade all nodes in the cluster at the same time? It’s only possible because 1) it’s possible to grab the module version from a module loaded into memory, 2) it’s possible to grab the same from a module on disk and, crucially, modules are not reloaded when make:all/0 is run.

The module version defaults to the MD5 checksum of the module if no -vsn(Vsn) attribute is specified. For the life of me I can’t remember where Module:module_info() is documented but this is what you use to grab the attributes of the module. It’s a property list so you can use proplists:get_value/2 to grab the vsn property and thus the module version.

To take advantage of local processing power, the API initiating the upgrade request does no work apart from inspecting the SYNC distributed named process group and telling each sync gen_server in the group to initiate the upgrade procedure. This means that each module loaded into the Erlang node hosting the sync server needs to be checked for changes.

Grabbing the version of the BEAM file holding the code for a given module is done using beam_lib:version/1. This is complicated by the fact that all of the Erlang EC2 nodes in the cluster download their code from the boot server. Normally, beam_lib:version/1 takes either a module name, a file name or a binary.

I haven’t documented why I’m not using a module name or a file name in the boot server scenario but I must have found them not to work. I had to resort to fetching the module BEAM file from the boot server and inspecting that. Fortunately, traffic between EC2 instances is free and fast and the same applies to your LAN.

To find out if a module is modified I grab the list of loaded modules with code:all_loaded/0 and inspect each module with code:is_loaded/1. I skip preloaded modules (see documentation for code:is_loaded) and use the path returned otherwise to instruct erl_prim_loader:get_file/1 to fetch the BEAM file. I then pass the file contents to beam_lib:version/1 and I have my disk version. After that it’s a simple matter of comparing the two versions and reloading the module if they differ.

Setting Up Erlang on Amazon EC2

Setting up Erlang on Amazon EC2

12 Oct 2007 – Tenerife

This article describes a project that I recently completed for a startup company. The code is proprietary and cannot be published but the company has graciously allowed me to write about my experience.

Why Erlang and Amazon EC2?

There’s no need to introduce the Amazon Elastic Computing Cloud (EC2) since everyone knows about it by now. In essence, EC2 allows you to rent computing power by the hour. That hour is just $0.10 which works out to about $70 per month. The virtual server that Amazon provides is called an instance. The important bit is that you are completely in control of the operating system that the instance runs and the software installed on it.

Amazon lets you run scores of instances at any given time. Major benefits are realized when EC2 instances work as a cluster, though. Think of GoogleBot, a page crawler that indexes your site’s content. Such a crawler would surely benefit from being run on as many machines as possible, all indexing different pages and working in parallel. Once the crawler is finished, you can shut the machines down until next time.

Amazon does not provide tools to cluster your instances or replicate data among them. This is a task that Erlang copes with extremely well so Amazon EC2 and Erlang are a match made in haven!

How to set up Erlang on Amazon EC2

How do you start with Erlang and EC2? You need to build a Linux image that runs Erlang upon startup and automatically starts a new Erlang node. This node should then contact an existing Erlang node to join your Erlang cluster.

CEAN is a great way to set up the necessary components of Erlang on your new instance. Set up CEAN and have it install just the Erlang applications that you need. Create a script that will run Erlang when Linux starts. Make sure to adjust $HOME in this script and set $PROGNAME to start.sh in cean/start.sh. Use cean:install/1 to pull in the inets and sasl packages. You will likely need the compiler package as well.

The EC2 API lets you pass arguments to your newly started instance and these arguments can be retrieved by your Erlang code. One of the arguments you absolutely must pass is the name of an existing instance that is already part of your Erlang cluster. By connecting to Erlang running on the existing instance your new node will automatically become aware of the rest of the cluster.

Upgrading your Erlang code

The software available to your instance is normally part of your instance image. It’s quite cumbersome to rebuild an image every time you deploy a software update, though. It’s much better to push software updates to your instances whenever an update is available. Note that these updates need to be pushed to every instance in your Erlang cluster and reloaded every time an instance restarts. Fortunately, Erlang makes all this easy.

The boot server facility is probably one of the least documented and appreciated pieces of the Erlang infrastructure but one that comes in most handy here. A boot server enables Erlang nodes to fetch their configuration files and code from a central location, over the network. This neatly sidesteps the issue of pushing upgrades to your Erlang cluster. All you need to do is restart your instances one by one and have them fetch new software.

Note that you don’t need to physically restart the EC2 instances themselves. All you need to do is tell our Erlang nodes to reboot without exiting the VM. This is done using init:restart/0.

The boot server

The Erlang boot server lives in the erlbootserver module and keeps a list of slave hosts authorized to connect to it. You can use man erlbootserver to read up on the boot server API.

The boot server will not have any hosts authorized to connect to it upon startup. A new EC2 instance that you are starting up needs to be added to the boot server slave list BEFORE you attempt to start an new Erlang node. This is easily accomplished by starting a “controller” node that will issue an RPC call to the boot server and add its own IP address to the boot server’s slave list.

Once the controller node adds the internal Amazon instance address to the boot server authorized slave list, it can start the worker node and safely exit. Now that the boot server knows about the new slave it will allow connection and the worker node will successfully fetch its software from the boot server.

The boot server and all the slave nodes must share the same Erlang cookie. The cookie is stored in ~/.erlang.cookie. All nodes must also share the same OTP version.

So long as all the nodes are part of the same EC2 security group we should be reasonably secure that no node outside of our group will be able to make use of our boot server. This security is also aided by the requirement that all Erlang nodes in the cluster must use the same cookie to talk to each other. It’s convenient to assign one of the existing instances as a boot server since it will then be within the EC2 security group.

Setting up

I use a script like this to start slave nodes. The path specified is on the boot server.


#!/bin/sh

COOKIE=RRFJBVGLSOUPFLWVEYJP
BOOT=/Users/joelr/work/erlang/sync/ebin/diskless
HOST=192.168.1.33
ID=diskless

erl -name $ID -boot $BOOT -setcookie $COOKIE -id $ID -loader inet -hosts $HOST -mode embedded ${1+"$@"}

You will also need to create a boot file which must be created with full paths inside (local option to systools:make_script/2).

To build a boot file I use these two lines of Erlang:


code:add_path("./ebin"). 
systools:make_script("diskless", [local, {outdir, "./ebin"}]).

script files is made from rel and app files. boot files are made from script files.

Note that diskless is the same boot file name that is used in the shell script above. I’m assuming that it lives in ./ebin and so I add it to the code path for make_script to find it.

If everything is done correctly your new EC2 instances will now fetch their code from the boot sever upon startup and whenever you restart Erlang nodes running on them with init:restart/0.

I may not always be convenient to pull updates from the boot server. I will describe a push facility that I implemented in another post.

HiPE for Mac X86

It started innocently enough with my wondering why High-performance Erlang (HiPE) was not available on my shiny new MacBook Pro. This turned into a long but awesome email exchange and culminated in victory, and my port of HiPE to Mac Intel, less than two weeks later.

I learned a fair number of things about the Mac OSX kernel, the FPU, floating-point exceptions, Intel 32-bit architecture and SSE2 along the way. I also had an awesome time with assembler code.

I could not have done it without the mach_override library and numerous tips from Mikael Pettersson as well as other folks on the Erlang Questions mailing list. I will be shipping my patches to the HiPE team this week.

Haskell vs Erlang, Reloaded

Haskell vs Erlang, Reloaded

01 Jan 2006 – Tenerife

On Dec 29, 2005, at 8:22 AM, Simon Peyton-Jones wrote in response to me:


Using Haskell for this networking app forced me to focus on all the issues but the business logic. Type constraints, binary IO and serialization, minimizing memory use and fighting laziness, timers, tweaking concurrency and setting up message channels, you name it.

That’s a disappointing result. Mostly I think Haskell lets you precisely focus on the logic of your program, because lots else is taken care of behind the scenes. You found precisely the reverse.

It’d be interesting to understand which of these issues are

  • language issues
  • library issues
  • compiler/run-time issues

My (ill-informed) hypothesis is that better libraries would have solved much of your problems. A good example is a fast, generic serialization library.

If you felt able (sometime) to distill your experience under headings like the above, only more concretely and precisely, I think it might help to motivate Haskellers to start solving them.

Please browse through the original Haskell Postmortem thread for background info. Then read this post and head to the thread describing my Erlang rewrite experience.

Other threads relevant to my experience, including crashes and Glasgow Haskell Compiler runtime issues included at the bottom of this post.

Goals

The goal of my project was to be able to thoroughly test a poker server using poker bots. Each poker bot was to to excercise different parts of the server by talking the poker protocol consisting of 150+ binary messages. The poker server itself is written in C++ and runs on Windows.

Easy scripting was an essential requirement since customer’s QA techs were not programmers but needed to be able to write the bots. Another key requirement was to be able to launch at least 4,000 poker bots from a single machine.

This app is all about binary IO, thousands of threads/processes and easy serialization. All I ever wanted to do was send packets back and forth, analyze them and have thousands of poker bots running on my machine doing same. Lofty but simple goal :-). Little did I know!

Summary

I spent a few months writing a poker server in Erlang but fell in love with Haskell after reading Simon’s Composing Financial Contracts paper. When I was offered to write a stress tool for an existing poker server, I thought I would write it in Haskell since my customer expressed a concern about QA techs having to learn Erlang and the Haskell syntax looked clean and elegant.

Overall, I spent 10-11 weeks on the Haskell version of the project. The end result did not turn out as elegant as I wanted it to be and wasn’t easy on the QA techs. They told me, in retrospect, that the Erlang code was easier to understand and they preferred it.

It took me less than 1 week to rewrite the app in Erlang. It’s the end of that week and I’m already way past the state of the Haskell version. The Erlang code, at 3900 lines of code (LOC) including examples, is about 50% of the Haskell code. 

It’s far easier to rewrite the code when you know the application, of course, but this rewrite did not involve a lot of domain knowledge. I also translated to Erlang the original code in the Pickler Combinators paper.

Issues

I spent the first few weeks of the project coding the packets using [Word8] serialization. This proved to be naive as the app ate HUGE amounts of memory. I didn’t concern myself with applying strictness at that point. 

I ran into a dead end on Windows for some reason. The app seemed to hang frequently when running hundreds of threads on Windows and did it in ways that were distinctly different from Mac OSX. The customer had FreeBSD and agreed to run my app on it.

Running on FreeBSD did not improve things and that’s when I started looking deeply into strictness optimizations, etc. After 10-11 weeks with this app I was still nowhere near my goals and I had no guarantee that all my issues would be resolved this this tweak or that.

Runtime issues

What threw me off almost right away is the large number of GHC runtime issues that I stumbled upon. I was trying to do serialization and heavy concurrency which I learned to take for granted with Erlang but it turned out that this area of GHC has not been exercised enough. 

Records

Haskell is supposed to be about declarative programming. Haskell programs should look like specifications. Haskell is supposed to be succint and Succinctness is Power, according to Paul Graham.

One area where this breaks down quickly is records. 

Compare Erlang


  
    -record(pot, {
      profit = 0,
      amounts = []
     }).

with Haskell


  
    data Pot = Pot
        {
         pProfit :: !Word64,
         pAmounts :: ![Word64] -- Word16/
        } deriving (Show, Typeable)

    mkPot :: Pot
    mkPot =
        Pot
        {
         pProfit = 333,
         pAmounts = []
        }

The Haskell version requires twice as much lines of code just to initialize the structures with meaningful defaults. I have 164 record in the program, some of which are rather large. Renaming and using the Haskell accessor functions gets rather tedious after a while and there’s nothing elegant in having to explain to the customer how xyFoo is really different from zFoo when they really mean the same thing. This might seem like no big deal but, again, I have a lot of records. 

I tried creating classes for each “kind” of field and I tried using HList to put these fields together into records. This seems like 1) a terrible hack compared to the Erlang version and 2) not very efficient. I did not get to measuring efficiency with a profiler but I did have GHC run out of memory trying to compile my HList code. SPJ fixed this but I decided not to take it further.

Static typing

The records issue is a language issue just like static typing working against me with events. Part of the reason why the Erlang code is 1/2 the size of the Haskell code is that Erlang is dynamically typed. I just post an event of any type I want. I basically post tuples of various sizes but Haskell requires me to either use Dynamic or define the events in advance. 

Yes, I did retrofit the code with


data Event a = Foo | Bar | Baz a

late in the development cycle but it was a major pain in the rear, specially when it came to my bot monad

Speaking of monads… There’s not a lot of beauty in this:

type ScriptState b = ErrorT String (StateT (World b) IO)
type ScriptResult b = IO (Either String (), World b)
type Dispatcher b = Event -> (ScriptState b) Status
data Status
    = Start
    | Eat !(Maybe Event)
    | Skip
    deriving Show
instance Show (String, Dispatcher b) where
    show (tag, _) = show tag
runScript :: World b -> (ScriptState b) () -> ScriptResult b
runScript world = flip runStateT world . runErrorT

and then this:

withFilter :: Event
           -> (Event -> (ScriptState b) ())
           -> (ScriptState b) ()
withFilter event fun =
    do w <- get
       let p = trace_filter w
       unless (p event) $ fun event

In fact, I still don’t have anywhere near in-depth knowledge of how to write my own monad.

Erlang is free of side effects (built-in function aside) just like Haskell but pattern-matching becomes far easier and the code becomes much smaller when you don’t have to deal with static typing. To wit:

%%% Matching on a tuple
handshake(Bot, _, {udp, _, , ?SRV_PORT, <<?SRVERROR, Code:32>>}) ->
    bot:trace(Bot, 85, "handshake: ~w: Error: ~w~n", [?LINE, Code]),
    erlang:error({handshake_error, Code});
% Matching on a tuple of a different size (records are tuples in Erlang)
handshake(Bot, [_, , Event], #srvhandshake{}) ->
    Bot1 = bot:pop(Bot),
    bot:post(Bot1, Event),
    {eat, Bot1};
% Yet another tuple
handshake(Bot, Args, X = {tcp_closed, _}) ->
    bot:trace(Bot, 85, "Connection closed during handshake, retrying"),
    Bot1 = retry(Bot, Args, X),
    {eat, Bot1};
Concurrency

Concurrency in Haskell deserves a praise, specially when used together with STM. Threads are lightweight (1024 bytes on the heap) and easy to launch and STM is a beautiful thing. Nothing beats being able to just send yourself a message, though. This is something that you can easily do with Erlang.

Erlang processes (327 bytes starting up, including heap) come with a message queue and you retrieve messages with “selective receive” that uses the same pattern-matching facilities as everything else.

%%% Dispatch event
run(, {keepgoing, Bot})
when is_record(Bot, bot) ->
receive
{tcp, _, <<Packet/binary>>} ->
Event = unpickle(Bot, Packet),
run(Bot, handle(Bot, Event));
{script, Event} ->
case Event of
{tables, [H|T]} ->
trace(Bot, 95, "Event: {tables, [~w, ~w more]}",
[H, length(T)]);
_ ->
trace(Bot, 95, "Event: ~p", [Event])
end,
run(Bot, handle(Bot, Event));
Any ->
run(Bot, handle(Bot, Any))
end;

This code just works. It collects network messages, events, timer events, you name it. Posting an event is also easy.

post(Bot, Event) -&gt;
self() ! {script, Event}.

I tried implementing this scheme using STM.TChan but failed. The best example of this is my logger. The most natural way to implement logging seemed to be by reading from a TChan in a loop and printing out the messages. I launched several thousand threads, all logging to the single TChan. Bummer, I think I ran out of memory.

Follow-up discussions on Haskell-Cafe narrowed the issue down to the logger thread not being able to keep up. I took this for granted and implemented a single-slot logger. This worked and reduced memory consumption drastically but I believe introduced locking delays in other places since threads could only log sequentially.

Erlang provides the disk_log module that logs to disk anything sent to the logger process. The logger can be located anywhere on a network of Erlang nodes (physical machines or VMs) but I’m using a local logger without any major problems so far.

Could the difference be due to differences in the scheduler implementation?

The Erlang version of my code has a separate socket reader process that sends incoming packets as messages to the process that opened the socket. This is the standard way of doing things in Erlang. Network packets get collected in the same message queue as everything else. It’s the natural way and the right way.

I tried to do the same with Haskell by attaching a TChan mailbox to my threads. Big bummer, I quickly ran out of memory. The socket readers were quick to post messages to the TChan but the threads reading from it apparently weren’t quick enough. This is my unscientific take on it.

Moving to single-slot mailboxes did wonders to lower memory consumption but introduced other problems since I could no longer send a message to myself from the poker bot thread. The socket reader would stick a packet into a TMVar and then the poker bot code would try to stick one in and block. This caused a deadlock since the bot code would never finish to let the thread loop empty the TMVar.

I ended up creating a bunch of single-slot mailboxes, one for the socket reader, one for messages posted from the poker bot code, one for outside messages like “quit now”, etc. Thanks to STM the code to read any available messages was elegant and probably efficient too but overall the approach looks hackish.

fetch :: (ScriptState b) (Integer, Integer, Integer, (Event))
fetch =
do w <- get
liftIO $ atomically $
readQ (killbox w) `orElse`
readQ (scriptbox w) `orElse`
readQ (timerbox w) `orElse`
readQ (netbox w)

I had to replace this code with some other hack to be able to run retainer profiling since it does not work with STM.

I also had issues with asynchronous exceptions (killThread blocking?), including crashes with a threaded runtime.

Serialization

This horse has been beaten to death by now. I would just say that thinking of Haskell binary IO and serialization makes me cringe. Binary IO is so damn easy and efficient with Erlang that I look forward to it. Specially after I wrote the Erlang version of the Pickler Combinators. Please refer to Bit Syntax for more information. I would give an arm and a leg to stick to binary IO in Erlang rather than process XML or other textual messages, just because it’s so easy.

With Haskell I tried reading network packets as a list of bytes which was elegant but not very efficient. I also tried serialization base don Ptr Word8 and IOUArray. I don’t think there’s a lot of difference between the two efficiency-wise. allocaBytes is implemented on top of byte arrays, for example.

allocaBytes :: Int -&gt; (Ptr a -&gt; IO b) -&gt; IO b
allocaBytes (I# size) action = IO $ \ s ->
case newPinnedByteArray# size s      of { (# s, mbarr# #) ->
case unsafeFreezeByteArray# mbarr# s of { (# s, barr#  #) ->
let addr = Ptr (byteArrayContents# barr#) in
case action addr    of { IO action ->
case action s       of { (# s, r #) ->
case touch# barr# s of { s ->
  1. s, r #)
    }}}}}

I would preferred serialization on top of byte arrays since you can inspect them and see the data. There’s no version of Storable for arrays, though. Not unless you use a Storable Array and then it can only be an array of that instance of Storable.

Inspecting the environment

Erlang has plenty of tools to inspect your environment. You can get the number of processes running, a list of process ids, state of each process, etc. This very convenient for debugging.

Other libraries

I can log any Erlang term to disk, store it in a database, etc. This makes my life significantly easier.

Conclusion

I was able to finish the Erlang version 10 times faster and with 1/2 the code. Even if I cut the 10-11 weeks spent on the Haskell version in half to account for the learning curve, I would still come out way ahead with Erlang.

This is due to language issues where static typing and records are working against me. This also due to the many GHC/runtime issues that I stumbled upon, specially with regards to concurrency, networking and binary IO. Last but not least, this is due to the much better library support on the Erlang side.

I would not have been able to get the Haskell version as far as I did without the enthusiastic support from Haskell-Cafe, #haskell and the Haskell Headquarters. I can’t even imagine one of the chief Erlang designers logging in to my machine to troubleshoot some issues. Simon Marlow did! And that brings up another issue…

Ericsson has a whole team of developers hacking the Erlang distribution all day. I don’t know the size of the team but I would think 10-15 people, maybe more. My understanding is that a separate bigger team hacks away at the OTP libraries. The flagship Ericsson AXD 301 switch has something like 1.7 million lines of Erlang code and the team that worked on it consisted of 100-300 people.

You cannot compare the weight of the biggest telco thrown behind Erlang to the weight of Simon Marlow and Simon Peyton-Jones behind GHC, although the two Simons are without a trace of doubt VERY HEAVY.

I would love to be able to hack away at GHC to bring it on par with Erlang. I’m not dumb and I learn very quickly. Still, it’s probably a loosing proposition. Just like specialist and narrowly focused companies dominate their market niches, the specialist languages win the day.

Erlang is the specialist language narrowly focused on networked, highly-scalable concurrent applications. I doubt any other language can beat Erlang at what it does. Haskell is also a specialist language. Do I hear cries of disbelief? How come?

Haskell is a specialist language for doing extremely complex things that few outside of the tight-knit Haskell PhD. community will understand or (heresy!) even care about. Also, things-probably-could-be-done-much-simpler™. Think Djinn, Zipper-based file server/OS, GADTs, Fundeps, Existential types, Comonads, Delimited Continuations, Yampa.

That said, I love Haskell because it forever twisted my brain into a different shape and I think I’m overall a much better coder now. I also have a much better understanding of why LexiFi was implemented in OCaml while based on the Composing Financial Contracts (Haskell) paper.

Forward-looking statements

I paid my dues. I felt the pain. I fought the fight but did not win. I drank the poison Kool-aid. I still have an itch to try to fit Haskell into trading or some other domain like AI or robotics but it seems to me that some things in Haskell are unnecessarily complex, to the detriment of my productivity.

I started looking at Yampa a few weeks back and my interest picked up significantly after Frag was released. I would like to make the Frag/Quake monsters super-intelligent, for example. Still, looking at Yampa I cannot comprehend why coding robotics has to be so complex.

My bet is that doing the same on top of a highly concurrent architecture and message-passing would be much easier if less declarative. To that end I will try to port Frag to Erlang and see what comes out. I suspect that I will be proven right.

Haskell-Cafe discussions related to my project

Design:

Runtime issues: