Tenerife Skunkworks

Trading and technology

Parsing Text and Binary Files With Erlang

Erlang originated in the telecommunications industry where one of the major tasks is conversion of text and binary data from one format to another. This is a task that Erlang excels at!

Parsing text and binary data is something that you will be doing very often in the course of writing your super-scalable internet servers so lets take a look at some efficient approaches to parsing text and binary data.

Strings vs binaries

Erlang does not have a built-in string data type. Strings are simulated on top of lists of integers. In a 32-bit Erlang virtual machine (VM) an integer is 4 bytes and we need 4 more bytes for a pointer to the next element of the list, for a total of 8 bytes per “character”. In 64-bit VM this number doubles.

Why should you care?

Each network connection to your servers will require certain amount of memory to send and receive data. A lot of protocols used on the internet, such as XML, are text and quite verbose at that. Imagine receiving a 10 kilobyte XML message, for example. Converting this message to a string for processing will inflate its size to 80K or 160K respectively.

When network connections to your server number in the thousands, it becomes necessary to minimize the amount of memory each connection requires for processing data that is sent and received. Any received message will also become garbage once it’s converted to an Erlang data structure and this garbage will need to be collected. The less garbage we generate, the less work the garbage collector has to do and the more responsive our server will become.

Lets keep binary data we receive from the network as binary data and avoid converting it to strings. Parsing of binary data is specially fast and convenient with special syntax for constructing binaries and matching binary patterns as well as bit strings and binary comprehensions. String processing enjoys no such advantage.

Remember that all Erlang input and output functions can deal with binary data. Any program that sticks to binary data processing will work much faster than a similar program that converts binary data to strings for processing!

That said, lets hammer a final nail into the Erlang string coffin and look at how we can process text as binary data.

Processing text files as binaries

Suppose we have a comma-delimited text file to parse. We need to split each line into a list of fields and collect our lines into a list. The file is not too large so we can afford to load it into memory in one fell swoop. This is our code in a nutshell.

1
2
3
4
5
6
7
8
9
-module(act).
-compile([export_all]).

parse(Filename) when is_list(Filename) ->
    {ok, Bin} = file:read_file(Filename),
    parse(Bin).

parse(Bin) when is_binary(Bin) ->
    parse(Bin, [], [], []).

Note that the empty lists are the initial values for our accumulators. Field is the list of characters that represents the current field we are processing. Line is the list of fields we have gathered while processing the current line. Finally, Acc is the list of lines we have gathered while processing our file.

We also have two functions named parse here: one that takes a Filename string (a list of integers) and another that takes a binary. Keeping the function name the same and using guards is strictly a matter of taste. I could have just as well called the second parse function parse1 or do_parse.

Erlang functions are distinguished based on their number of arguments and there’s a special fun_name/num_args notation to reflect that. Our parse function above would be written as parse/1. When the number of arguments is the same, the functions are distinguished based on guards such as is_list or is_binary above.

It’s good coding practice to present a neat interface to the outside world. parse/1 above would serve that purpose and parse/4 below would not. parse/1 takes just one argument whereas parse/4 clutters the interface with three extra arguments. As a user of the parsing module I would not know the purpose of the Field, Line and Acc arguments to parse/4 below.

1
2
parse(<<$\,, Rest/binary>>, Field, Line, Acc) ->
    parse(Rest, [], [lists:reverse(Field)|Line], Acc);

$_, matches the tab character and Rest matches the rest of the binary. It’s quite fast to add to the beginning of the list but the accumulated list needs to be reversed when we are done. Also, we almost certainly have accumulated a field by the time we hit the tab. This is why we reverse the field accumulator, prepend it to the accumulated list of fields and start our next iteration (recurse) with an empty field accumulator.

1
2
parse(<<$\r, Rest/binary>>, Field, Line, Acc) ->
    parse(Rest, Field, Line, Acc);

Just in case we have been given a file produced on Windows, we skip the carriage return character ($\r) and start on our next recursive iteration.

1
2
3
4
parse(<<$\n, Rest/binary>>, Field, Line, Acc) ->
    Field1 = lists:reverse(Field),
    FieldList = [Field1|Line],
    parse(Rest, [], [], [FieldList|Acc]);

A new line ($\n) means that we have hit the end of our current line and need to start processing a new one. We start this processing with empty field and line accumulators and prepend the list of fields to the lines accumulator Acc.

1
2
parse(<<Char, Rest/binary>>, Field, Line, Acc) ->
    parse(Rest, [Char|Field], Line, Acc);

We prepend any character not matching our field or line delimiters to the field accumulator and keep going.

1
2
parse(<<>>, [], [], Acc) ->
    {ok, lists:reverse(Acc)};

We do run out of binary data to process at some point in time and detect this fact by matching the empty binary <<>>. If our field and line accumulators are empty then we are done. We do need to reverse the list of lines that we have accumulated to return them in their original order.

1
2
parse(<<>>, Field, Line, Acc) ->
    parse(<<$\n>>, Field, Line, Acc).

What do we do if our accumulators are not empty by the time we are done processing? We can add custom processing code, of course, but wouldn’t it be better to leverage code that we have already written? We already go through the proper motions when we find a new line so to make our job easier we make it look like we found one. We continue parsing by creating a small fake binary with a single new line character.

Processing binary data the hard way

You did not misread it! Yes, the customary way of processing Erlang binaries is the hard way. It’s low level and involves lots of typing and a good deal of code duplication. It’s also the fastest and most efficient way.

I will show you a less efficient but more structured way to process binary data later in this chapter. We need to learn to walk before we learn to run so lets take a look at how you normally process binary data in Erlang.

Here’s a chunk of the binary protocol that my OpenPoker server uses.

Packet format:

1
2
3
4
0  1   2        N
+--+---+--- ... +
| Size | Body   |
+------+--- ... +

Body:

1
2
3
4
0      1           N
+------+--- ... ---+
| Type | Arguments |
+------+--- ... ---+

Each packet starts with a 2-byte packet size then a 1-byte packet type and the data payload. The body of a NOTIFY_JOIN command will then look like this:

1
2
3
4
0    1     5     9      10    12
+----+-----+-----+-------+-----+
| 21 | GID | PID | Seat# | Seq |
+----+-----+-----+-------+-----+

GID and PID are 4-byte integers, the seat number is a byte and the sequence number a 2-byte integer. We need one function to read this command from a binary packet and return something easy to deal with, e.g. a tuple.

1
2
read(<<?PP_NOTIFY_JOIN, GID:32, PID:32, SeatNum, Seq:16>>) ->
    {21, GID, PID, SeatNum, Seq};

To send the command out through the socket, we first need to convert the tuple to a binary.

1
2
3
4
5
6
7
write({21, GID, Player, SeatNum, Seq})
when is_number(GID),
     is_pid(Player),
     is_number(SeatNum),
     is_number(Seq) ->
    PID = gen_server:call(Player, 'ID'),
    <<21, GID:32, PID:32, SeatNum, Seq:16>>;

There are scores of commands in the OpenPoker protocol and the number will grow as new functionality is added. I did write code like the above for each and every command in the OpenPoker protocol and I wish I knew a way that enabled more code reuse.

What you should walk away with here is that reading and writing binary data in Erlang is simple and straightforward.

Pickler combinators

The OpenPoker protocol handling code is quite verbose. The protocol is also very much flat as it does not involve nested data structures. There is a way to describe reading and writing of structured data and generally save ourselves time and typing.

Andrew Kennedy coined the term pickler combinator in his 2004 ‘Functional Pearl of the same name. He wrote that

The tedium of writing pickling and unpickling functions by hand is relieved using a combinator library similar in spirit to the well-known parser combinators. Picklers for primitive types are combined to support tupling, alternation, recursion, and structure sharing.

Andrew Kennedy’s implementation used SML and is an Erlang book. We are still still functional programmers, though, so let’s see what we can do…

1
2
3
4
5
6
7
8
9
-module(pickle).

-export([pickle/2, unpickle/2, test/0]).
-export([byte/0, short/0, sshort/0,
   int/0, sint/0, long/0, slong/0]).
-export([list/2, choice/2, optional/1, wrap/2,
   tuple/1, record/2, binary/1, wstring/0]).

-compile([export_all]).

Let’s design and implement a pickling module that will save us a lot of typing down the road. The goal is to save us a lot of typing and enable us to describe our packet formats in terms of bytes, words and strings, as opposed to bits and binaries.

1
2
3
4
5
6
7
%%% Pickle and unpickle. We accumulate into a list.

pickle({Pickler, _}, Value) ->
    lists:reverse(Pickler([], Value)).

unpickle({_, Pickler}, Bin) ->
    element(1, Pickler(Bin)).

pickle and unpickle are responsible for doing the work for us and like a good manager they delegate the bulk of the work to their underlings. Note that the pickler combinator is represented by a two-element tuple where the first element is the function used for pickling and the second element for unpickling.

To pickle any Erlang term, we give pickle the tuple representing the pickler combinator as well as the value. Binaries, while convenient, don’t lend themselves to accumulating values so we accumulate the pickled data into a list. Fortunately for us, the Erlang input/output system can take lists and convert them to binaries for us.

To unpickle a binary we invoke the un-pickler on it and take the first element of the resulting tuple. Why does the unpickler return a tuple? We may have data left over from processing and it needs to be stored somewhere. The first element of the tuple stores the result of the unpickle operation and the second stores the remainder of the data.

Without further ado lets implement a pickler combinator for serializing byte data.

1
2
3
4
5
6
7
8
9
10
11
%%% Byte

byte() ->
    {fun write_byte/2, fun read_byte/1}.

write_byte(Acc, Byte) ->
    [<<Byte:8>>|Acc].

read_byte(Bin) ->
    <<Byte:8, Rest/binary>> = Bin,
    {Byte, Rest}.

byte is the name of the combinator and byte/0 simply returns a tuple of the pickler and unpickler functions. This is the pattern that we will be using over and over again. Also, picklers take the list serving as accumulator as their fist argument and the Erlang value as their second argument.

To pickle a byte we simply tell Erlang to prepend the binary representation of the byte <<Byte:8>> to the accumulator list. To unpickle a byte, read_byte/1 splits the binary into the byte itself and the remainder of the data and returns both as a tuple.

Simple, isn’t it?

A pickler combinator for unsigned short values stored in little-endian format looks like this.

1
2
3
4
5
6
7
8
9
10
11
%%% Unsigned short

short() ->
    {fun write_short/2, fun read_short/1}.

write_short(Acc, Word) ->
    [<<Word:16/little>>|Acc].

read_short(Bin) ->
    <<Word:16/little, Rest/binary>> = Bin,
    {Word, Rest}.

Signed short is implemented similarly, the only difference is that we use little-signed instead of little.

1
2
3
4
5
6
7
8
9
10
11
%%% Signed short

sshort() ->
    {fun write_sshort/2, fun read_sshort/1}.

write_sshort(Acc, Word) ->
    [<<Word:16/little-signed>>|Acc].

read_sshort(Bin) ->
    <<Word:16/little-signed, Rest/binary>> = Bin,
    {Word, Rest}.

I will skip the implementation of signed and unsigned integers and long integers. You can find the full code at the end of this book. It’s much more interesting to ponder the design of a list combinator.

1
2
3
4
5
6
%%% List. We supply a pickler for list length 
%%% as well as a pickler for list elements.

list(Len, Elem) ->
    {fun(Acc, List) -> write_list(Len, Elem, Acc, List) end,
     fun(Bin) -> read_list(Len, Elem, Bin) end }.

To pickle and unpickle a list we need to pickle the length of the list and then pickle each element of the list in sequence. The only requirement for the picker and unpickler functions is to take certain arguments and return values in the format that the library expects.

list/2 returns a tuple of anonymous functions that follow this convention. The functions that are ultimately invoked look a bit different.

1
2
3
4
write_list({Len, _}, {Elem, _}, Acc, List) ->
    Acc1 = Len(Acc, length(List)),
    Fun = fun(A, Acc2) -> Elem(Acc2, A) end,
    lists:foldr(Fun, Acc1, List).

write_list/4 takes both the pickler for the length of the list and the pickler for each element. Remember that each pickler returns a list of pickled binary chunks. The length pickler itself is primed with the list of previously pickled chunks given to write_list/4 (Acc). Acc1 is the list of binary chunks resulting from the pickling of the list length (Len) together with the list of chunks accumulated before the call to write_list/4.

I won’t be explaining lists:foldr/3 but we are creating an anonymous function (Fun) that will be invoked for each element of the to-be-pickled list we are given (List). This anonymous function will then invoke the list element pickler and return the list of binary chunks we have pickled so far, to prime list:foldr/3 on its next iteration.

Reading a pickled list aka unpickling is, again, a two step operation.

1
2
3
read_list({_, Len}, {_, Elem}, Bin) ->
    {N, Bin1} = Len(Bin),
    read_list(N, [], Elem, Bin1).

First we unpickle the list length. This gives us both the number of elements to read and the remainder of our binary data. We then proceed to unpickle the list itself using the data that was left over.

1
2
3
4
read_list(0, Acc, _, Bin) -> {Acc, Bin};
read_list(N, Acc, Elem, Bin) ->
    {E, Bin1} = Elem(Bin),
    read_list(N - 1, [E|Acc], Elem, Bin1).

There are no for loops in Erlang but they are easily emulated with recursion. We unpickle each element of the list, accumulate it and proceed to unpickle the next element after decrementing the number of elements read. We finish once we have reached 0.

Often, we will need to store different data depending on some flag. This requires us to both store the value of the flag and store the data. The choice combinator implements this alternative selection.

1
2
3
4
5
6
7
8
9
%%% Alternative selection. This could probably use some
%%% deeper thinking. Otherwise, we take a pickler for the tag
%%% as well as a tuple of two functions. The first one
%%% returns the tag value and a pickler based on the supplied
%%% value. The second one selects a pickler based on a tag value.

choice(Tag, Choice) ->
    {fun(Acc, Value) -> write_choice(Tag, Choice, Acc, Value) end,
     fun(Bin) -> read_choice(Tag, Choice, Bin) end }.

There’s not much to say about choice itself. It looks a lot like the list combinator that we have just implemented.

1
2
3
4
5
6
write_choice({Tag, _}, {Choice, _}, Acc, Value)
  when is_function(Tag),
       is_function(Choice) ->
    {T, {Pickler, _}} = Choice(Value),
    Acc1 = Tag(Acc, T),
    Pickler(Acc1, Value).

The choice helper function (selector) analyzes the data we are pickling and returns both a tag to store and the combinator to use for the data. We serialize the tag and the value right after it.

1
2
3
4
5
6
read_choice({_, Tag}, {_, Choice}, Bin)
  when is_function(Tag),
       is_function(Choice) ->
    {T, Bin1} = Tag(Bin),
    {_, Pickler} = Choice(T),
    Pickler(Bin1).

We work in the opposite direction to unpickle. We first unpickle the tag and let the choice function (selector) give us the combinator to use for the data. We use this combinator to unpickle the data.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
%%% Optional value. Use 'none' to indicate no value.

optional(Pickler) ->
    {fun(Acc, Value) -> write_optional(Pickler, Acc, Value) end,
     fun(Bin) -> read_optional(Pickler, Bin) end}.

write_optional(_, Acc, none) ->
    [<<0>>|Acc];

write_optional({Pickler, _}, Acc, Value) ->
    Pickler([<<1>>|Acc], Value).

read_optional({_, Pickler}, Bin) ->
    <<Opt:8, Bin1/binary>> = Bin,
    case Opt of
      0 -> {none, Bin1};
      _ -> Pickler(Bin1)
    end.

To serialize optional values (i.e. value or nothing) we pickle 0 when there’s no value and 1 otherwise.

The value of the wrapper combinator may not be readily apparent but bear with me, it will come in handy to implement serialization of enumerations!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
%%% Wrapper. Take a pickler and a wrapper tuple of two functions
%%% where the first one is used to convert the value before 
%%% pickling and the second one after unpickling.

wrap(Wrap, Pickler) ->
    {fun(Acc, Value) -> write_wrap(Wrap, Pickler, Acc, Value) end,
     fun(Bin) -> read_wrap(Wrap, Pickler, Bin) end}.

write_wrap({Wrap, _}, {Pickler, _}, Acc, Value) ->
    Pickler(Acc, Wrap(Value)).

read_wrap({_, Wrap}, {_, Pickler}, Bin) ->
    {Value, Bin1} = Pickler(Bin),
    {Wrap(Value), Bin1}.

This combinator resembles the list combinator we implemented previously but instead of taking a combinator used to serialize the list length, it takes a helper function used to transform the data before pickling and after unpickling.

Erlang does not support enumerations but I want to have enumerations where values increase sequentially such as {cow, sheep, horse}. I also want to be able to explicitly assign a value to each element of the enumeration, e.g. [{cow, 10}, {sheep, 100}]. Finally, the whole point of the exercise is to be able to marshal these enumerations back and forth.

We will assume that enumerated values given as a tuple start from 1.

1
2
enum(Enum, Pickler) ->
    wrap(wrap_enum(Enum), Pickler).

The enum/2 combinator takes both the format of the enumeration (list or tuple) as well as the pickler for the enumeration value. It’s clear in its simplicity but it’s also clear that it’s hiding something!

What is wrap_enum/1 and why do we need it?

1
2
3
4
5
6
7
wrap_enum(Enum)
  when is_tuple(Enum) ->
    wrap_enum_1(prep_enum_tuple(Enum));

wrap_enum(Enum)
  when is_list(Enum) ->
    wrap_enum_1(prep_enum_list(Enum)).

Recall that the format of our enumeration can be given both as a tuple and a list. We will pre-process the enumeration format to convert it to a list when it’s given to us as a tuple.

1
2
3
4
5
6
7
8
9
10
11
prep_enum_tuple(Enum)
  when is_tuple(Enum) ->
    prep_enum_tuple(Enum, size(Enum), [], []).

prep_enum_tuple(_, 0, Acc1, Acc2) ->
    {Acc1, Acc2};

prep_enum_tuple(Enum, N, Acc1, Acc2) ->
    prep_enum_tuple(Enum, N - 1,
        [{element(N, Enum), N}|Acc1],
        [{N, element(N, Enum)}|Acc2]).

The above will convert {cow, sheep, horse} into a pair (tuple of two elements) of lists [{cow, 1}, {sheep, 2}, {horse, 3}] and [{1, cow}, {2, sheep}, {3, horse}]. We need the regular list to convert cow into 1 for pickling and the inverted list to convert 1 back into cow.

1
2
3
4
5
6
prep_enum_list(Enum)
  when is_list(Enum) ->
    % expect a list of {tag, #value} pairs
    Inv = fun({Key, Val}) -> {Val, Key} end,
    InvEnum = lists:map(Inv, Enum),
    {Enum, InvEnum}.

The only thing we need to do when the format of the enumeration is a list is to create the inverted list of pairs. The anonymous function Inv swaps {cow, 100} for {100, cow}. We use lists:map/2 to apply our anonymous function to each element of the enumeration specification list, thus inverting each element.

And here’s the ultimate wrapper, the function that we spent so much time gearing up to!

1
2
3
4
5
6
7
8
9
wrap_enum_1({List1, List2}) ->
    F = fun(A, B) -> A < B end,
    %% gb_trees needs an ordered list
    Dict1 = lists:sort(F, List1),
    Dict2 = lists:sort(F, List2),
    Tree1 = gb_trees:from_orddict(Dict1),
    Tree2 = gb_trees:from_orddict(Dict2),
    {fun(Key) -> gb_trees:get(Key, Tree1) end,
     fun(Key) -> gb_trees:get(Key, Tree2) end}.

The gb_trees module implements a lightweight hash table using Erlang tuples. We order our enumeration lists and convert them to gb_trees to make lookups simple. We save two trees, one to look up cow using one and another to look one up using cow.

This completes our processing of enumerations. There’s one last hill to climb and then I promise you that we will be in combinator nirvana! But first we need to handle serialization of tuples and Erlang records.

1
2
3
4
5
6
7
%%% Tuple. Uses a tuple of picklers of the same size.

tuple(Picklers)
  when is_tuple(Picklers) ->
    wrap({fun tuple_to_list/1,
    fun list_to_tuple/1},
   tuple_0(tuple_to_list(Picklers))).

We reuse the wraping code we just develop to ensure that each tuple is pickled as a list and that each list is converted back to a tuple after we deserialize it.

1
2
3
4
5
record(Tag, Picklers)
  when is_tuple(Picklers) ->
    wrap({fun(Record) -> record_to_list(Tag, Record) end,
    fun(List) -> list_to_record(Tag, List) end},
   tuple_0(tuple_to_list(Picklers))).

We rely on Erlang records being tuples and just add the record tag as the first element when unpickling the record.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
write_tuple_0([], Acc, _) ->
    Acc;

write_tuple_0([{Pickler, _}|Rest], Acc, [Value|Tuple]) ->
    write_tuple_0(Rest, Pickler(Acc, Value), Tuple).

read_tuple_0(Picklers, Bin) ->
    read_tuple_0(Picklers, Bin, []).

read_tuple_0([], Bin, Acc) ->
    {lists:reverse(Acc), Bin};

read_tuple_0([{_, Pickler}|Rest], Bin, Acc) ->
    {Value, Bin1} = Pickler(Bin),
    read_tuple_0(Rest, Bin1, [Value|Acc]).

We serialize a tuple by requiring a tuple of combinators of the same length.We then convert the tuple of combinators to a list to simplify processing. We pickle and unpickle recursively, using the accumulator idiom.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
%%% It's convenient to be able to convert the tuple
%%% to a list first as there's no erlang:prepend_element/2.

tuple_0(Picklers)
  when is_list(Picklers) ->
    {fun(Acc, Value) -> write_tuple_0(Picklers, Acc, Value) end,
     fun(Bin) -> read_tuple_0(Picklers, Bin) end}.

record_to_list(Tag, Record)
  when is_atom(Tag) ->
    lists:nthtail(1, tuple_to_list(Record)).

list_to_record(Tag, List)
  when is_atom(Tag),
       is_list(List) ->
    list_to_tuple([Tag|List]).

Erlang records are regular tuples where the first element of the tuple is the record tag. When pickling a record we convert the record to a list and drop the first element (tag). When unpickling we prepend the tag. We never store the tag itself since the record combinator knows it.

We’ll skip the implementation of the binary combinator since it’s nothing fancy. You can find the full code at the end of the book.

Testing

Now is a good time to tackle testing.

Erlang macro facilities could definitely be better but they are also nothing to sneeze at! Lets define a couple of macros to help us with writing our unit tests.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
-define(error1(Expr, Expected, Actual),
  io:format("~s is ~w instead of ~w at ~w:~w~n",
      [??Expr, Actual, Expected, ?MODULE, ?LINE])).

-define(match(Expected, Expr),
        fun() ->
    Actual = (catch (Expr)),
    case Actual of
        Expected ->
      {success, Actual};
        _ ->
      ?error1(Expr, Expected, Actual),
      erlang:error("match failed", Actual)
    end
  end()).

We will be using match/2 a lot to make sure that whatever we pickle matches whatever we get after unpickling.

1
2
3
4
check(Pickler, Value) ->
    X = pickle(Pickler, Value),
    Bin = list_to_binary(X),
    unpickle(Pickler, Bin).

This little function is our testing workhorse. It takes a pickler combinator and a value and proceed to pickle a value and unpicle it using the same combinator.

1
2
3
4
5
6
7
test1() ->
    X = 16#ff,
    ?match(X, check(byte(), X)).

test3() ->
    X = -1,
    ?match(X, check(sshort(), X)).

This is what our tests look like. We give check a combinator and a value and use our match macro to check results. I’ll skip a few non-essential tests since you have the full source code at the end of the book. I do want to spend time on other tests, though, since these tests server as the manual and documentation for our pickler combinator library.

1
2
3
test8() ->
    X = "Wazzup!",
    ?match(X, check(list(int(), byte()), X)).

We serialize the length of the list as an integer and each elment as a byte. Erlang strings are nothing but lists of integers but ASCII character values fit within a byte. There’s nothing preventing you from using the int combinator to serialize elements of the list, though.

1
2
3
value2tag(Action)
  when is_list(Action) ->
    {0, list(byte(), byte())};

We want to serialize either a list of bytes where the length of the list is also stored as a byte or a long value. We use 0 as a tag for the list.

1
2
value2tag(_) ->
    {1, long()}.

And store 1 when the following value is a long.

1
2
3
4
5
tag2value(0) ->
    list(byte(), byte());

tag2value(1) ->
    long().

We reverse things when unpickling. If a tag value of 0 is found then we return the list combinator and otherwise return the long one.

1
2
selector() ->
    {fun value2tag/1, fun tag2value/1}.

selector simply combines the above tag convertors together.

1
2
3
4
5
test9() ->
    X1 = "Just testing",
    X2 = 16#ffff,
    ?match(X1, check(choice(byte(), selector()), X1)),
    ?match(X2, check(choice(byte(), selector()), X2)).

We need to make sure that pickling and unpickling both the string (list) and the long value works. And it does!

1
2
3
4
5
test10() ->
    X1 = none,
    X2 = 55,
    ?match(X1, check(optional(byte()), X1)),
    ?match(X2, check(optional(byte()), X2)).

We use none to stand for no value.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
test11() ->
    %% tuple enum
    Enum1 = {cow, sheep, horse},
    {FROM1, TO1} = wrap_enum(Enum1),
    ?match(1, FROM1(cow)),
    ?match(2, FROM1(sheep)),
    ?match(3, FROM1(horse)),
    ?match(cow, TO1(1)),
    ?match(sheep, TO1(2)),
    ?match(horse, TO1(3)),
    %% list enum
    Enum2 = [{cow, 20}, {sheep, 30}, {horse, 40}],
    {FROM2, TO2} = wrap_enum(Enum2),
    ?match(20, FROM2(cow)),
    ?match(30, FROM2(sheep)),
    ?match(40, FROM2(horse)),
    ?match(cow, TO2(20)),
    ?match(sheep, TO2(30)),
    ?match(horse, TO2(40)).

Personally, I found the enumeration combinator the most tedious to describe, probably because there’s so much supporting code. That supporting code needs to be tested and we do it here. wrap_enum works like a charm!

1
2
3
4
5
test12() ->
    Enum1 = {cow, sheep, horse},
    Enum2 = [{cow, 20}, {sheep, 30}, {horse, 40}],
    ?match(cow, check(enum(Enum1, byte()), cow)),
    ?match(sheep, check(enum(Enum2, byte()), sheep)).

Once we know our supporting code works, we can proceed with the rest. Here we test enumerations given as a tuple and a list of key/value pairs. Farm animals galore!

1
2
3
4
test13() ->
    Tuple = {"Joel", 16#ff00, none},
    Spec = {list(byte(),byte()), short(), optional(byte())},
    ?match(Tuple, check(tuple(Spec), Tuple)).

Tuple combinators, remember those? We use a tuple of combinators to pickle a tuple of values. The size of both tuples must be the same but the values can be anything. Here we use a string, a short and an optional byte.

1
2
3
-record(foo, { a, b }).
-record(bar, { c, d }).
-record(baz, { e, f }).

Records are important to an Erlang programmer, even though they are syntactic sugar on top of tuples. I would be quickly left without hair to tear out, if I always had to use element/2 to access tuple elements by number.

We define some records for the purposes of testing.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
test14() ->
    R = #baz{ e = 30, f = "Enough nesting!" },
    R1 = #foo{ a = 10, b = #bar{ c = 20, d = R }},
    Pickler = record(foo, {
           byte(),
           record(bar, {
        int(),
        record(baz, {
           sshort(),
           list(byte(), byte())
          })
             })
          }),
    ?match(R1, check(Pickler, R1)).

Here’s the mother of all tests where we pickle a bunch of nested records with mixed values and restore them back to their original glory. This concludes our presentation of pickler combinators. Please feel free to use them as you see fit and don’t forget to send me any ideas for improvement!

The last thing to note is that the pickler combinator approach is definitely less efficient than processing binaries directly. Still, it is suitable for lots of applications where programming productivity is more important than ultimate efficiency.

Full pickler combinator source code is available.

Comments