Tuesday, March 17, 2009

Lies, Damned Lies, and Benchmarks

Erlang/OTP R13A was released today with a number of major SMP improvements. I've been playing with R13 snapshots for a while and wrote a simple HTTP server to compare the SMP performance on R12 and R13. This server uses {packet, http} to decode requests, increments a counter with a transactional mnesia:read/3 and mnesia:write/1, and responds with the counter's previous value. You'll find the source here.

I ran the HTTP server on a x86_64 CentOS 5 machine running Linux 2.6.18-53.el5. The server has two quad-Core Intel Xeon E5450 CPUs and 8GB of RAM. Erlang/OTP R12B-5 and R13A were compiled from source and run as erl -pa ebin +SN -s ehttpd start where N indicated the number of schedulers to run.

To get performance numbers I ran ab on another server connected via a 100 Mb/s private VLAN as ab -c N -n 100000 http://10.0.0.32:8889/ where N was the number of concurrent requests. ab was run 3 times for each value of N and the following chart shows the average requests/sec with 4 and 8 schedulers.



R13A's SMP improvements include multiple run queues and improved locking. It also supports binding schedulers to specific CPU cores and hardware threads. Binding isn't enabled by default, so the following chart shows the result of setting erlang:system_flag(scheduler_bind_type, thread_no_node_processor_spread) and running with 100 concurrent requests.



There is a lot missing from these benchmarks, I didn't test kernel polling and only generated load from one client machine. The drop between 500 and 1000 concurrent requests on R13A +S8 looks too steep and may be the result of using ab. That said, the SMP optimizations in R13 are looking very promising!

Tuesday, January 20, 2009

Erlang and PostgreSQL Redux

I spent last weekend implementing encoding and decoding of PostgreSQL's date and time types. With this complete, epgsql is nearing a 1.0 release. Aside from bug fixes I've also implemented support for the 'returning' clause to return rows from an insert, update, or delete. The README file has been updated and includes a section describing how PostgreSQL types are represented in Erlang.

One thing you may notice is a lack of support for connection pooling, which I have remedied with epgsql_pool, available from this mercurial repository:

http://glozer.net/src/epgsql_pool

I debated including connection pools in epgsql itself, but I don't think they are the correct approach in Erlang. Instead I create pools of application-specific database accessor processes which are supervised and hold a persistent connection. This provides both a nice abstraction and the ability to prepare named statements during initialization rather than when a request comes in.

So use epgsql_pool at your own risk, I probably won't be maintaining it =)

Friday, January 2, 2009

Erlang and PostgreSQL

Despite the lack of blog posts, the last four months at heysan have been quite busy. Most of December was spent developing a new web application, entirely in Erlang, which compliments our existing service. This application makes extensive use of PostgreSQL 8.3, which eventually lead me to resurrecting and releasing my own database driver, available in this mercurial repository:

http://glozer.net/src/epgsql

Note: this is a 'static' HTTP repository, so if you're using a version of mercurial older than 1.1 the URL is static-http://glozer.net/src/epgsql.

Yes, I've inflicted yet-another PostgreSQL driver with an incompatible API (and even conflicting module names!) on the Erlang community. I'm very sorry =) But please, hear me out! I'd like to present my case and hopefully convince you to use my driver. While I must admit implementing PostgreSQL's network protocol was rather fun and entertaining, I did attempt to use the existing drivers before embarking on the project of writing a new one.

pgsql (originally from jungerl) is a small and simple driver, which was also forked by Process One for use in ejabberd. I was able to get up and running quickly with this driver, since it supports the "trusted" auth mechanism and has a simple API. On the other hand psql:squery/2 drops null columns from results and psql:pquery/3 fails during row decoding when nulls are present. All columns are returned as Erlang lists, which results in a lot of escaping overhead for bytea columns.

psql (originally from Erlang Consulting) is quite the opposite of pgsql with a very complicated API that requires you to configure connection pools using the application environment. This driver is also prone to hanging in various states, for example when trying to use an authentication method other than MD5. On the bright side it does decode column values into native Erlang types, but fails on very simple queries such as: psql:sql_query(C, "select 1, null").

Since neither driver was suitable for my needs, I spent Christmas vacation writing a new one, epgsql, which features a simple API, converts common db types to native Erlang types, and uses PostgreSQL's binary format for common types. For example the SQL statement "select 1" will return a numeric 1, "select true" will return the atom true, and "select 'hi'::bytea" will return a binary.

This type conversion is used in the extended query protocol, i.e. pgsql:equery and epgsql:parse, epgsql:bind, epgsql:execute:


{ok, C} = pgsql:connect("localhost", []),
{ok, Cols, Rows} = pgsql:equery(C, "select 1, null"),
[{column,<<"?column?">>,int4,4,-1,0}, {column,<<"?column?">>,unknown,-2,-1,0}] = Cols,
[{1,null}] = Rows.


The simple query protocol used by epgsql:squery returns values (other than null which is always the atom 'null') as Erlang binaries:


{ok, C} = pgsql:connect("localhost", []),
{ok, Cols, Rows} = pgsql:squery(C, "select 1, null"),
[{column,<<"?column?">>,int4,4,-1,0}, {column,<<"?column?">>,unknown,-2,-1,0}] = Cols,
[{<<"1">>,null}] = Rows.


Documentation for epgsql is a little sparse at the moment, but the README file covers the basics. I intend to continue developing and supporting this driver so please send me bug reports and enhancement requests!

Sunday, August 31, 2008

Hunting Bugs

Our Erlang gateways were developed and deployed in phases starting with AIM/ICQ, GTalk, Yahoo, and finally MSN. Aside from minor protocol implementation bugs there were no problems and we were very satisfied with stability and performance. However, not long after releasing the MSN gateway we noticed that it used a ton of memory, and also periodically suffered massive spikes in memory use which invoked the Linux kernel's OOM killer. This lead to a crash course in debugging running Erlang apps, and a great appreciation for the years of real-world lessons that have influenced the features and design of Erlang/OTP.

The first thing I looked into is why the gateway would eventually use several gigabytes of memory with only a few hundred users online. Since each Erlang process has its own heap, I started by looking for which processes were using the most memory. erlang:processes/0 returns a list of all running processes, and erlang:process_info/1 provides a ton of information about a process including heap use, stack size, etc. So I wrote a quick script to dump the process info of all processes to a file, sorted by total memory use. This was run on the live gateway instance.

It turned out that only a few active MSN sessions were using the majority of the heap, and these sessions were for users with very large contact lists. After initial login, one session could be using > 1GB of heap.

Newer versions of the MSNP protocol use SOAP requests to get authorization tokens, contact lists, allow/block lists, etc. My initial implementation was very simple, using inets to submit the HTTP request, reading the full response body as a list, and then parsing that list with xmerl. These responses could be very large and since the gateway was running on a 64bit Erlang VM, each character would occupy 16 bytes of memory. xmerl's representation of an XML document also requires quite a bit of storage. A simple XML document such as:

<a><b>foo</b><c/></a>
is represented as:

{xmlElement,a,a,[],
{xmlNamespace,[],[]},
[],1,[],
[{xmlElement,b,b,[],
{xmlNamespace,[],[]},
[{a,1}],
1,[],
[{xmlText,[{b,1},{a,1}],1,[],"foo",text}],
[],"/tmp/",undeclared},
{xmlElement,c,c,[],
{xmlNamespace,[],[]},
[{a,1}],
2,[],[],[],undefined,undeclared}],
[],"/tmp/",undeclared}


So I rewrote my SOAP module to use the streaming method http:request/4 which returns the HTTP response as a series of binary chunks. xmerl doesn't support parsing binaries so I switched to erlsom, which does, and also converted the XML to a very simple and compact format:

{a,[],
[{b,[],[<<"foo">>]},
{c,[],[]}]}


After making these changes the amount of memory used per login decreased by 2.5-3x. However the gateway was still occasionally using up all available memory and dying at what appeared to be random intervals. My best guess was that something in the protocol stream was triggering this problem so I updated the gateway to log each login attempt, and ran tcpdump to capture all MSN traffic. Eventually I was able to correlate the crashes with incoming status text messages from certain contacts of a few heysan users.

MSNP transports status text as an XML payload of the UBX command:

<Data><CurrentMedia></CurrentMedia><PSM>status text</PSM></Data>

I was still using xmerl to parse this small XML document and grab the cdata from the <PSM> tag. The status text of some contacts contained combinations of UTF-8 text and numeric unicode entities such as &#x3A;. Simply attempting to parse these small XML documents would cause xmerl to allocate more than 8GB of memory and thus kill the emulator. Parsing the UBX payload with erlsom instead of xmerl completely resolved the problem, but was a bit of a letdown after so much time spent hunting hunting such an esoteric bug.

UPDATE: the crash described above is fixed in xmerl-1.1.10, which is included in Erlang/OTP R12B-4.

Sunday, June 22, 2008

Why Erlang Gateways? - Part 3

After protocol decoding and client implementation, the third major challenge was making sure the gateways were scalable. Each gateway instance must handle a large number of active client sessions and multiple instances must be clusterable using some form of session-based load balancing.

Every heysan user has a jabber account that is registered with one or more gateways. Upon login the jabber server sends a presence probe to each gateway which then connects the user to the appropriate legacy network. A client session consists three Erlang processes that communicate via message passing: socket reader, FSM, and translator of protocol-specific events into a common format. This style of concurrency oriented programming is very quite and elegant, but is not possible in many languages due to the overhead of native OS threads.

Erlang's SMP VM distributes runnable processes across all available CPUs. IM clients spend most of their time waiting on network IO, so the gateways run with the +Ktrue command line option which instructs the VM to use epoll/kqueue/poll rather than select() to efficiently determine which processes are runnable. We have yet to explore the limits of a single gateway instance, but our most popular network has reached > 700 active clients and load on the gateway server was very low.

Of course at some point a single gateway instance will become constrained by the finite amount of CPU power, RAM, IO bandwidth, etc available on a single server. So the gateways have been designed to support multiple instances running in a cluster. Persistent data such as user credentials are stored in OTP's distributed database mnesia and accessible from every instance. Our jabber server, ejabberd, provides built-in support for load-balancing sessions across multiple instances of a gateway. Bringing a new instance online is a simple matter of telling it to join an existing cluster and updating the ejabberd config file.

So, in the final analysis, Erlang/OTP served as an ideal platform for building jabber gateways. Which isn't particularly surprising since such a system isn't far from Erlang's original telcom applications, and ejabberd has already proven itself in large deployments. In upcoming posts I intend to discuss some of Erlang's other libraries as well as some of the difficulties encountered due to MSNP v15's extensive use of SOAP and XML.

Tuesday, June 17, 2008

Why Erlang Gateways? - Part 2

Erlang's bit syntax provides a very convenient method of decoding and encoding binary protocols like OSCAR and YMSG. MSNP is a textual protocol that is also easily parsed with gen_tcp's {packet, line} and {active, once} options. Each protocol implementation has a process which reads incoming packets, does preliminary decoding, and forwards the data to a finite state machine.

The language and VM are great, but one of the major benefits to writing programs in Erlang is OTP which provides a large standard library and design principles that have been developed and tested in real-world use. One example of this is gen_fsm, a generic behavior for event-based FSMs which is particularly useful for implementing network protocols. The gen_fsm module handles initialization, synchronous and asynchronous event delivery, error reporting, debugging, etc. All the programmer must do is implement a callback module which handles events and makes state change decisions.

Execution of a gen_fsm begins with a call to gen_fsm:start or gen_fsm:start_link, which invokes the callback module's init/1 function. init performs any necessary initialization and determines the FSM's starting state. Here is msnp_fsm:init/1:

init([Username, Password, Client, Opts]) ->
process_flag(trap_exit, true),

Host = proplists:get_value(host, Opts, ?MSNP_HOST),
Port = proplists:get_value(port, Opts, ?MSNP_PORT),

State = #state{client = Client,
user = #user{username = Username, password = Password},
contacts = [],
pending = dict:new(),
sessions = []},
gen_fsm:send_event(self(), {connect, Host, Port}),
{ok, protocol_negotiation, State}.

The FSM begins in state protocol_negotiation, and immediately receives an asynchronous event {connect, Host, Port} telling it to connect to the server. Next, gen_fsm will call msnp_fsm:protocol_negotiation/2, passing it the event and FSM state:

protocol_negotiation({connect, Host, Port}, State) ->
{ok, Sock} = msnp_sock:start_link(self(), Host, Port),
msnp_sock:send_cmd(Sock, "VER", ["MSNP15", "CVR0"]),
{next_state, protocol_negotiation, State#state{sock = Sock}};

The FSM remains in protocol_negotiation until it receives a matching VER command from the server, then transitions through sso_auth, waiting_for_profile, synchronizing, and eventually to the ready state. Each state function is quite simple, and with multiple function clauses to handle different events the code is easy to read. Here are two ready/1 event handlers that handle contact status changes:

ready(#cmd{cmd = "ILN", args = [_, Code, Name, _, Nick | _]}, State) ->
State2 = send_status_update(Name, url_util:decode(Nick), Code, State),
{next_state, ready, State2};

ready(#cmd{cmd = "NLN", args = [Code, Name, _, Nick | _]}, State) ->
State2 = send_status_update(Name, url_util:decode(Nick), Code, State),
{next_state, ready, State2};

msnp_sock parses incoming packets and sends the resulting event to the msnp_fsm as a cmd record. Erlang's pattern matching provides a very concise way to determine which clause to invoke and binds the parameters needed to variables such as Name and Nick.

OTP behaviors like gen_fsm also provide much useful debugging functionality. When an event handler fails for any reason a log message is generated with the FSM's state, the last message received, the cause, etc. gen_fsm:start_link can also be called with the option {debug, [trace]} which will log every incoming event and state change. Running systems can be inspected in real-time thanks to functions such as sys:get_status/1 which displays the state of running processes.

With the combination of Erlang's bit-syntax and OTP library it is pretty easy to implement an IM client. Indeed most of the difficulty is due to lack of official documentation and incomplete reverse engineering. In the next post I'll cover one of the most interesting and important aspects of the gateways: scalability.

Saturday, June 14, 2008

Why Erlang Gateways? - Part 1

Jabber gateways (XEP-0100, also known as transports) provide a bridge from XMPP to legacy IM network protocols like OSCAR, used by AIM and ICQ; YMSG, used by Yahoo Messenger; and MSNP, used by MSN/Windows Live Messenger. Unfortunately these protocols are all proprietary and lacking comprehensive documentation. AOL has released partial specs for parts of OSCAR but neglected to include the sections necessary for ICQ compatibility, and there is an obsolete draft RFC for MSNP. Both of those protocols have been reverse engineered and documented fairly well, but there are no open source implementations suitable for use as jabber gateways. The various Python transports (PyAIMt, PyMSNt, PyYIMt) appear fairly unmaintained and libpurple's API is better suited to clients than gateways, not to mention supporting very outdated versions of MSNP and YMSG.

Fortunately Erlang is an excellent language for implementing network protocols! In this post I'll cover one reason: a data type and syntax for manipulating binary data. The data type is called a binary (which as of R12B is a subset of a new type, bitstring) and stores bytes, very similar to a byte array in C, C++, Java, etc. However, unlike a byte array, the bit syntax provides access to an arbitrary number of bits. Let's jump right into an example using the syntax for a FLAP header which frames all OSCAR packets:

<<16#2A:8, Chan:8, Seq_Num:16, Size:16>>

The FLAP header is 6 bytes long. Byte 1 is an asterisk (character code 0x2A), byte 2 is a channel number, bytes 3-4 are a sequence number, and bytes 5-6 contain the size of the remaining payload. When this Erlang expression is matched against an incoming FLAP packet it assigns the channel number to Chan, sequence number to Seq_Num, and size to Size. Size:16 is actually a shortcut, taking advantage of bit syntax defaults, for Size:16/big-unsigned-integer which means a 16 bit unsigned integer, encoded as big-endian.

Most OSCAR packets are sent on channel 2 which indicates another level of framing called SNAC. When a FLAP header arrives on channel 2 we read Size bytes, which start with a SNAC header:

<<Family:16, Subtype:16, Flags:16, Req_Id:32, Bin/binary>>

This looks quite similar to FLAP, except for the last part. Until now we've only been pulling integers of various size from of the binary, but the bit syntax also supports floats and binaries. When matched against a SNAC packet, this expression stores the first 10 bytes in Family, Subtype, Flags, and Req_Id, and all remaining bytes in Bin. The values of Family and Subtype determine the format of the data in Bin.

So far we've been looking at the bit syntax patterns used to decompose binary data. What does the data actually look like? Well it's just a sequence of bytes, which Erlang displays as a comma-separated list of numeric values surrounded by << and >> This is easier to demonstrate using another common OSCAR data type, string16, which is a UTF-8 string, prefixed with a 2 byte length.

Here is what the string "hello world" would look like as a string16 binary: <<0,11,104,101,108,108,111,32,119,111,114,108,100>>. The first two bytes are the length, and the remaining bytes are UTF-8 character codes. When matched against the pattern

<<Len:16, Str:Len/binary>>

Len will contain 16 and Str will contain <<"hello world">>, which is a shortcut for <<104,101,108,108,111,32,119,111,114,108,100>> (Erlang does not have a string data type, but sequences of Latin-1 characters can be enclosed in quotes in both lists and binaries).

These examples of the bit syntax have been taken from my actual OSCAR implementation, but are only a small part of the puzzle. In the next post I'll discuss modeling protocols using finite state machines, via Erlang's gen_fsm behavior. Here's a teaser snippet:

ready(#packet{family = 16#13, subtype = 16#19, data = Data}, State) ->
<<Len:8, Contact:Len/binary, _Rest/binary>> = Data,
State#state.client ! {incoming_contact_add, Contact},
{next_state, ready, State};