Saturday, June 14, 2008

Why Erlang Gateways? - Part 1

Jabber gateways (XEP-0100, also known as transports) provide a bridge from XMPP to legacy IM network protocols like OSCAR, used by AIM and ICQ; YMSG, used by Yahoo Messenger; and MSNP, used by MSN/Windows Live Messenger. Unfortunately these protocols are all proprietary and lacking comprehensive documentation. AOL has released partial specs for parts of OSCAR but neglected to include the sections necessary for ICQ compatibility, and there is an obsolete draft RFC for MSNP. Both of those protocols have been reverse engineered and documented fairly well, but there are no open source implementations suitable for use as jabber gateways. The various Python transports (PyAIMt, PyMSNt, PyYIMt) appear fairly unmaintained and libpurple's API is better suited to clients than gateways, not to mention supporting very outdated versions of MSNP and YMSG.

Fortunately Erlang is an excellent language for implementing network protocols! In this post I'll cover one reason: a data type and syntax for manipulating binary data. The data type is called a binary (which as of R12B is a subset of a new type, bitstring) and stores bytes, very similar to a byte array in C, C++, Java, etc. However, unlike a byte array, the bit syntax provides access to an arbitrary number of bits. Let's jump right into an example using the syntax for a FLAP header which frames all OSCAR packets:

<<16#2A:8, Chan:8, Seq_Num:16, Size:16>>

The FLAP header is 6 bytes long. Byte 1 is an asterisk (character code 0x2A), byte 2 is a channel number, bytes 3-4 are a sequence number, and bytes 5-6 contain the size of the remaining payload. When this Erlang expression is matched against an incoming FLAP packet it assigns the channel number to Chan, sequence number to Seq_Num, and size to Size. Size:16 is actually a shortcut, taking advantage of bit syntax defaults, for Size:16/big-unsigned-integer which means a 16 bit unsigned integer, encoded as big-endian.

Most OSCAR packets are sent on channel 2 which indicates another level of framing called SNAC. When a FLAP header arrives on channel 2 we read Size bytes, which start with a SNAC header:

<<Family:16, Subtype:16, Flags:16, Req_Id:32, Bin/binary>>

This looks quite similar to FLAP, except for the last part. Until now we've only been pulling integers of various size from of the binary, but the bit syntax also supports floats and binaries. When matched against a SNAC packet, this expression stores the first 10 bytes in Family, Subtype, Flags, and Req_Id, and all remaining bytes in Bin. The values of Family and Subtype determine the format of the data in Bin.

So far we've been looking at the bit syntax patterns used to decompose binary data. What does the data actually look like? Well it's just a sequence of bytes, which Erlang displays as a comma-separated list of numeric values surrounded by << and >> This is easier to demonstrate using another common OSCAR data type, string16, which is a UTF-8 string, prefixed with a 2 byte length.

Here is what the string "hello world" would look like as a string16 binary: <<0,11,104,101,108,108,111,32,119,111,114,108,100>>. The first two bytes are the length, and the remaining bytes are UTF-8 character codes. When matched against the pattern

<<Len:16, Str:Len/binary>>

Len will contain 16 and Str will contain <<"hello world">>, which is a shortcut for <<104,101,108,108,111,32,119,111,114,108,100>> (Erlang does not have a string data type, but sequences of Latin-1 characters can be enclosed in quotes in both lists and binaries).

These examples of the bit syntax have been taken from my actual OSCAR implementation, but are only a small part of the puzzle. In the next post I'll discuss modeling protocols using finite state machines, via Erlang's gen_fsm behavior. Here's a teaser snippet:

ready(#packet{family = 16#13, subtype = 16#19, data = Data}, State) ->
<<Len:8, Contact:Len/binary, _Rest/binary>> = Data,
State#state.client ! {incoming_contact_add, Contact},
{next_state, ready, State};

8 comments:

Mietek said...

Excellent! I'm looking forward to the rest of the series.

I'm curious to know -- are you planning on using jabberlang? Or have you perhaps coaxed a prerelease version of exmpp out of Process One?

dayjah said...

nice post - its approachable :) I once attempted to do a java dhcpd server - and protocol stuff in java sucks primarily because of the lack of concept of unsigned ints, I had massive inaccuracies in my bit shifting when trying to make ints be unsigned whilst maintaining the correct value. How, if at all, does erlang deal with that?

Luke Gorrie said...

Hi Will! Nice to see you popping up as an Erlang hacker at a Y-Combinator company. :-)

Will Glozer said...

Thanks @mietek, and yes I'm using a pre-release of exmpp =)

@dayjah the bit syntax supports both signed and unsigned integers, but Erlang itself has no fixed-length integer type so there's no messing around with shifting.

Hey @Luke! LTNS, thanks!

Mietek said...

Will, would you mind saying how an interested party can get access to exmpp?

Bosky said...
This post has been removed by the author.
Bosky said...

hi,
A couple of days back,i tried coming up with a space complexity calculator for pattern matching binary , which We're using to implement auto-complete, suggestive keywords .

it seems to work decently, although i'd love to have your comments as well. While i'm at it , I'd love to have a look exmpp too :)

http://paste.lisp.org/display/63271#4

Keep Clicking,
Bhasker V Kode

Jamaal said...

If you hit this entry through google, the exmpp library is now available for us all at http://support.process-one.net/doc/display/EXMPP/

cheers.