In a previous post we discussed the design of zlog,
our implementation of the CORFU distributed
shared-log protocol on top of Ceph. A key component of the system is the
sequencer server that reports the current log tail to clients. In this post
we'll discuss the implementation and performance of the sequencer in
zlog.
The fast path of the sequencer server is simple. It contains an in-memory
counter that is incremented when a client requests the next position in the
log. Our current implementation receive requests from clients that are about
60 bytes in size (log identifier, epoch value), and responds with the
current counter and epoch totaling about 16 bytes. The current design achieves
approximately 180K requests per second with 4 server threads on the loopback
interface, and approximately 100K requests per second over the 10 Gb ethernet
network on Google Compute Engine.
In contrast, the authors of the Tango paper report sequencer performance of
approximately 500K requests per second over a 1 Gb ethernet link. While 100K
requests per second for our implementation is more than enough to get started
with some interesting applications, it would be nice to reach parity with the
results in the Tango paper.
In the remainder of this paper we'll discuss the current implementation. If
anyone has suggestions on improving performance that would be great :)
Protocol
Our implementation uses Google Protocol Buffers to encode messages between
clients and the sequencer. Below is the specification for the messages. The
MSeqRequest is sent by clients, and the MSeqReply is the message type of
each server reply.
Next we'll discuss the server and client designs.
Server
When the server starts up it binds to a specific port --port and starts a
set of worker threads --nthreads. The server can also be daemonized but this
code has been removed for brevity. The LogManager object tracks log metadata
and is used by the server.
The server design uses Boost ASIO and is based on the multi-threaded HTTP
server example available on the Boost ASIO website. The server contains a
single I/O service object, and we set the TCP_NODELAY option for all new connections.
When the server starts a set of threads all call the run method on the I/O
service object.
New connections are handled by the state machine implemented in the Session
object that we'll see next.
Session
The Session object implements the sequencer protocol state machine. Each
connection starts with the client sending us a 32-bit message header.
Next we sanity check the header and read the rest of the message.
Next we handle the part of the message with the interesting parts. We start by
re-setting a MSeqRequest and initializing it with the received data.
Next we prepare a MSeqReply structure and contact the LogManager with the
information in the request. We'll discuss the LogManager in the next
section.
Once the LogManager is finished, we record any errors that occurred, and set
the position reported by the LogManager and send out the reply back to the
client.
Next we'll present the LogManager component of the sequencer server.
Log Manager
The LogManager manages the mapping between log identifiers and the actual
tail counter. A tail is represented by a Sequence object:
A log is identified by the tuple (name, pool) and stored in the LogManager
in the Log struct that includes the current epoch value for the log
instance:
The important part is when the log sequences are read and incremented. This is
handled by the ReadSequence method called from the server Session object
in responce to a client request. The routine looks up the log in a std::map
and if it exists the sequence value is returned, and optionally incremented.
If the log isn't found in the index the client recieves an -EAGAIN message
and a request to register the log is placed in a queue. A separate thread
handles the registration of a new log in the LogManager because it is an
expensive process that would otherwise block other client requests.
The process of registering a new log includes the slow tail finding mechanism,
but is omitted here because it isn't part of the fast path. Next we'll show
the client side, which is very straight-forward.
Client
A client starts by connecting to the sequencer server.
Client requests are in the form of a MSeqRequest object and include the
metadata for identifying the log.
The write call is blocking, so after it returns we read the response from
the same socket.
And that's it. Performance is OK at 100K requests per second, but it would be
nice to reach the reported 500K requests per second form the Tango paper.
Thoughts
I haven't had much time to focus on optimizing the server.
Some simple hacks suggest that the use of the std::map and shared locking
aren't currently a bottleneck.
Apparently there is locking in the I/O service, and we could use multiple
I/O service objects, but I'm not familiar enough with Boost ASIO internals to
know if this is worth the effort.
I'll be revisting sequencer performance in the future, and will now be
focusing on building some applications on top of zlog for demonstration.