The Design of Software (CLOSED)

A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.

The "Design of Software" discussion group has been merged with the main Joel on Software discussion group.

The archives will remain online indefinitely.

TCP Protocol Design

I'm implementing a TCP protocol and would like to use a human-readable message format rather than some binary format.  My reading on SMTP and HTTP indicates that these are sometimes known as "line oriented" protocols since they typically use newlines to delimit information in requests/responses.

My question is: how do I efficiently implement code to receive such messages?  Some sample (teaching) programs I've seen read a byte at a time..  Bleck..  Were I designing a binary protocol, I could put the message length in a header, but since this will be strictly text, I need some other method.

My thinking is that I could peek the size of the receive buffer, read all the data, and continue this process until
I encounter TWO newlines. 

Can anyone offer tips on how this is done in the real world?
New To Sockets
Sunday, June 18, 2006
 
 
This may answer some of your questions:

http://tangentsoft.net/wskfaq
sgf
Sunday, June 18, 2006
 
 
One application I've seen writes an 'ascii' string with the number of bytes to follow, terminated by a newline.  Then, the message bytes (matching the number in the ascii string) follow.  It's nice if the last character is a new-line, but that's optional.

To read this, you do have to read the initial line byte-by-byte, looking for the ending newline.  Once you've done that, you convert that string into a number, and then read that number of bytes from the port.

Hopefully, that number of bytes is already in that port's buffer -- but you do have to code some workaround should all the message bytes not have reached the port's buffers yet.  The simplist fix is to check the actual number read, and if they don't match the number requested, do a 'sleep' and re-read until you do get all the bytes.
AllanL5
Sunday, June 18, 2006
 
 
To efficiently receive an arbitrary stream, you simply call recv repeatedly, passing-in a largeish receive buffer. Recv can return as soon as some (any) data has been received from the network, even if the buffer which you passed in has not been entirely filled. The data typically arrives in chunks of several hundred bytes at a time, so this is hundreds of times more efficient than reading one byte at a time. It isn't the most efficient way, but it's simple.
Christopher Wells Send private email
Sunday, June 18, 2006
 
 
I've seen lots of solutions.  Some even use a Cobolesque approach with a fixed-length numeric length header field (i.e. 6 digits with left-zero fill).  Others use 2 (or 4, or 6) hex digits.

Anything besides a length prefix in a header means you potentially must deal with escaping whatever symbol(s) comprise your ETX/EOM marker.

And there are more insects to worry about...

It turns out the the use of message numbers and sentinal strings is still of value even when using TCP.  The reason isn't so much about corrupted streams as man-in-the middle injection, denial of service, and buffer overrun attacks.

I know one format that uses a 3-field message prefix header.  The first 2 bytes must be 0xABCD, then there is a 16-bit unsigned rollover message sequence number, then a 16-bit unsigned binary "payload length" field.  If there isn't a 0xABCD at the head of the buffer bytes are skipped until there IS one.  Then the message number must match expectations, else another skip to the next 0xABCD proceeds.  Three strikes (or a timeout) and you're out - the receiver sends a "that's all folks," etc. message and disconnects.

One could do the same with human-readable text headers too.
Glen Hamer
Sunday, June 18, 2006
 
 
I know that recv can return fewer bytes than requested, but I haven't seen this in practice unless:

1) The socket was closed by the client
2) The client called shutdown with SD_SEND

Perhaps my simple test program is too simple- Maybe I should be using select() and nonblocking sockets?
New To Sockets
Monday, June 19, 2006
 
 
> in practice I haven't seen recv return fewer bytes than requested

What platform are you using?

The help for recv in the Windows SDK says "For connection-oriented sockets (type SOCK_STREAM for example), calling recv will return as much data as is currently available—up to the size of the buffer specified."

For Unix the recv command can include a MSG_WAITALL flag which isn't used in the Windows SDK, but not specifying that flag looks like it would result in Windows-style behaviour.

http://www.google.ca/search?hl=en&safe=off&q=recv+example shows several recv examples, including http://tangentsoft.net/wskfaq/examples/packetize.html which shows how to repacketize a stream.

> Maybe I should be using select() and nonblocking sockets?

An efficiency problem with using select() is that it results in twice as many function calls: a call to select followed by a call to recv.

> nonblocking sockets

Maybe; I thought the [only] difference between a blocking and a non-blocking socket was whether recv returns immediately when there's no data to be read; but I haven't checked.

>> continue this process until I encounter TWO newlines
> you potentially must deal with escaping whatever symbol(s) comprise your ETX/EOM marker

In an ASCII-over-TCP protocol the end-of-message is traditionally signalled by a '.' all by itself on the last line, e.g. http://en.wikipedia.org/wiki/POP3#Dialog_example

Page 2 of http://www.ietf.org/rfc/rfc1939.txt says:

  Responses to certain commands are multi-line.  In these cases, which
  are clearly indicated below, after sending the first line of the
  response and a CRLF, any additional lines are sent, each terminated
  by a CRLF pair.  When all lines of the response have been sent, a
  final line is sent, consisting of a termination octet (decimal code
  046, ".") and a CRLF pair.  If any line of the multi-line response
  begins with the termination octet, the line is "byte-stuffed" by
  pre-pending the termination octet to that line of the response.

which means that "CRLF.CRLF" marks the end of the message, and that if "." appears at the beginning of a line within the message then it's replaced with ".." before being sent.
Christopher Wells Send private email
Monday, June 19, 2006
 
 
Stick with an ASCII protcol. You will thank yourself in the end, its easier to debug and easier to understand than a binary protocol.

To receive lines you do as previously suggested, read as much as you can from the socket, stuff it into an intermediate buffer, and then scan that buffer for the lines. When you find the new line, remove it from the intermediate buffer, leaving any remaining characters (they will be part of the next line).

Using select with non-blocking sockets is always a good idea, it reduced the CPU overhead on most operating systems by allowing the system to do other things while your application is waiting for new data.
Brian Lane Send private email
Monday, June 19, 2006
 
 
Plaintext protocols are surely no worse than using XML payloads.

Yes, an application-level stream assembly buffer for using in accumulating messages/datagrams is pretty routine stuff.  For output many message-over-TCP programs also use an outbound buffer (lightweight "message queue") and drive actual transmission of the data with send completion events/callbacks... assuming you're using async I/O operations rather than wasting a thread on output to one socket.

Most of this can be packaged up fairly neatly as a socket wrapper component or library for reuse (debug once, profit endlessly).
Tonya Anon
Monday, June 19, 2006
 
 
ASCII?  ASCII?  Must be a Unix guy or what?

If only my life were that simple, but I have EBCDIC and Unicode to consider as well.  Not only ISN'T IT 1982 anymore, it is ALSO 1964 yet!
Krish T
Monday, June 19, 2006
 
 
Most internet protocols use text, or at least use text on the command channel and use binary on the data channel. But they generally don't send a lot of complex commenads.

If your application has large structured data that doesn't transport a lot of text fields then go binary.

But go all the way. Once you serialize it into and out of binary format your performance is toast anyway. So if you are going to do it stupidly just use text.
son of parnas
Monday, June 19, 2006
 
 
Son of Parnas:

So you're saying that I should make the decision to go either all binary or all text, because hybrids perform just as badly as text?

Also, here's another question: Given how fast processors are when compared to I/O channels, is it really terrible to use a text-based protocol?  At first blush, I cringed at the idea of reading a buffer, scanning it to find delimiters, etc,. when I could just use a binary protocol that would be lead with a message size.

However, after thinking about it, I figured-- the choice must depend on what proportion of the time is spent computing a response.  In other words-- what's the true bottleneck?

I would welcome comments or criticism on the above statement.  I'm new to developing socket-based server apps and am trying to understand the issues.

Thanks again, I always feel like this is a great forum to learn.
New To Sockets
Monday, June 19, 2006
 
 
Right.  Where's the bottleneck.

If network I/O isn't a bottleneck, there are plenty of http libraries out there where you can use XML out of the box.  I hate XML as much as the next guy, maybe even more, but writing a text protocol from scratch will mean you have to deal with a lot of wierd bugs from invalid/untrusted data.  For example, will a port-scan from an unknown client crash your app?
Grant
Monday, June 19, 2006
 
 
I think Parnas was saying that for a 'command' channel, text would be fine, but for large amounts of binary data on the 'data' channel, binary would be better.

That's because the 'command' channel doesn't usually get very large packets compared to a 'data' channel, so expanding a little into text is not a problem.

I still think the "<text number string><lf><number of binary bytes>" can be pretty efficient.  You read the text string byte by byte -- but it's short.  You can then read the 'number of bytes' in a single read (if all of them have been recieved already).

I've gotten data from Alaska to Washington D.C. using this, and it works very reliably.  And yes, over that distance, occasionally you do get the 'front half' of the binary data in one read, then the second half in the next read.
AllanL5
Monday, June 19, 2006
 
 
> I cringed at the idea of reading a buffer, scanning it to find delimiters

My rule of thumb, first order approximation, is that any code I write takes negligeably time compared to any O/S call (e.g. recv) or library call (e.g. malloc).

> In other words-- what's the true bottleneck?

I'm guessing that you're not trying to get a throughput of megabytes per second.

One advantage of binary over ASCII is the network bandwidth: e.g. a dial-up link at 28.8 Kbps is more of a bottle-neck than a typical CPU.
Christopher Wells Send private email
Monday, June 19, 2006
 
 
My primary objection to XML (always, I'm not a big fan) is the overhead involved.  The fact that there are well known parsers available, though, does make it attractive as an envelope for messages.
New To Sockets
Monday, June 19, 2006
 
 
Actually though, now that I think it through, you could still take advantage of HTTP returning plain text data.  HTTP itself is reasonably lightweight if it meets your requirements.
Grant
Monday, June 19, 2006
 
 
Search for Type-Length-Values (TLV). Basically, each message that you send or receive must start with a Type/Length header. If your Type is always the same length you don't need a Length.

Doing this lets you know how much data you are expected to send or receive. This can be done at the ASCII or binary level but it's better at the binary level.
Wayne B Send private email
Monday, June 19, 2006
 
 
Yep, Unix guy. For a good discussion on the benefits of using plain text see Eric Raymond's "Art of Unix Programming", Chapter 5. Textuality - http://www.catb.org/~esr/writings/taoup/html/ch05s01.html
Brian Lane Send private email
Monday, June 19, 2006
 
 
You can't underestimate the value of being able to point telnet to your port and do real work. You can also look for web protocol stuff so someone can point their browser at the port and get an interface too.

Once you go to binary you can't do this sort of thing.
son of parnas
Monday, June 19, 2006
 
 
> You can't underestimate the value of being able to point > telnet to your port and do real work.

This is a huge factor in debugging stuff, also being able to use something like "sock" as a server to point your client at.
Arethuza Send private email
Tuesday, June 20, 2006
 
 
"I know that recv can return fewer bytes than requested, but I haven't seen this in practice unless"

Your tests are too simple. You DO always need to cater for this because it CAN and WILL happen. Often developers try and test on a single machine using the loopback interface (you're unlikely to see short recvs) or between machines on a network segment that is relatively quiet, etc. Out in the wild you WILL see recvs that recv less than you expect so you should ALWAYS write the recv code to assume that each recv will return a single byte...

A few years ago I wrote a tool to help with testing this kind of thing, it's on codeproject, if you're interested: http://www.codeproject.com/cs/internet/testingsocketservers.asp

As for your original question, as some of the other posters have said, read into an intermediate buffer, look for line ends, extract complete commands, move the remaining data to the 'front' of the buffer, look for line ends (etc) and when you dont find any more complete commands, read again into the same intermediate buffer starting at the last byte that you haven't processed.
Len Holgate Send private email
Tuesday, June 20, 2006
 
 
I found this in my archives, I am not sure who it was from:

    But I think you're confused about how "heavy" HTTP is.
    Look:

    GET /some-file HTTP/1.0\r\n \r\n

    Holy shit! That's it!

    Then you get the data sendfile(2)'d back to you
    after  this huge header:

    HTTP/1.1 200 OK Content-Length: 23983

    Man, that's heavy.
son of parnas
Tuesday, June 20, 2006
 
 
W/respect to what I said earlier about recv returning fewer bytes:  I understand that stream-based sockets do not preserve message boundaries, but what I WAS seeing was that if I request 20 bytes and only 10 bytes were available, recv() would block.

My sample server is now more sophisticated and puts the sockets into nonblocking mode and waits on a set using select.  I don't see this behavior any more (And of course am wondering if I ever did...)

Thanks for the help, I think I agree with Parnas et. al, - unless the protocol really needs to be tight, using text makes things easier to see/debug (although probably slightly more work to program.)
New To Sockets
Tuesday, June 20, 2006
 
 
"but what I WAS seeing was that if I request 20 bytes and only 10 bytes were available, recv() would block"

recv will return 10 bytes (while you are asking for 20 bytes) only and only if sending side
close the socket, I think

PS:
  sorry for such an ugly english ))
Iskandar Zaynutdinov Send private email
Friday, July 07, 2006
 
 

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
 
Powered by FogBugz