Luke Gorrie's blog

21st Century Network Hacking

Echoing Packets With Snabb Switch

Just for fun, here is some simple Snabb Switch example code to retransmit every packet that is received on a set of Ethernet ports. It’s part of the basic selftest code in port.lua. The coding style is starting to settle down lately and so it’s more tempting for me to share little snippets of code.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
function Port:echo ()
   local inputs, outputs = self.inputs, self.outputs
   repeat
      for i = 1,#inputs do
         local input, output = inputs[i], outputs[i]
         input:sync_receive()
         while input:can_receive() and output:can_transmit() do
            local buf = input:receive()
            output:transmit(buf)
            buffer.deref(buf)
         end
         while input:can_add_receive_buffer() do
            input:add_receive_buffer(buffer.allocate())
         end
         output:sync_transmit()
      end
      C.usleep(1)
   until coroutine.yield("echo") == nil
end

Reasonably easy to understand, I hope?

The features of this program that I would like to draw attention to are:

  1. It is written in a high-level language, LuaJIT.
  2. On one Xeon core it handles 12 million packets per second spread across 20 x 10GbE network ports. This loop runs in about 85 nanoseconds per packet. And there is heaps of room for optimization left, both in the single-core and multi-core contexts.
  3. It uses device drivers directly. The input and output objects are each one of our drivers for Intel 1G, Intel 10G, or Virtio ethernet devices. The “buffers” are blocks of physical memory that are directly used for hardware DMA.
  4. It’s written in a pretty natural style and yet doesn’t need to garbage collect. The only objects being frequently allocated and freed are the packet buffers, and those are straightforwardly reference counted with buffer.ref() and buffer.deref().
  5. There are no confusing concepts like abrupt processor interrupts, threads, locks, or special memory allocation flags. It’s plain and simple high-level user space code just like, say, the Javascript on a web page.

This my friends is an idea for how we could write high-speed packet networking code over the next decade or so. I hope to win you over to this way of thinking :-)

If you want to know more then check out the project homepage, browse the code on github, or browse an early draft of the book.

Recent Talks

I have given a few talks recently:

SLIME at the Emacs Conference (16 min). History of the SLIME project and tips for people writing new Emacs-based IDEs.

Snabb Switch at the Swiss OpenStack User Group (8 min). Introducing the Snabb Switch project to people in the local OpenStack cloud computing community.

Snabb Switch at the EduPERT workshop (27 min). What networking problems can you address with x86 servers? and more on Snabb Switch.

Teclo Networks at ECLM 2011 (40 min). The early days of teclo.net the telecom startup company founded by Common Lisp hackers.

This has been really fun!

Routing Traffic Between Local Applications on Linux

Want to create a really interesting virtual network on your own host and test it with ordinary applications? Great! Here is how.

We will make the address 192.168.100.1 act like 127.0.0.1 but route packets through a custom network topology before processing them.

First start with any custom topology. In this example: west and east are endpoints with an Open vSwitch bridge ovs in between. (This would be great for applying OpenFlow rules to packets sent between local applications.)

1
2
3
4
5
6
7
8
9
10
ip link add west type veth peer name west_to_ovs
ip link add east type veth peer name east_to_ovs

ovs-vsctl add-br ovs
ovs-vsctl add-port ovs west_to_ovs
ovs-vsctl add-port ovs east_to_ovs

for i in west west_to_ovs east east_to_ovs ovs; do
  ip link set $i up
done

Now assign addresses and routes for these interfaces. Packets sent to 192.168.100.1 should first be routed into interface west then switched via ovs and finally delivered to east for processing.

1
2
3
4
5
ip addr add 192.168.100.2 dev west
ip addr add 192.168.100.1 dev east

ip route add 192.168.100.1 dev west
ip route add 192.168.100.2 dev east

The ingredients are in place but they don’t work yet. If you ping 192.168.100.1 then the packets are sent to lo instead of being routed through the bridge.

And that brings us to the trick: Policy Routing.

First make Linux globally “forget” that these addresses are local.

1
2
ip route del 192.168.100.2 table local
ip route del 192.168.100.1 table local

Now packets sent to 192.168.100.1 do get routed down the right path. They are not processed at the other end though, because Linux does not remember they are local. We are half way there.

Next create separate routing tables strictly for when packets are received after they have traversed the switch. These tables remember that the addresses are local.

1
2
3
4
5
6
7
ip rule add iif west lookup 100
ip route add local 192.168.100.2 dev west table 100
echo 1 > /proc/sys/net/ipv4/conf/west/accept_local

ip rule add iif east lookup 101
ip route add local 192.168.100.1 dev east table 101
echo 1 > /proc/sys/net/ipv4/conf/east/accept_local

Now we are done!

If you connect to 192.168.100.1 then your packets will first traverse the bridge and then be processed locally. The setup is symmetric so that return traffic will be routed back through the bridge too. This will work with all your favourite programs like ping, curl, apache, etc. Check it out by running tcpdump on west or east.

Go ahead and create interesting virtual networks on your own machine.

Snabb - My Lab

I love my little company, Snabb. It is my laboratory.

My lab makes me productive in society. I write really cool open source code for anybody to use, and I earn money by traveling around meeting interesting people and helping them to solve important problems. I have “hard fun” looking for ways to do these things in harmony.

My lab is where I can be creative. I can ask myself, after 25 years of thoughtful programming, what’s the right way to do fast ethernet I/O? and then develop my answer: a small 10G Ethernet device driver written in LuaJIT and embedded in the application. Fun!

My lab lets me work with friendly people in the open source world. I get to take part in the conversation on how to write networking software, and I can collaborate on equal footing with other clever hackers who have their own interests and labs, big and small.

My lab lets me buy fun toys (without asking permission :-)). Any day now I will receive a server with twenty 10G ethernet ports to share with everybody hacking on the Snabb Switch project. We get to break new ground together in the spirit of creative fun.

My lab lets me create software of enduring value. I do not have to constrain my code with secrecy and license restrictions, which means that it can take on a life of its own. The software can follow its own strong tendency to spread and thrive. I have just given a talk about SLIME at the first Emacs conference – 10-year-old work on a 37-year-old editor and still in widespread use!

My lab lets me choose to do all of these things, all by myself, and it would let me choose to do something completely different if that was what I wanted. It lets me decide when to work from my house in the Swiss Alps, a beach in Thailand, my family’s home in Australia, a cosy cafe in a big city, or a friendly office. It makes me abundantly wealthy in freedom and independence.

In summary I feel that I am close to a local maximum for creativity and freedom.

The problem has never been to “make” the company a success, but rather to preserve the great situation that I already had from the first day. On the one hand I need to avoid things like running out of money, selling shares, and signing employment contracts, and on the other hand I need to keep writing cool open source software and finding people with important problems I can help with. This way I can keep on being proud of my company.

A man is a success if he wakes up in the morning and goes to bed at night and in between does what he wants to do.- Bob Dylan

Know what I mean? I would be happy to hear from you on luke@snabb.co.

Cute Code

Today was one of those pleasant days when I know it’s not the time for serious programming.

I had a lovely twitter conversation with Dimitri Fontaine, Tony Finch, and Jan Lehnardt. I started out wanting to recommend a data structure to @rahul-mr for easily garbage collecting Snabb Switch’s ethernet forwarding table. Trouble is, I don’t know the name of the data structure, so I posted this implementation in a gist and asked if anybody knows what it’s called. Here’s the code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
-- Table structure that "garbage collects" not-recently-used items in O(1) time.

-- Three operations are supported:
--   insert(k,v): add a new value
--   lookup(k): lookup an existing value
--   age(): delete old entries (that have not been used since previous call to age())

-- Initialize 'old' and 'new' to empty tables
local old, new = {}, {}

function insert(k, v)
  new[k] = v
end

function lookup(k)
  if new[k] then
    -- Found in new table
    return new[k]
  elseif old[k] then
    -- Migrate from old table to new table
    new[k] = old[k]
    return old[k]
  else
    -- Not found
    return nil
  end
end

function age()
  -- Entries in 'old' are dropped, entries in 'new' become old.
  old = new
  new = {} -- empty table
end

And the twitterverse helped me rewrite this much more concisely:

1
2
3
4
local old, new = {}, {}
function insert(k, v) new[k] = v; return v               end
function lookup(k)    return new[k] or insert(k, old[k]) end
function age()        old, new = new, {}                 end

which I find rather aesthetically satisfying.

Rewriting code more concisely is one of my favorite activities. Lisp is my usual tool of choice for this purpose, so I tried a translation just for fun:

1
2
3
4
5
6
7
8
9
10
11
12
(defvar *new* (make-hash-table))
(defvar *old* (make-hash-table))

(defun insert (k v)
  (when v (setf (gethash *new* k) v)))

(defun lookup (k v)
  (or (gethash *new* k) (insert k (gethash *old* k))))

(defun age ()
  (rotatef *old* *new*)
  (clrhash *new*))

and I found it interesting that the Lua version is so much more compact than the Lisp version. Sure, I’ve compacted it with whitespace tweakery and so on, but each version is as concise as I feel comfortable making it. So I wonder if Lua is becoming my preferred vehicle for writing pseudo-pseudo-code and indulging in cutenesses?

Then having cuteness cross my mind I couldn’t help but think back to cute code I’ve worked on before. I wrote my own favorite bit of cute production code in the OLPC firmware HD Audio driver. You see, the firmware is allowed to hard-code knowledge of the physical design of the laptop and motherboard. This bit of firmware code tells the audio chip explicit details such as the size and color of each physical audio jack in the laptop. The chip can later provide this information to the operating system for presentation to a user, for example in an audio mixer application.

1
2
3
4
5
6
7
8
9
10
11
12
porta  config(  1/8" green left hp-out jack     )config
porta  config(  1/8" green left hp-out jack     )config
portb  config(  1/8" pink left mic-in jack      )config
portc  config(  builtin internal top mic-in no-detect other-analog )config
portd  config(  unused line-out no-detect       )config
porte  config(  unused line-out no-detect       )config
portf  config(  unused line-out no-detect       )config
portg  config(  builtin internal front speaker no-detect other-analog )config
porth  config(  unused line-out no-detect       )config
porti  config(  unused line-out no-detect       )config
portj  config(  unused line-out no-detect       )config
portk  config(  unused line-out no-detect       )config

The code reads a bit like english - 1/8” pink left mic-in jack - but is actually purely imperative Forth code that was inspired by the wonderful book Thinking Forth.

So it goes!

P.S. I never did work out the canonical name of the data structure. Please drop me an email on luke@snabb.co if you know.

Snabb Switch’s Kernel-bypass Networking

Snabb Switch has built-in kernel-bypass networking support. The switch is engineered as firmware and that means regarding hardware specifications on equal footing with software APIs when designing subsystems. My personal conclusion for this project is that hardware network interfaces from Intel compare favourably with software interfaces from third-party API developers. So I have written the code to talk directly to the hardware.

Here are the most notable features of the design:

  • It’s special-purpose: exactly the features that benefit the application.
  • It’s written 100% from scratch, including the base ethernet device driver.
  • It’s small: about 1 KLOC of source code; about 20KB of object code.
  • It’s fast: tens of millions of packets per second.
  • It supports Intel hardware: great quality in all shapes and sizes.
  • It runs on any modern Linux kernel. No kernel module or device driver needed.
  • It’s portable to other platforms: Give it access to hardware (physical RAM, raw PCI) and it takes care of everything else.
  • It’s independent. No forks to maintain, no upstreams to feed, no license fees to pay.

This design suits Snabb Switch very well. In the future I foresee support for more Intel NICs, more advanced NIC features, more operating systems, and more hardware families. This will be a fun challenge over the months and years ahead.

Self-reliance FTW!

(Comment on Snabb Switch Reddit)

Kernel-bypass Networking

Kernel-bypass networking is gaining popularity. This means moving control of Ethernet hardware directly into userspace processes to avoid the overhead of communicating with the operating system kernel. This gives userspace all of the raw performance traditionally enjoyed by the kernel – and all of the responsibility too. This is important for certain specialized applications that can gain as much as 20x more performance.

Here is a selection of the many kernel-bypass solutions that are available:

These products each take their own design approaches and it’s interesting to consider choices that they make.

  • Customized kernel device driver. netmap and DNA both fork standard Intel drivers with extensions to map I/O memory into userspace.
  • Custom hardware. Myricom and Napatech both distribute bespoke device drivers for their own custom hardware (ASIC for Myricom and FPGA for Napatech).
  • Userspace library. These solutions each provide unique libraries to access their extensions. The scope varies tremendously: Ethernet I/O, libpcap compatibility, hardware-assisted traffic dispatching for multiprocessing, buffer memory management, all the way up to entire TCP/IP socket layers.
  • Licensing. netmap is open-source, DNA requires a modest license for its userspace library, Napatech requires an NDA and depends on very expensive hardware.

If you are developing high-speed (10+ Gbps) networking applications then you should seriously consider using one of these solutions. If you are an expert on one of these solutions then please tell us about it on the Snabb Switch Reddit!

Snabb Switch’s LuaJIT Ethernet Device Driver

I am having fun while writing the Snabb Switch Ethernet device driver. This is like Intel’s standard E1000/IGB/IGBE driver for Linux except that:

  • It is written in LuaJIT. (It’s my first LuaJIT program, I’m a newbie.)
  • It runs in a normal Linux userspace process but talks directly to hardware.
  • It is tailor made for one application, a simple hypervisor-friendly ethernet switch.

I have been optimizing the LuaJIT selftest code to transmit ethernet packets in a loop. I am pretty encouraged by the performance that I see: 3.1% CPU utilization on a low-end Hetzner EX6 machine to saturate a 1Gbps ethernet port with tiny packets. That is 28 nanoseconds of CPU time per packet.

I hope the details will be interesting. It is not so often that people write about low-level networking in high-level dynamic programming languages, is it? So, to give a quick taste, the driver source code is in intel.lua and the selftest main loop works like this:

1
2
3
4
5
6
7
8
local deadline = C.get_time_ns() + 10000000000LL
repeat
   while tx_load() > 0.75 do C.usleep(10000) end
   for i = 1, tx_available() do
      add_txbuf(dma_phys, 60)
   end
   flush_tx()
until C.get_time_ns() > deadline

and here is what that means:

  • Sleep while the hardware transmit queue is at least 75% full (tx_load).
  • Fill up the transmit queue with 60-byte packets (add_txbuf).
  • Tell the NIC hardware to process the transmit queue (flush_tx).
  • Stop after 10 seconds have elapsed (get_time_ns).

This is all accomplished by directly controlling the NIC using memory-mapped register I/O and DMA with shared memory. The only operating system calls here are to sleep and check the time.

This is a really fun sort of programming to be doing!

Going forward I am really excited to see how much of a production quality Ethernet switch can be written in a high-level dynamic programming language, and how neatly any parts that are ultimately written in C can be integrated into the whole. This is an open source project and you are welcome to join in the fun too!

(Comments welcome on the Snabb Switch Reddit.)

Tech Mesh 2012 Trip Report

I had a great time at the Tech Mesh conference in London this week. Hearty thanks to everybody who made it happen. These are my impressions.

Stuart Bailey’s talk was heartwarming. He’s an Erlang guy who’s finally meeting other Erlang people in the flesh. “Honey, you wouldn’t believe it, I can talk about Erlang here and they don’t look at me like I’m crazy.” I think it’s a beautiful moment that many of us can relate to.

Amazingly, Lisp code was on screens all over the place with no fuss being made whatsoever. This was all Clojure. Rich Hickey was there. I’d heard him speak once before, five years ago at the Lisp 50th anniversary event at OOPSLA, where everybody looked to him as the great hope to give Lisp a fresh start. Looks to me like he has delivered on what he promised. Great work Rich! That is no small feat.

I met a lot of friendly and interesting people. Haskell hackers, Xen hackers, even another Queenslander like me. I found the boyish enthusiasm of Simon Peyton Jones and Joe Armstrong very infectious, as always. I was also really glad to meet up with a lot of my old friends from the Stockholm Erlang scene.

The language runtime panel reminded me of one idea that’s been rattling around in my head forever. Take it as given that (a) the Erlang VM is great for concurrency because it gives you efficient process isolation and (b) hardware advances are making the Linux kernel’s process isolation more efficient every year. So when will it be time to start using the Linux kernel as an Erlang-like language runtime environment? If you strip away all the layers of crud on top, are we already there? This seems like valid research question.

I like the overall conference theme of helping to introduce niche ideas to a wider group of people. I am a bit outside the target demographic myself. Tech Mesh is full of ideas with ten thousand devotees trying to spread themselves to the next million. I am more comfortable with the smaller memes myself, when everybody thinks you are mad and you have to work hard to convince the first ten or hundred people that you’re not. That is why I am working on high-performance networking firmware in userspace with LuaJIT device drivers. I suppose everybody has their own ideal proximity to mainstream thinking and that every technology is a moving target in that respect.

The organization was great. Plentiful food, coffee, and other beverages. A bit chilly but hey, this is England. The venue was right near the British Museum so I finally saw the Rosetta Stone for the first time. Thanks everybody!

Firmware vs. Software

Firmware is great stuff. It’s the same as software and at the same time it’s different. It’s a bit like visible and ultraviolet light: they are fundamentally the same “stuff” but they are on different parts of the spectrum.

  1. Firmware compiles to a self-contained image. The build process creates this one file and it has no external dependencies. There is a specific piece of hardware that can execute the image file.
  2. Firmware has known space targets. They are usually measured in kilobytes. You always keep an eye on how much of your ROM and RAM budgets are used up.
  3. Firmware is intimate with hardware. You don’t use libraries and frameworks, you control microchips. Your documentation is hardware data sheets and errata lists. Your job is to understand the hardware completely and make it sing.
  4. Firmware is self-reliant. You take responsibility for every byte you ship in your image. That means being prepared to study and learn to debug any code that you choose to reuse. You always keep an eye on the total transitive complexity of the code you depend on.

My favorite piece of firmware is Openfirmware. This is shipped on a 1MB ROM chip in the OLPC XO instead of a traditional BIOS. It’s a complete and self-sufficient software environment that has been carefully crafted over the past few decades by super-hacker Mitch Bradley. If you have an OLPC XO and would like to learn about firmware then I can recommend studying Mitch’s Forth Lessons.

These days we can choose to deploy our applications as firmware, software, and increasingly as full-scale OS distributions – kilobytes, megabytes, or gigabytes. There is sometimes one style that’s clearly the best and other times several options are all reasonable choices.

Firmware is fun and refreshing to write. I have chosen to develop my new Snabb Switch project as firmware. I will post a lot more about the implications of this decision over time.