Skip to main content

Cross-host Container Communication with rkt and flannel

The latest release of rkt, a container runtime, introduced many valuable features in v0.8. One notable feature is the ability to effortlessly run rkt with flannel, a software-defined network for containers. This makes it easy for all the containers in your cluster to have a unique IP over which they can converse with each other.

Setting up rkt with flannel

Let's walk through setting up rkt with flannel on CoreOS. We start with the CoreOS image 808.0 or later and bring up 3 instances clustered together using the following cloud-config: 

#cloud-config
coreos:
units:
- name: etcd2.service
command: start
- name: flanneld.service
drop-ins:
- name: 50-network-config.conf
content: |
[Service]
ExecStartPre=/usr/bin/etcdctl set /coreos.com/network/config '{ "network": "10.1.0.0/16" }'
command: start
etcd2:
discovery: $YOUR_DISCOVERY_TOKEN
advertise-client-urls: http://$public_ipv4:2379
initial-advertise-peer-urls: http://$private_ipv4:2380
listen-client-urls: http://0.0.0.0:2379
listen-peer-urls: http://$private_ipv4:2380
write_files:
- path: "/etc/rkt/net.d/10-containernet.conf"
permissions: "0644"
owner: "root"
content: |
{
"name": "containernet",
"type": "flannel"
}

Once the instances have booted, we can confirm that flannel is up and running by checking for flannel0 interface.

If you look up at cloud-config, you will notice that write_files section has written out a 10-containernet.conf file. This describes a network that rkt containers will join. In our case the configuration is really simple — it gives the network a name and specifies that it will work with flannel. We will look into specifics of the \"type\" field shortly.

We are now ready to launch a container with rkt to test out the setup. We will be using an Alpine Linux Docker container with an entrypoint set to /bin/sh. Start the rkt container as follows:

$ sudo rkt run --private-net --interactive --insecure-skip-verify docker://quay.io/coreos/alpine-sh:latest
Downloading f4fddc471ec2: [====================================] 2.49 MB/2.49 MB
Downloading 577f81886e20: [====================================] 32 B/32 B
2015/09/16 19:17:06 Preparing stage1
2015/09/16 19:17:07 Loading image sha512-14f9c6504e687e4b902461437ddb3d4c3d84c039bf9111d5d165a52e380942b7
2015/09/16 19:17:07 Writing pod manifest
2015/09/16 19:17:07 Setting up stage1
2015/09/16 19:17:07 Writing image manifest
2015/09/16 19:17:07 Wrote filesystem to /var/lib/rkt/pods/run/6a35d365-565b-4c61-898e-2e2929c2ff38
2015/09/16 19:17:07 Writing image manifest
2015/09/16 19:17:07 Pivoting to filesystem /var/lib/rkt/pods/run/6a35d365-565b-4c61-898e-2e2929c2ff38
2015/09/16 19:17:07 Execing /init
/ #

The --private-net option instructs rkt to allocate a separate networking stack for the container and have it join the networks configured in /etc/rkt/net.d. To confirm the container has joined the "containernet", look at its interfaces:

/ # ifconfig
eth0 Link encap:Ethernet HWaddr 96:97:A2:15:4F:A7
inet addr:10.1.93.2 Bcast:0.0.0.0 Mask:255.255.255.0
inet6 addr: fe80::9497:a2ff:fe15:4fa7/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:8973 Metric:1
RX packets:18 errors:0 dropped:0 overruns:0 frame:0
TX packets:5 errors:0 dropped:1 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:2184 (2.1 KiB) TX bytes:418 (418.0 B)
eth1 Link encap:Ethernet HWaddr 3A:87:1C:29:9A:57
inet addr:172.16.28.3 Bcast:0.0.0.0 Mask:255.255.255.254
inet6 addr: fe80::3887:1cff:fe29:9a57/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:6 errors:0 dropped:0 overruns:0 frame:0
TX packets:5 errors:0 dropped:1 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:508 (508.0 B) TX bytes:418 (418.0 B)
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)

The IP of eth0 is within the flannel network that we defined (10.1.0.0/16). Note that eth1 is the so-called "default" network that is automatically added by rkt. It is there to allow the container to communicate with the host and the Internet.

Bring up the same Alpine container on the other two instances and note their eth0 IPs. The containers should now be able to ping each other by their flannel (eth0) IPs.

Warning: On CoreOS, the above setup cannot be mixed with Docker due to only a single flannel subnet being allocated to the host. Starting docker.service unit will cause docker0 bridge to be assigned the same flannel subnet and will lead to conflicts. If you need to run flannel with Docker and rkt side-by-side, we will be adding support for that in the future.

Looking behind the curtain

flannel uses CNI to power its networking plugins. The "type" field in the network conf file refers to the CNI plugin. With that in mind, let's look at what CNI's flannel plugin goes through to attach the container to the "containernet" network.

A CNI plugin is a simple executable file that runs when the container comes up and runs again when it is torn down during the garbage collection cycle. As we'll see, the flannel plugin itself does surprisingly little — it is actually a wrapper around two lower-level plugins. When executed to add a container to the network, it combines the information from /etc/rkt/net.d/10-containernet.conf and /run/flannel/subnet.env to generate a configuration for the plugins to which it will delegate the work.

/run/flannel/subnet.env is written out by flannel on start up and contains information such as the subnet that it was assigned:

FLANNEL_NETWORK=10.1.0.0/16
FLANNEL_SUBNET=10.1.93.1/24
FLANNEL_MTU=8973
FLANNEL_IPMASQ=true

CNI's flannel plugin uses this data to synthesize the following configuration for the "bridge" and "host-local" plugins:

{
"name" : "containernet",
"type" : "bridge",
"mtu" : 8973,
"ipMasq" : false,
"isGateway" : true,
"ipam" : {
"type" : "host-local",
"subnet" : "10.1.93.0/24",
"routes" : [ { "dst" : "10.1.0.0/16" } ]
}
}

It then executes the "bridge" plugin, which does the following:

  • creates a linux-bridge on the host
  • executes the "host-local" plugin to get an IP for both the container and the bridge (gateway) within 10.1.93.0/24
  • assigns an IP to the bridge
  • creates a veth pair
  • plugs one end of the veth pair into the bridge
  • moves the other end of the veth into the container and assign it an IP
  • ensures that MTU on both the bridge and the veths is 8973 (to match flannel)

The above flow illustrates a key design decision of CNI: a plugin gets full control over both the host and container networking namespaces and is expected to do everything to connect the container to the network. It is, however, encouraged to delegate some of the work to other plugins. Giving the plugins complete control over the namespaces provides for the most flexibility. It allows plugin writers to better integrate their networking solutions with rkt and other CNI compatible container runtimes.

If you'd like to learn more about CNI plugins and rkt, please join us at our next CoreOS Meetup in San Francisco on Monday, September 21.