Exploring Docker container networking: the virtualized physical and data link layers.
The default networking model used by Docker is a simple and familiar pattern: a multihomed host connecting an “internal” network to “external” ones, providing address translation as it forwards packets back and forth. In this case, the internal network is one or more virtualized segments connecting local containers, and the external networks are the Docker hosts other network connections. The Docker host will typically route packets between these networks.
The private internal network segment is built around a kernel Ethernet bridge. Each container is “wired” to the Docker host’s bridge by one half of a veth(4) interface pair. The other half is placed in the container’s network namespace and manifests as its eth0
interface.
Our Test Setup
We’ll start up a container with a socket printer to facilitate further investigation. Run a daemonized container with netcat(1) listening on 80/tcp.
Note: you'll want the OpenBSD flavor of netcat(1) which provides the -k
flag, instructing netcat to loop and listen for additional connections.
jereme@buttercup $ docker run -d -p 8080:80 --name=test_ct debian nc -lkp 80
5d0b5875fcde3e67b02db0a356c129a34b02ad6ae70b6c076ef53845fbba3acb
In this simple topology we have a single container, test_ct
, with a network interface, eth0
, connected to the default Docker network, which is named bridge
.
Network Segment Design
Each of these networks is built around a kernel Ethernet bridge. The default Docker network is named "bridge" and uses an Ethernet bridge named docker0
.
We can list out the currently defined Docker networks.
jereme@buttercup $ docker network ls
NETWORK ID NAME DRIVER SCOPE
be324648dcae bridge bridge local
7c1aff0016a4 host host local
70a1ccaba5b5 none null local
…and then inspect a particular network, dumping out all sorts of configuration details and the state of currently connected containers.
jereme@buttercup $ docker network inspect bridge
[
{
"Name": "bridge",
"Id": "be324648dcae32e5b3f61ed5824e3c2cce1c249c26886b48adb5ca1c21719659",
"Created": "2018-07-02T20:17:26.877137429-04:00",
"Scope": "local",
"Driver": "bridge",
"EnableIPv6": false,
"IPAM": {
"Driver": "default",
"Options": null,
"Config": [
{
"Subnet": "172.17.0.1/16",
"IPRange": "172.17.0.0/16",
"Gateway": "172.17.0.1"
}
]
},
"Internal": false,
"Attachable": false,
"Ingress": false,
"ConfigFrom": {
"Network": ""
},
"ConfigOnly": false,
"Containers": {
"5d0b5875fcde3e67b02db0a356c129a34b02ad6ae70b6c076ef53845fbba3acb": {
"Name": "test_ct",
"EndpointID": "804ccbc32cdec13d835d6eaf7719d3a3bae8798a47e3f04ff0d4b183281afd44",
"MacAddress": "02:42:ac:11:00:02",
"IPv4Address": "172.17.0.2/16",
"IPv6Address": ""
}
},
"Options": {
"com.docker.network.bridge.default_bridge": "true",
"com.docker.network.bridge.enable_icc": "true",
"com.docker.network.bridge.enable_ip_masquerade": "true",
"com.docker.network.bridge.host_binding_ipv4": "0.0.0.0",
"com.docker.network.bridge.name": "docker0",
"com.docker.network.driver.mtu": "1500"
},
"Labels": {}
}
]
Examing a Bridge
Using brctl(8), from bridge-utils, we can examine the bridge interface, docker0
, which forms the base of the bridge Docker network.
Here we see the bridge has a single connected interface: veth8c9981e
:
jereme@buttercup $ sudo brctl show docker0
bridge name bridge id STP enabled interfaces
docker0 8000.024278a263b6 no veth8c9981e
Bridges are simply network interfaces and they can be managed like any other network link, with ip(8), from iproute2.
jereme@buttercup $ ip link list dev docker0
50: docker0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default
link/ether 02:42:78:a2:63:b6 brd ff:ff:ff:ff:ff:ff
Together these two interfaces: docker0
and veth8c9981e
, form a single broadcast domain. As we add additional containers to the bridge
Docker network, we will see additional veth
interfaces added to that bridge, expanding the network segment.
The design is quite scalable. Here is a production Docker host with many hundreds of containers on the docker0
bridge.
jereme@gnarly-carrot.nyc2 $ sudo brctl show docker0 | head
bridge name bridge id STP enabled interfaces
docker0 8000.021c744cae6f no veth0032851
veth00ef9dd
veth01570e4
veth033485b
veth040a5dc
veth05e46cd
veth061f1f7
veth07251d6
veth072e807
jereme@gnarly-carrot.nyc2 $ sudo brctl show docker0 | grep -c veth
349
Connecting Containers to a Bridge
Containers are connected to a given Docker network’s Ethernet bridge via a pair of interconnected veth(4) interfaces. These interface pairs can be thought of as the two ends of a tunnel. This is the crux of Docker container network connectivity.
We’ve seen the beginnings of this already: one half of a veth pair, like veth8c9981e
above, remains in the Docker host’s network namespace where it’s connected to the specified bridge. The other half is placed into the network namespace of the container and manifests as its eth0
interface. In so doing, we establish the layer 2 path upon which the rest of the container’s network connectivity will be built.
Using our test_ct
as an example, recall the connected interface, veth8c9981e
, on bridge docker0
.
jereme@buttercup $ sudo brctl show docker0
bridge name bridge id STP enabled interfaces
docker0 8000.024278a263b6 no veth8c9981e
We can use ethtool(8) to get veth8c9981e
’s peer_ifindex
- the other end of the tunnel, so to speak.
jereme@buttercup $ sudo ethtool -S veth8c9981e
NIC statistics:
peer_ifindex: 75
Looking through the list of interfaces on our Docker host, we don’t see an interface with index 75
jereme@buttercup $ ip -o link list | cut -d : -f 1,2
1: lo
2: eth0
3: wlan0
50: docker0
74: tun0
76: veth8c9981e@if75
…but if we examine the interfaces in our test_ct
container, we find our associated peer interface.
jereme@buttercup $ docker exec test_ct ip -o link list | cut -d : -f 1,2
1: lo
75: eth0@if76
Docker has handled the work of moving the peer interface into our container’s network namespace. In so doing, our container is now connected to our Docker network’s designated bridge interface.
Containers are processes running with isolated namespaces and other resource partitioning, like control groups. This is a simplification, but not untrue.
Accordingly, this nicely buttoned up docker(1) invocation.
jereme@buttercup $ docker exec test_ct ip link show dev eth0
75: eth0@if76: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default
link/ether 02:42:ac:11:00:02 brd ff:ff:ff:ff:ff:ff link-netnsid 0
…is equivalent to using nsenter(1), from the util-linux package, to run ip in the net and mount namespaces of our container’s process.
jereme@buttercup $ ct_pid=$(docker inspect test_ct | jq .[].State.Pid)
jereme@buttercup $ ps $ct_pid
PID TTY STAT TIME COMMAND
7581 ? Ss 0:00 nc -lkp 80
jereme@buttercup $ sudo nsenter --net --mount --target $ct_pid ip link show dev eth0
75: eth0@if76: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default
link/ether 02:42:ac:11:00:02 brd ff:ff:ff:ff:ff:ff link-netnsid 0
Manually Adding a Second veth Interface to a Container
You can do this manually if you want to explore a bit deeper.
Here we create a new veth
pair: ve_A
and ve_B
, which are allocated indices 90
and 89
respectively. The interface names do not include the @{peer} part - that's just a helpful detail provided by ip(1). You won't see this in /sys/class/net
.
jereme@buttercup $ sudo ip link add ve_A type veth peer name ve_B
jereme@buttercup $ ip -o link list | cut -d : -f 1,2
1: lo
2: eth0
3: wlan0
50: docker0
74: tun0
76: veth8c9981e@if75
89: ve_B@ve_A
90: ve_A@ve_B
We then move ve_B
into test_ct
’s network namespace.
jereme@buttercup $ sudo ip link set ve_B netns $(docker inspect test_ct | jq .[].State.Pid)
The other half of our veth
pair, ve_A
, remains in our namespace. Notice how ip now displays ve_A@ve_B
as ve_A@if89
(89
being the peer interface’s index).
jereme@buttercup $ ip -o link list | cut -d : -f 1,2
1: lo
2: eth0
3: wlan0
50: docker0
74: tun0
76: veth8c9981e@if75
90: ve_A@if89
Looking back into our container we see the newly added interface.
jereme@buttercup $ docker exec test_ct ip -o link list | cut -d : -f 1,2
1: lo
75: eth0@if76
89: ve_B@if90
Regarding Interface Indices
From what we’ve seen so far it looks like interface indices are global to the kernel. After all, we created a veth
pair and moved interface 89
into an existing container and observed that it retained the index 89
. However, I believe there is actually one interface index set per network namespace and we actually have two distinct 89
’s, so to speak.
I’m glad the index value is retained as we move interfaces between namespaces, but I believe they are actually distinct. As an example, you can see each namespace has an index 1
for its loopback interface, lo
.
jereme@buttercup $ ip link list dev lo
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
jereme@buttercup $ docker exec test_ct ip link list dev lo
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
I suppose it’s possible that there’s some special handling for lo
, but I don’t believe that’s the case.
Consider dev_new_index
from net/core/dev.c
.
/**
* dev_new_index - allocate an ifindex
* @net: the applicable net namespace
*
* Returns a suitable unique value for a new device interface
* number. The caller must hold the rtnl semaphore or the
* dev_base_lock to be sure it remains unique.
*/
static int dev_new_index(struct net *net)
{
int ifindex = net->ifindex;
for (;;) {
if (++ifindex <= 0)
ifindex = 1;
if (!__dev_get_by_index(net, ifindex))
return net->ifindex = ifindex;
}
}
I get the impression, from lots of mailing list posts, that in the early day of namespaces interface indices were still global, but I don’t think this is the case anymore. Corrections here (as always) are most welcome.
Conclusion
So there you have it: some of the fundamentals of how interfaces are organized and bridged into networks to interconnect containers. In the next post in this series, we’ll move up the stack a bit and look at addressing and routing of traffic, between containers, and beyond the Docker host itself.
Cover photo by Samuel Sianipar