aboutgitcodebugslistschat
path: root/README.md
blob: b3fff6da34ac05577a862cb29614c1d209333b21 (plain) (blame)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
# passt: Plug A Simple Socket Transport

_passt_ implements a translation layer between a Layer-2 network interface (tap)
and native Layer-4 sockets (TCP, UDP, ICMP/ICMPv6 echo) on a host. It doesn't
require any capabilities or privileges, and it can be used as a simple
replacement for Slirp.

<img src="/builds/passt_overview.png" usemap="#image-map" class="bright" style="z-index: 20; position: relative;">
<map name="image-map" id="map_overview">
    <area class="map_area" target="_blank" href="https://man7.org/linux/man-pages/man7/tcp.7.html" coords="229,275,246,320,306,294,287,249" shape="poly">
    <area class="map_area" target="_blank" href="https://lwn.net/Articles/420799/" coords="230,201,243,246,297,232,289,186" shape="poly">
    <area class="map_area" target="_blank" href="https://man7.org/linux/man-pages/man7/udp.7.html" coords="234,129,236,175,297,169,293,126" shape="poly">
    <area class="map_area" target="_blank" href="https://en.wiktionary.org/wiki/passen#German" coords="387,516,841,440,847,476,393,553" shape="poly">
    <area class="map_area" target="_blank" href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/net/ipv4/udp.c" coords="398,123,520,157" shape="rect">
    <area class="map_area" target="_blank" href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/net/ipv4/ping.c" coords="397,164,517,197" shape="rect">
    <area class="map_area" target="_blank" href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/net/ipv4/tcp.c" coords="398,203,516,237" shape="rect">
    <area class="map_area" target="_blank" href="https://man7.org/linux/man-pages/man7/unix.7.html" coords="569,306,674,359" shape="rect">
    <area class="map_area" target="_blank" href="https://passt.top/passt/tree/udp.c" coords="719,152,740,176,792,134,768,108" shape="poly">
    <area class="map_area" target="_blank" href="https://passt.top/passt/tree/icmp.c" coords="727,206,827,120,854,150,754,238" shape="poly">
    <area class="map_area" target="_blank" href="https://passt.top/passt/tree/tcp.c" coords="730,273,774,326,947,176,902,119" shape="poly">
    <area class="map_area" target="_blank" href="https://passt.top/passt/tree/igmp.c" coords="865,273,912,295" shape="rect">
    <area class="map_area" target="_blank" href="https://passt.top/passt/tree/arp.c" coords="854,300,897,320" shape="rect">
    <area class="map_area" target="_blank" href="https://passt.top/passt/tree/ndp.c" coords="869,325,909,344" shape="rect">
    <area class="map_area" target="_blank" href="https://passt.top/passt/tree/mld.c" coords="924,267,964,289" shape="rect">
    <area class="map_area" target="_blank" href="https://passt.top/passt/tree/dhcpv6.c" coords="918,297,986,317" shape="rect">
    <area class="map_area" target="_blank" href="https://passt.top/passt/tree/dhcp.c" coords="931,328,981,352" shape="rect">
    <area class="map_area" target="_blank" href="https://man7.org/linux/man-pages/man7/udp.7.html" coords="1073,115,1059,154,1120,176,1133,137" shape="poly">
    <area class="map_area" target="_blank" href="https://lwn.net/Articles/420799/" coords="966,113,942,152,1000,175,1017,136" shape="poly">
    <area class="map_area" target="_blank" href="https://man7.org/linux/man-pages/man7/tcp.7.html" coords="1059,175,1039,213,1098,237,1116,197" shape="poly">
    <area class="map_area" target="_blank" href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/net/ipv4/udp.c" coords="1203,154,1326,189" shape="rect">
    <area class="map_area" target="_blank" href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/net/ipv4/ping.c" coords="1202,195,1327,228" shape="rect">
    <area class="map_area" target="_blank" href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/net/ipv4/tcp.c" coords="1204,236,1327,269" shape="rect">
    <area class="map_area" target="_blank" href="https://en.wikipedia.org/wiki/OSI_model#Layer_architecture" coords="1159,52,1325,147" shape="rect">
    <area class="map_area" target="_blank" href="https://man7.org/linux/man-pages/man4/veth.4.html" coords="1119,351,1157,339,1198,340,1236,345,1258,359,1229,377,1176,377,1139,375,1114,365" shape="poly">
    <area class="map_area" target="_blank" href="https://man7.org/linux/man-pages/man4/veth.4.html" coords="1044,471,1090,461,1126,462,1150,464,1176,479,1160,491,1121,500,1081,501,1044,491,1037,483" shape="poly">
    <area class="map_area" target="_blank" href="https://man7.org/linux/man-pages/man7/network_namespaces.7.html" coords="240,379,524,452" shape="rect">
    <area class="map_area" target="_blank" href="https://man7.org/linux/man-pages/man7/netlink.7.html" coords="1119,278,1117,293,1165,304,1169,288" shape="poly">
    <area class="map_area" target="_blank" href="https://passt.top/passt/tree/passt.c#n195" coords="989,294,1040,264,1089,280,986,344" shape="poly">
</map>
<canvas id="map_highlight" style="border: 0px; z-index: 10; position: fixed; pointer-events: none"></canvas>
<script>
function canvas_position(el) {
	var rect = el.getBoundingClientRect();
	var canvas = document.getElementById('map_highlight');

	canvas.width = rect.right - rect.left;
	canvas.height = rect.bottom - rect.top;
	canvas.style.left = rect.left + 'px';
	canvas.style.top = rect.top + 'px';
}

function map_hover() {
	var coords = this.coords.split(',');
	var canvas = document.getElementById('map_highlight');
	var ctx = canvas.getContext('2d');

	canvas_position(this);

	ctx.fillStyle = 'rgba(255, 255, 255, .3)';
	ctx.lineWidth = 1.5;
	ctx.strokeStyle = 'rgba(255, 255, 100, 1)';

	ctx.beginPath();
	ctx.setLineDash([15, 15]);
	if (this.shape == "poly") {
		ctx.moveTo(coords[0], coords[1]);
		for (item = 2; item < coords.length - 1; item += 2) {
			ctx.lineTo(coords[item], coords[item + 1])
		}
	} else if (this.shape == "rect") {
		ctx.rect(coords[0], coords[1],
			 coords[2] - coords[0], coords[3] - coords[1]);
	}

	ctx.closePath();
	ctx.stroke();
	ctx.fill();
}

function map_out() {
	var canvas = document.getElementById('map_highlight');
	var ctx = canvas.getContext('2d');

	ctx.clearRect(0, 0, canvas.width, canvas.height);
}

var map_areas = document.getElementsByClassName("map_area");

for (var i = 0; i < map_areas.length; i++) {
	map_areas[i].onmouseover = map_hover;
	map_areas[i].onmouseout = map_out;
}
</script>

- [General idea](#general-idea)
- [Non-functional Targets](#non-functional-targets)
- [Interfaces and Environment](#interfaces-and-environment)
- [Services](#services)
- [Addresses](#addresses)
- [Protocols](#protocols)
- [Ports](#ports)
- [Try it](#try-it)
- [Contribute](#contribute)

## General idea

When container workloads are moved to virtual machines, the network traffic is
typically forwarded by interfaces operating at data link level. Some components
in the containers ecosystem (such as _service meshes_), however, expect
applications to run locally, with visible sockets and processes, for the
purposes of socket redirection, monitoring, port mapping.

To solve this issue, user mode networking as provided e.g. by _Slirp_,
_libslirp_, _slirp4netns_ can be used. However, these existing solutions
implement a full TCP/IP stack, replaying traffic on sockets that are local to
the pod of the service mesh. This creates the illusion of application processes
running on the same host, eventually separated by user namespaces.

While being almost transparent to the service mesh infrastructure, that kind of
solution comes with a number of downsides:

* three different TCP/IP stacks (guest, adaptation and host) need to be
  traversed for every service request. There are no chances to implement
  zero-copy mechanisms, and the amount of context switches increases
  dramatically
* addressing needs to be coordinated to create the pretense of consistent
  addresses and routes between guest and host environments. This typically needs
  a NAT with masquerading, or some form of packet bridging
* the traffic seen by the service mesh and observable externally is a distant
  replica of the packets forwarded to and from the guest environment:
  * TCP congestion windows and network buffering mechanisms in general operate
    differently from what would be naturally expected by the application
  * protocols carrying addressing information might pose additional challenges,
    as the applications don't see the same set of addresses and routes as they
    would if deployed with regular containers

_passt_ implements a thinner layer between guest and host, that only implements
what's strictly needed to pretend processes are running locally. A further, full
TCP/IP stack is not necessarily needed. Some sort of TCP adaptation is needed,
however, as this layer runs without the `CAP_NET_RAW` capability: we can't
create raw IP sockets on the pod, and therefore need to map packets at Layer-2
to Layer-4 sockets offered by the host kernel.

The problem and this approach are illustrated in more detail, with diagrams,
[here](https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Networking.md).

## Non-functional Targets

Security and maintainability goals:

* no dynamic memory allocation
* ~2 000 LoC target
* no external dependencies

## Interfaces and Environment

_passt_ exchanges packets with _qemu_ via UNIX domain socket, using the `socket`
back-end in qemu. Currently, qemu can only connect to a listening process via
TCP. Two temporary solutions are available:

* a [patch](https://passt.top/passt/tree/qemu) for qemu
* a wrapper, [qrap](https://passt.top/passt/tree/qrap.c), that connects to a
  UNIX domain socket and starts qemu, which can now use the file descriptor
  that's already opened

This approach, compared to using a _tap_ device, doesn't require any security
capabilities, as we don't need to create any interface.

## Services

_passt_ provides some minimalistic implementations of networking services that
can't practically run on the host:

* [ARP proxy](https://passt.top/passt/tree/arp.c), that resolves the address of
  the host (which is used as gateway) to the original MAC address of the host
* [DHCP server](https://passt.top/passt/tree/dhcp.c), a simple implementation
  handing out one single IPv4 address to the guest, namely, the same address as
  the first one configured for the upstream host interface, and passing the
  nameservers configured on the host
* [NDP proxy](https://passt.top/passt/tree/ndp.c), which can also assign prefix
  and nameserver using SLAAC
* [DHCPv6 server](https://passt.top/passt/tree/dhcpv6.c): a simple
  implementation handing out one single IPv6 address to the guest, namely, the
  the same address as the first one configured for the upstream host interface,
  and passing the first nameserver configured on the host

## Addresses

For IPv4, the guest is assigned, via DHCP, the same address as the upstream
interface of the host, and the same default gateway as the default gateway of
the host. Addresses are translated in case the guest is seen using a different
address from the assigned one.

For IPv6, the guest is assigned, via SLAAC, the same prefix as the upstream
interface of the host, the same default route as the default route of the
host, and, if a DHCPv6 client is running on the guest, also the same address as
the upstream address of the host. This means that, with a DHCPv6 client on the
guest, addresses don't need to be translated. Should the client use a different
address, the destination address is translated for packets going to the guest.

For UDP and TCP, for both IPv4 and IPv6, packets addressed to a loopback address
are forwarded to the guest with their source address changed to the address of
the gateway or first hop of the default route. This mapping is reversed as the
guest replies to those packets (on the same TCP connection, or using destination
port and address that were used as source for UDP).

## Protocols

_passt_ supports TCP, UDP and ICMP/ICMPv6 echo (requests and replies). More
details about the TCP implementation are available
[here](https://passt.top/passt/tree/tcp.c), and for the UDP
implementation [here](https://passt.top/passt/tree/udp.c).

An IGMP proxy is currently work in progress.

## Ports

To avoid the need for explicit port mapping configuration, _passt_ binds to all
unbound non-ephemeral (0-49152) TCP and UDP ports. Binding to low ports (0-1023)
will fail without additional capabilities, and ports already bound (service
proxies, etc.) will also not be used.

UDP ephemeral ports are bound dynamically, as the guest uses them.

Service proxies and other services running in the container need to be started
before _passt_ starts.

## Try it

* build from source:

        git clone https://passt.top/passt
        cd passt
        make

  * to make _passt_ not fork into background when it starts, and to get verbose
    debug information, build with:

            CFLAGS="-DDEBUG" make

* a static build for x86_64 as of the latest commit is also available for
  convenience [here](https://passt.top/builds/static/). These binaries are
  simply built with:

        CFLAGS="-static" make

* run the demo script, that creates a network namespace called `passt`, sets up
  sets up a _veth_ pair and and addresses, together with NAT for IPv4 and NDP
  proxying for IPv6, then starts _passt_ in the network namespace:

        doc/demo.sh

* from the same network namespace, start qemu. At the moment, qemu doesn't
  support UNIX domain sockets for the `socket` back-end. Two alternatives:

  * use the _qrap_ wrapper, which maps a tap socket descriptor to _passt_'s
    UNIX domain socket, for example:

            ip netns exec passt ./qrap 5 qemu-system-x86_64 ... -net socket,fd=5 -net nic,model=virtio ...

  * or patch qemu with [this patch](https://passt.top/passt/tree/qemu/0001-net-Allow-also-UNIX-domain-sockets-to-be-used-as-net.patch)
    and start it like this:

            qemu-system-x86_64 ... -net socket,connect=/tmp/passt.socket -net nic,model=virtio

* alternatively, you can use libvirt, with [this patch](https://passt.top/passt/tree/libvirt/0001-conf-Introduce-support-for-UNIX-domain-socket-as-qem.patch),
  to start qemu (with the patch mentioned above), with this kind of network
  interface configuration:

        <interface type='client'>
          <mac address='52:54:00:02:6b:60'/>
          <source path='/tmp/passt.socket'/>
          <model type='virtio'/>
          <address type='pci' domain='0x0000' bus='0x01' slot='0x00' function='0x0'/>
        </interface>

* and that's it, you should now have TCP connections, UDP, and ICMP/ICMPv6
  echo working from/to the guest for IPv4 and IPv6

* to connect to a service on the VM, just connect to the same port directly
  with the address of the network namespace. For example, to ssh to the guest,
  from the main namespace on the host:

        ssh 192.0.2.2

## Contribute

Send patches and issue reports to [sbrivio@redhat.com](mailto:sbrivio@redhat.com).