<feed xmlns='http://www.w3.org/2005/Atom'>
<title>passt/pcap.c, branch 2023_03_21.1ee2f7c</title>
<subtitle>Plug A Simple Socket Transport</subtitle>
<link rel='alternate' type='text/html' href='https://passt.top/passt/'/>
<entry>
<title>udp: Use tap_send_frames()</title>
<updated>2023-01-23T17:55:04+00:00</updated>
<author>
<name>David Gibson</name>
<email>david@gibson.dropbear.id.au</email>
</author>
<published>2023-01-06T00:43:22+00:00</published>
<link rel='alternate' type='text/html' href='https://passt.top/passt/commit/?id=54502cca7f7a89eadce92bb7a2b6ab61b878a5c6'/>
<id>54502cca7f7a89eadce92bb7a2b6ab61b878a5c6</id>
<content type='text'>
To send frames on the tap interface, the UDP uses a fairly complicated two
level batching.  First multiple frames are gathered into a single "message"
for the qemu stream socket, then multiple messages are send with
sendmmsg().  We now have tap_send_frames() which already deals with sending
a number of frames, including batching and handling partial sends.  Use
that to considerably simplify things.

This does make a couple of behavioural changes:
  * We used to split messages to keep them under 32kiB (except when a
    single frame was longer than that).  The comments claim this is
    needed to stop qemu from closing the connection, but we don't have any
    equivalent logic for TCP.  I wasn't able to reproduce the problem with
    this series, although it was apparently easy to reproduce earlier.

    My suspicion is that there was never an inherent need to keep messages
    small, however with larger messages (and default kernel buffer sizes)
    the chances of needing more than one resend for partial send()s is
    greatly increased.  We used not to correctly handle that case of
    multiple resends, but now we do.

  * Previously when we got a partial send on UDP, we would resend the
    remainder of the entire "message", including multiple frames.  The
    common code now only resends the remainder of a single frame, simply
    dropping any frames which weren't even partially sent.  This is what
    TCP always did and is probably a better idea for UDP too.

Signed-off-by: David Gibson &lt;david@gibson.dropbear.id.au&gt;
Signed-off-by: Stefano Brivio &lt;sbrivio@redhat.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
To send frames on the tap interface, the UDP uses a fairly complicated two
level batching.  First multiple frames are gathered into a single "message"
for the qemu stream socket, then multiple messages are send with
sendmmsg().  We now have tap_send_frames() which already deals with sending
a number of frames, including batching and handling partial sends.  Use
that to considerably simplify things.

This does make a couple of behavioural changes:
  * We used to split messages to keep them under 32kiB (except when a
    single frame was longer than that).  The comments claim this is
    needed to stop qemu from closing the connection, but we don't have any
    equivalent logic for TCP.  I wasn't able to reproduce the problem with
    this series, although it was apparently easy to reproduce earlier.

    My suspicion is that there was never an inherent need to keep messages
    small, however with larger messages (and default kernel buffer sizes)
    the chances of needing more than one resend for partial send()s is
    greatly increased.  We used not to correctly handle that case of
    multiple resends, but now we do.

  * Previously when we got a partial send on UDP, we would resend the
    remainder of the entire "message", including multiple frames.  The
    common code now only resends the remainder of a single frame, simply
    dropping any frames which weren't even partially sent.  This is what
    TCP always did and is probably a better idea for UDP too.

Signed-off-by: David Gibson &lt;david@gibson.dropbear.id.au&gt;
Signed-off-by: Stefano Brivio &lt;sbrivio@redhat.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>pcap: Replace pcapm() with pcap_multiple()</title>
<updated>2023-01-23T17:54:27+00:00</updated>
<author>
<name>David Gibson</name>
<email>david@gibson.dropbear.id.au</email>
</author>
<published>2023-01-06T00:43:06+00:00</published>
<link rel='alternate' type='text/html' href='https://passt.top/passt/commit/?id=d3089eb0eab412e34bd91f7853e3e72429912edb'/>
<id>d3089eb0eab412e34bd91f7853e3e72429912edb</id>
<content type='text'>
pcapm() captures multiple frames from a msghdr, however the only thing it
cares about in the msghdr is the list of buffers, where it assumes there is
one frame to capture per buffer.  That's what we want for its single caller
but it's not the only obvious choice here (one frame per msghdr would
arguably make more sense in isolation).  In addition pcapm() has logic
that only makes sense in the context of the passt specific path its called
from: it skips the first 4 bytes of each buffer, because those have the
qemu vnet_len rather than the frame proper.

Make this clearer by replacing pcapm() with pcap_multiple() which more
explicitly takes one struct iovec per frame, and parameterizes how much of
each buffer to skip (i.e. the offset of the frame within the buffer).

Signed-off-by: David Gibson &lt;david@gibson.dropbear.id.au&gt;
Signed-off-by: Stefano Brivio &lt;sbrivio@redhat.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
pcapm() captures multiple frames from a msghdr, however the only thing it
cares about in the msghdr is the list of buffers, where it assumes there is
one frame to capture per buffer.  That's what we want for its single caller
but it's not the only obvious choice here (one frame per msghdr would
arguably make more sense in isolation).  In addition pcapm() has logic
that only makes sense in the context of the passt specific path its called
from: it skips the first 4 bytes of each buffer, because those have the
qemu vnet_len rather than the frame proper.

Make this clearer by replacing pcapm() with pcap_multiple() which more
explicitly takes one struct iovec per frame, and parameterizes how much of
each buffer to skip (i.e. the offset of the frame within the buffer).

Signed-off-by: David Gibson &lt;david@gibson.dropbear.id.au&gt;
Signed-off-by: Stefano Brivio &lt;sbrivio@redhat.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>pcap: Introduce pcap_frame() helper</title>
<updated>2023-01-23T17:48:51+00:00</updated>
<author>
<name>David Gibson</name>
<email>david@gibson.dropbear.id.au</email>
</author>
<published>2023-01-06T00:43:05+00:00</published>
<link rel='alternate' type='text/html' href='https://passt.top/passt/commit/?id=cb3c1ce30760aff10a4952a1c3b85f61a6eeacfd'/>
<id>cb3c1ce30760aff10a4952a1c3b85f61a6eeacfd</id>
<content type='text'>
pcap(), pcapm() and pcapmm() duplicate some code, for the actual writing
to the capture file.  The purpose of pcapm() and pcapmm() not calling
pcap() seems to be to avoid repeatedly calling gettimeofday() and to avoid
printing errors for every packet in a batch if there's a problem.  We can
accomplish that while still sharing code by adding a new helper which
takes the packet timestamp as a parameter.

Signed-off-by: David Gibson &lt;david@gibson.dropbear.id.au&gt;
Signed-off-by: Stefano Brivio &lt;sbrivio@redhat.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
pcap(), pcapm() and pcapmm() duplicate some code, for the actual writing
to the capture file.  The purpose of pcapm() and pcapmm() not calling
pcap() seems to be to avoid repeatedly calling gettimeofday() and to avoid
printing errors for every packet in a batch if there's a problem.  We can
accomplish that while still sharing code by adding a new helper which
takes the packet timestamp as a parameter.

Signed-off-by: David Gibson &lt;david@gibson.dropbear.id.au&gt;
Signed-off-by: Stefano Brivio &lt;sbrivio@redhat.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>Move logging functions to a new file, log.c</title>
<updated>2022-10-14T15:38:25+00:00</updated>
<author>
<name>Stefano Brivio</name>
<email>sbrivio@redhat.com</email>
</author>
<published>2022-09-24T07:53:15+00:00</published>
<link rel='alternate' type='text/html' href='https://passt.top/passt/commit/?id=da152331cf2e8537bc3651e10eb8b72d751721c3'/>
<id>da152331cf2e8537bc3651e10eb8b72d751721c3</id>
<content type='text'>
Logging to file is going to add some further complexity that we don't
want to squeeze into util.c.

Signed-off-by: Stefano Brivio &lt;sbrivio@redhat.com&gt;
Reviewed-by: David Gibson &lt;david@gibson.dropbear.id.au&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Logging to file is going to add some further complexity that we don't
want to squeeze into util.c.

Signed-off-by: Stefano Brivio &lt;sbrivio@redhat.com&gt;
Reviewed-by: David Gibson &lt;david@gibson.dropbear.id.au&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>conf: Make the argument to --pcap option mandatory</title>
<updated>2022-08-30T17:17:57+00:00</updated>
<author>
<name>David Gibson</name>
<email>david@gibson.dropbear.id.au</email>
</author>
<published>2022-08-26T04:58:32+00:00</published>
<link rel='alternate' type='text/html' href='https://passt.top/passt/commit/?id=bf95322fc1ef2d0d7a21cb0e6a9bf01902024859'/>
<id>bf95322fc1ef2d0d7a21cb0e6a9bf01902024859</id>
<content type='text'>
The --pcap or -p option can be used with or without an argument.  If given,
the argument gives the name of the file to save a packet trace to.  If
omitted, we generate a default name in /tmp.

Generating the default name isn't particularly useful though, since making
a suitable name can easily be done by the caller.  Remove this feature.

Signed-off-by: David Gibson &lt;david@gibson.dropbear.id.au&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
The --pcap or -p option can be used with or without an argument.  If given,
the argument gives the name of the file to save a packet trace to.  If
omitted, we generate a default name in /tmp.

Generating the default name isn't particularly useful though, since making
a suitable name can easily be done by the caller.  Remove this feature.

Signed-off-by: David Gibson &lt;david@gibson.dropbear.id.au&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>treewide: Invalid type in argument to printf format specifier, CWE-686</title>
<updated>2022-04-05T16:47:04+00:00</updated>
<author>
<name>Stefano Brivio</name>
<email>sbrivio@redhat.com</email>
</author>
<published>2022-04-04T21:35:32+00:00</published>
<link rel='alternate' type='text/html' href='https://passt.top/passt/commit/?id=dbd0a7035c9a7a59f17d1312063ffa7fc3cd6860'/>
<id>dbd0a7035c9a7a59f17d1312063ffa7fc3cd6860</id>
<content type='text'>
Harmless except for two bad debugging prints.

Signed-off-by: Stefano Brivio &lt;sbrivio@redhat.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Harmless except for two bad debugging prints.

Signed-off-by: Stefano Brivio &lt;sbrivio@redhat.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>treewide: Fix android-cloexec-* clang-tidy warnings, re-enable checks</title>
<updated>2022-03-29T13:35:38+00:00</updated>
<author>
<name>Stefano Brivio</name>
<email>sbrivio@redhat.com</email>
</author>
<published>2022-03-27T11:10:26+00:00</published>
<link rel='alternate' type='text/html' href='https://passt.top/passt/commit/?id=62c3edd9575cb7dfef0d86392e4b87a4c14d8dee'/>
<id>62c3edd9575cb7dfef0d86392e4b87a4c14d8dee</id>
<content type='text'>
Signed-off-by: Stefano Brivio &lt;sbrivio@redhat.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Signed-off-by: Stefano Brivio &lt;sbrivio@redhat.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>treewide: Mark constant references as const</title>
<updated>2022-03-29T13:35:38+00:00</updated>
<author>
<name>Stefano Brivio</name>
<email>sbrivio@redhat.com</email>
</author>
<published>2022-03-26T06:23:21+00:00</published>
<link rel='alternate' type='text/html' href='https://passt.top/passt/commit/?id=48582bf47f5ef7a1bf136ca455d182addad08028'/>
<id>48582bf47f5ef7a1bf136ca455d182addad08028</id>
<content type='text'>
Signed-off-by: Stefano Brivio &lt;sbrivio@redhat.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Signed-off-by: Stefano Brivio &lt;sbrivio@redhat.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>pcap: Fix mistake in printed string</title>
<updated>2022-03-25T12:21:13+00:00</updated>
<author>
<name>Stefano Brivio</name>
<email>sbrivio@redhat.com</email>
</author>
<published>2022-03-15T00:00:52+00:00</published>
<link rel='alternate' type='text/html' href='https://passt.top/passt/commit/?id=5c0426981eb97dc9836d0c7b50f6c102b9359ed9'/>
<id>5c0426981eb97dc9836d0c7b50f6c102b9359ed9</id>
<content type='text'>
Packets are saved *to* a file, not *at* it.

Signed-off-by: Stefano Brivio &lt;sbrivio@redhat.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Packets are saved *to* a file, not *at* it.

Signed-off-by: Stefano Brivio &lt;sbrivio@redhat.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>passt, pasta: Namespace-based sandboxing, defer seccomp policy application</title>
<updated>2022-02-21T12:41:13+00:00</updated>
<author>
<name>Stefano Brivio</name>
<email>sbrivio@redhat.com</email>
</author>
<published>2022-02-07T20:11:37+00:00</published>
<link rel='alternate' type='text/html' href='https://passt.top/passt/commit/?id=0515adceaa8f69a1d85ae3c8c550c37dd49b0c47'/>
<id>0515adceaa8f69a1d85ae3c8c550c37dd49b0c47</id>
<content type='text'>
To reach (at least) a conceptually equivalent security level as
implemented by --enable-sandbox in slirp4netns, we need to create a
new mount namespace and pivot_root() into a new (empty) mountpoint, so
that passt and pasta can't access any filesystem resource after
initialisation.

While at it, also detach IPC, PID (only for passt, to prevent
vulnerabilities based on the knowledge of a target PID), and UTS
namespaces.

With this approach, if we apply the seccomp filters right after the
configuration step, the number of allowed syscalls grows further. To
prevent this, defer the application of seccomp policies after the
initialisation phase, before the main loop, that's where we expect bad
things to happen, potentially. This way, we get back to 22 allowed
syscalls for passt and 34 for pasta, on x86_64.

While at it, move #syscalls notes to specific code paths wherever it
conceptually makes sense.

We have to open all the file handles we'll ever need before
sandboxing:

- the packet capture file can only be opened once, drop instance
  numbers from the default path and use the (pre-sandbox) PID instead

- /proc/net/tcp{,v6} and /proc/net/udp{,v6}, for automatic detection
  of bound ports in pasta mode, are now opened only once, before
  sandboxing, and their handles are stored in the execution context

- the UNIX domain socket for passt is also bound only once, before
  sandboxing: to reject clients after the first one, instead of
  closing the listening socket, keep it open, accept and immediately
  discard new connection if we already have a valid one

Clarify the (unchanged) behaviour for --netns-only in the man page.

To actually make passt and pasta processes run in a separate PID
namespace, we need to unshare(CLONE_NEWPID) before forking to
background (if configured to do so). Introduce a small daemon()
implementation, __daemon(), that additionally saves the PID file
before forking. While running in foreground, the process itself can't
move to a new PID namespace (a process can't change the notion of its
own PID): mention that in the man page.

For some reason, fork() in a detached PID namespace causes SIGTERM
and SIGQUIT to be ignored, even if the handler is still reported as
SIG_DFL: add a signal handler that just exits.

We can now drop most of the pasta_child_handler() implementation,
that took care of terminating all processes running in the same
namespace, if pasta started a shell: the shell itself is now the
init process in that namespace, and all children will terminate
once the init process exits.

Issuing 'echo $$' in a detached PID namespace won't return the
actual namespace PID as seen from the init namespace: adapt
demo and test setup scripts to reflect that.

Signed-off-by: Stefano Brivio &lt;sbrivio@redhat.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
To reach (at least) a conceptually equivalent security level as
implemented by --enable-sandbox in slirp4netns, we need to create a
new mount namespace and pivot_root() into a new (empty) mountpoint, so
that passt and pasta can't access any filesystem resource after
initialisation.

While at it, also detach IPC, PID (only for passt, to prevent
vulnerabilities based on the knowledge of a target PID), and UTS
namespaces.

With this approach, if we apply the seccomp filters right after the
configuration step, the number of allowed syscalls grows further. To
prevent this, defer the application of seccomp policies after the
initialisation phase, before the main loop, that's where we expect bad
things to happen, potentially. This way, we get back to 22 allowed
syscalls for passt and 34 for pasta, on x86_64.

While at it, move #syscalls notes to specific code paths wherever it
conceptually makes sense.

We have to open all the file handles we'll ever need before
sandboxing:

- the packet capture file can only be opened once, drop instance
  numbers from the default path and use the (pre-sandbox) PID instead

- /proc/net/tcp{,v6} and /proc/net/udp{,v6}, for automatic detection
  of bound ports in pasta mode, are now opened only once, before
  sandboxing, and their handles are stored in the execution context

- the UNIX domain socket for passt is also bound only once, before
  sandboxing: to reject clients after the first one, instead of
  closing the listening socket, keep it open, accept and immediately
  discard new connection if we already have a valid one

Clarify the (unchanged) behaviour for --netns-only in the man page.

To actually make passt and pasta processes run in a separate PID
namespace, we need to unshare(CLONE_NEWPID) before forking to
background (if configured to do so). Introduce a small daemon()
implementation, __daemon(), that additionally saves the PID file
before forking. While running in foreground, the process itself can't
move to a new PID namespace (a process can't change the notion of its
own PID): mention that in the man page.

For some reason, fork() in a detached PID namespace causes SIGTERM
and SIGQUIT to be ignored, even if the handler is still reported as
SIG_DFL: add a signal handler that just exits.

We can now drop most of the pasta_child_handler() implementation,
that took care of terminating all processes running in the same
namespace, if pasta started a shell: the shell itself is now the
init process in that namespace, and all children will terminate
once the init process exits.

Issuing 'echo $$' in a detached PID namespace won't return the
actual namespace PID as seen from the init namespace: adapt
demo and test setup scripts to reflect that.

Signed-off-by: Stefano Brivio &lt;sbrivio@redhat.com&gt;
</pre>
</div>
</content>
</entry>
</feed>
