<feed xmlns='http://www.w3.org/2005/Atom'>
<title>passt/tcp_conn.h, branch 2026_01_17.81c97f6</title>
<subtitle>Plug A Simple Socket Transport</subtitle>
<link rel='alternate' type='text/html' href='https://passt.top/passt/'/>
<entry>
<title>tcp: Adaptive interval based on RTT for socket-side acknowledgement checks</title>
<updated>2025-12-08T08:15:36+00:00</updated>
<author>
<name>Stefano Brivio</name>
<email>sbrivio@redhat.com</email>
</author>
<published>2025-12-03T19:04:21+00:00</published>
<link rel='alternate' type='text/html' href='https://passt.top/passt/commit/?id=000601ba86da0d876fc91e0813a1e752540666f1'/>
<id>000601ba86da0d876fc91e0813a1e752540666f1</id>
<content type='text'>
A fixed 10 ms ACK_INTERVAL timer value served us relatively well until
the previous change, because we would generally cause retransmissions
for non-local outbound transfers with relatively high (&gt; 100 Mbps)
bandwidth and non-local but low (&lt; 5 ms) RTT.

Now that retransmissions are less frequent, we don't have a proper
trigger to check for acknowledged bytes on the socket, and will
generally block the sender for a significant amount of time while
we could acknowledge more data, instead.

Store the RTT reported by the kernel using an approximation (exponent),
to keep flow storage size within two (typical) cachelines. Check for
socket updates when half of this time elapses: it should be a good
indication of the one-way delay we're interested in (peer to us).

Representable values are between 100 us and 3.2768 s, and any value
outside this range is clamped to these bounds. This choice appears
to be a good trade-off between additional overhead and throughput.

This mechanism partially overlaps with the "low RTT" destinations,
which we use to infer that a socket is connected to an endpoint to
the same machine (while possibly in a different namespace) if the
RTT is reported as 10 us or less.

This change doesn't, however, conflict with it: we are reading
TCP_INFO parameters for local connections anyway, so we can always
store the RTT approximation opportunistically.

Then, if the RTT is "low", we don't really need a timer to
acknowledge data as we'll always acknowledge everything to the
sender right away. However, we have limited space in the array where
we store addresses of local destination, so the low RTT property of a
connection might toggle frequently. Because of this, it's actually
helpful to always have the RTT approximation stored.

This could probably benefit from a future rework, though, introducing
a more integrated approach between these two mechanisms.

Signed-off-by: Stefano Brivio &lt;sbrivio@redhat.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
A fixed 10 ms ACK_INTERVAL timer value served us relatively well until
the previous change, because we would generally cause retransmissions
for non-local outbound transfers with relatively high (&gt; 100 Mbps)
bandwidth and non-local but low (&lt; 5 ms) RTT.

Now that retransmissions are less frequent, we don't have a proper
trigger to check for acknowledged bytes on the socket, and will
generally block the sender for a significant amount of time while
we could acknowledge more data, instead.

Store the RTT reported by the kernel using an approximation (exponent),
to keep flow storage size within two (typical) cachelines. Check for
socket updates when half of this time elapses: it should be a good
indication of the one-way delay we're interested in (peer to us).

Representable values are between 100 us and 3.2768 s, and any value
outside this range is clamped to these bounds. This choice appears
to be a good trade-off between additional overhead and throughput.

This mechanism partially overlaps with the "low RTT" destinations,
which we use to infer that a socket is connected to an endpoint to
the same machine (while possibly in a different namespace) if the
RTT is reported as 10 us or less.

This change doesn't, however, conflict with it: we are reading
TCP_INFO parameters for local connections anyway, so we can always
store the RTT approximation opportunistically.

Then, if the RTT is "low", we don't really need a timer to
acknowledge data as we'll always acknowledge everything to the
sender right away. However, we have limited space in the array where
we store addresses of local destination, so the low RTT property of a
connection might toggle frequently. Because of this, it's actually
helpful to always have the RTT approximation stored.

This could probably benefit from a future rework, though, introducing
a more integrated approach between these two mechanisms.

Signed-off-by: Stefano Brivio &lt;sbrivio@redhat.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>tcp: Clamp the retry timeout</title>
<updated>2025-12-02T22:05:08+00:00</updated>
<author>
<name>Yumei Huang</name>
<email>yuhuang@redhat.com</email>
</author>
<published>2025-12-02T03:00:07+00:00</published>
<link rel='alternate' type='text/html' href='https://passt.top/passt/commit/?id=1a834879a2f7ab138c12cd65c610f71eece8a939'/>
<id>1a834879a2f7ab138c12cd65c610f71eece8a939</id>
<content type='text'>
Clamp the TCP retry timeout as Linux kernel does. If a retry occurs
during the handshake and the RTO is below 3 seconds, re-initialise
it to 3 seconds for data retransmissions according to RFC 6298.

Suggested-by: Stefano Brivio &lt;sbrivio@redhat.com&gt;
Signed-off-by: Yumei Huang &lt;yuhuang@redhat.com&gt;
Reviewed-by: David Gibson &lt;david@gibson.dropbear.id.au&gt;
Signed-off-by: Stefano Brivio &lt;sbrivio@redhat.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Clamp the TCP retry timeout as Linux kernel does. If a retry occurs
during the handshake and the RTO is below 3 seconds, re-initialise
it to 3 seconds for data retransmissions according to RFC 6298.

Suggested-by: Stefano Brivio &lt;sbrivio@redhat.com&gt;
Signed-off-by: Yumei Huang &lt;yuhuang@redhat.com&gt;
Reviewed-by: David Gibson &lt;david@gibson.dropbear.id.au&gt;
Signed-off-by: Stefano Brivio &lt;sbrivio@redhat.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>tcp: Rename "retrans" to "retries"</title>
<updated>2025-12-02T22:05:08+00:00</updated>
<author>
<name>Yumei Huang</name>
<email>yuhuang@redhat.com</email>
</author>
<published>2025-12-02T03:00:03+00:00</published>
<link rel='alternate' type='text/html' href='https://passt.top/passt/commit/?id=785214c6a781a3a8814c7e066899d855004d0d77'/>
<id>785214c6a781a3a8814c7e066899d855004d0d77</id>
<content type='text'>
Rename "retrans" to "retries" so it can be used for SYN retries.

Signed-off-by: Yumei Huang &lt;yuhuang@redhat.com&gt;
Reviewed-by: David Gibson &lt;david@gibson.dropbear.id.au&gt;
Signed-off-by: Stefano Brivio &lt;sbrivio@redhat.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Rename "retrans" to "retries" so it can be used for SYN retries.

Signed-off-by: Yumei Huang &lt;yuhuang@redhat.com&gt;
Reviewed-by: David Gibson &lt;david@gibson.dropbear.id.au&gt;
Signed-off-by: Stefano Brivio &lt;sbrivio@redhat.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>tcp, flow: Replace per-connection in_epoll flag with an epollid in flow_common</title>
<updated>2025-10-30T14:32:50+00:00</updated>
<author>
<name>Laurent Vivier</name>
<email>lvivier@redhat.com</email>
</author>
<published>2025-10-21T21:01:13+00:00</published>
<link rel='alternate' type='text/html' href='https://passt.top/passt/commit/?id=dd5302dd7bf518aa2c50a9819ee06ea2d6fd0061'/>
<id>dd5302dd7bf518aa2c50a9819ee06ea2d6fd0061</id>
<content type='text'>
The in_epoll boolean flag in tcp_tap_conn and tcp_splice_conn only tracked
whether a connection was registered with epoll, not which epoll instance.
This limited flexibility for future multi-epoll support.

Replace the boolean with an epollid field in flow_common that identifies
which epoll instance the flow is registered with.
Use FLOW_EPOLLID_INVALID to indicate when a flow is not registered with
any epoll instance. An epoll_id_to_fd[] mapping table translates
epoll ids to their corresponding epoll file descriptors.

Add helper functions:
- flow_in_epoll() to check if a flow is registered with epoll
- flow_epollfd() to retrieve the epoll fd for a flow's thread
- flow_epollid_register() to register an epoll fd with an epollid
- flow_epollid_set() to set the epollid of a flow
- flow_epollid_clear() to reset the epoll id of a flow

This change also simplifies tcp_timer_ctl() and conn_flag_do() by removing
the need to pass the context 'c', since the epoll fd is now directly
accessible from the flow structure via flow_epollfd().

Add a defensive check at the beginning of tcp_flow_repair_queue() to
avoid a false positive with "make clang-tidy":
  error: The 1st argument to 'send' is &lt; 0 but should be &gt;= 0
   3230 |                 ssize_t rc = send(conn-&gt;sock, p, MIN(len, chunk), 0);

Signed-off-by: Laurent Vivier &lt;lvivier@redhat.com&gt;
Reviewed-by: David Gibson &lt;david@gibson.dropbear.id.au&gt;
Signed-off-by: Stefano Brivio &lt;sbrivio@redhat.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
The in_epoll boolean flag in tcp_tap_conn and tcp_splice_conn only tracked
whether a connection was registered with epoll, not which epoll instance.
This limited flexibility for future multi-epoll support.

Replace the boolean with an epollid field in flow_common that identifies
which epoll instance the flow is registered with.
Use FLOW_EPOLLID_INVALID to indicate when a flow is not registered with
any epoll instance. An epoll_id_to_fd[] mapping table translates
epoll ids to their corresponding epoll file descriptors.

Add helper functions:
- flow_in_epoll() to check if a flow is registered with epoll
- flow_epollfd() to retrieve the epoll fd for a flow's thread
- flow_epollid_register() to register an epoll fd with an epollid
- flow_epollid_set() to set the epollid of a flow
- flow_epollid_clear() to reset the epoll id of a flow

This change also simplifies tcp_timer_ctl() and conn_flag_do() by removing
the need to pass the context 'c', since the epoll fd is now directly
accessible from the flow structure via flow_epollfd().

Add a defensive check at the beginning of tcp_flow_repair_queue() to
avoid a false positive with "make clang-tidy":
  error: The 1st argument to 'send' is &lt; 0 but should be &gt;= 0
   3230 |                 ssize_t rc = send(conn-&gt;sock, p, MIN(len, chunk), 0);

Signed-off-by: Laurent Vivier &lt;lvivier@redhat.com&gt;
Reviewed-by: David Gibson &lt;david@gibson.dropbear.id.au&gt;
Signed-off-by: Stefano Brivio &lt;sbrivio@redhat.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>treewide: By default, don't quit source after migration, keep sockets open</title>
<updated>2025-07-29T15:57:01+00:00</updated>
<author>
<name>Stefano Brivio</name>
<email>sbrivio@redhat.com</email>
</author>
<published>2025-07-17T08:38:17+00:00</published>
<link rel='alternate' type='text/html' href='https://passt.top/passt/commit/?id=a8782865c342eb2682cca292d5bf92b567344351'/>
<id>a8782865c342eb2682cca292d5bf92b567344351</id>
<content type='text'>
We are hitting an issue in the KubeVirt integration where some data is
still sent to the source instance even after migration is complete. As
we exit, the kernel closes our sockets and resets connections. The
resulting RST segments are sent to peers, effectively terminating
connections that were meanwhile migrated.

At the moment, this is not done intentionally, but in the future
KubeVirt might enable OVN-Kubernetes features where source and
destination nodes are explicitly getting mirrored traffic for a while,
in order to decrease migration downtime.

By default, don't quit after migration is completed on the source: the
previous behaviour can be enabled with the new, but deprecated,
--migrate-exit option. After migration (as source), the -1 / --one-off
option has no effect.

Also, by default, keep migrated TCP sockets open (in repair mode) as
long as we're running, and ignore events on any epoll descriptor
representing data channels. The previous behaviour can be enabled with
the new, equally deprecated, --migrate-no-linger option.

By keeping sockets open, and not exiting, we prevent the kernel
running on the source node to send out RST segments if further data
reaches us.

Reported-by: Nir Dothan &lt;ndothan@redhat.com&gt;
Signed-off-by: Stefano Brivio &lt;sbrivio@redhat.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
We are hitting an issue in the KubeVirt integration where some data is
still sent to the source instance even after migration is complete. As
we exit, the kernel closes our sockets and resets connections. The
resulting RST segments are sent to peers, effectively terminating
connections that were meanwhile migrated.

At the moment, this is not done intentionally, but in the future
KubeVirt might enable OVN-Kubernetes features where source and
destination nodes are explicitly getting mirrored traffic for a while,
in order to decrease migration downtime.

By default, don't quit after migration is completed on the source: the
previous behaviour can be enabled with the new, but deprecated,
--migrate-exit option. After migration (as source), the -1 / --one-off
option has no effect.

Also, by default, keep migrated TCP sockets open (in repair mode) as
long as we're running, and ignore events on any epoll descriptor
representing data channels. The previous behaviour can be enabled with
the new, equally deprecated, --migrate-no-linger option.

By keeping sockets open, and not exiting, we prevent the kernel
running on the source node to send out RST segments if further data
reaches us.

Reported-by: Nir Dothan &lt;ndothan@redhat.com&gt;
Signed-off-by: Stefano Brivio &lt;sbrivio@redhat.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>migrate, tcp: Migrate RFC 7323 timestamp</title>
<updated>2025-03-19T14:27:27+00:00</updated>
<author>
<name>David Gibson</name>
<email>david@gibson.dropbear.id.au</email>
</author>
<published>2025-03-19T05:14:22+00:00</published>
<link rel='alternate' type='text/html' href='https://passt.top/passt/commit/?id=cfb3740568ab291d7be00e457658c45ce9367ed5'/>
<id>cfb3740568ab291d7be00e457658c45ce9367ed5</id>
<content type='text'>
Currently our migration of the state of TCP sockets omits the RFC 7323
timestamp.  In some circumstances that can result in data sent from the
target machine not being received, because it is discarded on the peer due
to PAWS checking.

Add code to dump and restore the timestamp across migration.

Link: https://bugs.passt.top/show_bug.cgi?id=115
Signed-off-by: David Gibson &lt;david@gibson.dropbear.id.au&gt;
[sbrivio: Minor style fixes]
Signed-off-by: Stefano Brivio &lt;sbrivio@redhat.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Currently our migration of the state of TCP sockets omits the RFC 7323
timestamp.  In some circumstances that can result in data sent from the
target machine not being received, because it is discarded on the peer due
to PAWS checking.

Add code to dump and restore the timestamp across migration.

Link: https://bugs.passt.top/show_bug.cgi?id=115
Signed-off-by: David Gibson &lt;david@gibson.dropbear.id.au&gt;
[sbrivio: Minor style fixes]
Signed-off-by: Stefano Brivio &lt;sbrivio@redhat.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>tcp: Don't pass both flow pointer and flow index</title>
<updated>2025-02-18T12:33:10+00:00</updated>
<author>
<name>David Gibson</name>
<email>david@gibson.dropbear.id.au</email>
</author>
<published>2025-02-18T08:59:23+00:00</published>
<link rel='alternate' type='text/html' href='https://passt.top/passt/commit/?id=ba0823f8a0e60d4fc0cb21179aaf64940509156a'/>
<id>ba0823f8a0e60d4fc0cb21179aaf64940509156a</id>
<content type='text'>
tcp_flow_migrate_source_ext() is passed both the index of the flow it
operates on and the pointer to the connection structure.  However, the
former is trivially derived from the latter.  Simplify the interface.

Signed-off-by: David Gibson &lt;david@gibson.dropbear.id.au&gt;
Signed-off-by: Stefano Brivio &lt;sbrivio@redhat.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
tcp_flow_migrate_source_ext() is passed both the index of the flow it
operates on and the pointer to the connection structure.  However, the
former is trivially derived from the latter.  Simplify the interface.

Signed-off-by: David Gibson &lt;david@gibson.dropbear.id.au&gt;
Signed-off-by: Stefano Brivio &lt;sbrivio@redhat.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>tcp: Remove spurious prototype for tcp_flow_migrate_shrink_window</title>
<updated>2025-02-18T12:33:08+00:00</updated>
<author>
<name>David Gibson</name>
<email>david@gibson.dropbear.id.au</email>
</author>
<published>2025-02-18T08:59:22+00:00</published>
<link rel='alternate' type='text/html' href='https://passt.top/passt/commit/?id=854bc7b1a3b4e5443ea071e49b3a68198dbb88b3'/>
<id>854bc7b1a3b4e5443ea071e49b3a68198dbb88b3</id>
<content type='text'>
This function existed in drafts of the migration code, but not the final
version.  Get rid of the prototype.

Signed-off-by: David Gibson &lt;david@gibson.dropbear.id.au&gt;
Signed-off-by: Stefano Brivio &lt;sbrivio@redhat.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
This function existed in drafts of the migration code, but not the final
version.  Get rid of the prototype.

Signed-off-by: David Gibson &lt;david@gibson.dropbear.id.au&gt;
Signed-off-by: Stefano Brivio &lt;sbrivio@redhat.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>tcp: More type safety for tcp_flow_migrate_target_ext()</title>
<updated>2025-02-18T12:32:52+00:00</updated>
<author>
<name>David Gibson</name>
<email>david@gibson.dropbear.id.au</email>
</author>
<published>2025-02-18T08:59:21+00:00</published>
<link rel='alternate' type='text/html' href='https://passt.top/passt/commit/?id=e56c8038fc23a349ff4a457c6b447f927ac1a56e'/>
<id>e56c8038fc23a349ff4a457c6b447f927ac1a56e</id>
<content type='text'>
tcp_flow_migrate_target_ext() takes a raw union flow *, although it is TCP
specific, and requires a FLOW_TYPE_TCP entry.  Our usual convention is that
such functions should take a struct tcp_tap_conn * instead.  Convert it to
do so.

Signed-off-by: David Gibson &lt;david@gibson.dropbear.id.au&gt;
Signed-off-by: Stefano Brivio &lt;sbrivio@redhat.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
tcp_flow_migrate_target_ext() takes a raw union flow *, although it is TCP
specific, and requires a FLOW_TYPE_TCP entry.  Our usual convention is that
such functions should take a struct tcp_tap_conn * instead.  Convert it to
do so.

Signed-off-by: David Gibson &lt;david@gibson.dropbear.id.au&gt;
Signed-off-by: Stefano Brivio &lt;sbrivio@redhat.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>migrate: Migrate TCP flows</title>
<updated>2025-02-17T07:29:03+00:00</updated>
<author>
<name>Stefano Brivio</name>
<email>sbrivio@redhat.com</email>
</author>
<published>2025-02-13T12:14:13+00:00</published>
<link rel='alternate' type='text/html' href='https://passt.top/passt/commit/?id=89ecf2fd40adab549bdf25cdb68996f56d67b13e'/>
<id>89ecf2fd40adab549bdf25cdb68996f56d67b13e</id>
<content type='text'>
This implements flow preparation on the source, transfer of data with
a format roughly inspired by struct tcp_tap_conn, plus a specific
structure for parameters that don't fit in the flow table, and flow
insertion on the target, with all the appropriate window options,
window scaling, MSS, etc.

Contents of pending queues are transferred as well.

The target side is rather convoluted because we first need to create
sockets and switch them to repair mode, before we can apply options
that are *not* stored in the flow table. This also means that, if
we're testing this on the same machine, in the same namespace, we need
to close the listening socket on the source before we can start moving
data.

Further, we need to connect() the socket on the target before we can
restore data queues, but we can't do that (again, on the same machine)
as long as the matching source socket is open, which implies an
arbitrary limit on queue sizes we can transfer, because we can only
dump pending queues on the source as long as the socket is open, of
course.

Co-authored-by: David Gibson &lt;david@gibson.dropbear.id.au&gt;
Reviewed-by: David Gibson &lt;david@gibson.dropbear.id.au&gt;
Tested-by: David Gibson &lt;david@gibson.dropbear.id.au&gt;
Signed-off-by: Stefano Brivio &lt;sbrivio@redhat.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
This implements flow preparation on the source, transfer of data with
a format roughly inspired by struct tcp_tap_conn, plus a specific
structure for parameters that don't fit in the flow table, and flow
insertion on the target, with all the appropriate window options,
window scaling, MSS, etc.

Contents of pending queues are transferred as well.

The target side is rather convoluted because we first need to create
sockets and switch them to repair mode, before we can apply options
that are *not* stored in the flow table. This also means that, if
we're testing this on the same machine, in the same namespace, we need
to close the listening socket on the source before we can start moving
data.

Further, we need to connect() the socket on the target before we can
restore data queues, but we can't do that (again, on the same machine)
as long as the matching source socket is open, which implies an
arbitrary limit on queue sizes we can transfer, because we can only
dump pending queues on the source as long as the socket is open, of
course.

Co-authored-by: David Gibson &lt;david@gibson.dropbear.id.au&gt;
Reviewed-by: David Gibson &lt;david@gibson.dropbear.id.au&gt;
Tested-by: David Gibson &lt;david@gibson.dropbear.id.au&gt;
Signed-off-by: Stefano Brivio &lt;sbrivio@redhat.com&gt;
</pre>
</div>
</content>
</entry>
</feed>
