Commit message (Collapse) | Author | Age | Files | Lines | |
---|---|---|---|---|---|
* | checksum: Stream load into four registers at a time with > 128 bytes | Stefano Brivio | 2021-10-15 | 1 | -3/+47 |
| | | | | | | | | ...and further interleave register usage. This brings the csum() overhead reported by perf(1) for 30 seconds of 64KiB TCP IPv4 frames, host to guest, from 7.2% to 5.8%. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> | ||||
* | checksum: Interleave lo/hi sums while folding into 128-bit sums, drop TODO | Stefano Brivio | 2021-10-15 | 1 | -3/+3 |
| | | | | | | | I left a TODO and never checked -- this actually seems to slightly improve CPIs on AMD Naples (two 128-bit FMA units glued together). Signed-off-by: Stefano Brivio <sbrivio@redhat.com> | ||||
* | checksum: Introduce AVX2 implementation, unify helpers | Stefano Brivio | 2021-07-26 | 1 | -0/+292 |
Provide an AVX2-based function using compiler intrinsics for TCP/IP-style checksums. The load/unpack/add idea and implementation is largely based on code from BESS (the Berkeley Extensible Software Switch) licensed as 3-Clause BSD, with a number of modifications to further decrease pipeline stalls and to minimise cache pollution. This speeds up considerably data paths from sockets to tap interfaces, decreasing overhead for checksum computation, with 16-64KiB packet buffers, from approximately 11% to 7%. The rest is just syscalls at this point. While at it, provide convenience targets in the Makefile for avx2, avx2_debug, and debug targets -- these simply add target-specific CFLAGS to the build. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> |