Fork of github.com/xtaci/kcp-go with faster RS codec and highwayhash for checksums
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Loki Verloren b7d5985a4f typo 2 weeks ago
assets tidied up readme a bit to start with 2 weeks ago
.gitignore lint 2 years ago
.travis.yml Bump Go versions and use '.x' to always get latest patch versions 6 months ago
LICENSE lint 2 years ago
README.md typo 2 weeks ago
crypt.go simplify buffer creation in cryptos 8 months ago
crypt_test.go reset timer before entering the crypto benchmark loop 8 months ago
entropy.go lint 4 months ago
fec.go len->cap 2 weeks ago
fec_test.go rename typeFEC->typeParity 1 month ago
kcp.go len->cap 2 weeks ago
kcp_test.go lint 4 months ago
readloop_generic.go conditional build for linux and others 4 weeks ago
readloop_linux.go move batchSize to readloop_linux.go 4 weeks ago
sess.go move batchSize to readloop_linux.go 4 weeks ago
sess_test.go add WriteBuffers function to send a vector of slice in batch 1 month ago
snmp.go add InPkts & OutPkts counters 2 years ago
updater.go fix time comparsion on edge 1 month ago

README.md

kcp9

A fork of KCP using the fastest checksum and RS codecs available

The original version of this library uses the klauspost RS library, but templexxx version is a lot faster.

The original also uses a lot of insecure hash functions for checksums, in this version all the junk is removed and HighwayHash is used, a hash function suitable for simple checksums with extremely good cache locality and strong collision resistance.

Aside from the elimination of slow and insecure hash functions and retargeting the templexxx reed solomon library, nothing else is altered.


TODO: old version readme to be revised later

kcp-go

Introduction

kcp-go is a Production-Grade Reliable-UDP library for golang.

This library intents to provide a smooth, resilient, ordered, error-checked and anonymous delivery of streams over UDP packets, it has been battle-tested with opensource project kcptun. Millions of devices(from low-end MIPS routers to high-end servers) have deployed kcp-go powered program in a variety of forms like online games, live broadcasting, file synchronization and network acceleration.

Lastest Release

Features

  1. Designed for Latency-sensitive scenarios.
  2. Cache friendly and Memory optimized design, offers extremely High Performance core.
  3. Handles >5K concurrent connections on a single commodity server.
  4. Compatible with net.Conn and net.Listener, a drop-in replacement for net.TCPConn.
  5. FEC(Forward Error Correction) Support with Reed-Solomon Codes
  6. Packet level encryption support with AES, TEA, 3DES, Blowfish, Cast5, Salsa20, etc. in CFB mode, which generates completely anonymous packet.
  7. Only A fixed number of goroutines will be created for the entire server application, costs in context switch between goroutines have been taken into consideration.
  8. Compatible with skywind3000’s C version with various improvements.

Documentation

For complete documentation, see the associated Godoc.

Specification

Frame Format

+-----------------+
| SESSION         |
+-----------------+
| KCP(ARQ)        |
+-----------------+
| FEC(OPTIONAL)   |
+-----------------+
| CRYPTO(OPTIONAL)|
+-----------------+
| UDP(PACKET)     |
+-----------------+
| IP              |
+-----------------+
| LINK            |
+-----------------+
| PHY             |
+-----------------+
(LAYER MODEL OF KCP-GO)

Usage

Client: full demo

kcpconn, err := kcp.DialWithOptions("192.168.0.1:10000", nil, 10, 3)

Server: full demo

lis, err := kcp.ListenWithOptions(":10000", nil, 10, 3)

Benchmark

  Model Name:	MacBook Pro
  Model Identifier:	MacBookPro14,1
  Processor Name:	Intel Core i5
  Processor Speed:	3.1 GHz
  Number of Processors:	1
  Total Number of Cores:	2
  L2 Cache (per Core):	256 KB
  L3 Cache:	4 MB
  Memory:	8 GB
$ go test -v -run=^$ -bench .
beginning tests, encryption:salsa20, fec:10/3
goos: darwin
goarch: amd64
pkg: github.com/xtaci/kcp-go
BenchmarkSM4-4                 	   50000	     32180 ns/op	  93.23 MB/s	       0 B/op	       0 allocs/op
BenchmarkAES128-4              	  500000	      3285 ns/op	 913.21 MB/s	       0 B/op	       0 allocs/op
BenchmarkAES192-4              	  300000	      3623 ns/op	 827.85 MB/s	       0 B/op	       0 allocs/op
BenchmarkAES256-4              	  300000	      3874 ns/op	 774.20 MB/s	       0 B/op	       0 allocs/op
BenchmarkTEA-4                 	  100000	     15384 ns/op	 195.00 MB/s	       0 B/op	       0 allocs/op
BenchmarkXOR-4                 	20000000	        89.9 ns/op	33372.00 MB/s	       0 B/op	       0 allocs/op
BenchmarkBlowfish-4            	   50000	     26927 ns/op	 111.41 MB/s	       0 B/op	       0 allocs/op
BenchmarkNone-4                	30000000	        45.7 ns/op	65597.94 MB/s	       0 B/op	       0 allocs/op
BenchmarkCast5-4               	   50000	     34258 ns/op	  87.57 MB/s	       0 B/op	       0 allocs/op
Benchmark3DES-4                	   10000	    117149 ns/op	  25.61 MB/s	       0 B/op	       0 allocs/op
BenchmarkTwofish-4             	   50000	     33538 ns/op	  89.45 MB/s	       0 B/op	       0 allocs/op
BenchmarkXTEA-4                	   30000	     45666 ns/op	  65.69 MB/s	       0 B/op	       0 allocs/op
BenchmarkSalsa20-4             	  500000	      3308 ns/op	 906.76 MB/s	       0 B/op	       0 allocs/op
BenchmarkCRC32-4               	20000000	        65.2 ns/op	15712.43 MB/s
BenchmarkCsprngSystem-4        	 1000000	      1150 ns/op	  13.91 MB/s
BenchmarkCsprngMD5-4           	10000000	       145 ns/op	 110.26 MB/s
BenchmarkCsprngSHA1-4          	10000000	       158 ns/op	 126.54 MB/s
BenchmarkCsprngNonceMD5-4      	10000000	       153 ns/op	 104.22 MB/s
BenchmarkCsprngNonceAES128-4   	100000000	        19.1 ns/op	 837.81 MB/s
BenchmarkFECDecode-4           	 1000000	      1119 ns/op	1339.61 MB/s	    1606 B/op	       2 allocs/op
BenchmarkFECEncode-4           	 2000000	       832 ns/op	1801.83 MB/s	      17 B/op	       0 allocs/op
BenchmarkFlush-4               	 5000000	       272 ns/op	       0 B/op	       0 allocs/op
BenchmarkEchoSpeed4K-4         	    5000	    259617 ns/op	  15.78 MB/s	    5451 B/op	     149 allocs/op
BenchmarkEchoSpeed64K-4        	    1000	   1706084 ns/op	  38.41 MB/s	   56002 B/op	    1604 allocs/op
BenchmarkEchoSpeed512K-4       	     100	  14345505 ns/op	  36.55 MB/s	  482597 B/op	   13045 allocs/op
BenchmarkEchoSpeed1M-4         	      30	  34859104 ns/op	  30.08 MB/s	 1143773 B/op	   27186 allocs/op
BenchmarkSinkSpeed4K-4         	   50000	     31369 ns/op	 130.57 MB/s	    1566 B/op	      30 allocs/op
BenchmarkSinkSpeed64K-4        	    5000	    329065 ns/op	 199.16 MB/s	   21529 B/op	     453 allocs/op
BenchmarkSinkSpeed256K-4       	     500	   2373354 ns/op	 220.91 MB/s	  166332 B/op	    3554 allocs/op
BenchmarkSinkSpeed1M-4         	     300	   5117927 ns/op	 204.88 MB/s	  310378 B/op	    6988 allocs/op
PASS
ok  	github.com/xtaci/kcp-go	50.349s

Typical Flame Graph

Flame Graph in kcptun

Key Design Considerations

  1. slice vs. container/list

kcp.flush() loops through the send queue for retransmission checking for every 20ms(interval).

I’ve wrote a benchmark for comparing sequential loop through slice and container/list here:

https://github.com/xtaci/notes/blob/master/golang/benchmark2/cachemiss_test.go

BenchmarkLoopSlice-4   	2000000000	         0.39 ns/op
BenchmarkLoopList-4    	100000000	        54.6 ns/op

List structure introduces heavy cache misses compared to slice which owns better locality, 5000 connections with 32 window size and 20ms interval will cost 6us/0.03%(cpu) using slice, and 8.7ms/43.5%(cpu) for list for each kcp.flush().

  1. Timing accuracy vs. syscall clock_gettime

Timing is critical to RTT estimator, inaccurate timing leads to false retransmissions in KCP, but calling time.Now() costs 42 cycles(10.5ns on 4GHz CPU, 15.6ns on my MacBook Pro 2.7GHz).

The benchmark for time.Now() lies here:

https://github.com/xtaci/notes/blob/master/golang/benchmark2/syscall_test.go

BenchmarkNow-4         	100000000	        15.6 ns/op

In kcp-go, after each kcp.output() function call, current clock time will be updated upon return, and for a single kcp.flush() operation, current time will be queried from system once. For most of the time, 5000 connections costs 5000 * 15.6ns = 78us(a fixed cost while no packet needs to be sent), as for 10MB/s data transfering with 1400 MTU, kcp.output() will be called around 7500 times and costs 117us for time.Now() in every second.

  1. Memory management

Primary memory allocation are done from a global buffer pool xmit.Buf, in kcp-go, when we need to allocate some bytes, we can get from that pool, and a fixed-capacity 1500 bytes(mtuLimit) will be returned, the rx queue, tx queue and fec queue all receive bytes from there, and they will return the bytes to the pool after using to prevent unnecessary zer0ing of bytes. The pool mechanism maintained a high watermark for slice objects, these in-flight objects from the pool will survive from the perodical garbage collection, meanwhile the pool kept the ability to return the memory to runtime if in idle.

Connection Termination

Control messages like SYN/FIN/RST in TCP are not defined in KCP, you need some keepalive/heartbeat mechanism in the application-level. A real world example is to use some multiplexing protocol over session, such as smux(with embedded keepalive mechanism), see kcptun for example.

FAQ

Q: I’m handling >5K connections on my server, the CPU utilization is so high.

A: A standalone agent or gate server for running kcp-go is suggested, not only for CPU utilization, but also important to the precision of RTT measurements(timing) which indirectly affects retransmission. By increasing update interval with SetNoDelay like conn.SetNoDelay(1, 40, 1, 1) will dramatically reduce system load, but lower the performance.

Who is using this?

  1. https://github.com/xtaci/kcptun -- A Secure Tunnel Based On KCP over UDP.
  2. https://github.com/getlantern/lantern -- Lantern delivers fast access to the open Internet.
  3. https://github.com/smallnest/rpcx -- A RPC service framework based on net/rpc like alibaba Dubbo and weibo Motan.
  4. https://github.com/gonet2/agent -- A gateway for games with stream multiplexing.
  5. https://github.com/syncthing/syncthing -- Open Source Continuous File Synchronization.

Links

  1. https://github.com/xtaci/libkcp -- FEC enhanced KCP session library for iOS/Android in C++
  2. https://github.com/skywind3000/kcp -- A Fast and Reliable ARQ Protocol
  3. https://github.com/klauspost/reedsolomon -- Reed-Solomon Erasure Coding in Go