Branch: refs/heads/stable1-proposed Home: https://github.com/kronosnet/kronosnet Commit: 4df82e5fd847423b164f4fba70e20fd0026639ce https://github.com/kronosnet/kronosnet/commit/4df82e5fd847423b164f4fba70e20f... Author: Fabio M. Di Nitto fdinitto@redhat.com Date: 2019-09-09 (Mon, 09 Sep 2019)
Changed paths: M libknet/threads_heartbeat.c M libknet/threads_rx.c
Log Message: ----------- [links] stabilize latency calculation when nodes are not responsive
The following scenario is more of a corner case than normal, but this change allows to better deal with this situation:
1) 2 nodes cluster (corosync) (node A and node B) 2) kill -stop $(pidof corosync) on node A 3) node B will continue to send ping packets to node A 4) node A is accumulating those ping packets in the kernel network socket 5) wait some seconds and unpause node A 6) node A will start processing the ping packets in the queue and send pong replies to node B 7) node B will see an extreme increase of latency due those "obsoleted" ping/pong packets 8) node B, as latency increases, will take longer and longer to notice that node A is down due to the pong_timeout adjustment for latency (required for initial cluster spike).
the solution:
1) Use average latency to calculate pong_timeout_adj vs latency_max. Averate latency will go down again in time, while latency_max is never reset.
2) RX thread will filter out all pong packets that have higher latency than currently configure pong_timeout. This barrier should have been in place even before.
this solution reduces the latency spike on node B to a perfectly reasonable level and it will all eventually stabilize over time as latency samples increase and latency will reduce.
Please be aware that using a pong_timeout smaller than latency will simply mark the link down now.
Signed-off-by: Fabio M. Di Nitto fdinitto@redhat.com