On 12/6/2017 5:54 AM, Fabio M. Di Nitto wrote:
Hey Chrissie,
2) there is a spurious link down event on MTU change.
I can reproduce
this one without crypto and with ping_data test (I set it at 10 secs
interval to avoid waiting minutes).
I figured this one out and it´s not very
pretty.
The workaround is super simple:
diff --git a/libknet/threads_heartbeat.c b/libknet/threads_heartbeat.c
index ffe2f99..0e0eea0 100644
--- a/libknet/threads_heartbeat.c
+++ b/libknet/threads_heartbeat.c
@@ -155,8 +155,8 @@ static void _adjust_pong_timeouts(knet_handle_t knet_h)
struct knet_link *dst_link;
int link_idx;
- if (pthread_rwlock_wrlock(&knet_h->global_rwlock) != 0) {
- log_debug(knet_h, KNET_SUB_HEARTBEAT, "Unable to get
write lock");
+ if (pthread_rwlock_trywrlock(&knet_h->global_rwlock) != 0) {
+ log_debug(knet_h, KNET_SUB_HEARTBEAT,
"_adjust_pong_timeouts: Unable to get write lock");
return;
}
but the root cause is the PMTUd thread holding a read lock for very long
time when we hit the pthread_cond_timedwait and the response will never
come back from the other node (the reason is irrelevant).
That same read lock can block many other operations internals (for
example the dstcache handler) or external (any API call that requires a
write lock) for a almost unknown amount of time.
I don´t have a full solution yet. The only way I can think this could
work is to have the PMTUd use a different lock set and be notified of
any configuration change from the main lock.
So in theory, on each run, the PMTUd would cache link/host information
with a normal read lock, release the lock, get its own internal lock,
and on any relevant config changes, be notified that the cache is
invalid and restart (or something along those lines...).
The cache could be as simple as a timestamp from where we started the
process of PMTUd and what´s recorded in memory by host/link config, we
surely don´t need a copy of the whole thing.
Any ideas or suggestion are welcome.
Cheers
Fabio