[Koha-bugs] [Bug 27078] Starman hanging in 3-node Koha cluster when 1 node goes offline.

Wed Nov 25 00:16:31 CET 2020

https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=27078

--- Comment #11 from David Cook <dcook at prosentient.com.au> ---
Thanks for providing more information, Christian.

I'm not familiar with your hostnames but I would've done something like
"tcpdump dst 10.10.100.51" on the remaining nodes.

It sounds like Elasticsearch and MariaDB are well handling the missing node.

What about Memcached? That should be the last distributed system that your
living nodes could be trying to query, I suspect. Is that referenced by IP
address or hostname in your koha-conf.xml?

Koha is using Cache::Memcached::Fast::Safe... which takes us to
https://metacpan.org/pod/release/RAZ/Cache-Memcached-Fast-0.26/lib/Cache/Memcached/Fast.pm.
Mostly using defaults. Take a look at "max_failures". It looks like the
Memcached client isn't managing failures. Even if it gets a timeout from a
server, it will keep trying that server. 

The connect_timeout looks like 0.25 seconds, but there are some warnings that
it can take longer. Koha does lean hard on Memcached so I wouldn't be surprised
if this is the issue. You could try the tcpdump again targeting the Memcached
specifically.

(In reply to Christian McDonald from comment #10)
> Another really weird thing, is that if I bring up a virgin replacement node,
> that ONLY has debian 10 installed and has the original node's static ip
> address (so in my case 10.10.100.51), the other two nodes immediately start
> performing as expected again.
> 

This doesn't surprise me at all. As I've been saying, it sounds like Koha (or
rather a subcomponent of Koha) is trying to contact the missing node. Since
it's missing, it is probably blocking while it waits for a response and
eventually times out. In your latest case, when you have a node available at
10.10.100.51, it is probably responding quite quickly by refusing the
connection, and that's why the nodes are performing well.

At this point, I think that you've narrowed it down to the network connection. 

I would suggest playing around with your Memcached config. Like maybe remove
the Node A Memcached from the server list on Node B and Node C and then take
down Node A. See if that makes a difference.

I recall now that you said "However, when one node goes offline, the two
remaining nodes become really sluggish." Can you clarify what you mean by
"really sluggish"? How much slower are they?

My gut is telling me it's the Memcached setup causing the problems, but it
could be something else. Still, I hope that you pursue this angle.

You may also try adding "max_failures => 3" or something like that to
Koha::Cache::_initialize_memcached to see if that helps. 

I doubt many people are using Memcached clusters, so it wouldn't surprise me if
Koha's Memcached client config isn't optimal.

-- 
You are receiving this mail because:
You are watching all bug changes.
You are the assignee for the bug.