[Koha-bugs] [Bug 27078] Starman hanging in 3-node Koha cluster when 1 node goes offline.

Tue Nov 24 05:14:41 CET 2020

https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=27078

--- Comment #7 from David Cook <dcook at prosentient.com.au> ---
(In reply to Christian McDonald from comment #6)
> Well Koha doesn't really know it is part of a cluster, does it? It still
> talks to a single MariaDB instance (at localhost, just like a standalone
> instance), and the same can be applied to ES (again a single localhost:9200
> ES node could be configured on each, just like a standalone deployment). 
> 
> Apache isn't aware it's part of a cluster. Plack/Starman aren't aware
> either. Really Koha is still "standalone" from the perspective of Koha
> itself, it just so happens that the DB, file-system and index are
> distributed.
> 

Right, Koha isn't aware of the clusters, but it relies on components that are
clustered, so if those components experience latency, then Koha will experience
latency.

> So what I mean about taking nodes offline is that, when all nodes are online
> and all services running, everything is very performant. As I would expect.
> However, when one node goes offline, the Koha application itself becomes
> very slow to respond to page loads...lots of spinning browser tabs waiting
> for a response...but it will eventually respond. My question is, why? Again,
> Koha as an application isn't aware that it is part of a cluster, it doesn't
> know that it's database, file system and index are replicated under-the-hood.
> 

And that's why Jonathan and I were saying that you should profile the
application to see where it's hanging. 

It might be trying to do database I/O, but the database might not be responding
because it's trying to synchronize with the missing DB node. I'm not familiar
with Galera, but multi-master sounds like it would need synchronous
consistency, which could be slow and if it were trying to synchronize (until it
times out) with a node that isn't there... that could be a source of latency. 

> Here's what's weird. Like I said, when all nodes and their services are
> online, everything is fine. However, say if I "systemctl stop" Maria,
> elastic, Apache, and memcached on a single node (say on Node A), everything
> still is fine when connecting to either of the remaining nodes (Nodes B and
> C). However, if I then power down Node A (remember, all it's koha-related
> services had been stopped) or pull node A's network connection, node B and C
> become very slow to serve page requests. Again, this isn't because of some
> performance degradation of Maria (3 nodes can withstand 1 node offline),
> Elastic (again, one node offline is okay). 
> 

When you say that you "power down Node A", what kind of power down is that? Is
it graceful or forced? Pulling out node A's network connection is definitely a
forced outage, so I could see that having impact on a cluster. 

If you're sure about Galera's and Elastic's fault tolerance, I'd say look at
Memcached. 

Even if the systems are fault tolerant, I imagine there will be some latency
due to cluster re-balancing and comms timeouts. But I haven't run these
particular clusters. I've just had issues with latency on other clustered
systems when the remaining systems are trying to determine that the dead
members are actually dead. 

> What can I do to monitor performance, or poke into Plack/Starman? The reason
> why I have a hunce Plack/Starman are involved here is because when I disable
> Plack on all nodes, the loss of a single node doesn't impact the performance
> of the remaining nodes... granted with Plack disabled they are all noticably
> slower, which is expected

I really don't think it's going to Be Plack/Starman. Everything suggests it
being one of the distributed systems. However, the easiest way to find that is
by profiling the Koha application. 

Now I haven't done this in a while... but you'll want to look at NYTProf and
you might get some clues about configuration at
https://wiki.koha-community.org/wiki/Plack.

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are watching all bug changes.