Projects tigase _server server-core Issues #922
Server opens too many cluster connections (#922)
Andrzej Wójcik (Tigase) opened 7 years ago

Server opens a lot of cluster connections and their number increases over time. It is confirmed that connections are opened between cluster nodes and that they are opened according to netstat.

Andrzej Wójcik (Tigase) commented 7 years ago

I could not find anything in logs or in code which could lead to this situation. However, as I reviewed old statistics data, it occurred to me that this happens from time to time. It may be 2 days without opening new ports and then every day it may start opening 20 new ports on each cluster node. As I've checked on itemAdded() method of ClusterConnectionManager, the server will try to open 5 new ports for each node which is added. If we assume that this is called for each node then the method would be called 4 times on each node, which would give us 2 calls for each cluster node to which we have to connect.

I assume that itemRemoved() is called as well but it does nothing. I wonder if we should not stop all connections to the node for which itemRemoved() is called. This could allow us to control/limit number of opened connections between nodes. Alternatively, we should close the newly opened connection if we would have more open connections between nodes than allowed.

%kobit %wojtek What do you think?

wojciech.kapcia@tigase.net commented 7 years ago

Andrzej Wójcik wrote:

I assume that itemRemoved() is called as well but it does nothing. I wonder if we should not stop all connections to the node for which itemRemoved() is called. This could allow us to control/limit number of opened connections between nodes. Alternatively, we should close the newly opened connection if we would have more open connections between nodes than allowed.

I was thinking what could lead to this, i.e. causing itemRemoved() and itemAdded() without nodes actually disconnecting - where they removed from the table itself (which would trigger call to itemRemoved and then afterwards re-added) or rather yet this could be an effect of removing stale nodes (i.e. tigase.cluster.repo.ClConConfigRepository#itemLoaded - it would remove the item and re-add it).

Closing all connections on itemRemoved seems more logical IMHO, but we would have to make sure that the logic correctly detects the node being removed.

Artur Hefczyc commented 7 years ago

It's not a super critical problem, although it might be for a large number of cluster nodes I guess.

Artur Hefczyc commented 7 years ago

One more comment. We should be careful about stopping all connections to a specific cluster node as this results in removing the cluster node from the cluster. Then reconnecting the node forces resynchronization of the node with the rest of the cluster.

Andrzej Wójcik (Tigase) commented 7 years ago

I've found a particular case in which this could happen.

We used following workflow for refreshing data:

  1. Deprecate items found in memory (using timeout 5*15s)
  2. Load items from the database and add them if not older than 5*15s.
  3. Create outgoing and accept incoming connection

Then after 15 seconds:

  1. Deprecate items found in memory (using timeout 5*15s)
  2. Load items from the database and add them if not older than 5*15s.
  3. Create outgoing and accept incoming connection

Here is what I suppose happened:

  1. So it is possible, that there was a valid entry for node2 loaded during the first run by node1, but it was about to expire (ie. 4.5*15s since the last update). I assume that at this time node2 was down (not stopped properly, ie. killed or crashed).
  2. Then Tigase on node2 was restarted and updated its entry in the tig_cluster_nodes entry, which allowed node2 and node1 to establish connections between those to nodes.
  3. Tigase at node1 started a refresh of cluster node items and deprecated item for node1 as it was too old and removed this entry.
  4. Tigase at node1 reloaded items from the database and readded item for node2 as it was now valid. **(This triggered opening of connections from node1 to node2 but not from node2 to node1).

I've modified our workflow for reloading items so now it looks like:

  1. Deprecate items found in memory (using timeout 6*15s) Here is a 6 instead of 5
  2. Load items from the database and add them if not older than 5*15s.
  3. Create outgoing and accept incoming connection
  4. Deprecate items found in memory (using timeout 5*15s)

So it is not possible (or almost impossible) for an item to be loaded and then expire with 1 refresh delay causing an item to be removed and then added.

Moreover, I've added a code which if the item is removed from repository then all connections for the node which is represented this item will be stopped. I suppose this should fix this issue for good.

Wojtek, %kobit Do you have any objections or comments about those changes? (Those changes are already committed to the Tigase code repositories).

Artur Hefczyc commented 7 years ago

Looks good to me.

issue 1 of 1
Type
Bug
Priority
Minor
Assignee
RedmineID
6658
Version
tigase-server-8.0.0
Spent time
30h
Issue Votes (0)
Watchers (0)
Reference
tigase/_server/server-core#922
Please wait...
Page is in error, reload to recover