Projects tigase _server server-core Issues #148
Cluster connections improved (#148)
Artur Hefczyc opened 1 decade ago
Due Date
2015-09-29

Following improvement have to be done for clustering connections to improve reliability for packets delivery between cluster nodes and reduce delays.

  1. Dedicated pair of connections for each queue priority, so the packets with higher priority (cluster, system) do not have to wait for lower priority packets (presences)

  2. Packet delivery confirmation for each cluster connection

Artur Hefczyc commented 10 years ago

Andrzej, I am moving it for next version but if you think you can still do it before the end of Jan 2015, please revert it for 7.0.

Andrzej Wójcik (Tigase) commented 9 years ago

I understand idea of creation of separate priority queues however I have no idea how it would help as majority of packets are sent between cluster nodes using separate cluster commands (PacketForwardCmd, etc.), so they will be in same priority queue, right?

So how this change would help?

Artur Hefczyc commented 9 years ago

That's the point. You said that majority of packets are "PacketForwardCmd" and the rest are related to the clustering synchronization traffic. So these would be the first 2 types of traffic we want to separate:

  1. Normal XMPP traffic - messages, presences, IQ

  2. Cluster traffic related to cluster cache updates

If we could get these 2 types of traffic separate to different connections that would be a great improvement. My tests and experiments show that lots of performance issues are caused by late cluster cache updates. Especially the problem is apparent during mass users disconnects. We have huge number of presences flowing which delay cluster cache updates. Late cache updates cause that the system thinks some users are still online who are in fact offline already and this causes unnecessary presences traffic between cluster nodes.

I think super fast cluster cache updates are critical for the Tigase performance and correctness so I think we should focus our first step on this direction.

Andrzej Wójcik (Tigase) commented 9 years ago

Then I think there still will be problem as presences are forwarded inside PacketForwardCmd, so this change may not fix it.

I suppose this could be changed by using priority queues in out queues for packets being sent to remote nodes.

Artur Hefczyc commented 9 years ago

What I meant is this:

  1. For the default clustering strategy, everything goes to a single connection (or is evenly distributed over all open connections).

  2. For advanced clustering strategy we would have a dedicated connection(s) for all packets related to cluster cache updates: SyncOnlineCmd, TrafficStatisticsCmd, UserConnectedCmd, UserDisconnectedCmd, UserPresenceCmd, and all the rest of the traffic PacketForwardCmd and other normal XMPP traffic goes to separate connection(s).

Andrzej Wójcik (Tigase) commented 9 years ago

OK, I got this but I think that this may not be a good idea as if now we have 2 connections we use them both form traffic and let's say that both use 80% of it's limit but important packets are using 70% of bandwidth.

So on connections we have:

  1. 70% important packets - 56% of bandwidth, 30% other packet - 24% of bandwidth

  2. 70% important packets - 56% of bandwidth, 30% other packet - 24% of bandwidth

If we change this that 1 connection is for important packets and other for there rest then we will get:

  1. important packets only - 2*56% = 112% of bandwidth (overloaded)

  2. other packets - 2*24% = 48% (OK)

In this case to handle same amount of traffic we would need 4 connections from which 2 of them might be rarely used

I think it might be better to change processing of packets by assigning to proper packets (ie. SyncOnlineCmd, TrafficStatisticsCmd, UserConnectedCmd, UserDisconnectedCmd, UserPresenceCmd) System priority, while other left without change. This would change ordering in PriorityQueues for sending packets over clustering connection and would speed up delivery of packets from more important group (ie. SyncOnlineCmd, TrafficStatisticsCmd, UserConnectedCmd, UserDisconnectedCmd, UserPresenceCmd).

I'm suggesting this solution as usage of more balanced traffic should be better than unbalanced traffic between cluster connections.

What do you think about this approach?

Artur Hefczyc commented 9 years ago

Andrzej Wójcik wrote:

OK, I got this but I think that this may not be a good idea as if now we have 2 connections we use them both form traffic and let's say that both use 80% of it's limit but important packets are using 70% of bandwidth.

No, I have run tests for this already. The important traffic is just a fraction of the whole traffic. I do not remember exact numbers right now, I can dig it up but it is at about 1% or less.

In this case to handle same amount of traffic we would need 4 connections from which 2 of them might be rarely used

I am not 100% sure but right now Tigase opens 4 connections to each other cluster node. And to be honest I think that the most reasonable number would be: 2 connections for important traffic, and 4 connections for all other traffic. And we could actually make it either configurable or automatic based on a number of CPU cores.

I think it might be better to change processing of packets by assigning to proper packets (ie. SyncOnlineCmd, TrafficStatisticsCmd, UserConnectedCmd, UserDisconnectedCmd, UserPresenceCmd) System priority, while other left without change. This would change ordering in PriorityQueues for sending packets over clustering connection and would speed up delivery of packets from more important group (ie. SyncOnlineCmd, TrafficStatisticsCmd, UserConnectedCmd, UserDisconnectedCmd, UserPresenceCmd).

I'm suggesting this solution as usage of more balanced traffic should be better than unbalanced traffic between cluster connections.

What do you think about this approach?

I think we should implement both. I thought that the cluster traffic uses priority queues already but maybe I only planned and never finished this. In any case I think it is critical to have both ideas implemented.

Andrzej Wójcik (Tigase) commented 9 years ago

%kobit I would like to mention only one thing - clustering uses priority queues but important commands as SyncOnlineCmd, TrafficStatisticsCmd, UserConnectedCmd, UserDisconnectedCmd, UserPresenceCmd use priority level Cluster and *less important PacketForwardCmd uses same - Cluster priority which may cause this more important commands to wait even when we have only 2 connections (now we have 4 - 2 incoming and 2 outgoing as I remember).

I still think that additional connection for important packets may not be needed if we change/fix priorities for each cluster command, ie. decrease priority for PacketForwardCmd as this more important commands would be delivered without waiting in queue after PacketForwardCmd.

Artur Hefczyc commented 9 years ago

Yes, PacketForwardCmd definitely needs to have lower priority than cluster. However, this priority change does not solve the problem. It does help a little bit when the system is not overloaded, that is if queues are 0 or very low. But then, even priority does not matter that much. However, when traffic is very high and queues start to grow, than the cluster traffic goes to a queue which is already full and has to wait. It may even happen that cluster packet is dropped due to queue overflow. However, the cluster traffic is critical and we should not allow for it to be dropped at any case.

Please also note, that in some cases, if cluster traffic is able to omit other traffic and gets to all cluster nodes very quickly it may even significantly reduce load on the whole system. A good example is mass-disconnect. This generates huge storm of presence updates. However, if information about disconnected users gets quickly to all cluster nodes, many of the presence updates would not be sent/generated because the cluster knows that a buddy is offline already.

So I think both changes are critical.

Andrzej Wójcik (Tigase) commented 9 years ago

I implemented priority for cluster command so it is possible to set priority for cluster commands.

I decided to keep most important cluster commands at CLUSTER priority level, while I changed less important cluster commands (packet-forward, clustering commands from MUC, PubSub) to use HIGH priority level.

To set priority level for each command you need to use setPriority method on CommandListener before adding this listener. This way command which should handled by this CommandListener implementation will have assigned this priority.

This should improve handling of command and it's processing as PriorityQueue is used, so most important commands will be faster delivered to recipient cluster node.

I created new interface ClusterConnectionSelectorIfc which implementations are used to select which cluster connection (XMPPIOService instance) will be used to send packet to recipient cluster node.

By default we use ClusterConnectionSelector class which implements this feature and uses separate connections for commands with priority higher or equal to CLUSTER level, while other commands use other cluster connections. By default I set number of connections for important packets to 1, but this can be changed to i.e. 2 by addition of following line to etc/init.properites file:

cl-comp/cluster-sys-connections-per-node[I]=2

In case of misconfiguration or if there is less open connections than number of connections assigned to important commands then server will use any connection to send any command.

I also created ClusterConnectionSelectorOld class which also implements new connection selection interface but it uses old mechanism - any command may be sent using any connection. I added this as a possible alternative so it can be enabled if needed by addition of following line to etc/init.properties file:

cl-comp/connection-selector=tigase.cluster.ClusterConnectionSelectorOld

I also increased number of cluster connections opened between cluster nodes to 3 as I assigned by default 1 connection only to important commands, so needed to increase number of opened connections to compensate this change.

As for now I used tests to confirm it works correctly (custom tests and TTS tests).

Artur Hefczyc commented 9 years ago

Thank you, I like your approach and solution.

However, I would like 2 minor changes, that is defaults set to:

  1. Default number of connections between cluster nodes - 5

  2. Default number of connections for CLUSTER level traffic set to 2

Please also run some tests with these new def settings to ensure that multiple connection setup for CLUSTER traffic is handled correctly.

Andrzej Wójcik (Tigase) commented 9 years ago

I changed default settings to suggested values and executed tests which worked fine.

issue 1 of 1
Type
Task
Priority
Normal
Assignee
RedmineID
813
Version
tigase-server-7.1.0
Estimation
40h
Spent time
132h
Issue Votes (0)
Watchers (0)
Reference
tigase/_server/server-core#148
Please wait...
Page is in error, reload to recover