Projects tigase _server server-core Issues #198
Tigase chooses wrong component to route traffic to (#198)
Closed
Rui Ferrao opened 1 decade ago

I have a Tigase Cluster configured with two MUCs in a load balanced setup exactly as described here :

http://www.tigase.org/content/load-balancing-external-components-cluster-mode

My setup uses two different servers, each one running an instance of Tigase Server and an Instance of Tigase Muc.

So server one has XMPP1 + MUC1 and server two has XMPP2 + MUC2

Everything seems to work fine except for one thing, which is the MUC ping functionality.

Let's say i connect i user to XMPP2. I then create a new room (get's created in MUC2). MUC 2 pings the user.

I see the following traffic :

Ping MUC2 -> XMPP2 -> User

Ping result User -> Ping result -> XMPP2 -> MUC1

Muc one then returns an error back, as the room is not residing there.

I hope the explanation is clear enough, i am still trying to go through the code and see if i can pinpoint the problem, but decided to open the bug immediately.

Rui Ferrao commented 1 decade ago

Actually i realized the problem with MUC load balancing and Tigase choosing the wrong component is not happening to pings only.

In the same scenario as above, this also happens to other types of traffic, such as presence stanzas, although due to a different reason.

This time is not the to address that causes Tigase to make the wrong decision, but the fact that traffic coming from the cluster connection is not taken in to account for this decision by tigase.server.ext.ComponentProtocol.getXMPPIOService().

This first thing this method does is :

// First we check whether the receiver has sent to us a packet through one
// of connections as this would be wise to send response on the same connection
for (ComponentConnection componentConnection : conns) {
      ComponentIOService serv = componentConnection.getService();
      if (serv != null && serv.isConnected() && serv.isRecentJID(p.getStanzaTo())) {
         result = serv;
         break;
      }
}

But the cluster link is not a part of this check. If receiver has sent to us a packet trough the cluster link, is it not taken into consideration and we then call the LB, which will only make the right decision by pure luck.

One possible Scenario were this happens :

User 2 connects to XMPP2 and creates a room. XMPP2 uses LB to decide which MUC to create traffic. If the MUC chosen is MUC2 and then User 2 tells the Room to invite User 1 who happens to be connected to XMPP1 to join the room, the invitation will go through the cluster link between XMPP1 and XMPP2 to reach User 1. But because of this, XMPP1 has no idea to which MUC to send the User 1 presence to the room (as no traffic to the room has been seen on the muc connections of XMPP1) and LB just chooses one of the connections.

Its pure chance if it chooses the right connection, or if we end up with duplicate rooms in both MUCs.

I had this setup working by chance for several days until i realized now that the whole thing falls apart. I still need to perform so more tests, but am now convinced that i can only connect 1 MUC to a Tigase cluster if i want it to work properly, although at the expense of no redundancy on the MUCs.

Artur Hefczyc commented 1 decade ago

Hi Rui,

Thank you for all the comments, detailed reports and looking into the problem. First of all thank you for reporting the problem.

I started to look at it, however, I am not sure when I have a definite solution, assuming there is a bug in the Tigase. In the meantime may I suggest that you switch to a setup with a virtual component mode for time being?

Just to make sure I correctly understand the problem, please let me give you some explanation and suggestions. To be honest, for me it looks like some problem with a setup on your system. The problem you describe is likely to happen when both/all MUC components are not connected to both/all cluster nodes. Let's say you have 2 cluster nodes: node1 and node2 and 2 MUC components: muc1 and muc2.

To make sure it all works correctly you have to make sure that muc1 is connected to node1 and node2 AND muc2 is connected to both, node1 and node2. Also, make sure the external connection is of type connect on both external components - muc1 and muc2 and accept/listen on the Tigase nodes side: node1 and node2.

Please note - the code you attached above matters only on the external component side (muc1 and muc2) not on the cluster nodes side.

Let's say both muc components work for the same MUC domain: muc.example.com.

The actual muc component is selected based on the destination BareJID address such as: rom1@muc.example.com. Now, assuming you have setup a correct LB for your connections on the cluster nodes side, which is: ReceivedBareJidLB it should calculate a hash code based on the BareJID destination address. This hashcode should be obviously the same regardless on which cluster node it is calculated.

Based on this hashcode an external component connection is selected....

And now I realized where the problem is... Ok, I think I will have a fix tonight committed to the code repository :-)

Artur Hefczyc commented 1 decade ago

Applied in changeset tigase-server|commit:54d8ff4999390e71626ed120b70c515e41ae695e.

Rui Ferrao commented 1 decade ago

Thank you for all the comments, detailed reports and looking into the problem. First of all thank you for reporting the problem.

Not a problem, thank you for looking into this Artur.

Just to make sure I correctly understand the problem, please let me give you some explanation and suggestions. To be honest, for me it looks like some problem with a setup on your system. The problem you describe is likely to happen when both/all MUC components are not connected to both/all cluster nodes. Let's say you have 2 cluster nodes: node1 and node2 and 2 MUC components: muc1 and muc2.

To make sure it all works correctly you have to make sure that muc1 is connected to node1 and node2 AND muc2 is connected to both, node1 and node2. Also, make sure the external connection is of type connect on both external components - muc1 and muc2 and accept/listen on the Tigase nodes side: node1 and node2.

I am/was sure that everything is properly configured and all mucs connected.

Please note - the code you attached above matters only on the external component side (muc1 and muc2) not on the cluster nodes side.

This part i did not know. I added some log statements to ComponentProtocol to understand what was happening and i was getting those logs on both my tigase.log and muc.log file, which kind of lead me to believe this code was used on both. The are running on the same machine, but could log4j mix the log statements up this way?

Based on this hashcode an external component connection is selected....

And now I realized where the problem is... Ok, I think I will have a fix tonight committed to the code repository :-)

Super, i will test it as soon as i can and give you feedback on my findings. Once again, thank you so much for the fast response. I can see you have the same sort of working hours as i do... Have a nice weekend.

Rui Ferrao commented 1 decade ago

Your commit seems to have fixed the second problem, so i am not getting duplicate rooms anymore.

The first problem however is still present, causing the MUC to terminate the rooms.

It's very easy to reproduce, just make sure you create one room that ends up on MUC1 and another one that ends up on MUC2 and then wait for Ghostbuster to start pinging. The rooms residing on the MUC that pinged first will survive, the ones on the other MUC will be terminated.

If the source of ping is room@muc.domain.org instead of just muc.domain.org i guest that would fix it, no? I believe that is still compliant with XEP-0199, what do you think?

Rui Ferrao commented 1 decade ago

I tried a simple approach at solving the PING issue, by giving the components a different name (muc1 and muc2) and using that as the source address of my pings. But for some reason, tigase did not liked this and every time the MUCs got a packet it just got in a closed loop forever..

So i decided to add a new property to the MUC configuration called ping-node , so now my muc configs have :

MUC1

...
muc/ping-every-minute[B]=true
muc/ping-node=muc1

MUC2

...
muc/ping-every-minute[B]=true
muc/ping-node=muc2

I then patched the MUCComponent/Ghostbuster so the MUC uses ping-node as the source node for the pings and everything works like a charm!

I've attached my patch to this Issue, but i am using an older revision of tigase-muc, the one used at the time of tigase-5.1.4 release, i believe it's aef22354.

Artur Hefczyc commented 1 decade ago

Thanks Rui for your continued help and all the details.

Would you happen to have the XMPP Ping communication log as it happens on the client side? That would help me in finding a good solution.

Rui Ferrao commented 1 decade ago

Artur Hefczyc wrote:

Thanks Rui for your continued help and all the details.

Would you happen to have the XMPP Ping communication log as it happens on the client side? That would help me in finding a good solution.

Artur, not sure i understand what you mean by XMPP Ping communication log on the client side.

I am using a web based client, bellow the actual stanzas received/sent on the client side in case of error :

<iq to="rui@user.eixodigital.com/tigase-1" type="get" xmlns="jabber:client" id="png-1" from="muc.eixodigital.com">
   <ping xmlns="urn:xmpp:ping"/>
</iq>

<iq to="muc.eixodigital.com" type="result" xmlns="jabber:client" id="png-1"/>

<iq to="rui@user.eixodigital.com/tigase-1" type="error" xmlns="jabber:client" id="png-1" from="muc.eixodigital.com">
   <error type="cancel" code="501">
      <feature-not-implemented xmlns="urn:ietf:params:xml:ns:xmpp-stanzas"/>
   </error>
</iq>

If the Ping result actually hits the right MUC, then we do not get the last error stanza back and Ghostbuster will not destroy the room.

Artur Hefczyc commented 1 decade ago

Hm, looks like the solution would be actually to send the ping with the full room address in the 'from' attribute, not just the MUC component address.

Bartosz, could you please check this out? Can we send ping from the MUC component with the room full address rather than just component domain? This would solve the problem I suppose.

Rui Ferrao commented 1 decade ago

Artur Hefczyc wrote:

Hm, looks like the solution would be actually to send the ping with the full room address in the 'from' attribute, not just the MUC component address.

Bartosz, could you please check this out? Can we send ping from the MUC component with the room full address rather than just component domain? This would solve the problem I suppose.

Don't forget that MUC pings each user only once, no matter how many rooms the user is present in. It's a MUC ping and not a Room ping, which makes perfect sense.

My first idea was to use different component names, something like :

--comp-name-1=muc1
--comp-class-1=tigase.muc.MUCComponent
--comp-name-1=muc2
--comp-class-1=tigase.muc.MUCComponent

And then compose the from address using the component name from=muc1@muc.eixodigital.com , but for some reason this causes the MUC to loop packets around forever and hence my use of a new configuration property.

Rui Ferrao commented 1 decade ago

Artur,

I have actually managed to reproduce the fault with LB once again. Looking more closely at your fix, i can see that you now compare two ComponentConnection based on the remote address, which seems ok, but you are still storing them on an ArrayList, which is not sorted and does not guarantee a consistent order across both servers.

My quick fix in addition to what you have already done is to add :

Collections.sort(conns); 

after successfully inserting a new connection in ComponentProtocol.addComponentConnection

Artur Hefczyc wrote:

Applied in changeset tigase-server|commit:54d8ff4999390e71626ed120b70c515e41ae695e.

Rui Ferrao commented 1 decade ago

Artur, please ignore my latest comment... I was looking at an older version of the code to which i ported your fixes..

Also realized my suggestions regarding the muc ping issue were nonsense and Bartosz fix of pinging from a last used room name fixes it.

Thank you for looking into this, i am glad i have everything working and ester holidays are just around the corner.

issue 1 of 1
Type
Bug
Priority
Normal
Assignee
RedmineID
1131
Version
tigase-server-5.2.0
Issue Votes (0)
Watchers (0)
Reference
tigase/_server/server-core#198
Please wait...
Page is in error, reload to recover