Artur Hefczyc opened 1 decade ago
|
|
Wojtek, when we talked about this last time you mentioned that you committed some fixes for this already but the problem appears to be still affecting some users. So this might have more than one cause. Could you please add some update on this issue in the comments? |
|
With the latest code nodes connects within ~5 seconds, however the issue described in the thread still occurs - users that connected to each node before establishing cluster connection won't see each other presence unless they re-broadcast presence. Ideally after establishing cluster connection there should be synchronization of online users (which I believe was implemented, but is removed now, vide: commit:499ba42281391b12259dae7483bfdfbc34ac4a27) |
|
As far as I remember the synchronization is not removed, it is moved to a different place. It is no longer implemented inside a SM, it is implemented on the clustering strategy level, as a strategy command: tigase.server.cluster.strategy.cmd. At least this is how it is supposed to work. Please verify that it works!. However, this is a different kind of presence synchronization from what you are talking about. The cluster synchronization is for synchronizing cluster cache data to each node knows who is online and where the user is connected. It does not resent or rebroadcast user's presence data to all contacts. And I think it should not do that. A correct way to handle it is, in my opinion to make Tigase delay accepting users' connections until cluster connections are established. Assuming that cluster connections are established within 5 seconds we could implement a delay of 30 seconds in Tigase after which it starts accepting users' connections if started in a cluster mode. |
|
This is actually important stuff for 7.1.0. |
|
Artur Hefczyc wrote:
Yes, it was moved to ACS: #tigase.server.cluster.strategy.cmd.RequestSyncOnlineCmd#RequestSyncOnlineCmd@ and it works (and I guess there's not much sense to have it in default strategy ad it doesn't utilize caching and received sync data was ignored in this case).
%kobit I was doing some tests and also analysing the issue and I think that simply delaying accepting user connections wont be ideal solution (for example when establishing of clustering connection takes longer for some reasons). I think that, in case of clustering mode enabled, we could use following logic:
What do you think? Also - there is a possible related problem - for when the nodes are online and works correctly, but for some reason cluster connection is broken - all the connections/disconnections/status changes during that time won't be visible after cluster connection is re-established. Should we also think of handling this case or do we consider it as 'abnormal'? |
|
Wojciech Kapcia wrote:
Yes, I think this is a good approach. By delaying "accepting user connections" I meant something like what you described above. I mean, some logic which either delays listening on user connection ports or prevents users from connecting to the node in some other way. I think the last point - loading cluster-repo and waiting for all nodes to be connected makes most sense.
This is something we should take care as well, in future versions. First, most important step would be to make sure all cluster nodes resynchronize after reconnecting, next, second step would be to make sure users are up to date with all the changes which happened in the meantime. The first step seems to be relatively simple but, I imagine, the second step might be quite difficult and maybe it is even an overkill. "Normally" we require very good connectivity between cluster nodes in order to consider clustering useful. We do not guarantee that clustering will work correctly if there are connectivity issues between cluster nodes. |
|
I've implemented delaying listening on user ports and tied opening them to the appropriate events (opening when there is single node in cluster or when there are more nodes and connection was established to all of them). I've also decreased initial reload time so establishing clustering connections takes now only a couple of seconds. In addition I've added expiration timer to start listening for the connections after roughly 2 minutes (just in case) and there are log entries informing that listening for user connections was delayed. I've tested it quite thoroughly and everything was working correctly. %andrzej.wojcik - could you review changes in |
|
I reviewed changes from
+Solution:+ I think it would be far better if we would move creation of event handler and registration to
|
|
Andrzej Wójcik wrote:
Thank you!
OK, makes sense (not using osgi nor familiarising myself yet with kernel made me miss those use-cases). I've made suggested changes.
OK, is this documented somewhere? What is done, what is called and what status can we expect at different moments? From what I remember it's possible to manage component configuration through ad-hocs (and some component's doesn't support it hence The other case is management of components through groovy script but is there a way to distinguish it (i.e. initialization of single component vs initialization of whole server, thus avoiding delaying opening connections, we could assume cluster is ready)? Does OSGi mode utilize same/similar approach? From what I've checked it could be quite tricky. We could store information about cluster availability in From what I gather with TKF this should be both easier to achieve and in a more sane manner - correct? However, please note that to avoid problem of this event missing for some reason I've included an expiration timer as mentioned in previous post - it has default time-out of 2 minutes (default delay * 60 so this may be too long, and granted, this is far from ideal).
Right now establishing cluster connection is quite fast, ACS synchronization is done separately - correct? I think the window would be really slim:
Correct?
OK, in this case chances of establishing s2s connection and missing user presence are even more slim - usually in case of cluster s2s connection would already be established to one of the nodes. Of course we can add similar mechanism there. |
|
Wojciech Kapcia wrote:
I'm not talking only about OSGi mode. We have
Script which I mentioned above
OSGi uses same methods in <= 7.1.0 as
To be honest, I'm not sure if it will be easier. We will know if component is started, stopped or reconfigured but still there will be no way to tell is cluster is already synced.
I agree - timer will somehow deal with this situation, but then we need to mention it in documentation.
For sure,
If we wait only for cluster connections and not for ACS cache sync then it is ok with me. |
|
Andrzej Wójcik wrote:
As mentioned earlier, there is a possibility to use 'ugly' solution - we could have a Any cons?
When I mentioned 'easier' I meant something along above - more direct access to other components (and I think, in addition to configuration handling, this was motivation behind Kernel)...
Will do, thanks for the comments so far. |
|
Wojciech Kapcia wrote:
I think there will be no cons in this case - except from direct access to the other component, but I think we can skip this con, just make sure this solution will work if we would get
Yes, this would be allowed by adding field and annotation to this field.
|
|
Changes has been completed, I've included above comments to avoid delaying client connectivity while reloading component if the cluster is active; I've added configuration parameters to control it and documentation with rationale and description how it works. Given that it was originally intended for 7.1.0, later on pushed to 7.1.1 and that 7.1.0 hasn't been published yet it was finally merged to |
|
Perfect! Thank you. |
Type |
Bug
|
Priority |
Normal
|
Assignee | |
RedmineID |
1783
|
Version |
tigase-server-7.1.0
|
Spent time |
231h
|
All the details are on the forums: message#1591