Projects tigase _server server-core Issues #1343
Undelivered stanzas are processed before stream closure commands (#1343)
Closed
wojciech.kapcia@tigase.net opened 2 years ago

(from https://github.com/tigase/tigase-server/issues/184)

Describe the bug There is a race condition between STREAM_CLOSED, STREAM_FINISHED command stanzas and processing undelivered stanzas.

When the connection is closed tigase.server.xmppclient.StreamManagementIOProcessor#serviceStopped method is called.

tigase.server.xmppclient.StreamManagementIOProcessor#serviceStopped calls;

  • tigase.server.xmppclient.ClientConnectionManager#serviceStopped -> tigase.server.xmppclient.ClientConnectionManager#xmppStreamClosed to create and send STREAM_CLOSED, STREAM_FINISHED stanzas
  • tigase.server.xmppclient.StreamManagementIOProcessor#sendErrorsForQueuedPackets -> tigase.server.xmppclient.ClientConnectionManager#processUndeliveredPacket to process undelivered stanzas

After creatingSTREAM_CLOSED, STREAM_FINISHED stanzas, they are started to process by calling addOutPacket however undelivered stanzas are started to process by calling processOutPacket.

Somehow, the first undelivered stanza processed before STREAM_CLOSED, STREAM_FINISHED stanzas in SessionManager. The first undelivered stanza includes delivery-error element that is processed by Message forwarder tigase.xmpp.impl.Message#preProcess and then tigase/xmpp/impl/C2SDeliveryErrorProcessor#preProcess filter stanza when it finds XMPPResourceConnection.

To Reproduce Send bulk messages to the receiver. The receiver disconnects before not sending the <a> acknowledgment to all of the stanzas

Impact Undelivered stanza processing is blocked by the Message forwarder and does not go through to offline messages. It causes the message lost.

[2022-12-27 08:05:24:390][FINEST  ][        in_20-sess-man ]  SessionManager.processPacket()   : Packet blocked by: message, packetfrom=c2s@internal/10.10.10.10_5222_192.168.255.255_61068, to=sess-man@internal, DATA=<message from="d6ce@mydomain.com/5964a0422a5b938a" to="6711@mydomain.com" xmlns="jabber:client" type="chat" id="gPHAlLqtwVnAoAKaFJm2o6PxbAjjbj"><group id="HY0FD" xmlns="http://kontalk.org/extensions/message#group"><nick xmlns="http://jabber.org/protocol/nick">Nick-d6ce</nick><body>Message 13</body><active xmlns="http://jabber.org/protocol/chatstates"/></group><delay xmlns="urn:xmpp:delay" stamp="2022-12-27T08:05:22.599Z" from="mydomain.com"/><delivery-error stamp="1672128322599" xmlns="http://tigase.org/delivery-error"/></message>, SIZE=623, XMLNS=jabber:client, PRIORITY=NORMAL, PERMISSION=NONE, TYPE=chat

Expected behavior We have to ensure STREAM_CLOSED, STREAM_FINISHED stanzas are processed before the undelivered stanza.

Screenshots NA

Details (please complete the following information):

  • Tigase version: 8.1.0
  • JVM flavour and version : openjdk 11.0.7 2020-04-14
  • Operating system/distribution/version : Ubuntu 18.04.1 LTS
  • INTEL(R) XEON(R) PLATINUM 8259CL CPU @ 2.50GHZ
  • GNU/Linux - 1 CPU - 2 virtual CPU - 4.06G RAM

Additional context tigase.log

Andrzej Wójcik (Tigase) commented 2 years ago

I think this is possible only if the stanza was queued for delivery in XMPPIOService in the waiting list, but was not written to the socket yet, as in another case it would be in OutQueue and would be handled during disconnection by StreamManagementIOProcessor (which would actually resend message after socket is closed, ie. after resumption timeout, see https://github.com/tigase/tigase-server/blob/09d0b30e7a7cd8056e627469cbed176d932c47df/src/main/java/tigase/server/xmppclient/StreamManagementIOProcessor.java#L535 and https://github.com/tigase/tigase-server/blob/09d0b30e7a7cd8056e627469cbed176d932c47df/src/main/java/tigase/server/xmppclient/StreamManagementIOProcessor.java#L584). In both those cases SM is sending errors (scheduling redelivery) after calling ConnectionManager::serviceStopped(service) which actually close the connection.

On the other hand, a call to ConnectionManager::processUndeliveredPacket() in ConnectionManager::serviceStopped() is made before service/connection is closed, as this is done in subclass ClientConnectionManager::serviceStopped() after method in the parent class is called.

While I think we could move scheduling redelivery to be called after the connection is closed (and STREAM_CLOSED, and STREAM_FINISHED are sent). This would result in moving the call to this code from ConnectionManager to ClientConnectionManager (and maybe other subclasses) making it a non-standalone, which feels wrong.

However, currently, scheduling of unacked packets (from StreamManagement queue), is now done after the connection is closed and after packets from waiting queue are resent. As messages in OutQueue (SM queue) are older than messages from waiting queue, I think we should make sure that they are redelivered before messages from waiting queue. That would also result in a non-standalone ConnectionManager class.

I think that the best way would be to reimplement this part (redelivery) to take packets from OutQueue and the waiting queue and process them in a single run to make sure that the ordering of packets is correct. Moreover, I think, it might be desirable to think about merging OutQueue and waiting queue somehow as both contain sent (or unacked) packets, while waiting queue is missing timestamps of those packets (however, it should be processed fast, unless connection IO/TCP buffer will saturate).

I'm not sure if we can cleanly solve this issue without API changes and a rather large reimplementation. Maybe just minor changes would work, but the order of stored messages might be incorrect. I also have mixed feeling about relying on offline storage in a multiclient environment and would suggest the usage of MAM for message syncing and dropping the usage of offline storage).

@wojtek What do you think? How should we proceed?

wojciech.kapcia@tigase.net commented 2 years ago

While I think we could move scheduling redelivery to be called after the connection is closed (and STREAM_CLOSED, and STREAM_FINISHED are sent). This would result in moving the call to this code from ConnectionManager to ClientConnectionManager (and maybe other subclasses) making it a non-standalone, which feels wrong.

ClientConnectionManager in itself has somewhat different processing logic so it wouldn't be that "non-standard", though looking at the code, I agree making the change would require bigger change.

I think it's OK to make API changes considering that those would be only addition (and possible removals would be deprecated first). How much time do you estimate for the proposed single queue implementation?

I'm not sure if we can cleanly solve this issue without API changes and a rather large reimplementation. Maybe just minor changes would work, but the order of stored messages might be incorrect. I also have mixed feeling about relying on offline storage in a multiclient environment and would suggest the usage of MAM for message syncing and dropping the usage of offline storage).

I think this isn't necessarily multi-client approach. Judging form the payload it's most likely kontalk with it's very non-standard groupchat (i.e. doesn't use MUC but each client broadcast the same message to all participants, hence possible "bulk messages" and most likely higher CPU usage / slower queue processing). Yes, MAM may be a solution, but in the case of "very secure" deployments they most likely don't use MAM and operate on volatile messages.

Andrzej Wójcik (Tigase) commented 2 years ago

How much time do you estimate for the proposed single queue implementation?

That is a difficult question. While I would prefer not to extend "waiting queue" as it is used frequently and may contain a lot of items (also for S2S and due to that may increase memory usage), this may be necessary to have a correct timestamp of messages in offline storage.

But.. we can try to "move" redelivery to the correct place and hope it will work fine without further changes. In the worst-case scenario, we would be forced to reimplement queuing mechanism. What do you think?

In Tigase 9.0, we should (as we discussed) unify MAM storage with offline storage. Then we may not need to redelivery those messages to offline storage, as they would be already there if not acked by the client (or not fully sent), so I would prefer not to spend too much time on this.

Yes, MAM may be a solution, but in the case of "very secure" deployments they most likely don't use MAM and operate on volatile messages.

Is storing messages in MAM a lot different from storing messages in offline storage? Basically, the only change is that after delivering the message, it "should" be removed. On the other hand, it forces single-client usage to ensure that your client has a full history of messages. I think that MAM with short "expiration" would work almost the same.

Andrzej Wójcik (Tigase) commented 2 years ago

@wojtek I've applied discussed changes and they are for now stored in a separate branch. I've also added a test case to TTS-NG to verify this fix. Please review it when you will have some time and then we will merge it and release it.

wojciech.kapcia@tigase.net commented 2 years ago

I checked the changes (and created PR: https://github.com/tigase/tigase-server/pull/188#pullrequestreview-1304619926) - it looks good IMHO.

In Tigase 9.0, we should (as we discussed) unify MAM storage with offline storage. Then we may not need to redelivery those messages to offline storage, as they would be already there if not acked by the client (or not fully sent), so I would prefer not to spend too much time on this.

Agreed.

Yes, MAM may be a solution, but in the case of "very secure" deployments they most likely don't use MAM and operate on volatile messages.

Is storing messages in MAM a lot different from storing messages in offline storage? Basically, the only change is that after delivering the message, it "should" be removed. On the other hand, it forces single-client usage to ensure that your client has a full history of messages. I think that MAM with short "expiration" would work almost the same.

This is reasonable. And in case of "pure offline store" messages can be removed from MAM upon offline client connecting (just like it happens right now). But yes, having MAM by default for everything would simplify a lot of back-and-forth and redelivery. (feel free to create task for 9.0 to address the change)

wojciech.kapcia@tigase.net batch edited 4 months ago
Name Previous Value Current Value
Iterations
empty
tigase-server-8.4.0
wojciech.kapcia@tigase.net changed state to 'Closed' 4 months ago
Previous Value Current Value
In QA
Closed
issue 1 of 1
Type
Bug
Priority
Normal
Assignee
Version
tigase-server-8.4.0
Iterations
Issue Votes (0)
Watchers (2)
Reference
tigase/_server/server-core#1343
Please wait...
Page is in error, reload to recover