Stream management errors queueing (#698)

Daniele Ricci opened 9 years ago

Following up discussion in #4142.

There is a case when undelivered messages are lost after a first pass through session management. Here is a scenario:

Alice sends a message to Bob's full JID
But Bob has a broken connection and he never receives the message (or it does, but the server doesn't get the ack, same thing)
Bob realizes about the broken connection (the server still does not) and reconnects
Server disconnects the old client for conflict (because Bob is using the same resource)
At this point, the message is marked with a delivery-error and stamped with time at point 5 and begins its redelivery process
Message is stored to offline storage (including the delivery-error element) because no connection is available to handle it yet
Bob sends the new available presence for the connection started at 4 and server redelivers messages from offline storage
C2SDeliveryErrorProcessor checks that the timestamp of the delivery error is 5 but the only available connection was started at 4, therefore discarding the message

At point 9, C2SDeliveryErrorProcessor returned true anyway even if no message was delivered. The message is now lost.

Is my reasoning right?

A first possible fix:

https://github.com/kontalk/tigase-server/commit/ed3750ec3994136c59e02d8c02bdfb04010138b1

A note on that commit: if what I said in my comment is right, my modification is not a complete fix anyway. My patch allows for the lost message to be redelivered again at the next connection - which I don't like it either by the way, it's a dirty and temporary workaround until I can find a better solution.

One more note on the matter. Further analysis of the problem brought me to modify also the conditions for which this special case is met, culminating in this modification:

https://github.com/kontalk/tigase-server/commit/f153b8d9d72f68b2d09e17ea4d8758dc0e4fc8de

https://github.com/kontalk/tigase-server/commit/1addf41e1636df0d39ae1adf13248caf8a11aae9

(those 2 commits could be squashed by the way, the first one was a mistake because I didn't read the getAuthTime method)

The reason for those commits is to avoid the issue my previous workaround (ed3750ec...) created: if, by a very brief moment, the message is redelivered while the user hasn't been authenticated yet, it will be sent to offline storage again (because initial presence hasn't been sent yet, therefore it will be sent on the next connection). Using the real authentication timestamp as a base for comparison makes sure the client is actually ready to receive the message (actually the right time would be after initial presence, but I guess authentication is a good enough compromise. What do you think?).

Note: I'm actually using git/release as the baseline.

Duplicate
- Messages getting lost when StreamManagement resumption is used (#685) Closed

Activities

Daniele Ricci commented 9 years ago

Sorry but I seem to not be getting notification emails by Redmine.

Anyway, before closing this, please consider the implications of my last paragraph (it was added lately). If you think also that issue is fixed, you may close it.

Thank you!
Login to comment

Type	Bug
Priority	Normal
Assignee	Wojciech Kapcia (Tigase)
RedmineID	4380

Issue Votes (0)

Watchers (0)

Reference

tigase/_server/server-core#698