"Ghostbuster" for PEP (#118)

Wojciech Kapcia (Tigase) opened 4 years ago

As per our discussion in chat - it could be convenient to have a mechanism similar to MUC's ghostbuster for PEP.

Related
- tigase.util.log.LogFormatter `errors` collection seems to retain elements (tigase/_server/server-core#1256) Open

Activities

Wojciech Kapcia (Tigase) commented 4 years ago

This comes as a result of seeing gigantic pep-sync stanza with user witch had hundreds of resources there...
Andrzej Wójcik (Tigase) commented 4 years ago

I've implemented "ghostbuster" and adjusted PresenceCollectorRepository to keep additional values and have a better API (classes instead of Map<Map>). I've adjusted synchronization to properly remove UserEntry from a map ServiceEntry without race conditions (multiple concurrent locks based on hashCode() of bare JIDs).

I still need to improve the synchronization of lastSeen between cluster nodes, but even without that feature will work (just ghostbuster could work with a delay).

However, I'm not sure if we should sync lastSeen, maybe having it different between cluster nodes will have a small to no impact. Only initial "ping" after reconnection of cluster nodes would be delayed by an hour but after that, it would just work as if we synced this value.

Wojciech Kapcia (Tigase) commented 4 years ago

I think that there is a problem with addressing:

[2021-05-27 15:59:40:869] [WARNING ] [              in_1-s2s ] S2SConnectionManager.processPacket(): Packet processing exception
tigase.xmpp.PacketInvalidTypeException: The packet has already 'error' type: from=…@jabb.im/…, to=tigase.pubsub.repository.PresenceCollectorRepository$ServiceEntry@39f703a, DATA=<iq xmlns="jabber:client" from="…@jabb.im/…" type="error" id="4a714a16-8180-41a7-892b-435de599b162" to="tigase.pubsub.repository.PresenceCollectorRepository$ServiceEntry@39f703a"><ping xmlns="urn:xmpp:ping"/><error code="406" type="modify"><not-acceptable xmlns="urn:ietf:params:xml:ns:xmpp-stanzas"/><text xml:lang="en" xmlns="urn:ietf:params:xml:ns:xmpp-stanzas">S2S - Incorrect source address (39f703a) - none of any local virtual hosts or components.</text></error></iq>, SIZE=484, XMLNS=jabber:client, PRIORITY=NORMAL, PERMISSION=NONE, TYPE=error, STABLE_ID=null
	at tigase.xmpp.Authorization.getResponseMessage(Authorization.java:480)
	at tigase.server.xmppserver.S2SConnectionManager.processPacket(S2SConnectionManager.java:215)
	at tigase.server.AbstractMessageReceiver$QueueListener.run(AbstractMessageReceiver.java:1404)

Andrzej Wójcik (Tigase) commented 4 years ago

@wojtek It should be fixed now in Tigase PubSub and in Tigase ACS PubSub. The main issue was with the serialization of entries to sync items between cluster nodes but PubSub was missing a method to retrieve service JID from the item. I've added a missing method and fixed the issue in ACS.

Wojciech Kapcia (Tigase) commented 4 years ago

Currently we have this version deployed:

root@ad4950537f90:/home/tigase/tigase-server# for jar in `ls jars/*pubsub*jar` ; do unzip -qc ${jar} META-INF/MANIFEST.MF | grep  "Implementation-Version" ; done
Implementation-Version: 3.0.0-SNAPSHOT-b153/9dad8da7
Implementation-Version: 5.0.0-SNAPSHOT-b790/9db75655

It should contain the fix (https://github.com/tigase/tigase-acs-pubsub/commit/9dad8da78429e57762ef32d844919f5440e53ca7) yet the errors continue:

[2021-05-28 09:40:14:490] [FINEST  ] [              in_5-s2s ] MessageRouter.processPacket()    : Processing packet: from=null, to=null, DATA=<iq xmlns="jabber:client" to="e…k@z….im/1716812579775850753812799012" type="get" id="8797a4d5-431c-4fab-b3b6-c77ca2595993" retryCount="15" from="tigase.pubsub.repository.PresenceCollectorRepository$ServiceEntry@358189b5" delay="1"><ping xmlns="urn:xmpp:ping"/></iq>, SIZE=271, XMLNS=jabber:client, PRIORITY=NORMAL, PERMISSION=NONE, TYPE=get, STABLE_ID=null
[2021-05-28 09:40:15:013] [WARNING ] [              in_7-s2s ] S2SConnectionManager.processPacket(): Packet processing exception
tigase.xmpp.PacketInvalidTypeException: The packet has already 'error' type: from=e…k@z….im/1716812579775850753812799012, to=tigase.pubsub.repository.PresenceCollectorRepository$ServiceEntry@358189b5, DATA=<iq from="e…k@z….im/1716812579775850753812799012" xmlns="jabber:client" to="tigase.pubsub.repository.PresenceCollectorRepository$ServiceEntry@358189b5" type="error" id="8797a4d5-431c-4fab-b3b6-c77ca2595993"><ping xmlns="urn:xmpp:ping"/><error code="406" type="modify"><not-acceptable xmlns="urn:ietf:params:xml:ns:xmpp-stanzas"/><text xmlns="urn:ietf:params:xml:ns:xmpp-stanzas" xml:lang="en">S2S - Incorrect source address (358189b5) - none of any local virtual hosts or components.</text></error></iq>, SIZE=509, XMLNS=jabber:client, PRIORITY=NORMAL, PERMISSION=NONE, TYPE=error, STABLE_ID=null
	at tigase.xmpp.Authorization.getResponseMessage(Authorization.java:480)
	at tigase.server.xmppserver.S2SConnectionManager.processPacket(S2SConnectionManager.java:215)
	at tigase.server.AbstractMessageReceiver$QueueListener.run(AbstractMessageReceiver.java:1404)

I checked the CloudWatch logs and it has entries like this:

021-05-28 08:50:13.711 TRACE [scheduler_pool-12-thread-2-pubsub] t.p.Ghostbuster.ping(): for tigase.pubsub.repository.PresenceCollectorRepository$ServiceEntry@7a434434 sending ping to 74@jwchat.org/169967308614608423461620839002768433

and looking at the code it would mean that the actual JID stored in entry.getServiceJid() is incorrect (?!)

It seems it should be ok and fixed. After taking a look at the amount of errors it seems those are declining. Should we assume that the incorrect values were added to the collections and then during the upgrade those faulty values were synchronised back but with the pass of time those should be removed and the issue should resolve itself over time?

Captura de pantalla 2021-05-28 a las 13.58.43.png Captura de pantalla 2021-05-28 a las 14.12.21.png

Andrzej Wójcik (Tigase) commented 4 years ago

Yes, you are correct. Invalid values were in the cache and were propagate to new cluster nodes when they joined the cluster. Now, as we do not have old/bad nodes, we need to wait for a new nodes to clean up those entries and everything should be back to normal.
Wojciech Kapcia (Tigase) added "Related" tigase/_server/server-core#1256 1 year ago
Login to comment

Type	Task
Priority	Normal
Assignee	Andrzej Wójcik (Tigase)
Version	Candidate for next minor release
Spent time	0

Issue Votes (0)

Watchers (2)

Reference

tigase/_server/tigase-pubsub#118