Presences to offline users optimization (#281)

Artur Hefczyc opened 1 decade ago

Problem description:

Let's say we have a user with 10k contacts in the roster. All the contacts (or almost all of them) are on the same Tigase installation. But only fraction of them are online.

However, when the user logins to the system, it generates a very high traffic of 10k presence probes + 10k initial presences to all contacts, even those who are offline. We can significantly reduce load on the server if we skip generating presences to all of those who we know are offline. That is those who belong to the local service and are offline. Of course we cannot say anything about contacts from other servers.

I have received quite a few requests recently describing high load problems. The service is for a relatively small number of online users (300 - 10k) but with large rosters (500 - 10k). As all users are local only, they do not open the installation to other servers we could significantly reduce load and improve the overall service with the optimization.

Suggested solution:

I do not want to force you to do it my way but here is how I attempted to solve this problem in the past:

The plugin: Presence has one private method: requiresPresenceSending. The old version of the method had code like this:

boolean result = !runtime.hasCompleteJidsInfo() ||!session.isLocalDomain(buddy_domain, false) || runtime.isJidOnline(buddy);

Where runtime is TigaseRuntime.

As you can see the code checks whether the system has all the required details to omit offline users, checks whether this is a local user and if the user is online. The first check was true for a single-mode installation and ACS. It was false for a default clustering strategy. Then either the main SM or ACS SM returned information whether the JID is online or not.

The code has been removed as it was not tested and I did not think at the time that this optimization is really needed. However, it looks like more and more services have very big rosters and we need something like this ASAP.

Please note, in the current Presence plugin code, sending probes is not enclosed with in requiresPresenceSending check, so this needs to be adjusted as well.

You need to make sure that runtime.hasCompleteJidsInfo() returns true for a standalone installation, false for default clustering and true for ACS. It should work this way already but as this is quite old code I am not sure if all bits are still there and if the ACS works correctly for the runtime.isJidOnline(buddy) call.

Related
- Not broadcasting initial presence to offline users (#596) Closed

Activities

Wojciech Kapcia (Tigase) commented 1 decade ago
Optimizations implemented based more or less on suggestion to use TigaseRuntime. Introduced new configuration option @skip-offline-sys@:

sess-man/plugins-conf/presence/skip-offline-sys=(true|false)

which takes into account TigaseRuntime as the environment conditions whether to send or not presence (and works with already present @skip-offline@, which somewhat stray from the RFC).

ACS has been slightly updated to correctly report .hasCompleteJidsInfo() (takes into account state of node connecting).
Artur Hefczyc commented 1 decade ago

Is this option set to true by default? If yes I would like to see load tests results which are run automatically in environment prepared by Eric. We should see improvements immediately.
Wojciech Kapcia (Tigase) commented 1 decade ago

No, this option (well, both skip-offline options) are off by default. Should it be enabled by default?
Artur Hefczyc commented 1 decade ago

skip-offline should be off by default as it breaks the spec but the skip-offline-sys option should be ON by default.
Wojciech Kapcia (Tigase) commented 1 decade ago

Changes applied: skip-offline-sys is enabled by default, skip-offline is off by default.
Artur Hefczyc commented 1 decade ago

Eric, could you please let me know when load tests are run with a current code from our repo? I am very curious to see difference.
Eric Dziewa commented 1 decade ago

Testing results from the 8th messed up. Testing from today ended a bit early due to a cron miscalculation. There are still two hours of useful data.

http://graph.cluster-c.xmpp-test.net/2014-02-07/ :: tigase-issue #5.2.0-SNAPSHOT-b3367

http://graph.cluster-c.xmpp-test.net/2014-02-09/ :: tigase-issue #5.2.0-SNAPSHOT-b3443
Artur Hefczyc commented 1 decade ago

I compared tests results for the 7th Feb and 11th Feb. They look the same. I mean roughly the same number of presences is being processed in both cases. Could you please make sure the last code is used for all recent tests?

Also, looks like there is a memory issue on Tigase cluster. The Tigase runs out of memory which causes the bad results. What are memory settings for VM and what are memory settings for Tigase for these load tests?
Eric Dziewa commented 1 decade ago

Do you see the difference during iq:roster:get? That happens around 85 minutes into the test. I guess this is unrelated/not important. Tests from the 9th, 10th, and 11th used build 3443. Previously used was build 3367. Testing was off yesterday, is resuming today with build 3450, and OnlineUsersCachingStrategy.

The cluster VMs each has 2G allotted. The MySQL VM has 4G. The cluster memory settings are: -Xms200M -Xmx1500M -XX:PermSize=32m -XX:MaxPermSize=256m -XX:MaxDirectMemorySize=128m. Anything higher on these VMs impacts performance on t2.tigase.org. The cluster is ready for use on the new hypervisor v35.tigase.org.
Artur Hefczyc commented 1 decade ago

Looked at last results from load tests and it seems to be the fix is working well.
Login to comment

Type	Task
Priority	Critical
Assignee	Eric Dziewa
RedmineID	1714
Version	tigase-server-5.2.0
Spent time	0

Issue Votes (0)

Watchers (0)

Reference

tigase/_server/server-core#281