-
Optimizations implemented based more or less on suggestion to use TigaseRuntime. Introduced new configuration option @skip-offline-sys@:
sess-man/plugins-conf/presence/skip-offline-sys=(true|false)
which takes into account
TigaseRuntime
as the environment conditions whether to send or not presence (and works with already present @skip-offline@, which somewhat stray from the RFC).ACS has been slightly updated to correctly report
.hasCompleteJidsInfo()
(takes into account state of node connecting). -
Testing results from the 8th messed up. Testing from today ended a bit early due to a cron miscalculation. There are still two hours of useful data.
http://graph.cluster-c.xmpp-test.net/2014-02-07/ :: tigase-issue #5.2.0-SNAPSHOT-b3367
http://graph.cluster-c.xmpp-test.net/2014-02-09/ :: tigase-issue #5.2.0-SNAPSHOT-b3443
-
I compared tests results for the 7th Feb and 11th Feb. They look the same. I mean roughly the same number of presences is being processed in both cases. Could you please make sure the last code is used for all recent tests?
Also, looks like there is a memory issue on Tigase cluster. The Tigase runs out of memory which causes the bad results. What are memory settings for VM and what are memory settings for Tigase for these load tests?
-
Do you see the difference during iq:roster:get? That happens around 85 minutes into the test. I guess this is unrelated/not important. Tests from the 9th, 10th, and 11th used build 3443. Previously used was build 3367. Testing was off yesterday, is resuming today with build 3450, and OnlineUsersCachingStrategy.
The cluster VMs each has 2G allotted. The MySQL VM has 4G. The cluster memory settings are: -Xms200M -Xmx1500M -XX:PermSize=32m -XX:MaxPermSize=256m -XX:MaxDirectMemorySize=128m. Anything higher on these VMs impacts performance on t2.tigase.org. The cluster is ready for use on the new hypervisor v35.tigase.org.
Type |
Task
|
Priority |
Critical
|
Assignee | |
RedmineID |
1714
|
Version |
tigase-server-5.2.0
|
Spent time |
0
|
Problem description:
Let's say we have a user with 10k contacts in the roster. All the contacts (or almost all of them) are on the same Tigase installation. But only fraction of them are online.
However, when the user logins to the system, it generates a very high traffic of 10k presence probes + 10k initial presences to all contacts, even those who are offline. We can significantly reduce load on the server if we skip generating presences to all of those who we know are offline. That is those who belong to the local service and are offline. Of course we cannot say anything about contacts from other servers.
I have received quite a few requests recently describing high load problems. The service is for a relatively small number of online users (300 - 10k) but with large rosters (500 - 10k). As all users are local only, they do not open the installation to other servers we could significantly reduce load and improve the overall service with the optimization.
Suggested solution:
I do not want to force you to do it my way but here is how I attempted to solve this problem in the past:
The plugin: Presence has one private method: requiresPresenceSending. The old version of the method had code like this:
Where runtime is TigaseRuntime.
As you can see the code checks whether the system has all the required details to omit offline users, checks whether this is a local user and if the user is online. The first check was true for a single-mode installation and ACS. It was false for a default clustering strategy. Then either the main SM or ACS SM returned information whether the JID is online or not.
The code has been removed as it was not tested and I did not think at the time that this optimization is really needed. However, it looks like more and more services have very big rosters and we need something like this ASAP.
Please note, in the current Presence plugin code, sending probes is not enclosed with in requiresPresenceSending check, so this needs to be adjusted as well.
You need to make sure that runtime.hasCompleteJidsInfo() returns true for a standalone installation, false for default clustering and true for ACS. It should work this way already but as this is quite old code I am not sure if all bits are still there and if the ACS works correctly for the runtime.isJidOnline(buddy) call.