Artur Hefczyc opened 1 decade ago
|
|
Optimizations implemented based more or less on suggestion to use TigaseRuntime. Introduced new configuration option @skip-offline-sys@:
which takes into account ACS has been slightly updated to correctly report |
|
Is this option set to true by default? If yes I would like to see load tests results which are run automatically in environment prepared by Eric. We should see improvements immediately. |
|
No, this option (well, both |
|
skip-offline should be off by default as it breaks the spec but the skip-offline-sys option should be ON by default. |
|
Changes applied: skip-offline-sys is enabled by default, skip-offline is off by default. |
|
Eric, could you please let me know when load tests are run with a current code from our repo? I am very curious to see difference. |
|
Testing results from the 8th messed up. Testing from today ended a bit early due to a cron miscalculation. There are still two hours of useful data. http://graph.cluster-c.xmpp-test.net/2014-02-07/ :: tigase-issue #5.2.0-SNAPSHOT-b3367 http://graph.cluster-c.xmpp-test.net/2014-02-09/ :: tigase-issue #5.2.0-SNAPSHOT-b3443 |
|
I compared tests results for the 7th Feb and 11th Feb. They look the same. I mean roughly the same number of presences is being processed in both cases. Could you please make sure the last code is used for all recent tests? Also, looks like there is a memory issue on Tigase cluster. The Tigase runs out of memory which causes the bad results. What are memory settings for VM and what are memory settings for Tigase for these load tests? |
|
Do you see the difference during iq:roster:get? That happens around 85 minutes into the test. I guess this is unrelated/not important. Tests from the 9th, 10th, and 11th used build 3443. Previously used was build 3367. Testing was off yesterday, is resuming today with build 3450, and OnlineUsersCachingStrategy. The cluster VMs each has 2G allotted. The MySQL VM has 4G. The cluster memory settings are: -Xms200M -Xmx1500M -XX:PermSize=32m -XX:MaxPermSize=256m -XX:MaxDirectMemorySize=128m. Anything higher on these VMs impacts performance on t2.tigase.org. The cluster is ready for use on the new hypervisor v35.tigase.org. |
|
Looked at last results from load tests and it seems to be the fix is working well. |
Type |
Task
|
Priority |
Critical
|
Assignee | |
RedmineID |
1714
|
Version |
tigase-server-5.2.0
|
Spent time |
183h
|
Problem description:
Let's say we have a user with 10k contacts in the roster. All the contacts (or almost all of them) are on the same Tigase installation. But only fraction of them are online.
However, when the user logins to the system, it generates a very high traffic of 10k presence probes + 10k initial presences to all contacts, even those who are offline. We can significantly reduce load on the server if we skip generating presences to all of those who we know are offline. That is those who belong to the local service and are offline. Of course we cannot say anything about contacts from other servers.
I have received quite a few requests recently describing high load problems. The service is for a relatively small number of online users (300 - 10k) but with large rosters (500 - 10k). As all users are local only, they do not open the installation to other servers we could significantly reduce load and improve the overall service with the optimization.
Suggested solution:
I do not want to force you to do it my way but here is how I attempted to solve this problem in the past:
The plugin: Presence has one private method: requiresPresenceSending. The old version of the method had code like this:
Where runtime is TigaseRuntime.
As you can see the code checks whether the system has all the required details to omit offline users, checks whether this is a local user and if the user is online. The first check was true for a single-mode installation and ACS. It was false for a default clustering strategy. Then either the main SM or ACS SM returned information whether the JID is online or not.
The code has been removed as it was not tested and I did not think at the time that this optimization is really needed. However, it looks like more and more services have very big rosters and we need something like this ASAP.
Please note, in the current Presence plugin code, sending probes is not enclosed with in requiresPresenceSending check, so this needs to be adjusted as well.
You need to make sure that runtime.hasCompleteJidsInfo() returns true for a standalone installation, false for default clustering and true for ACS. It should work this way already but as this is quite old code I am not sure if all bits are still there and if the ACS works correctly for the runtime.isJidOnline(buddy) call.