Slow MAM synchronization (#75)

Andrzej Wójcik (Tigase) opened 3 years ago

This issue was created after results from siskinim-352 pointed out that slow MAM sync in SiskinIM is caused by slow responses by the server while causing high CPU usage on the server-side.

Activities

Andrzej Wójcik (Tigase) commented 3 years ago

After reviewing logs on our installation, high CPU usage was pointing to the issue with processing stanzas in Tigase XMPP Server. I've run profiler agast local installation and got the following result.

It is clear that most time is used in MAM during message retrieval in XML parser SimpleParser method (which is to be expected), however, what was not expected, was to see StringBuilder::append() method causing high CPU usage.

Looking at the code in AbstractStringBuilder::append() method, it was clear that there were some if statements related to new and enabled by default since JDK9 Compact String feature. I've decided to rerun the profiler with the additional Java flag -XX:-CompactStrings to disable this feature. I've got following results:

Parsing XML took now 2977ms for 900 stanzas while previously it took 2532ms for 100 stanzas. Changing this flag resulted in 7.6 times after XML parsing.

We should consider using this flag on all our installations, but we need to be careful as this will increase memory usage by "latin-1" strings in Tigase (will use twice the amount of the memory). With compact strings disabled, the amount of memory used would equal running Tigase on pre-JDK9 environment, which seems to be acceptable.
Andrzej Wójcik (Tigase) commented 3 years ago

Should we change this setting by default? or review our usage of StringBuilder in XML parser and look for a replacement (or create a replacement) to speed up XML parsing? or usage of this flag is acceptable?
Wojciech Kapcia (Tigase) commented 3 years ago

I would lean more towards disabling string compact (i.e usage of the flag). In the future we could reevaluate it and reconsider our own implementation, but I would say this would require better testing (JMH with memory comparison) to actually have solid data to back the changes.

Changing tigase.conf would make sense as well as adding comment to documentation.
Andrzej Wójcik (Tigase) commented 3 years ago
As the results the next evening were the same as before adding -XX:-CompactStrings, I've discussed this issue with @wojtek and we concluded that we need to investigate it further.

After checking the performance of Tigase during sync (in the morning in Europe when it was fast, and in the morning in the US when it was slow), there was no slow down on the server-side. After comparing network usage on an EC2 instance and in SiskinIM, it looks like, when Tigase is fast to respond it peaks around 150MB (according to EC2 monitoring) but when it is slow it peaks around 96MB.

Screen from when Tigase was slow to respond

Comparing that to metrics gathered in SiskinIM. SiskinIM reports that it "only" transferred ~10MB of data. When Tigase was fast to respond it was around 0.6MB/s. That is weird compared to EC2 metrics seen above (150MB?).

However, when I was trying to measure the SiskinIM transfer rate today, I also received 0.5MB/s during the time when Tigase was slow to respond yesterday.

It is possible that the slowdown was caused by AWS issues, ie. increase of traffic and our instances are t3a.small which have networking rated as up to 5Gbps which mean in a burst they can reach 5Gbps when it is possible without limiting other instances. The base networking speed for our instances is 0.128Gbps, which would be 16MB/s. However, even assuming that Siskin transferred 10MB/s and we had to fetch the same amount of data from AWS RDS, then it still should be done in 2s and not a few minutes.

Taking all of that into consideration, I think that:

-XX:-CompactStrings did not give us any performance gain, as there were times when Tigase was as slow as without this flag

Tigase had almost no load, no queues

Tigase during slow responses was calling GC a lot (but JMX monitoring was connected)

It is possible that the slowdown was caused by AWS issues, ie. traffic increase on AWS zone

Initially, slow responses were caused by excessive logging (but as pointed out by @wojtek, required on our installations)

Long round trip times also impacted sync (ping is over 190ms in my case)

I will repeat my performance test tomorrow to confirm the results as today (the first time it was "fast" in the evening), Wojtek was not connected (I do not it has anything to do, but I want to rule this out - ie. increased used of resource impacting Tigase performance).
Wojciech Kapcia (Tigase) commented 3 years ago

Tigase during slow responses was calling GC a lot (but JMX monitoring was connected)

I ran into same issue and from what I gathered at that time the "excessive" allocation is caused by RMI/JXM calls so this should be considered when using JMX (VisualVM for example). I Would recommend using YourKit which doesn't seem to suffer from the same issue.

Initially, slow responses were caused by excessive logging (but as pointed out by @Wojciech Kapcia, required on our installations)

It's more of "it's very nice to have" as if issue happens then it's easier to debug it with complete set of information, and considering those are our test installations to which we deploy tigase first then having more information almost always proves to be helpful. Though, we could adjust debugging / logging statements to somewhat limit it (DOMXMPPHandler was usually enabled with xmpp package which caused a lot of excessive debugging; using logback instead of JUL offers significantly better performance thus disabling JUL's FileHandler helps as well)
Andrzej Wójcik (Tigase) commented 3 years ago
I've decided to take a different approach to further testing, as I was convinced that we have some kind of a connectivity issue. I've captured TCP traffic using Wireshark and reviewed exchanged packets.

The first thing which I observed was that Tigase responds very fast if the client is connected through Load Balancer (see-ohter-host was not triggered as LB forwarded connection to proper backend server). When it was redirected directly to the EC2 instance (and backend server), responses were slow.

Comparing those exchanges, I've noticed that exchanges with EC2 instance (direct connection) without forwarding by LB were very often marked by Wireshark with TCP Full Window (packets sent from client to the server). Investigating this issue, I found that, TCP packets sent from the server have a very small Window size (around 1.1K) and scaling set to 1, while packets sent from client to server have a window size around 13.5K and scaling set to 64. That lead me to the conclusion that the TCP buffer on the server-side was too small and this was triggering the TCP transmission control protocol to back off sending new packets until there is a space for that (during message sync a lot of packets is being sent and a lof of <a/> packets to confirm delivery of packets needs to be sent from the client to the server).

Knowing that I've compared TCP performance (message sync) with different network buffer sizes set in Tigase XMPP Server. Reported times are the time of syncing 300 messages from the MAM archive.

Buffer size Fastest time Longest time

2KB 1.20s 4.50s

4KB 0.65s 1.25s

8KB 0.54s 0.79s

16KB 0.50s 0.70s

From this table it looks like our default C2S buffer size of 2KB is rather poor choice as its performance is bad in this cases (long round trip time but high throughput).

On the other hand, each XMPP connection holds:

an instance of a network buffer (inside a socket, equal to net-buffer size)

an instance of a buffer to which data is read from a socket (equal to net-buffer size)

an instance of char buffer (4 times net-buffer size)

Each increase of a network buffer size will increase the amount of memory used by Tigase XMPP Server. Taking all of that into consideration, I think we should "bump" size of this buffer to at least 4K.

We can do that either for each connection manager or just for C2S (and WebSocket) connection managers.

Another idea is to increase the network buffer size, but not use too much of the memory, we could create a separate setting to set network/socket buffer size as a separate value from the buffer used to read data from the socket. That could lead to "repeated" reads which could increase CPU usage, so I'm not sure if that is a good idea.

In my opinion, we should address this issue in version 8.3.0 (at least increase buffer size to 4KB or 8KB) and consider some changes in 9.0.0, ie. to dynamically allocate buffers to free them up when they are not used. That could increase the amount of alloc and free operations, but as we are scaling those buffers anyway when larger stanzas are being sent, this may not be so critical - we can asses that in version 9.0.0 with some performance tests.

This is important as more and more devices are "mostly offline" and connect from time to time, ie. to fetch history from the last week (as ie. iOS was not waking them up for a week) and the time required for sync is a time when user doesn't have actual data - he needs to wait. Reducing this time is important and ie. in my case, the sync of the last month of messages (around 11k of messages) can be reduced from (after optimizations in SiskinIM) 49s to 19s (with 8k) or (22s with 4k). Those 49s are in good network conditions, when there are bad, this value slipped to over 1m30s (with those optimizations in buffer size, this would have a lower impact).

@wojtek @kobit What do you think?
Wojciech Kapcia (Tigase) commented 3 years ago

I'm somewhat hesitant about this. The main motivation is "syncing a lot of messages" (i.e. from over the week). Yes, in case of iOS the client is mostly offline but users should most likely use the application at least once a day? Or should we optimise of edge cases where someone doesn't use the application?

On the other hand - this mostly applies to "offline clients" and in that case memory allocation for the buffers should be less of an issue (compared to mostly idle constantly connected clients). Dynamic sizing of this buffer indeed seems like a good idea (if possible), starting with bigger buffer to give more performance and lowering it for idle connections.
Andrzej Wójcik (Tigase) commented 3 years ago

Size of this buffer has impact not only on "offline clients" but also on clients sending large chunks of data, ie. vcard, avatar, etc. Basically every "stanza" which is bigger than this buffer will trigger a socket buffer overflow (1 or more times) causing TCP roundtrip to "notify client that server read the data" and then resending next part of the stanza while the previous part will still be in the memory of the server. Please note that resizing internal Tigase buffers has no impact on resizing socket buffer, AFAIR, socket buffers are not resized at all!
Wojciech Kapcia (Tigase) commented 3 years ago

I'm aware of that, I'm just weighting possibilities - under normal circumstances large chunks of data (vcard/avatar) happen less often than other stanzas (presence, messages) so accommodating for that would mean we would have constant bigger buffer for packets that happen less often and for me (with static/fixed sizing) it seems wasteful.

Buffer size	Fastest time	Longest time
2KB	1.20s	4.50s
4KB	0.65s	1.25s
8KB	0.54s	0.79s
16KB	0.50s	0.70s

Andrzej Wójcik (Tigase) commented 3 years ago

Keeping data while we wait for other parts of the same data set required to be processed is also wasteful as we could already process this data instead of keeping them in memory.

I agree that keeping memory allocated for those buffers may be a waist, that is why I've suggested introducing a new variable so we could increase the size only of the socket buffer without increasing the size of internal Tigase buffers. That would increase memory usage but would have a smaller impact, see the below table for more details.

net-buffer size	socket buffer size	usage per connection	% increase	Calculation	Comment
2KB	X	12KB	1.0	2 + 2 + 4*2
4KB	X	24KB	2.0	4 + 4 + 4*4
8KB	X	48KB	4.0	8 + 8 + 8*4
16KB	X	96KB	8.0	16 + 16 + 16*4
2KB	4KB	14KB	1.17	4 + 2 + 4*2	Suggested version with custom setting to increase just socket buffer
4KB	4KB	24KB	2.0	4 + 4 + 4*4	Suggested version with custom setting to increase just socket buffer
2KB	8KB	18KB	1.5		8 + 2 + 4*2
4KB	8KB	28KB	2.34	8 + 4 + 4*4	Suggested version with custom setting to increase just socket buffer

Note: socket buffer is most likely direct memory and not part of the heap.

From this table, you can see that increasing socket buffer size (without increasing the size of net-buffer) would reduce no. of required RTT for large packets sent from the client to the server while (at the same time) would only slightly increase memory usage (far less than with current implementation to use net-buffer size for socket buffer size).

Example memory usage table

| net-buffer size | socket buffer size | no. of connections | memory usage | | --- | --- | --- | --- | --- | | 2kB | X | 1000 | ~12MB | | 2kB | X | 100000 | ~1200MB | | 4kB | X | 1000 | ~24MB | | 4kB | X | 100000 | ~2400MB | | 8kB | X | 1000 | ~48MB | | 8kB | X | 100000 | ~4800MB | | 2KB | 4KB | 1000 | ~14MB | | 2KB | 4KB | 100000 | ~1400MB | | 2KB | 4KB | 1000 | ~18MB | | 2KB | 8KB | 100000 | ~1800MB |

From this table, it looks like, we could increase only socket buffer to 4KB (which would reduce TTL) and would require 200MB more memory for 100k connections. I do not think this would be a huge memory usage increase while it would improve performance.

Andrzej Wójcik (Tigase) commented 3 years ago

Moreover, as the socket buffer is managed by Java (or rather by the operating system on behalf of JVM) it may allocate these buffers on demand instead of keeping them allocated all the time, which would cause even smaller memory increase than I've mentioned above.
Andrzej Wójcik (Tigase) commented 3 years ago

Also, it is worth mentioning that blocked/filed socket buffer on the server-side (used for reading data from the network) is causing a slowdown in sending data to the client (at least it looks like it), which is causing a buildup in a queue of packets waiting to be sent to the client, what is even more increasing memory usage on the server-side.
Wojciech Kapcia (Tigase) commented 3 years ago

OK, thank you for the explanation. Indeed it seems to make sense and it doesn't cause such memory increase and it seems that other parts of the Tigase could be optimised here.

I'm pondering if, with loom, changing handling of socket connection (and possibly not using NIO directly) would cause significant memory usage changes, but that's for the future.
Andrzej Wójcik (Tigase) commented 3 years ago

I've tested changes made in server-1325 and using this feature to set socket input buffer to 4k greatly improved MAM sync performance. Basically, it resulted in the same performance gain as setting the net-buffer size to 4K, but increasing memory usage only by 17% instead of 100%.

I think that we should consider setting socket-buffer-size to 4k by default with a way to override it. Memory usage increase would not be large, but it would greatly increase I/O performance for C2S connections (BOSH, XMPP, WebSocket) for larger stanzas (bigger than 2k).
Wojciech Kapcia (Tigase) commented 3 years ago

I think it makes sense to increase it to 4K (with option to override)
Login to comment

Type	Task
Priority	Normal
Assignee	Andrzej Wójcik (Tigase)

Issue Votes (0)

Watchers (0)

Reference

tigase/_server/tigase-message-archiving#75