Cluster Node Shutting Down Unexpectedly (#336)

Eric Dziewa opened 1 decade ago

Over the weekend tigase software on blue exited. Nothing in the logfiles seemed telling to me so I restarted the software. This morning the same thing happened on green.

We are using tigase-issue #5.3.0-SNAPSHOT-b3568.

Attaching the last hours worth of logfiles. Maybe you, the developers, will see something that I don't.

Activities

Artur Hefczyc commented 1 decade ago

There were many errors which we have to go through as they might be some problems with code in a few places but nothing critical which could cause application exit. However, there appears to be problems on the network level.

There are several errors about broken network connectivity and Network is unreachable. It appears that the VM lost network not long before the Tigase failure.

Eric, this is something for you to investigate.

Do we have the monitoring already? I mean statistics dumping somewhere? It would be very useful to discover what was going on with the blue node.
Eric Dziewa commented 1 decade ago
The logfile had been rotated out by the time I checked. Looking deeper I found the kernel killed the java process:

Jun 17 06:52:17 green kernel: [5421285.980465] Out of memory: Kill process 4116 (java) score 939 or sacrifice child Jun 17 06:52:17 green kernel: [5421285.980803] Killed process 4116 (java) total-vm:9852948kB, anon-rss:7648336kB, file-rss:0kB

Jun 14 06:39:18 blue kernel: [5161317.353488] Out of memory: Kill process 14728 (java) score 935 or sacrifice child Jun 14 06:39:18 blue kernel: [5161317.353957] Killed process 14728 (java) total-vm:9785112kB, anon-rss:7619396kB, file-rss:0kB

There are statisitcs being uploaded. But I think you're asking about the software being used to record the load tests, are you? If so, then no, we are not monitoring the sure.im servers. I don't think that would be a good idea in its current form. It's fine to record then playback several hours but constant monitoring, then to generate images, and playback, would need to be done differently.

It is possible something was wrong with the network but I can't find any other indicator. Would lack of network cause the OOM, or vice-versa?

I will keep a connection open to these two machines watching for network issues. I need a little more proof before I submit a ticket at SoYouStart.
Artur Hefczyc commented 1 decade ago
Eric Dziewa wrote:

The logfile had been rotated out by the time I checked. Looking deeper I found the kernel killed the java process:

Ok, that's explains it.

There are statisitcs being uploaded. But I think you're asking about the software being used to record the load tests, are you?

No, I was just asking about statistics archive as we could look at them post-factum and see what was going on before the failure. I guess if we looked at the stats we could see memory usage increase. Than we could look deeper what caused that - increased in traffic of some kind or network problems.... This could help us improve Tigase so it could handle the problem better.

If so, then no, we are not monitoring the sure.im servers. I don't think that would be a good idea in its current form. It's fine to record then playback several hours but constant monitoring, then to generate images, and playback, would need to be done differently.

Yes, this should be done differently. But there are 2, separate things we could/should do:

To have some more or less intelligent logic monitoring the installation and sending us notifications in case it detects something wrong

Nice visualization of the system, load, traffic, etc... which we could put on our main website for people to see

Both of above could use the same data source. Tigase can store it's metrics in database or in text files, so the visualization logic can pull it from there and draw charts and some additional information (something like Tigase Monitor). The charts can be actually generated by some background job (cron) and the website can load simple pictures to make it simpler.

The monitoring logic can be also something running in background from a cron checking on the system and sending us alerts via email, sms, xmpp or something else.

It is possible something was wrong with the network but I can't find any other indicator. Would lack of network cause the OOM, or vice-versa?

Everything is possible. We cannot rule anything out without investigating it.

I will keep a connection open to these two machines watching for network issues. I need a little more proof before I submit a ticket at SoYouStart.

OK
Login to comment

Type	Bug
Priority	Normal
Assignee	Artur Hefczyc
RedmineID	2008

Issue Votes (0)

Watchers (0)

Reference

tigase/_server/server-core#336