Projects tigase _server server-core Issues #417
Tigase Becomes Unresponsive (#417)
Closed
Subir Jolly opened 10 years ago
Due Date
2015-02-05

At random times, Tigase has started becoming unresponsive and all the requests start queuing up. We have a clustered setup of 4 servers. When Tigase starts becoming unresponsive, the queues for Session Manager start rising and in order for lowering down the queues, we have to get our tigase servers out of rotation. The other way is to restart Tigase servers. We have just a little over than 10K users on this system. We have overridden Auth component to do custom auth against our system.

We have noticed that when Tigase starts becoming unresponsive and Session Manager queues start growing, there aren't any DB reads. Not sure if this is okay.

What do you suggest we do in this scenario?

Server configurations

Linux CentOS

80 GB RAM

24 Core Processor

Packages

java-1.7.0-oracle-1.7.0.55

mysql-connector-java-5.1.12

tigase-extras-1.1.0

tigase-issue #5.2.1

tigase-xmltools-3.4.4

tigase-utils-3.4.2

1-20-15-stats.txt

wojciech.kapcia@tigase.net commented 10 years ago

Could you check Tigase Statistics from the session manager component (either via XMPP or JMX)? Check:

  • time of sess-man/Average processing time on last 100 runs [ms]

  • processing times of plugins in sess-man/Processor: * which could indicate if any of the plugins is culprit.

Sam Wright commented 10 years ago

@sess-man/Average processing time on last 100 runs [ms]:--0

sess-man/Processor: session-open:--, Queue: 0, AvTime: 1, Runs: 467704, Lost: 0

sess-man/Processor: http://jabber.org/protocol/stats:--, Queue: 0, AvTime: 0, Runs: 0, Lost: 0

sess-man/Processor: http://jabber.org/protocol/commands:--, Queue: 0, AvTime: 0, Runs: 0, Lost: 0

sess-man/Processor: jabber:iq:version:--, Queue: 0, AvTime: 0, Runs: 0, Lost: 0

sess-man/Processor: jabber:iq:roster:--, Queue: 0, AvTime: 4, Runs: 434876, Lost: 0

sess-man/Processor: starttls:--, Queue: 0, AvTime: 0, Runs: 0, Lost: 0

sess-man/Processor: presence:--, Queue: 0, AvTime: 2, Runs: 2232561, Lost: 0

sess-man/Processor: default-handler:--, Queue: 0, AvTime: 1, Runs: 30721573, Lost: 0

sess-man/Processor: urn:ietf:params:xml:ns:xmpp-sasl:--, Queue: 10949, AvTime: 16775, Runs: 575521, Lost: 0

sess-man/Processor: urn:xmpp:ping:--, Queue: 0, AvTime: 0, Runs: 0, Lost: 0

sess-man/Processor: urn:ietf:params:xml:ns:xmpp-session:--, Queue: 0, AvTime: 1, Runs: 407769, Lost: 0

sess-man/Processor: session-close:--, Queue: 0, AvTime: 1, Runs: 460723, Lost: 0

sess-man/Processor: amp:--, Queue: 0, AvTime: 1, Runs: 2272762, Lost: 0

sess-man/Processor: disco:--, Queue: 0, AvTime: 0, Runs: 0, Lost: 0

sess-man/Processor: zlib:--, Queue: 0, AvTime: 0, Runs: 0, Lost: 0

sess-man/Processor: jabber:iq:privacy:--, Queue: 0, AvTime: 0, Runs: 0, Lost: 0

sess-man/Processor: urn:ietf:params:xml:ns:xmpp-bind:--, Queue: 0, AvTime: 1, Runs: 408281, Lost: 0

sess-man/Processor: message-carbons:--, Queue: 0, AvTime: 1, Runs: 280698, Lost: 0

sess-man/Processor: jabber:iq:private:--, Queue: 0, AvTime: 0, Runs: 0, Lost: 0

sess-man/Processor: jabber:iq:auth:--, Queue: 0, AvTime: 0, Runs: 0, Lost: 0

total/Total queues wait:--10949@

"sess-man/Processor: urn:ietf:params:xml:ns:xmpp-sasl" Seems to be the one with queues. I've attached the entire dump of JMX statistics.

wojciech.kapcia@tigase.net commented 10 years ago

Subir Jolly wrote:

We have overridden Auth component to do custom auth against our system.

Sam Wright wrote:

sess-man/Processor: urn:ietf:params:xml:ns:xmpp-sasl:--, Queue: 10949, AvTime: 16775, Runs: 575521, Lost: 0

"sess-man/Processor: urn:ietf:params:xml:ns:xmpp-sasl" Seems to be the one with queues. I've attached the entire dump of JMX statistics.

SASL plugin uses AuthRepository to perform authentication. What changes have you made exactly? Can you verify, that your implementation of AuthRepository/processing doesn't cause the slowdown in authentication that impacts SASL plugin?

Subir Jolly commented 10 years ago

Not sure if it matters but in our implementation of AuthRepository, we use "synchronized" keyword for auths. Do you think that making auth synchronous could cause this issue?

Artur Hefczyc commented 10 years ago

This is very likely. Unresponsive means the server got stuck on something, usually threads-lock on some resource. This maybe database access. This may often happen if you use synchronized.

Subir Jolly commented 10 years ago

Thanks,

I am now removing the synchronous login.

I would also like to know the exact meaning of these stats. Would it be possible to give me some brief description for each of these?

sess-man/Authentication timouts

sess-man/Closed user connections

sess-man/Open user connections

sess-man/Total user connections

sess-man/Open user sessions

sess-man/Total user sessions

Thanks,

Subir

Artur Hefczyc commented 10 years ago

Subir Jolly wrote:

Thanks,

I am now removing the synchronous login.

I would also like to know the exact meaning of these stats. Would it be possible to give me some brief description for each of these?

sess-man/Authentication timouts

Number of user logins failed due to authentication timeout. That is user opened a connection but not completed authentication within required time.

sess-man/Closed user connections

Total number of user's connection closed. During normal operation of the server users connect and disconnect all the time. So this number indicates how many connections were closed since the server startup time.

sess-man/Open user connections

A number of currently opened user connections. Please note a user may have more than one connection opened, for example, one from mobile device, another from desktop client, another from a web client. These are counted here.

sess-man/Total user connections

A total number of user connections opened since the server startup time. Some of these connections can be already closed.

sess-man/Open user sessions

A number of user currently opened users' sessions. If a user is connected/logged in from a multiple devices it still counts as a single user session. In other words this is a number of distinct users online at the moment.

sess-man/Total user sessions

A total number of users sessions opened since the server startup time. In other words number of distinct user sessions which were opened and some of them closed since the server startup time. If a user has many connections active at the same time, it still counts as a single session. However, if a user opened multiple connections, closed all of them and then opened connections again, this counts as 2 separate sessions.

Subir Jolly commented 10 years ago

Thank you so much for all this information. It cleared a lot of things for us.

I have made the changes to auth. After that we tested it in our dev environment and it seemed to be holding up fine. We have deployed it to live as well and waiting for seeing the behavior now.

Thanks again for all your help. I really appreciate your support and efforts.

Subir

issue 1 of 1
Type
Bug
Priority
Major
Assignee
RedmineID
2624
Issue Votes (0)
Watchers (0)
Reference
tigase/_server/server-core#417
Please wait...
Page is in error, reload to recover