Added interface for monitoring of Tigase XMPP Server installations (#42)
Closed
Andrzej Wójcik (Tigase) opened 1 decade ago
Due Date
2014-12-28

Add new feature (maybe as a tab) which will be visible or not depending on permissions of authenticated user, which would present server statistics. This should be similar to data presented in #2095 but should be able to aggregate data in time - so it would be possible to show how values are changed in time.

Currently it should be implemented using adhoc commands and polling, but when simplified PubSub API for components #478 would be ready and data would be retrieved we could switch to use push for retrieving updates of statistics and other monitored data.

We should take into account that this Tigase XMPP Server installation might not be a single instance but a cluster and we should present aggregated data from all nodes if possible.

Web_Application_Starter_Project_and_Support__2180__Artur_Hefczyc_2014_-public_forums_time_tracker-Tigase_Private-_Tigase_Projects_and_int_muc_tigase_org.png cluster general metrics.pdf

Andrzej Wójcik (Tigase) commented 10 years ago

I added support for presentation of statistics data collected from single instance and presenting them on charts.

I was thinking about possibility of collection of statistics from each of cluster node, but do we have any way in which XMPP client can retrieve list of cluster nodes, so I could send adhoc commands to stats component on each node and collect data?

Artur Hefczyc commented 10 years ago

I was thinking about possibility of collection of statistics from each of cluster node, but do we have any way in which XMPP client can retrieve list of cluster nodes, so I could send adhoc commands to stats component on each node and collect data?

We will have. It will be one of Monitor component tasks to provide admin with a list of cluster nodes.

Artur Hefczyc commented 10 years ago

Actually I was thinking of an admin ad-hoc command in Monitor component to provide admin with a list of cluster nodes. However we do have a way right now to retrieve a list of connected cluster nodes. If you run service discovery from an admin account, you get a list of all server components. One of them is "Cluster connection manager" which shows a list of subitems. Each subitem corresponds to one cluster node. As the item "Node" field is the cluster node hostname.

Andrzej Wójcik (Tigase) commented 10 years ago

I already considered usage of discovery and "Cluster connection manager" subitems, but it contains only list of "other" nodes (local node is not listed) and I need full list of nodes, so due to that it is not usable. I also considered additional adhoc command for this, but I would like make this command embedded (ie. as part of Monitor component), so it would be part of Tigase XMPP Server jar and would not require additional scripts containig this command.

Artur Hefczyc commented 10 years ago

Andrzej Wójcik wrote:

I already considered usage of discovery and "Cluster connection manager" subitems, but it contains only list of "other" nodes (local node is not listed) and I need full list of nodes, so due to that it is not usable.

Hm, OK. I thought you could get the local node information somehow or guess it form the general service-discovery information. I think we could actually add the local node to the list of cluster nodes provided by "Cluster connection manager" as this is only informative data which do not affect anything else (for now). But I am hesitant to do this, because "Connection manager" name suggest that the component provides information related to "connections" only.

I also considered additional adhoc command for this, but I would like make this command embedded (ie. as part of Monitor component), so it would be part of Tigase XMPP Server jar and would not require additional scripts containig this command.

Yes, the new Monitoring framework Bartek is working on will allow for creating commands in either Java or Scripts.

However, maybe it makes sense that Monitor as a component provides some useful information through service discovery as well? Maybe a list of all cluster nodes (remote and local) would be it?

Andrzej Wójcik (Tigase) commented 10 years ago

Yes, I could use list of nodes from Monitor component whether I could retrieve it using adhoc command or thru service discovery. If it make sens to add it to service discovery then I would be great but I suppose for service discovery we should create proper "tree" with informations so subnode or nodes on similar level of a tree would present similar data.

Artur Hefczyc commented 10 years ago

To be honest, it does not feel right to put a list of cluster nodes into service discovery information for Monitor. It just does not fit there and it would be not natural. I think much better place would be an admin ad-hoc command to either ClusterController or Monitor. You can actually add such a command to either of these components if you like. At least as a temp solution so we would have a fully working interface for monitoring Tigase installations.

Andrzej Wójcik (Tigase) commented 10 years ago

I added gathering and presentation of statistics of each component of running Tigase XMPP Server. It also retrieves statistics from every cluster node and present them on graph so it is possible to compare traffic on each node. To make it work on each cluster node newly added script to retrieve list of cluster nodes must be deployed (I deployed this script on every node of sure.im installation).

I've just commited changes so working version should be deployed on http://beta.sure.im/ tomorrow.

Artur Hefczyc commented 10 years ago

It looks very well, There are 2 things I noticed:

  1. On the Management screen buttons "Cancel" and "Confirm" do not fit well on the page. They are half hidden. I think these should be moved up a little bit. Screenshot attached.

  2. Statistics charts display is very nice but they only display value above zero for last minute/second, all older values are zero. This is fine since the tool does not pull any historic data. But I think we would need some kind of auto-refresh mechanism so the tool would automatically pull statistics at regular intervals of time and draw charts. So if I leave the web client with stats display open for a long time, I would end-up with nice charts.

It would be also extremely useful to have one more item on the list, kind of summary with main statistics for the server. I am thinking of statistics which are provided by default if you double click on the statistics top level component with the lowest stats level. It loads a set of basic and most important metrics of the server. If we had a screen with this summary and a few charts for the metrics for which charts make sense, this would be very useful tool. (with auto-refresh of course).

Andrzej Wójcik (Tigase) commented 10 years ago

I will move Cancel and Confirm buttons up a little be.

Auto refresh is already working. Web client retrieves statistics once every minute and keeps info about last 20 requests so after 20 minutes we would get full chart with 20 points, so I think it would be nice chart.

%kobit could you draw how this basic statistics should look like? and list what metrics would you like to have there?

Artur Hefczyc commented 10 years ago

I attached a PDF file with a general idea of the main metrics page. This is what I would like to see when clicked on the Statistics tab without any component selected. This is of course a general idea, if you think of any other metrics useful from an admin point of view, feel free to add. %eric can you think of anything else useful from an admin point of view?

Andrzej Wójcik (Tigase) commented 10 years ago

I added graphs for main metrics as presented in PDF and moved buttons a little bit up.

Tomorrow version with this feature and fix should be deployed by Jenkins at http://beta.sure.im/

Artur Hefczyc commented 10 years ago

Everything looks very well. One issue I can see is that CPU usage is showing incorrect. Current CPU usage is close to 0% on all machines but the web tool shows between 30% - 50% depending on server. So there must be a mistake somewhere.

Another point worth noting is that we lost a significant number of users on the service....

Andrzej Wójcik (Tigase) commented 10 years ago

Fixed issue with CPU usage.

About loss of users I suppose it is related to change of servers and IP addresses so some of vhost owner propably moved to other changes or forgot to update DNS entries.

Artur Hefczyc commented 10 years ago

Yes, it looks good now, thank you.

issue 1 of 1
Type
New Feature
Priority
Normal
Assignee
RedmineID
2097
Issue Votes (0)
Watchers (0)
Reference
tigase/_clients/sureim#42
Please wait...
Page is in error, reload to recover