Diagnosis capabilities (#6)

Due Date
2014-08-14

Andrzej Wójcik (Tigase) commented 1 decade ago

Could you explain what it should return?

Whole file?

Some parts of file?

What with authentication to access this logs?

Artur Hefczyc commented 1 decade ago

First thing:

What is purpose of all of this?

We need a more or less static page form the server which will show us basic info about the installed server: hostname, CPUs NO, RAM, uptime, cluster mode true/false, list of cluster nodes, traffic, load, basic configuration settings - this to make easier for us to see whether the server is correctly installed and some OS level settings are correct
Some extended information which would let us tell whether the server works correctly or not, is it overloaded, if yes, then where....

I think that we should create some kind of a framework which would allow components to generate and publish some operational information such as:

Some basic information about the component (code version, main config settings)
Some basic stats (number of connections, list of cluster nodes, user sessions, queues, .....) some essential information from each component
Also some kind of extended diagnostic information, i.e., number of errors, I do not know what else
Kind of logs - but of course we do not want all logs, we have to think of what we need to expose here, probably over time we will add some logic to components to provide us with some useful information which would help us diagnose the system, I mean troubleshoot. But this is not really regular logs, more like - we had this error 10k times in last hour....

So, my idea was to extend current Tigase monitor and add some kind of "monitor able" interface which then could be implemented by a component. This interface would have ab API to collect some data on periodic basis with possibly callback handlers to collect data more "event based". Then the Monitor would provide the data in a different ways:

It could periodically generate a static page with some basic info - REST API component could load the static page and show on request
It could provide some ad-hoc commands to receive more detailed information
It could actively send some stats over XMPP or email

Wojciech Kapcia (Tigase) commented 1 decade ago

Artur Hefczyc wrote:

What is purpose of all of this?

My impression was (from the earlier discussions) to have capabilities similar to current monitoring (and maybe data similar to the error notification / resource consumption) but available over http.

As for authentication - as with the remaining handlers - basic http auth should suffice.

Andrzej Wójcik (Tigase) commented 1 decade ago

Maybe this page should be created dynamically and retrieve list of nodes and present it and also for each node it should retrieve stats (also with detailed information for particular component), so we might create some table with statistics or graphs?

I so then we might extend stats API to added new "important" things to report to this API so components could report issues to stats API and we could retrieve it from this API?

Artur Hefczyc commented 1 decade ago

Wojciech Kapcia wrote:

Artur Hefczyc wrote:

What is purpose of all of this?

My impression was (from the earlier discussions) to have capabilities similar to current monitoring (and maybe data similar to the error notification / resource consumption) but available over http.

Well, it could. Actually this is already available now. Using REST API you can execute any command including statistics retrieval.

However, I thought of something different actually. More or less a simple text or HTML file with some basic information which would allow you to tell if the system runs OK and if it is configured correctly.

However, what you say does makes a lot of sense. But it should be implemented in such a way that when a user sends a correct REST API call he gets in return a nice HTML page with all details presented clearly and easily to read. Not additional calculation or interpretation needed. Then this would make a lot of sense and it would be very useful.

Actually, one step further and we could add some timer to which would cause the browser to automatically refresh the page, so it would updated data.

One step further and we can add more JS to draw some graphs, one step further and we can use Andrzej's web client to do all the work :-)

As for authentication - as with the remaining handlers - basic http auth should suffice.

Yes, if we use standard REST API we can make some data available through ad-hoc command for admins only or for authenticated users, etc....

Artur Hefczyc commented 1 decade ago

Andrzej Wójcik wrote:

Maybe this page should be created dynamically and retrieve list of nodes and present it and also for each node it should retrieve stats (also with detailed information for particular component), so we might create some table with statistics or graphs?

I was thinking of generating static page in case something is not working right and you cannot connect or authenticate or REST API is not loaded, or something else prevents you from request dynamic content, then you could simply load the file from HDD to your web browser and see what is says. Or just send the file to someone who knows what it means and ask for help.

Then, if we create the static file anyway, then instead of requesting dynamic content we could load the static file.

I so then we might extend stats API to added new "important" things to report to this API so components could report issues to stats API and we could retrieve it from this API?

Yes, yes. Something like this. I am, however, not sure it stats API is good for all of these. I do not want to overload it. Some of the stuff is just stats/metrics but some others are more diagnostic, error reporting and some other. So maybe new monitoring API would be more suitable for this.

The monitoring API is supposed to be more intelligent. Instead of just collecting data, it can try to interpret them, maline decisions, sending notifications, etc.... The basic Monitor would work on a local level and ClusteredMonitor would work on cluster level....

Andrzej Wójcik (Tigase) commented 1 decade ago

I want to make sure I got it right, we want to create static HTML file created somewhere (ie. in logs directory) which will contain all informations about potential issues which will be created by separate timer of Monitor component which write there results from it's tests?

Artur Hefczyc commented 1 decade ago

I would actually Reject this ticket. We have extended the idea presented here and described it better in tickets: #2086 and #2083. And this is our focus. Providing some logs over HTTP is just one of possible function of the generic HTTP interface.

The way it should work, in my opinion is, if Tigase can discover some problems automatically, it should be able to collect data about the problem. Logs can be part of the whole data collection related to the problem. And this should be actually realized by some monitor ad-hoc command. The workflow could be like this:

Give me a list of all discovered issues
Give me all details for issue '1'

What do you think? %wojtek you created the issue, so please comment.

Wojciech Kapcia (Tigase) commented 1 decade ago

Artur Hefczyc wrote:

I would actually Reject this ticket. We have extended the idea presented here and described it better in tickets: #2086 and #2083.

I concur. We now have plans for better and more general solution.

Andrzej Wójcik (Tigase) commented 1 decade ago

I disagree with this as I mentioned in #2103, in this issue we would implement static file generation and in #2083 to serve it using HTTP, if we implement more advanced feature in #2103 then I cannot guarantee that we would be always able to access those data as if permissions based on user credentials will be used then without working UserRepository instance we would not be able to get this data!!

I think we still should create static file in case whole HTTP API would fail to work as solution for getting information about CRITICAL issues.

I would rather vote for keeping this one and REJECTING #2103 as #2103 would be duplication of features from #2086 but not duplication of this issue

Artur Hefczyc commented 1 decade ago

Andrzej:

This ticket's description is: "return selected logs related to tigase operationing".

This is not really a priority for us right now and to be honest, I do not think we really want to send Tigase logs via HTTP API component. It is not really useful. At least I not see a use-case making it worthwhile to implement it.

It looks like we may have too many tasks describing the same or similar thing which causes confusion. So let me explain.

What we really need is (what was discussed already) this:

Serving a simple static HTML page with the server status information - this is task #2103 and #2083, we discussed all the details in comments to the ticket recently. There are 2 tasks for the same thing, because one is in Private project and one is in HTTP API component project.
Dynamically generated information, more detailed about current system state and maybe with buttons to execute some admin commands via HTTP, retrieve some monitoring information, statistics, etc... - task #2095

Type	New Feature
Priority	Normal
Assignee	Andrzej Wójcik (Tigase)
RedmineID	2013