Eric Dziewa opened 8 years ago
|
|
Also would it be possible to only bind to tigase.org interface instead of all so the URL cannot be accessed using projects.tigase.org?
|
|
I see entries for mail.tigase.org:8080 too.
|
|
I looked into this issue and looks like more people had issue with googlebot and CLOSE_WAIT connections. Usually they tried to fix it by using robots.txt - for some it worked and some banned googlebot IP addresses as robots.txt file did not help. Also I see that some other connections (except from googlebot) created CLOSE_WAIT connections. If they are bots then maybe we should ban every bot in robots.txt by default when we add support for this file? Right now we do not have support for serving robots.txt file - this is something which would need to be added. Also we bind to all IP addresses associated with every interfaces and for now there is no way to change it. On our installation at t2.tigase.org we are using embedded HTTP server which comes from Java JDK. We have support for alternative HTTP server (Jetty) which in my opinion is more mature project that HTTP server embedded in JDK and may provide use with better performance and better handling of issues - ie. maybe it will be able to deal with CLOSE_WAIT issue?. I currently have my own installation running with Jetty as HTTP server and I do not have any connections to 8080 with CLOSE_WAIT state. For now I think we could try to switch to use Jetty as HTTP server for our HTTP API at t2.tigase.org and check if it will help. In meanwhile I can work on tasks:
%kobit What do you think about this change? I now you wanted to use HTTP server embedded in JDK, however I'm not sure if we should use it or even promote usage of this HTTP server implementation if CLOSE_WAIT issue will be gone if we use Jetty. Do you agree on this change in configuration of t2.tigase.org? |
|
I am in favor of using as few external dependencies as possible. So using HTTP server embedded in JDK is the preferred way. However, I am ok with deploying Jetty as a workaround for now and working on a solution for embedded JDK. The CLOSE_WAIT connections is a problem not only for HTTP server (REST API) but also for Bosh connections and maybe websockets. I remember reading about handling CLOSE_WAIT connections through the TCP/IP connections settings. It is possible to change CLOSE_WAIT timeout, so such connections are closed quickly and do not consume resources. I do not remember details though. It was quite a few years ago when I was working with Nasza Klasa on integrating Tigase with their system. For sure, the CLOSE_WAIT timeout can be changed on the OS level. It might be also possible to change it on the JVM level.
|
|
I've lowered the timeout to 5 minutes in sysctl.conf.
Also disallowed port 8080 connections to projects/mail IPs.
|
|
I looked into t2 and I see that we still have old CLOSE_WAIT connctions active. From what I see, I think that this are still the same CLOSE_WAIT connections. Due to that I think that change of setting on OS level is not a working solution. I tried to replicate this issue and finally I was able to create CLOSE_WAIT connection on my own. CLOSE_WAIT connections are connections closed by client but this close is not handled by server. In our case it happens in with HTTP server embedded in JDK if some unhandled exception appears and asynchronous processing is used. I will work on this fix probably tomorrow. I will also check if Jetty is vulnerable to the same issue. I think that if this will fix our issue then providing support for selecting IP address to bind by HTTP server and support for robots.txt can be moved to separate task ie. to be done in 7.2.0 %kobit %eric Do you agree? |
|
%eric I am pretty sure the keepalive_time is not for CLOSE_WAIT and has no effect on this. Unfortunately I do not remember which kernel settings had an effect on this port state. However, I am thinking that maybe we could iptables rules which would reject connections from googlebot to some of our IP addresses? %andrzej.wojcik I agree with your plan. I spent very little time on reading about the problem recently, so you probably know more about the issue and possible ways to solve it. From what I have read, it looks like this could/should be solved by explicit socket close by an application. So, if Tigase could discover an idle connection, it can explicitly close it, resolving the issue. We have a good mechanism for XMPP connections to detect broken connection, but this mechanism does not monitor HTTP connections though. So I am thinking, maybe closing idle HTTP connections after some timeout by Tigase could be a solution? Another possible approach could be to have a configurable list of IPs within the HTTP server for which we refuse connections and explicitly close connections if it is open. |
|
I can only adjust connection timeout using the kernel. Cannot do anything with CLOSE_WAIT. From http://stackoverflow.com/a/17856372 "CLOSE_WAIT means that the local end of the connection has received a FIN from the other end, but the OS is waiting for the program at the local end to actually close its connection. The problem is your program running on the local machine is not closing the socket. It is not a TCP tuning issue. A connection can (and quite correctly) stay in CLOSE_WAIT forever while the program holds the connection open." We cannot ban googlebot because that would also affect indexing of tigase.net and other sites running on t2 webserver. The only way those connections are going away is restarting tigase. |
|
%eric you are correct. We have to improve code on the Tigase if possible. I say if possible because we use Java built-in HTTP server for this stuff and I am not sure how deep we can make our own customizations to the TCP/IP connections handling code. I remember that with Java1.3 there was not much that could be done in cases like this. As for the IP Tables rules, I thought that maybe we have Tigase running on a separate IP address from tigase.net and other websites, so then we could ban search bots from accessing HTTP server running on Tigase or maybe we could disallow googlebot access only to port 8080? Another possible solution would be to put Tigase's HTTP service on a different port, like: 8888 or something like this. Then, googlebot would not try to access it. |
|
I changed our code to make sure While testing it locally I was able to replicate connection in CLOSE_WAIT and after this changes it was not possible, so fix should work. However during testing, I have found other issue inside embedded HTTP server - race condition issue. In some rare cases it is possible that 2 threads will try to read from same input stream which can cause lock. Fortunately our requests timeout properly takes care of this issue by closing this stream. But there is one catch - in some cases request may result in HTTP response with error code in which case client will need to send this request once again. It is rare situation but can happen. %kobit I did whatever I could to deal with HTTP server embedded in JDK. Including debugging of HTTP server implementation to deal with this race condition leading to lock, but I was not able to do so. It is very hard to find exact cause. Only solution is to make this HTTP server using only single thread - which is not good for performance (ie. admin page and Web UI will load very slowly when single thread solution is used). Due to that it is hard for me to tell that it is usable in production - it can be used but only for simple tasks. I also checked Jetty and when Jetty is used, I was not able to replicate CLOSE_WAIT connection on port 8080 locally. So in my opinion we should use Jetty in production deployments in future. %eric As for blocking googlebot - we can block it using iptables by blocking connections incoming from googlebot ip addresses to our 8080 port. This will not block requests from googlebot to our websites as they are hosted on port 80. Code with my changes is ready and will be part of tomorrows snapshot build. So it may be good idea to deploy this next snapshot build on our servers (t2 and t6). If this will work then issue is fixed, if not then I will reconfigure t2 and t6 servers to use Jetty. |
|
%andrzej.wojcik am I correct that during normal use, I mean the way we or our customers use HTTP interface to the Tigase server the CLOSE_WAIT problem is negligible if it exist at all? We are only aware of it because of googlebot and maybe other search engines? If so, I think all the fixes and improvements you made should be good enough. Also, for the deadlock problem. How likely it may happen during normal use? If that happens, what are the implications for the end-user? I am OK with having the JDK embedded HTTP server as default for test/devel or light use installations and Jetty as default for production systems. We should have this covered in our documentation though, %daniel . |
|
%kobit Yes, in typical use case when server receives data in expected format I haven't seen possibility for CLOSE_WAIT - so it is negligible. Lock may happen from time to time - it is random thing and it will result in 2 threads from HTTP server processing pool being locked for 1 or 2 times set for request timeout (from 1 to 2 minutes depending on case). After that it will return HTTP error and end user will need to repeat request. Other requests will be processed normally and will not be blocked by this lock. I will also add this information to documentation for HTTP project for version 7.2.0 on which I'm working on. %daniel please add this information to current version of documentation (part of admin documentation for Tigase XMPP Server). |
|
I've installed b4282 on t2 and t6. I will try to find an authoritive list of googlebot IPs this week. |
|
Had to revert to previous build b4222 because cannot upload statistics. Cronjob is failing.
t2, and t6 however are not having a problem uploading their own statistics locally with b4282. |
|
I looked into this issue and issue #4485 and found that rare locks I observed on my local installation are far more common events on t2, t6 and other server machines. I compared this to build without my latest changes and I it appears that locks started when my changes to deal with CLOSE_WAIT were applied, so I reviewed my latest changes and found few lines leading to concurrent execution of same HTTP request. I changed them and tested on local installation and our remote test cluster and now I was not able to see any locks. I suppose I solved locks issue by fixing issue with concurrent execution. Next (tomorrow) snapshot build should be working fine. |
|
Installed b4284. Statistics upload is working, everything seems fine. |
|
The IP addresses used by Googlebot change from time to time. Seems yandex has found the URL now.
Andrzej there's also Meta Tags we can use. Would that be easy to add? If so we should use nofollow as well. |
|
I'm not sure if we can do anything more at this point. I switched to t2.tigase.org to use Jetty. As for Meta Tags and nofollow, it will be hard to use it (if possible at all) as we do not return HTML - we usually return XML in our format or JSON (HTML is returned only in few exceptions). Eric, please verify if we still have this issue with Jetty. If so, then we will need to take care of it, but on sunday I'm starting my holiday and will be back after around a week. |
|
I've looked through the logs since August and have not seen any googlebot on port 8080. I think it is safe to assume jetty has solved the issue. |
Type |
New Feature
|
Priority |
Normal
|
Assignee | |
RedmineID |
4464
|
I was looking into why projects.tigase.org is slow to respond and I see these connections. Would it be possible to add a robots.txt to disallow googlebot indexing the site?