问题报错信息
socket.timeout:等待源端服务器响应超时
Traceback (most recent call last):
File "/opt/py/ve1/lib/python3.8/site-packages/urllib3/connectionpool.py", line 384, in _make_request
six.raise_from(e, None)
File "<string>", line 2, in raise_from
File "/opt/py/ve1/lib/python3.8/site-packages/urllib3/connectionpool.py", line 380, in _make_request
httplib_response = conn.getresponse()
File "/opt/py/lib/python3.8/http/client.py", line 1347, in getresponse
response.begin()
File "/opt/py/lib/python3.8/http/client.py", line 307, in begin
version, status, reason = self._read_status()
File "/opt/py/lib/python3.8/http/client.py", line 268, in _read_status
line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
File "/opt/py/lib/python3.8/socket.py", line 669, in readinto
return self._sock.recv_into(b)
socket.timeout: timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/py/ve1/lib/python3.8/site-packages/elasticsearch/connection/http_urllib3.py", line 245, in perform_request
response = self.pool.urlopen(
File "/opt/py/ve1/lib/python3.8/site-packages/urllib3/connectionpool.py", line 637, in urlopen
retries = retries.increment(method, url, error=e, _pool=self,
File "/opt/py/ve1/lib/python3.8/site-packages/urllib3/util/retry.py", line 343, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/opt/py/ve1/lib/python3.8/site-packages/urllib3/packages/six.py", line 686, in reraise
raise value
File "/opt/py/ve1/lib/python3.8/site-packages/urllib3/connectionpool.py", line 597, in urlopen
httplib_response = self._make_request(conn, method, url,
File "/opt/py/ve1/lib/python3.8/site-packages/urllib3/connectionpool.py", line 386, in _make_request
self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
File "/opt/py/ve1/lib/python3.8/site-packages/urllib3/connectionpool.py", line 306, in _raise_timeout
raise ReadTimeoutError(self, url, "Read timed out. (read timeout=%s)" % timeout_value)
urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='192.168.83.118', port=19200): Read timed out. (read timeout=60)
ConnectionRefusedError:尝试创建连接被拒绝
Traceback (most recent call last):
File "/opt/py/ve1/lib/python3.8/site-packages/urllib3/connection.py", line 158, in _new_conn
conn = connection.create_connection(
File "/opt/py/ve1/lib/python3.8/site-packages/urllib3/util/connection.py", line 80, in create_connection
raise err
File "/opt/py/ve1/lib/python3.8/site-packages/urllib3/util/connection.py", line 70, in create_connection
sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/py/ve1/lib/python3.8/site-packages/elasticsearch/connection/http_urllib3.py", line 245, in perform_request
response = self.pool.urlopen(
File "/opt/py/ve1/lib/python3.8/site-packages/urllib3/connectionpool.py", line 637, in urlopen
retries = retries.increment(method, url, error=e, _pool=self,
File "/opt/py/ve1/lib/python3.8/site-packages/urllib3/util/retry.py", line 343, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/opt/py/ve1/lib/python3.8/site-packages/urllib3/packages/six.py", line 686, in reraise
raise value
File "/opt/py/ve1/lib/python3.8/site-packages/urllib3/connectionpool.py", line 597, in urlopen
httplib_response = self._make_request(conn, method, url,
File "/opt/py/ve1/lib/python3.8/site-packages/urllib3/connectionpool.py", line 354, in _make_request
conn.request(method, url, **httplib_request_kw)
File "/opt/py/lib/python3.8/http/client.py", line 1255, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/opt/py/lib/python3.8/http/client.py", line 1301, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/opt/py/lib/python3.8/http/client.py", line 1250, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/opt/py/lib/python3.8/http/client.py", line 1010, in _send_output
self.send(msg)
File "/opt/py/lib/python3.8/http/client.py", line 950, in send
self.connect()
File "/opt/py/ve1/lib/python3.8/site-packages/urllib3/connection.py", line 181, in connect
conn = self._new_conn()
File "/opt/py/ve1/lib/python3.8/site-packages/urllib3/connection.py", line 167, in _new_conn
raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7fa65a586340>: Failed to establish a new connection: [Errno 111] Connection refused
已知问题原因
- ES 集群中某个节点的 ES 主进程被杀(由于服务器内存不足导致),导致该节点 ES 重新拉起,报 “尝试创建连接被拒绝”;
- ES 集群中某个节点的服务器内存不足,导致该节点 ES 的 cache 被清空,ES 响应请求速度下降,报 “等待源端服务器响应超时”。
问题定位方法
1. 通过查看计算日志,统计这两类问题频繁出现的时间区间及 ES 集群节点
“尝试创建连接被拒绝” 通常集中在 1 - 2 分钟,“等待源端服务器响应超时” 通常集中在 10 分钟以内
2. 通过 Grafana 监控,检查问题时间段内,ES 集群各节点间的网络连接是否正常
排除节点间网络连接问题的影响(需要注意,服务器内存不足,也可能导致服务器暂时无法响应监控的请求,显示丢包率提高)
3. 通过 Grafana 监控,检查在问题时间段内,ES 集群各节点中 ES 容器的监控指标是否出现显著变化
ES 容器的 rss 占用是否显著下降:因为 ES 会为自身分配一定数量的内存作为缓存,并将其保留在 RSS 中,所以通常来说 ES 的 RSS 为相当稳定,如果出现显著下降,则说明 ES 主进程可能被杀掉过
ES 容器的 cache 占用是否显著下降:因为如果服务器内存不足,则会优先清理各个容器的 cache,所以如果 ES 的 cache 显著下降,则说明可能出现过服务器内存严重不足的情况
ES 容器的磁盘读取是否显著增加:因为若 ES 进程被重新拉起,则需要从磁盘中读取数据,所以如磁盘读取有显著增加,则说明 ES 主进程可能被杀掉过。
4. 通过 Grafana 监控,检查在问题时间段内,ES 集群各节点中 ES 的监控指标是否发生显著变化:
各类指标在问题时段内是否中断:若 ES 主进程被杀或因各种原因无法响应,则会导致 ES 监控中断,说明中断期间该 ES 节点出现问题
在问题时间段后,ES 的各类缓存是否被重置为 0:如 ES 主进程被杀或 ES 缓存被清理,则会导致 ES 各类缓存被置为 0
5. 通过 Grafana 监控,检查在问题时间段内,ES 集群各节点的服务器内存是否出现显著变化:
在问题时间段内,服务器空闲内存(free)和缓存(cache)是否显著下降,并在问题时段结束后迅速回升
6. 通过 Linux 内核 message 日志,检查是否出现 OOM Killer 杀掉进程的情况
如果 OOM Killer 杀掉了非 ES 进程,则在杀掉前后,可能因为 ES 容器内存不足,缓存被严重清理,导致请求响应超时文章来源:https://www.toymoban.com/news/detail-756382.html
Jul 20 17:30:28 node118 kernel: Out of memory: Kill process 21434 (python) score 396 or sacrifice child
Jul 20 17:30:28 node118 kernel: Killed process 21434 (python) total-vm:63743040kB, anon-rss:57695916kB, file-rss:0kB, shmem-rss:1802304kB
如果 OOM Killer 杀掉了 ES 进程,那么在杀掉后,必然导致 ES 无法响应,连接请求被拒绝文章来源地址https://www.toymoban.com/news/detail-756382.html
Jul 27 16:36:08 node119 kernel: Out of memory: Kill process 29074 (elasticsearch[n) score 270 or sacrifice child
Jul 27 16:36:08 node119 kernel: Killed process 29074 (elasticsearch[n) total-vm:241520696kB, anon-rss:35248156kB, file-rss:173912kB, shmem-rss:0kB
问题解决方法
- 减少其他容器的内存占用
- 配置其他容器的内存上限
到了这里,关于ElasticSearch|ES 连接超时及创建连接失败问题定位过程记录的文章就介绍完了。如果您还想了解更多内容,请在右上角搜索TOY模板网以前的文章或继续浏览下面的相关文章,希望大家以后多多支持TOY模板网!