Java服务总在半夜挂,背后的真相竟然是...

这篇具有很好参考价值的文章主要介绍了Java服务总在半夜挂,背后的真相竟然是...。希望对大家有所帮助。如果存在错误或未考虑完全的地方,请大家不吝赐教,您也可以点击"举报违法"按钮提交疑问。

写在前面

最近有用户反馈测试环境Java服务总在凌晨00:00左右挂掉,用户反馈Java服务没有定时任务,也没有流量突增的情况,Jvm配置也合理,莫名其妙就挂了

问题排查

问题复现

为了复现该问题,写了个springboot的demo部署在测试环境,其中demo里只做了hello world功能,应用类型为web_tomcat (war包部署),基础镜像是base_tomcat/java-centos6-jdk18-60-tom8050-ngx197,镜像使用的Java版本是1.8.0_60,有了上次MySQL被kill的经验,盲猜是linux limit惹的祸,因此将打好的镜像分别部署了两批不同的机器,果不其然,新机器当晚挂掉了,老机器服务正常

看一下挂掉的limit设置

排查过程

Java进程会受到limits影响?

按理说Java进程是不会受到系统limit open files(系统最大句柄数)影响的,但是为了验证这个问题,我们将他修改为正常机器的值,由于demo是web_tomcat应用,没法修改启动脚本,因此我们通过prlimit修改java进程的limit

prlimit -p 32672 --nofile=1048576 

结果当晚00:00左右还是挂了,看来open files和java进程挂掉没关系,看dmesg也没发现什么问题

Java版本过低导致内存分配不合理?

通过寻求jdos研发组的帮助,jdos研发组的同学认为是java版本的问题,低版本可能没有限制住申请的内存大小,具体原因如下

https://blog.softwaremill.com/docker-support-in-new-java-8-finally-fd595df0ca54?gi=a0cc6736ed14

异常机器java内存情况

正常机器java内存情况

按照这个文档描述,使用docker cgroups限制内存可能会导致JVM进程被终止,原因是Java读取的还是宿主机的CPU,而不是docker cgroups限制的CPU,高版本的Java解决了这个问题,文档解决方案截图如下:

对此我们表示怀疑,因为我们的程序里设置了JVM参数

保持着试一试的心态,我们增加了一个实验组,实验组使用的Java版本是11.0.8

结果当晚实验组的Java进程还是死了,看来和Java版本也没关系

容器上存在定时任务导致的?

由于基础镜像是jdos官方提供的镜像,所以之前从来没有怀疑过是定时任务的问题,但是现在别无他法了,检查下容器的定时任务

虽然有定时任务,但是这个执行的时间点和Java挂掉的时间对不上,为此我们决定删除定时任务试试

结果当晚Java进程还是挂了,并且这次有dmesg的日志,发现Java被kill的同时crond也被kill了,被kill的原因是crond内存过高导致oom

难道还有系统级cron任务?于是查了一下/etc/crontab,发现果然还有cron任务(这是谁打的镜像!!!)

这个时间点和Java进程挂掉的时间点吻合,但是问题来了,执行的任务并没有logrotate.sh这个脚本,应该不会出现问题才对

到底是不是定时任务的问题,我们修改下cron的时间验证下,调整时间为中午11:00,验证下Java进程是否会挂,同时使用strace打印进程trace log

果然Java进程在中午11.00挂了,看来真的是cron任务导致的,让我们一起看一下strace

19:59:01 close(3)                        = 0
19:59:01 stat("/etc/pam.d", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
19:59:01 open("/etc/pam.d/crond", O_RDONLY) = 3
19:59:01 fstat(3, {st_mode=S_IFREG|0644, st_size=293, ...}) = 0
19:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd770804000
19:59:01 read(3, "#\n# The PAM configuration file f"..., 4096) = 293
19:59:01 open("/lib64/security/pam_access.so", O_RDONLY) = 5
19:59:01 read(5, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0000\17\0\0\0\0\0\0"..., 832) = 832
19:59:01 fstat(5, {st_mode=S_IFREG|0755, st_size=18552, ...}) = 0
19:59:01 mmap(NULL, 2113800, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 5, 0) = 0x7fd769322000
19:59:01 mprotect(0x7fd769325000, 2097152, PROT_NONE) = 0
19:59:01 mmap(0x7fd769525000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 5, 0x3000) = 0x7fd769525000
19:59:01 close(5) = 0
19:59:01 open("/etc/ld.so.cache", O_RDONLY) = 5
19:59:01 fstat(5, {st_mode=S_IFREG|0644, st_size=16203, ...}) = 0
19:59:01 mmap(NULL, 16203, PROT_READ, MAP_PRIVATE, 5, 0) = 0x7fd7707f8000
19:59:01 close(5) = 0
19:59:01 open("/lib64/libnsl.so.1", O_RDONLY) = 5
19:59:01 read(5, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0p@\0\0\0\0\0\0"..., 832) = 832
19:59:01 fstat(5, {st_mode=S_IFREG|0755, st_size=113432, ...}) = 0
19:59:01 mmap(NULL, 2198192, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 5, 0) = 0x7fd769109000
19:59:01 mprotect(0x7fd76911f000, 2093056, PROT_NONE) = 0
19:59:01 mmap(0x7fd76931e000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 5, 0x15000) = 0x7fd76931e000
19:59:01 mmap(0x7fd769320000, 6832, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fd769320000
19:59:01 close(5) = 0
19:59:01 mprotect(0x7fd76931e000, 4096, PROT_READ) = 0
19:59:01 mprotect(0x7fd769525000, 4096, PROT_READ) = 0
19:59:01 munmap(0x7fd7707f8000, 16203) = 0
19:59:01 open("/etc/pam.d/password-auth", O_RDONLY) = 5
19:59:01 fstat(5, {st_mode=S_IFREG|0644, st_size=692, ...}) = 0
19:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0)                     = 0x7fd770803000
19:59:01 read(5, "#%PAM-1.0\n# This file is auto-ge"..., 4096) = 692
19:59:01 open("/lib64/security/pam_unix.so", O_RDONLY) = 6
19:59:01 read(6, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\240&\0\0\0\0\0\0"..., 832) = 832
19:59:01 fstat(6, {st_mode=S_IFREG|0755, st_size=51960, ...}) = 0
19:59:01 mmap(NULL, 2196352, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 6, 0) = 0x7fd768ef0000
19:59:01 mprotect(0x7fd768efc000, 2093056, PROT_NONE) = 0
19:59:01 mmap(0x7fd7690fb000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 6, 0xb000) = 0x7fd7690fb000
19:59:01 mmap(0x7fd7690fd000, 45952, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fd7690fd000
19:59:01 close(6)                       = 0
19:59:01 mprotect(0x7fd7690fb000, 4096, PROT_READ) = 0
19:59:01 read(5, "", 4096)              = 0
19:59:01 close(5) = 0
19:59:01 munmap(0x7fd770803000, 4096) = 0
19:59:01 open("/lib64/security/pam_loginuid.so", O_RDONLY) = 5
19:59:01 read(5, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\220\t\0\0\0\0\0\0"..., 832) = 832
19:59:01 fstat(5, {st_mode=S_IFREG|0755, st_size=10240, ...}) = 0
19:59:01 mmap(NULL, 2105480, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 5, 0) = 0x7fd768ced000
19:59:01 mprotect(0x7fd768cef000, 2093056, PROT_NONE) = 0
19:59:01 mmap(0x7fd768eee000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 5, 0x1000) = 0x7fd768eee000
19:59:01 close(5) = 0
19:59:01 mprotect(0x7fd768eee000, 4096, PROT_READ) = 0
19:59:01 open("/etc/pam.d/password-auth", O_RDONLY) = 5
19:59:01 fstat(5, {st_mode=S_IFREG|0644, st_size=692, ...}) = 0
19:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd770803000
19:59:01 read(5, "#%PAM-1.0\n# This file is auto-ge"..., 4096) = 692
19:59:01 open("/lib64/security/pam_keyinit.so", O_RDONLY) = 6
19:59:01 read(6, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0`\10\0\0\0\0\0\0"..., 832) = 832
19:59:01 fstat(6, {st_mode=S_IFREG|0755, st_size=10224, ...}) = 0
19:59:01 mmap(NULL, 2105488, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 6, 0)                      = 0x7fd768aea000
19:59:01 mprotect(0x7fd768aec000, 2093056, PROT_NONE)                     = 0
19:59:01 mmap(0x7fd768ceb000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 6, 0x1000) = 0x7fd768ceb000
19:59:01 close(6) = 0
19:59:01 mprotect(0x7fd768ceb000, 4096, PROT_READ) = 0
19:59:01 open("/lib64/security/pam_limits.so", O_RDONLY) = 6
19:59:01 read(6, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\320\20\0\0\0\0\0\0"..., 832) = 832
19:59:01 fstat(6, {st_mode=S_IFREG|0755, st_size=18600, ...}) = 0
19:59:01 mmap(NULL, 2113848, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 6, 0) = 0x7fd7688e5000
19:59:01 mprotect(0x7fd7688e9000, 2093056, PROT_NONE) = 0
19:59:01 mmap(0x7fd768ae8000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 6, 0x3000) = 0x7fd768ae8000
19:59:01 close(6) = 0
19:59:01 mprotect(0x7fd768ae8000, 4096, PROT_READ) = 0
19:59:01 open("/lib64/security/pam_succeed_if.so", O_RDONLY) = 6
19:59:01 read(6, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\340\v\0\0\0\0\0\0"..., 832) = 832
19:59:01 fstat(6, {st_mode=S_IFREG|0755, st_size=14384, ...}) = 0
19:59:01 mmap(NULL, 2109624, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 6, 0) = 0x7fd7686e1000
19:59:01 mprotect(0x7fd7686e4000, 2093056, PROT_NONE) = 0
19:59:01 mmap(0x7fd7688e3000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 6, 0x2000) = 0x7fd7688e3000
19:59:01 close(6) = 0
19:59:01 mprotect(0x7fd7688e3000, 4096, PROT_READ)                       = 0
19:59:01 read(5, "", 4096) = 0
19:59:01 close(5)                     = 0
19:59:01 munmap(0x7fd770803000, 4096) = 0
19:59:01 open("/etc/pam.d/password-auth", O_RDONLY)                      = 5
19:59:01 fstat(5, {st_mode=S_IFREG|0644, st_size=692, ...}) = 0
19:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0)                      = 0x7fd770803000
19:59:01 read(5, "#%PAM-1.0\n# This file is auto-ge"..., 4096) = 692
19:59:01 open("/lib64/security/pam_env.so", O_RDONLY) = 6
19:59:01 read(6, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\300\r\0\0\0\0\0\0"..., 832) = 832
19:59:01 fstat(6, {st_mode=S_IFREG|0755, st_size=18592, ...}) = 0
19:59:01 mmap(NULL, 2113776, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 6, 0)                       = 0x7fd7684dc000
19:59:01 mprotect(0x7fd7684e0000, 2093056, PROT_NONE) = 0
19:59:01 mmap(0x7fd7686df000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 6, 0x3000) = 0x7fd7686df000
19:59:01 close(6) = 0
19:59:01 mprotect(0x7fd7686df000, 4096, PROT_READ)                     = 0
19:59:01 open("/lib64/security/pam_deny.so", O_RDONLY) = 6
19:59:01 read(6, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0000\5\0\0\0\0\0\0"..., 832) = 832
19:59:01 fstat(6, {st_mode=S_IFREG|0755, st_size=5952, ...}) = 0
19:59:01 mmap(NULL, 2101272, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 6, 0)                       = 0x7fd7682da000
19:59:01 mprotect(0x7fd7682db000, 2093056, PROT_NONE) = 0
19:59:01 mmap(0x7fd7684da000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 6, 0)                      = 0x7fd7684da000
19:59:01 close(6) = 0
19:59:01 mprotect(0x7fd7684da000, 4096, PROT_READ) = 0
19:59:01 read(5, "", 4096) = 0
19:59:01 close(5) = 0
19:59:01 munmap(0x7fd770803000, 4096) = 0
19:59:01 read(3, "", 4096)             = 0
19:59:01 close(3) = 0
19:59:01 munmap(0x7fd770804000, 4096)                      = 0
19:59:01 open("/etc/pam.d/other", O_RDONLY)                      = 3
19:59:01 fstat(3, {st_mode=S_IFREG|0644, st_size=154, ...}) = 0
19:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0)   = 0x7fd770804000
19:59:01 read(3, "#%PAM-1.0\nauth     required     "..., 4096) = 154
19:59:01 read(3, "", 4096) = 0
19:59:01 close(3) = 0
19:59:01 munmap(0x7fd770804000, 4096) = 0
19:59:01 open("/etc/passwd", O_RDONLY|O_CLOEXEC)   = 3
19:59:01 fstat(3, {st_mode=S_IFREG|0644, st_size=1057, ...}) = 0
19:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd770804000
19:59:01 read(3, "root:x:0:0:root:/root:/bin/bash\n"..., 4096) = 1057
19:59:01 close(3) = 0
19:59:01 munmap(0x7fd770804000, 4096) = 0
19:59:01 uname({sys="Linux", node="host-11-159-73-176", ...}) = 0
19:59:01 open("/etc/security/access.conf", O_RDONLY) = 3
19:59:01 fstat(3, {st_mode=S_IFREG|0644, st_size=4620, ...}) = 0
19:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd770804000
19:59:01 read(3, "# Login access control table.\n#\n"..., 4096) = 4096
19:59:01 read(3, " should get access from ipv4 net"..., 4096) = 524
19:59:01 read(3, "", 4096) = 0
19:59:01 close(3) = 0
19:59:01 munmap(0x7fd770804000, 4096) = 0
19:59:01 getuid() = 0
19:59:01 open("/etc/passwd", O_RDONLY|O_CLOEXEC) = 3
19:59:01 fstat(3, {st_mode=S_IFREG|0644, st_size=1057, ...}) = 0
19:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0)  = 0x7fd770804000
19:59:01 read(3, "root:x:0:0:root:/root:/bin/bash\n"..., 4096) = 1057
19:59:01 close(3)                       = 0
19:59:01 munmap(0x7fd770804000, 4096) = 0
19:59:01 geteuid() = 0
19:59:01 open("/etc/shadow", O_RDONLY|O_CLOEXEC) = 3
19:59:01 fstat(3, {st_mode=S_IFREG, st_size=901, ...}) = 0
19:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd770804000
19:59:01 read(3, "root:$6$4.53VPrJ$1wxMpbsWYp4VKea"..., 4096) = 901
19:59:01 close(3) = 0
19:59:01 munmap(0x7fd770804000, 4096)                      = 0
19:59:01 socket(PF_NETLINK, SOCK_RAW, 9)                       = 3
19:59:01 fcntl(3, F_SETFD, FD_CLOEXEC)   = 0
19:59:01 readlink("/proc/self/exe", "/usr/sbin/crond", 4096) = 15
19:59:01 sendto(3, "p\0\0\0M\4\5\0\1\0\0\0\0\0\0\0op=PAM:accountin"..., 112, 0, {sa_family=AF_NETLINK, pid=0, groups=00000000}, 12)                      = 112
19:59:01 poll([{fd=3, events=POLLIN}], 1, 500)   = 1 ([{fd=3, revents=POLLIN}])
19:59:01 recvfrom(3, "$\0\0\0\2\0\0\1\1\0\0\0\227\7\0\0\0\0\0\0p\0\0\0M\4\5\0\1\0\0\0"..., 8988, MSG_PEEK|MSG_DONTWAIT, {sa_family=AF_NETLINK, pid=0, groups=00000000}, [12]) = 36
19:59:01 recvfrom(3, "$\0\0\0\2\0\0\1\1\0\0\0\227\7\0\0\0\0\0\0p\0\0\0M\4\5\0\1\0\0\0"..., 8988, MSG_DONTWAIT, {sa_family=AF_NETLINK, pid=0, groups=00000000}, [12]) = 36
19:59:01 close(3) = 0
19:59:01 open("/etc/security/pam_env.conf", O_RDONLY) = 3
19:59:01 fstat(3, {st_mode=S_IFREG|0644, st_size=2980, ...}) = 0
19:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd770804000
19:59:01 read(3, "#\n# This is the configuration fi"..., 4096) = 2980
19:59:01 read(3, "", 4096) = 0
19:59:01 close(3)                      = 0
19:59:01 munmap(0x7fd770804000, 4096)                       = 0
19:59:01 open("/etc/environment", O_RDONLY)   = 3
19:59:01 fstat(3, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
19:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0)               = 0x7fd770804000
19:59:01 read(3, "", 4096) = 0
19:59:01 close(3) = 0
19:59:01 munmap(0x7fd770804000, 4096) = 0
19:59:01 socket(PF_NETLINK, SOCK_RAW, 9) = 3
19:59:01 fcntl(3, F_SETFD, FD_CLOEXEC) = 0
19:59:01 sendto(3, "p\0\0\0O\4\5\0\2\0\0\0\0\0\0\0op=PAM:setcred a"..., 112, 0, {sa_family=AF_NETLINK, pid=0, groups=00000000}, 12)                       = 112
19:59:01 poll([{fd=3, events=POLLIN}], 1, 500)   = 1 ([{fd=3, revents=POLLIN}])
19:59:01 recvfrom(3, "$\0\0\0\2\0\0\1\2\0\0\0\227\7\0\0\0\0\0\0p\0\0\0O\4\5\0\2\0\0\0"..., 8988, MSG_PEEK|MSG_DONTWAIT, {sa_family=AF_NETLINK, pid=0, groups=00000000}, [12]) = 36
19:59:01 recvfrom(3, "$\0\0\0\2\0\0\1\2\0\0\0\227\7\0\0\0\0\0\0p\0\0\0O\4\5\0\2\0\0\0"..., 8988, MSG_DONTWAIT, {sa_family=AF_NETLINK, pid=0, groups=00000000}, [12]) = 36
19:59:01 close(3) = 0
19:59:01 open("/etc/passwd", O_RDONLY|O_CLOEXEC) = 3
19:59:01 fstat(3, {st_mode=S_IFREG|0644, st_size=1057, ...}) = 0
19:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd770804000
19:59:01 read(3, "root:x:0:0:root:/root:/bin/bash\n"..., 4096) = 1057
19:59:01 close(3) = 0
19:59:01 munmap(0x7fd770804000, 4096) = 0
19:59:01 open("/proc/self/loginuid", O_WRONLY|O_TRUNC|O_NOFOLLOW)        = 3
19:59:01 write(3, "0", 1) = 1
19:59:01 close(3) = 0
19:59:01 open("/etc/passwd", O_RDONLY|O_CLOEXEC) = 3
19:59:01 fstat(3, {st_mode=S_IFREG|0644, st_size=1057, ...}) = 0
19:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd770804000
19:59:01 read(3, "root:x:0:0:root:/root:/bin/bash\n"..., 4096) = 1057
19:59:01 close(3) = 0
19:59:01 munmap(0x7fd770804000, 4096) = 0
19:59:01 getuid() = 0
19:59:01 getgid() = 0
19:59:01 keyctl(0, 0xfffffffd, 0, 0, 0) = 496466385
19:59:01 keyctl(0, 0xfffffffb, 0, 0, 0x30) = 785702132
19:59:01 open("/etc/passwd", O_RDONLY|O_CLOEXEC) = 3
19:59:01 fstat(3, {st_mode=S_IFREG|0644, st_size=1057, ...}) = 0
19:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd770804000
19:59:01 read(3, "root:x:0:0:root:/root:/bin/bash\n"..., 4096) = 1057
19:59:01 close(3) = 0
19:59:01 munmap(0x7fd770804000, 4096) = 0
19:59:01 getrlimit(RLIMIT_CPU, {rlim_cur=RLIM_INFINITY, rlim_max=RLIM_INFINITY}) = 0
19:59:01 getrlimit(RLIMIT_FSIZE, {rlim_cur=RLIM_INFINITY, rlim_max=RLIM_INFINITY}) = 0
19:59:01 getrlimit(RLIMIT_DATA, {rlim_cur=RLIM_INFINITY, rlim_max=RLIM_INFINITY}) = 0
19:59:01 getrlimit(RLIMIT_STACK, {rlim_cur=8192*1024, rlim_max=RLIM_INFINITY}) = 0
19:59:01 getrlimit(RLIMIT_CORE, {rlim_cur=RLIM_INFINITY, rlim_max=RLIM_INFINITY}) = 0
19:59:01 getrlimit(RLIMIT_RSS, {rlim_cur=RLIM_INFINITY, rlim_max=RLIM_INFINITY}) = 0
19:59:01 getrlimit(RLIMIT_NPROC, {rlim_cur=RLIM_INFINITY, rlim_max=RLIM_INFINITY}) = 0
19:59:01 getrlimit(RLIMIT_NOFILE, {rlim_cur=1073741816, rlim_max=1073741816}) = 0
19:59:01 getrlimit(RLIMIT_MEMLOCK, {rlim_cur=64*1024, rlim_max=64*1024}) = 0
19:59:01 getrlimit(RLIMIT_AS, {rlim_cur=RLIM_INFINITY, rlim_max=RLIM_INFINITY}) = 0
19:59:01 getrlimit(RLIMIT_LOCKS, {rlim_cur=RLIM_INFINITY, rlim_max=RLIM_INFINITY}) = 0
19:59:01 getrlimit(RLIMIT_SIGPENDING, {rlim_cur=883632, rlim_max=883632}) = 0
19:59:01 getrlimit(RLIMIT_MSGQUEUE, {rlim_cur=800*1024, rlim_max=800*1024}) = 0
19:59:01 getrlimit(RLIMIT_NICE, {rlim_cur=0, rlim_max=0}) = 0
19:59:01 getrlimit(RLIMIT_RTPRIO, {rlim_cur=0, rlim_max=0}) = 0
19:59:01 getpriority(PRIO_PROCESS, 0) = 20
19:59:01 open("/etc/security/limits.conf", O_RDONLY) = 3
19:59:01 fstat(3, {st_mode=S_IFREG|0644, st_size=1835, ...}) = 0
19:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd770804000
19:59:01 read(3, "# /etc/security/limits.conf\n#\n#E"..., 4096) = 1835
19:59:01 read(3, "", 4096) = 0
19:59:01 close(3) = 0
19:59:01 munmap(0x7fd770804000, 4096) = 0
19:59:01 open("/etc/security/limits.d", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
19:59:01 getdents(3, /* 3 entries */, 32768) = 88
19:59:01 open("/usr/lib64/gconv/gconv-modules.cache", O_RDONLY)                       = 5
19:59:01 fstat(5, {st_mode=S_IFREG|0644, st_size=26060, ...}) = 0
19:59:01 mmap(NULL, 26060, PROT_READ, MAP_SHARED, 5, 0) = 0x7fd7707f5000
19:59:01 close(5)  = 0
19:59:01 getdents(3, /* 0 entries */, 32768) = 0
19:59:01 close(3) = 0
19:59:01 open("/etc/security/limits.d/90-nproc.conf", O_RDONLY) = 3
19:59:01 fstat(3, {st_mode=S_IFREG|0644, st_size=193, ...}) = 0
19:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd770804000
19:59:01 read(3, "# Default limit for number of us"..., 4096) = 193
19:59:01 read(3, "", 4096)              = 0
19:59:01 close(3)                       = 0
19:59:01 munmap(0x7fd770804000, 4096)   = 0
19:59:01 setrlimit(RLIMIT_NPROC, {rlim_cur=RLIM_INFINITY, rlim_max=RLIM_INFINITY}) = 0
19:59:01 setpriority(PRIO_PROCESS, 0, 0) = 0
19:59:01 getuid() = 0
19:59:01 open("/etc/passwd", O_RDONLY|O_CLOEXEC) = 3
19:59:01 fstat(3, {st_mode=S_IFREG|0644, st_size=1057, ...}) = 0
19:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd770804000
19:59:01 read(3, "root:x:0:0:root:/root:/bin/bash\n"..., 4096) = 1057
19:59:01 close(3) = 0
19:59:01 munmap(0x7fd770804000, 4096)                     = 0
19:59:01 socket(PF_NETLINK, SOCK_RAW, 9)                      = 3
19:59:01 fcntl(3, F_SETFD, FD_CLOEXEC)                      = 0
19:59:01 sendto(3, "t\0\0\0Q\4\5\0\3\0\0\0\0\0\0\0op=PAM:session_o"..., 116, 0, {sa_family=AF_NETLINK, pid=0, groups=00000000}, 12) = 116
19:59:01 poll([{fd=3, events=POLLIN}], 1, 500) = 1 ([{fd=3, revents=POLLIN}])
19:59:01 recvfrom(3, "$\0\0\0\2\0\0\1\3\0\0\0\227\7\0\0\0\0\0\0t\0\0\0Q\4\5\0\3\0\0\0"..., 8988, MSG_PEEK|MSG_DONTWAIT, {sa_family=AF_NETLINK, pid=0, groups=00000000}, [12]) = 36
19:59:01 recvfrom(3, "$\0\0\0\2\0\0\1\3\0\0\0\227\7\0\0\0\0\0\0t\0\0\0Q\4\5\0\3\0\0\0"..., 8988, MSG_DONTWAIT, {sa_family=AF_NETLINK, pid=0, groups=00000000}, [12]) = 36
19:59:01 close(3) = 0
19:59:01 setgid(0) = 0
19:59:01 open("/proc/sys/kernel/ngroups_max", O_RDONLY) = 3
19:59:01 read(3, "65536\n", 31)         = 6
19:59:01 close(3)                       = 0
19:59:01 socket(PF_FILE, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
19:59:01 connect(3, {sa_family=AF_FILE, path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
19:59:01 close(3) = 0
19:59:01 socket(PF_FILE, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
19:59:01 connect(3, {sa_family=AF_FILE, path="/var/run/nscd/socket"}, 110)                       = -1 ENOENT (No such file or directory)
19:59:01 close(3) = 0
19:59:01 open("/etc/group", O_RDONLY|O_CLOEXEC) = 3
19:59:01 fstat(3, {st_mode=S_IFREG|0644, st_size=497, ...}) = 0
19:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd770804000
19:59:01 lseek(3, 0, SEEK_CUR) = 0
19:59:01 read(3, "root:x:0:\nbin:x:1:bin,daemon\ndae"..., 4096) = 497
19:59:01 read(3, "", 4096)              = 0
19:59:01 close(3)                       = 0
19:59:01 munmap(0x7fd770804000, 4096)                     = 0
19:59:01 setgroups(1, [0]) = 0
19:59:01 setreuid(0, 4294967295) = 0
19:59:01 rt_sigaction(SIGCHLD, {SIG_DFL, [CHLD], SA_RESTORER|SA_RESTART, 0x7fd76fa316a0}, {0x558826e03b80, [], SA_RESTORER|SA_RESTART, 0x7fd76fa316a0}, 8) = 0
19:59:01 pipe([3, 5])                   = 0
19:59:01 pipe([6, 7])                   = 0
19:59:01 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fd7707fca70) = 1946
19:59:01 gettid()                     = 1943
19:59:01 open("/proc/self/task/1943/attr/exec", O_RDWR) = 8
19:59:01 write(8, NULL, 0) = -1 EINVAL (Invalid argument)
19:59:01 close(8) = 0
19:59:01 close(3) = 0
19:59:01 close(7) = 0
19:59:01 close(5) = 0
19:59:01 fcntl(6, F_GETFL)                       = 0 (flags O_RDONLY)
19:59:01 fstat(6, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
19:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0)                     = 0x7fd770804000
19:59:01 lseek(6, 0, SEEK_CUR)                     = -1 ESPIPE (Illegal seek)
19:59:01 read(6, "/bin/bash: ./logrotate.sh: \346\262\241\346\234"..., 4096) = 55
19:59:01 uname({sys="Linux", node="host-11-159-73-176", ...}) = 0
19:59:01 getrlimit(RLIMIT_NOFILE, {rlim_cur=1073741816, rlim_max=1073741816}) = 0
19:59:01 mmap(NULL, 4294967296, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd6682da000
19:59:01 --- SIGCHLD (Child exited) @ 0 (0) ---
19:59:06 +++ killed by SIGKILL +++


可以看到最后用 mmap 一次分配了 4G 内存,然后就被kill了。

mmap前调用了getrlimit,和上次MySQL的问题一样,都是根据系统资源限制来分配内存

为了确定就是cron导致java挂掉的元凶,我们把cron进程手动kill掉,这样就不会执行定时任务了,这次我们在验证下Java进程是否会挂掉

果不其然,Java进程并没有挂掉,看来真的是cron任务导致的

高版本CentOS是否也会出现类似问题?

按理说oom killer应该只kill掉占用内存最高的才对,Java进程占用内存又不是最高的,高版本的CentOS系统oom killer策略会不会有升级?

让我们来一起验证下高版本的CentOS系统是否有这个问题

当前镜像的CentOS版本是CentOS release 6.6 (Final),为了验证高版本的CentOS是否也有类似的问题,我们将增加两个实验组,分别升级基础镜像至CentOS release 6.10 (Final)CentOS Linux release 7.9.2009 (Core),也添加相同的cron任务

结果发现CentOS release 6.10 (Final)CentOS Linux release 7.9.2009 (Core)都没有kill掉Java进程,只kill掉了cron的子进程

结论

由于容器limit open files(系统最大句柄数)设置不合理导致cron执行任务时使容器内存飙升,存在内存溢出的风险,linux由于保护机制会kill掉占用内存高的进程,导致cron子任务进程和Java进程一起被kill(但是问题来了,这个jdos基础镜像为什么会执行一个完全不存在的shell脚本,而且还是执行两次???),高版本的CentOS系统不会kill java进程,猜测不同版本的CentOS的kill选择策略略有不同

问题分析

Cron任务执行逻辑

在Linux中,crontab工具是由croine软件包提供的,让我们一起看下cron的执行过程

其中child_process()执行了cron子进程,cron执行子进程时会有发送mail的动作

cron_popen在执行时会按照open files(系统最大句柄数)清除内存

综上,cron oom的原因找到了,是由于open files设置过大且cron任务没有标准输出,导致执行了发送mail逻辑,而清除的内存大小超出了容器本身内存的大小,导致oom。

croine 1.5.4 版本之后修复了该问题,如果想查看当前容器croine版本可执行如下命令:

rpm -q cronie

Linux内核OOM killer机制

Linux 内核有个机制叫OOM killer(Out Of Memory killer),该机制会监控那些占用内存过大,尤其是瞬间占用内存很快的进程,然后防止内存耗尽而自动把该进程杀掉。内核检测到系统内存不足、挑选并杀掉某个进程的过程可以参考内核源代码linux/mm/oom_kill.c,当系统内存不足的时候,out_of_memory()被触发,然后调用select_bad_process()选择一个”bad”进程杀掉。

以下是一些主要的进程选择策略:

  1. 内存使用情况:OOM Killer首先倾向于选择占用内存最多的进程,因为终止这些进程可以释放最多的内存。

  2. OOM分数:每个进程都有一个OOM分数,该分数是基于其内存使用情况和其他因素计算出来的。OOM Killer倾向于终止OOM分数最高的进程。

  3. 进程优先级:在选择要终止的进程时,OOM Killer通常会避免终止对系统至关重要的系统进程。这些进程通常具有较高的优先级,因此它们更不容易成为终止目标。

  4. 进程资源需求:OOM Killer还会考虑进程的资源需求。它倾向于终止那些请求较少资源的进程,以最小化影响其他进程的运行。

  5. 进程属性:某些进程可能被标记为不可终止,例如通过设置/proc/[PID]/oom_score_adj的值来调整OOM分数。这些进程通常不容易被OOM Killer终止。

注:不同版本的Linux oom killer机制可能会存在一些差异

解决方案

使用高版本稳定的CentOS系统,如果业务无法升级CentOS,则需要设置合理的limit open files数量,application_worker类型应用可以在启动脚本中手动修改limit,web_tomcat类型应用没法修改启动脚本,可以选择kill掉cron进程或删除系统cron任务,也可以手动升级cronie的版本至1.5.7-5

写在后面

open files这个坑很大,栽这个坑两次了,大家一定要检查自己服务对应容器的CentOS版本和limit设置是否合理,本次案例发生在测试环境,尚不会引起事故,如果在生产出现类似情况,后果不堪设想

由于测试环境新增的这批机器都存在这个问题,我们团队已经联系机器提供方上报了该问题,后续这批机器会由提供方统一修改系统最大句柄数,如果当前问题影响到了业务的正常使用,可以临时删除容器中/etc/crontab中的任务

参考文献

https://cloud.tencent.com/developer/article/1183262

https://github.com/cronie-crond/cronie

作者:京东零售 杨云龙

来源:京东云开发者社区 转载请注明来源文章来源地址https://www.toymoban.com/news/detail-711417.html

到了这里,关于Java服务总在半夜挂,背后的真相竟然是...的文章就介绍完了。如果您还想了解更多内容,请在右上角搜索TOY模板网以前的文章或继续浏览下面的相关文章,希望大家以后多多支持TOY模板网!

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处: 如若内容造成侵权/违法违规/事实不符,请点击违法举报进行投诉反馈,一经查实,立即删除!

领支付宝红包 赞助服务器费用

相关文章

  • 服务器查看网络端口,竟然有这么多命令

    ping: 用来测试网络连接是否正常,通过向目标IP地址或域名发送数据包并等待响应时间来检测网络是否通畅。可通过sudo apt-get install iputils-ping安装。 nslookup: 用于查询DNS域名解析,通过输入域名或IP地址获取相应的解析信息。可通过sudo apt-get install dnsutils安装。 traceroute: 用于跟

    2024年02月15日
    浏览(31)
  • Dell服务器通过前面面板修改idrac管理卡的IP地址

    在服务器面板修改idrac卡的IP地址,比较适合刚刚购买回来的服务器,毕竟是第一次连接idrac卡,和公司的网络还没有互通。那么我们进行通过idrac卡来配置IP地址。(适用于Dell R730、R720、R630、R620、R420、R530) 1、  用一根网线接交换机和服务器的idrac的网口,如图所示: 2、回到

    2024年02月08日
    浏览(43)
  • 服务器装esxi及安装windows server 2016域控与文件权限设置教程(最完整最详细,有图有真相)

    将制作好的u盘插入到服务器(详情可以查看pe启动盘制作教程) 按F12,将服务器重启 输入密码,然后按回车键重启 按F11,进入引导管理 点击One shot UEFI BOOT MENU,进入uefi启动菜单 点击Disk connected to front USB ,连接u盘 进行ESXI安装 按回车键,进行安装 按F11,接受条款 用方向键

    2024年02月12日
    浏览(31)
  • 【网络】想学TCP,这一篇就够了 —— TCP理论知识详解(基于前面手搓TCP服务端博客的补充)

    本篇侧重理论,关于TCP的实践我前面博客中有,如果你想写一个简易的TCP服务端,可以看看这一篇:【网络】网络编程——带你手搓简易TCP服务端(echo服务器)+客户端(四种版本)。本篇也是基于这一篇进行理论方面的扩展,如果想要了解的同学可以看看。 本篇篇幅较长,

    2024年02月08日
    浏览(36)
  • 世界杯直播背后的服务器(云计算体系)

    世界杯直播过程中,各大网络平台流媒体app上最大的变化毫无疑问就是零延迟。以前球迷看球是都会发现,网络直播的球赛会比电视播出的球赛延迟40s左右。如果群里有个看电视的兄弟兄弟每个进球他都能提前40秒预告给你,那么所有惊喜荡然无存。 这种情况产生,就是因为

    2023年04月08日
    浏览(26)
  • 袋鼠云代码检查服务,揭秘高质量代码背后的秘密

    质量是产品的生命线,代码检查是软件开发过程中至关重要的一环,它可以帮助我们发现并纠正潜在的错误,提高软件质量,降低维护成本。 在袋鼠云产品中也存在这个问题,由于离线数据开发人员 SQL 水平不一,导致代码书写混乱、SQL 代码运行问题较多。本文将介绍在离线

    2024年02月08日
    浏览(37)
  • 分布式消息队列RabbitMQ-Linux下服务搭建,面试完腾讯我才发现这些知识点竟然没掌握全

    vim /usr/lib/rabbitmq/lib/rabbitmq_server-3.6.5/ebin/rabbit.app 5.修改配置文件 这里面修改{loopback_users, [“guest”]}改为{loopback_users, []} {application, rabbit, %% - - erlang - - [{description, “RabbitMQ”}, {id, “RabbitMQ”}, {vsn, “3.6.5”}, {modules, [‘background_gc’,‘delegate’,‘delegate_sup’,‘dtree’,‘file_han

    2024年04月14日
    浏览(41)
  • 进入微服务阶段后的学习方法

    陌生,多,复杂。 技术陌生,技术栈多,实现复杂。 对于每一个组件: 1.知道是什么、有什么用 2.知道操作步骤(跟着讲义操作即可),包括:         配置步骤         代码实现步骤 3.组件特点,注意事项

    2024年02月12日
    浏览(25)
  • GIT上超火的阿里内部1000页Java核心笔记,啃完竟然拿到阿里P7offer!

    5.线程池底层揭秘 6.并发安全解决方案 附加内容:并发编程高级面试题 Synchronized用过吗?其原理是什么? 你刚才提到获取对象的锁,这个“锁”到底是什么?如何确定对象的锁? 什么是“可重入性”,为什么说Synchronized是可重入锁? JVM对Java的原生锁做了哪些优化 ? 为什么

    2024年04月17日
    浏览(34)
  • 探秘ArrayList源码:Java动态数组的背后实现

    读者需先对源码的成员变量阅览一遍,看个眼熟,有助于后面源码的理解 以无参数构造方法创建 ArrayList时,实际上初始化赋值的是一个空数组。此时并没有为它创建对象,当真正对数组进行添加元素操作时,才真正分配容量。即向数组中添加第一个元素时,数组容量扩为

    2024年02月16日
    浏览(29)

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

博客赞助

微信扫一扫打赏

请作者喝杯咖啡吧~博客赞助

支付宝扫一扫领取红包,优惠每天领

二维码1

领取红包

二维码2

领红包