漫长的 DNS 问题排查和修复
问题发现 2022.05.20 用户反馈 Flink 作业莫名卡住 暂时修复: 通知用户及时重启恢复作业 关注作业积压的报警, 及时发现问题 排查机器和DNS服务: DNS 服务无明显的问题异常, 凌晨机器负载较高,但进一步的原因无法确认 由于作业已经重启, 无有效手段进行下一步分析, 因此计划开发 Flink Canray 程序,定期扫描全集群的作业,结合 metric 和 jstack 能过及时发现问题 Flink Canary 规划和开发 问题排查 2022.06.28 上线 Flink canary 之后, 能过及时发现现场并进行分析 查看 jstack 查看对应的执行栈, 发现最终是在一个 native 方法上 pool-21-thread-1": running at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method) at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:929) at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1324) at java.net.InetAddress.getAllByName0(InetAddress.java:1277) at java.net.InetAddress.getAllByName(InetAddress.java:1193) at java.net.InetAddress.getAllByName(InetAddress.java:1127) at org.apache.http.impl.conn.SystemDefaultDnsResolver.resolve(SystemDefaultDnsResolver.java:45) at org.apache.http.impl.conn.DefaultClientConnectionOperator.resolveHostname(DefaultClientConnectionOperator.java:263) at org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:162) at org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:326) at org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:610) at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:445) at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:835) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:118) at org....