• 周五. 12月 9th, 2022

5G编程聚合网

5G时代下一个聚合的编程学习网

热门标签

解Bug之路-ZooKeeper集群拒绝服务

[db:作者]

1月 6, 2022

{“type”:”doc”,”content”:[{“type”:”heading”,”attrs”:{“align”:null,”level”:2},”content”:[{“type”:”text”,”text”:”前言”,”attrs”:{}}]},{“type”:”paragraph”,”attrs”:{“indent”:0,”number”:0,”align”:null,”origin”:null},”content”:[{“type”:”text”,”text”:”ZooKeeper作为dubbo的注册中心,可谓是重中之重,线上ZK的任何风吹草动都会牵动心弦。最近笔者就碰到线上ZK Leader宕机后,选主无法成功导致ZK集群拒绝服务的现象,于是把这个case写出来分享给大家(基于ZooKeeper 3.4.5)。”,”attrs”:{}}]},{“type”:”heading”,”attrs”:{“align”:null,”level”:2},”content”:[{“type”:”text”,”text”:”Bug现场”,”attrs”:{}}]},{“type”:”paragraph”,”attrs”:{“indent”:0,”number”:0,”align”:null,”origin”:null},”content”:[{“type”:”text”,”text”:”一天早上,突然接到电话,说是ZooKeeper物理机宕机了,而剩余几台机器状态都是”,”attrs”:{}}]},{“type”:”codeblock”,”attrs”:{“lang”:null},”content”:[{“type”:”text”,”text”:”sh zkServer.sh status\nit is probably not running\n”,”attrs”:{}}]},{“type”:”paragraph”,”attrs”:{“indent”:0,”number”:0,”align”:null,”origin”:null},”content”:[{“type”:”text”,”text”:”笔者看了下监控,物理机宕机的正好是ZK的leader。3节点的ZK,leader宕了后,其余两台一直未能成为leader,把宕机的那台紧急拉起来之后,依旧无法选主,导致ZK集群整体拒绝服务!”,”attrs”:{}}]},{“type”:”image”,”attrs”:{“src”:”https://static001.geekbang.org/infoq/82/829c6385ca6959c92b81278ed9e3c417.png”,”alt”:””,”title”:null,”style”:[{“key”:”width”,”value”:”75%”},{“key”:”bordertype”,”value”:”none”}],”href”:null,”fromPaste”:true,”pastePass”:true}},{“type”:”paragraph”,”attrs”:{“indent”:0,”number”:0,”align”:null,”origin”:null}},{“type”:”heading”,”attrs”:{“align”:null,”level”:2},”content”:[{“type”:”text”,”text”:”业务影响”,”attrs”:{}}]},{“type”:”paragraph”,”attrs”:{“indent”:0,”number”:0,”align”:null,”origin”:null},”content”:[{“type”:”text”,”text”:”Dubbo如果连接不上ZK,其调用元信息会一直缓存着,所以并不会对请求调用造成实际影响。麻烦的是,如果在ZK拒绝服务期间,应用无法重启或者发布,一旦遇到紧急事件而重启(发布)不能,就会造成比较重大的影响。好在我们为了高可用,做了对等机房建设,所以非常淡定的将流量切到B机房,”,”attrs”:{}}]},{“type”:”image”,”attrs”:{“src”:”https://static001.geekbang.org/infoq/85/856adafca1fec5530f67b9b280602cf6.png”,”alt”:””,”title”:null,”style”:[{“key”:”width”,”value”:”75%”},{“key”:”bordertype”,”value”:”none”}],”href”:null,”fromPaste”:true,”pastePass”:true}},{“type”:”paragraph”,”attrs”:{“indent”:0,”number”:0,”align”:null,”origin”:null},”content”:[{“type”:”text”,”text”:”双机房建设就是好啊,一键切换!切换过后就可以有充裕的时间来恢复A机房的集群了。在紧张恢复的同时,笔者也开始了分析工作。”,”attrs”:{}}]},{“type”:”heading”,”attrs”:{“align”:null,”level”:2},”content”:[{“type”:”text”,”text”:”日志表现”,”attrs”:{}}]},{“type”:”paragraph”,”attrs”:{“indent”:0,”number”:0,”align”:null,”origin”:null},”content”:[{“type”:”text”,”text”:”首先,查看日志,期间有大量的client连接报错,自然是直接过滤掉,以免干扰。”,”attrs”:{}}]},{“type”:”codeblock”,”attrs”:{“lang”:null},”content”:[{“type”:”text”,”text”:”cat zookeeper.out | grep -v ‘client xxx’ | > /tmp/1.txt\n”,”attrs”:{}}]},{“type”:”paragraph”,”attrs”:{“indent”:0,”number”:0,”align”:null,”origin”:null},”content”:[{“type”:”text”,”text”:”首先看到的是下面这样的日志:”,”attrs”:{}}]},{“type”:”heading”,”attrs”:{“align”:null,”level”:3},”content”:[{“type”:”text”,”text”:”ZK-A机器日志”,”attrs”:{}}]},{“type”:”codeblock”,”attrs”:{“lang”:null},”content”:[{“type”:”text”,”text”:”Zk-A机器:\n2021-06-16 03:32:35 … New election. My id=3\n2021-06-16 03:32:46 … QuoeumPeer] LEADING // 注意,这里选主成功\n2021-06-16 03:32:46 … QuoeumPeer] LEADING – LEADER ELECTION TOOK – 7878’\n2021-06-16 03:32:48 … QuoeumPeer] Reading snapshot /xxx/snapshot.xxx\n2021-06-16 03:32:54 … QuoeumPeer] Snahotting xxx to /xxx/snapshot.xxx\n2021-06-16 03:33:08 … Follower sid ZK-B.IP\n2021-06-16 03:33:08 … Unexpected exception causing shutdown while sock still open\njava.io.EOFException \n\tat java.io.DataInputStream.readInt\n\t……\n\tat quorum.LearnerHandler.run\n2021-06-16 03:33:08 ******* GOODBYE ZK-B.IP *******\n2021-06-16 03:33:27 Shutting down\n”,”attrs”:{}}]},{“type”:”paragraph”,”attrs”:{“indent”:0,”number”:0,”align”:null,”origin”:null},”content”:[{“type”:”text”,”text”:”这段日志看上去像选主成功了,但是和其它机器的通信出问题了,导致Shutdown然后重新选举。”,”attrs”:{}}]},{“type”:”heading”,”attrs”:{“align”:null,”level”:2},”content”:[{“type”:”text”,”text”:”ZK-B机器日志”,”attrs”:{}}]},{“type”:”codeblock”,”attrs”:{“lang”:null},”content”:[{“type”:”text”,”text”:”2021-06-16 03:32:48 New election. My id=2\n2021-06-16 03:32:48 QuoeumPeer] FOLLOWING\n2021-06-16 03:32:48 QuoeumPeer] FOLLOWING – LEADER ELECTION TOOK – 222\n2021-06-16 03:33:08.833 QuoeumPeer] Exception when following the leader\njava.net.SocketTimeoutException: Read time out\n\tat java.net.SocketInputStream.socketRead0\n\t……\n\tat org.apache.zookeeper.server.quorum.Follower.followLeader\n2021-06-16 03:33:08.380 Shutting down\n”,”attrs”:{}}]},{“type”:”paragraph”,”attrs”:{“indent”:0,”number”:0,”align”:null,”origin”:null},”content”:[{“type”:”text”,”text”:”这段日志也表明选主成功了,而且自己是Following状态,只不过Leader迟迟不返回,导致超时进而Shutdown”,”attrs”:{}}]},{“type”:”heading”,”attrs”:{“align”:null,”level”:2},”content”:[{“type”:”text”,”text”:”时序图”,”attrs”:{}}]},{“type”:”paragraph”,”attrs”:{“indent”:0,”number”:0,”align”:null,”origin”:null},”content”:[{“type”:”text”,”text”:”笔者将上面的日志画成时序图,以便分析:”,”attrs”:{}}]},{“type”:”image”,”attrs”:{“src”:”https://static001.geekbang.org/infoq/9f/9f1f515cd140d8082019a1de077ba957.png”,”alt”:””,”title”:null,”style”:[{“key”:”width”,”value”:”75%”},{“key”:”bordertype”,”value”:”none”}],”href”:null,”fromPaste”:true,”pastePass”:true}},{“type”:”paragraph”,”attrs”:{“indent”:0,”number”:0,”align”:null,”origin”:null},”content”:[{“type”:”text”,”text”:”从ZK-B的日志可以看出,其在成为follower之后,一直等待leader,知道Read time out。从ZK-A的日志可以看出,其在成为LEADING后,在33:08,803才收到Follower也就是ZK-B发出的包。而这时,ZK-B已经在33:08,301的时候Read timed out了。”,”attrs”:{}}]},{“type”:”heading”,”attrs”:{“align”:null,”level”:3},”content”:[{“type”:”text”,”text”:”首先分析follower(ZK-B)的情况”,”attrs”:{}}]},{“type”:”paragraph”,”attrs”:{“indent”:0,”number”:0,”align”:null,”origin”:null},”content”:[{“type”:”text”,”text”:”我们知道其在03:32:48成为follower,然后在03:33:08出错Read time out,其间正好是20s。于是笔者先从Zookeeper源码中找下其设置Read time out是多长时间。”,”attrs”:{}}]},{“type”:”codeblock”,”attrs”:{“lang”:null},”content”:[{“type”:”text”,”text”:”Learner\nprotected void connectToLeader(InetSocketAddress addr) {\n\t……\n\tsock = new Socket()\n\t// self.tockTime 2000 self.initLimit 10\n\tsock.setSoTimeout(self.tickTime * self.initLimit);\n\t……\n}\n”,”attrs”:{}}]},{“type”:”paragraph”,”attrs”:{“indent”:0,”number”:0,”align”:null,”origin”:null},”content”:[{“type”:”text”,”text”:”其Read time out是按照zoo.cfg中的配置项而设置:”,”attrs”:{}}]},{“type”:”codeblock”,”attrs”:{“lang”:null},”content”:[{“type”:”text”,”text”:”tickTime=2000 self.tickTime\ninitLimit=10 self.initLimit\nsyncLimit=5\n”,”attrs”:{}}]},{“type”:”paragraph”,”attrs”:{“indent”:0,”number”:0,”align”:null,”origin”:null},”content”:[{“type”:”text”,”text”:”很明显的,ZK-B在成为follower后,由于某种原因leader在20s后才响应。那么接下来对leader进行分析。”,”attrs”:{}}]},{“type”:”heading”,”attrs”:{“align”:null,”level”:3},”content”:[{“type”:”text”,”text”:”对leader(ZK-A)进行分析”,”attrs”:{}}]},{“type”:”paragraph”,”attrs”:{“indent”:0,”number”:0,”align”:null,”origin”:null},”content”:[{“type”:”text”,”text”:”首先我们先看下Leader的初始化逻辑:”,”attrs”:{}}]},{“type”:”codeblock”,”attrs”:{“lang”:null},”content”:[{“type”:”text”,”text”:”quorumPeer\n\t|->打印 LEADING\n\t|->makeLeader\n\t\t|-> new ServerSocket listen and bind \n\t|->leader.lead()\n\t\t|->打印 LEADER ELECTION TOOK\n\t\t|->loadData\n\t\t\t|->loadDataBase \n\t\t\t\t|->resore 打印Reading snapshot\n\t\t\t|->takeSnapshot\n\t\t\t\t|->save 打印Snapshotting\n\t\t\t|->cnxAcceptor 处理请求Accept\t\t\n”,”attrs”:{}}]},{“type”:”paragraph”,”attrs”:{“indent”:0,”number”:0,”align”:null,”origin”:null},”content”:[{“type”:”text”,”text”:”可以看到,在我们的ZK启动监听端口到正式处理请求之间,还有Reading Snapshot和Snapshotting(写)动作。从日志可以看出一个花了6s多,一个花了14s多。然后就有20s的处理空档期。如下图所示:”,”attrs”:{}}]},{“type”:”image”,”attrs”:{“src”:”https://static001.geekbang.org/infoq/77/7703295c4313f4900c0570cbc5b76173.png”,”alt”:””,”title”:null,”style”:[{“key”:”width”,”value”:”75%”},{“key”:”bordertype”,”value”:”none”}],”href”:null,”fromPaste”:true,”pastePass”:true}},{“type”:”paragraph”,”attrs”:{“indent”:0,”number”:0,”align”:null,”origin”:null},”content”:[{“type”:”text”,”text”:”由于在socket listen 20s之后才开始处理数据,所以ZK-B建立成功的连接实际还放在tcp的内核全连接队列(backlog)里面,由于在内核看来三次握手是成功的,所以能够正常接收ZK-B发送的follower ZK-B数据。在20s,ZK-A真正处理后,从buffer里面拿出来20s前ZK-B发送的数据,处理完回包的时候,发现ZK-B连接已经断开。同样的,另一台follower(这时候我们已经把宕机的拉起来了,所以是3台)也是由于此原因gg,而leader迟迟收不到其它机器的响应,认为自己的leader没有达到1/2的票数,而Shutdown重新选举。”,”attrs”:{}}]},{“type”:”heading”,”attrs”:{“align”:null,”level”:2},”content”:[{“type”:”text”,”text”:”Snapshot耗时”,”attrs”:{}}]},{“type”:”paragraph”,”attrs”:{“indent”:0,”number”:0,”align”:null,”origin”:null},”content”:[{“type”:”text”,”text”:”那么是什么导致Snapshotting读写这么耗时呢?笔者查看了下Snapshot文件大小,有将近一个G左右。”,”attrs”:{}}]},{“type”:”heading”,”attrs”:{“align”:null,”level”:2},”content”:[{“type”:”text”,”text”:”调大initLimit”,”attrs”:{}}]},{“type”:”paragraph”,”attrs”:{“indent”:0,”number”:0,”align”:null,”origin”:null},”content”:[{“type”:”text”,”text”:”针对这种情况,其实我们只要调大initLimit,应该就可以越过这道坎。”,”attrs”:{}}]},{“type”:”codeblock”,”attrs”:{“lang”:null},”content”:[{“type”:”text”,”text”:”zoo.cfg\ntickTime=2000 // 这个不要动,因为和ZK心跳机制有关\ninitLimit=100 // 直接调成100,200s!\n”,”attrs”:{}}]},{“type”:”heading”,”attrs”:{“align”:null,”level”:2},”content”:[{“type”:”text”,”text”:”这么巧就20s么?”,”attrs”:{}}]},{“type”:”paragraph”,”attrs”:{“indent”:0,”number”:0,”align”:null,”origin”:null},”content”:[{“type”:”text”,”text”:”难道就这么巧,每次选举流程都刚好卡在20s不过?反复选举了好多次,应该有一次要1/2的条件,报错并跳出\n\t if (!tickSkip && !self.getQuorumVerifier().containsQuorum(syncedSet)) {\n shutdown(\”Only\” + syncedSet.size() + \” followers, need\” + (self.getVotingView().size()/2));\n return;\n } \n\t}\n}\n”,”attrs”:{}}]},{“type”:”paragraph”,”attrs”:{“indent”:0,”number”:0,”align”:null,”origin”:null},”content”:[{“type”:”text”,”text”:”报错的实质就是和leader同步的syncedSet小于固定的1/2集群,所以shutdown了。同时在代码里面我们又可以看到syncedSet的判定是通过learnerHander.synced()来决定。我们继续看下代码:”,”attrs”:{}}]},{“type”:”codeblock”,”attrs”:{“lang”:null},”content”:[{“type”:”text”,”text”:”LearnerHandler\n\tpublic boolean synced(){\n\t // 这边isAlive是线程的isAlive\n\t\treturn isAlive() && tickOfLastAck >= leader.self.tick – leader.self.syncLimit;\n\t}\n”,”attrs”:{}}]},{“type”:”paragraph”,”attrs”:{“indent”:0,”number”:0,”align”:null,”origin”:null},”content”:[{“type”:”text”,”text”:”很明显的,follower和leader的同步时间超过了leader.self.syncLimit也就是5 * 2 = 10s”,”attrs”:{}}]},{“type”:”codeblock”,”attrs”:{“lang”:null},”content”:[{“type”:”text”,”text”:”zoo.cfg\ntickTime = 2000\nsyncLimit = 5 \n”,”attrs”:{}}]},{“type”:”paragraph”,”attrs”:{“indent”:0,”number”:0,”align”:null,”origin”:null},”content”:[{“type”:”text”,”text”:”那么我们的tick是怎么更新的呢,答案是在follower响应UPTODATE包,也就是已经和leader同步后,follower每个包过来就更新一次,在此之前并不更新。”,”attrs”:{}}]},{“type”:”image”,”attrs”:{“src”:”https://static001.geekbang.org/infoq/4b/4bfe5cdfe4eb2b57a917f27e43528770.png”,”alt”:””,”title”:null,”style”:[{“key”:”width”,”value”:”75%”},{“key”:”bordertype”,”value”:”none”}],”href”:null,”fromPaste”:true,”pastePass”:true}},{“type”:”paragraph”,”attrs”:{“indent”:0,”number”:0,”align”:null,”origin”:null},”content”:[{“type”:”text”,”text”:”进一步推理,也就是我们的follower处理leader的包超过了10s,导致tick未及时更新,进而syncedSet小于数量,导致leader shutdown。”,”attrs”:{}}]},{“type”:”image”,”attrs”:{“src”:”https://static001.geekbang.org/infoq/1a/1a61fcdba4e9694d74d836efac82ca48.png”,”alt”:””,”title”:null,”style”:[{“key”:”width”,”value”:”75%”},{“key”:”bordertype”,”value”:”none”}],”href”:null,”fromPaste”:true,”pastePass”:true}},{“type”:”paragraph”,”attrs”:{“indent”:0,”number”:0,”align”:null,”origin”:null}},{“type”:”heading”,”attrs”:{“align”:null,”level”:3},”content”:[{“type”:”text”,”text”:”follower(ZK-B)第二种情况”,”attrs”:{}}]},{“type”:”paragraph”,”attrs”:{“indent”:0,”number”:0,”align”:null,”origin”:null},”content”:[{“type”:”text”,”text”:”带着这个结论,笔者去翻了follower(ZK-B)的日志(注:ZK-C也是如此)”,”attrs”:{}}]},{“type”:”codeblock”,”attrs”:{“lang”:null},”content”:[{“type”:”text”,”text”:”2021-06-16 03:38:24 New election. My id = 3\n2021-06-16 03:38:24 FOLLOWING\n2021-06-16 03:38:24 FOLLOWING – LEADER ELECTION TOOK – 8004\n2021-06-16 03:38:42 Getting a diff from the leader\n2021-06-16 03:38:42 Snapshotting\n2021-06-16 03:38:57 Snapshotting\n2021-06-16 03:39:12 Got zxid xxx\n2021-06-16 03:39:12 Exception when following the leader\njava.net.SocketException: Broken pipe\n”,”attrs”:{}}]},{“type”:”paragraph”,”attrs”:{“indent”:0,”number”:0,”align”:null,”origin”:null},”content”:[{“type”:”text”,”text”:”又是Snapshot,这次我们可以看到每次Snapshot会花15s左右,远超了syncLimit。从源码中我们可以得知,每次Snapshot之后都会立马writePacket(即响应),但是第一次回包有由于不是处理的UPTODATE包,所以并不会更新Leader端对应的tick:”,”attrs”:{}}]},{“type”:”codeblock”,”attrs”:{“lang”:null},”content”:[{“type”:”text”,”text”:”learner:\nproteced void syncWithLeader(…){\nouterloop:\n\twhile(self.isRunning()){\n\t\treadPacket(qp);\n\t\tswitch(qp.getType()){\n\t\t\tcase Leader.UPTODATE\n\t\t\tif(!snapshotTaken){\n\t\t\t\tzk.takeSnapshot();\n\t\t\t\t……\n\t\t\t}\n\t\t\tbreak outerloop;\n\t\t}\n\t\tcase Leader.NEWLEADER:\n\t\t\tzk.takeSnapshot();\n\t\t\t……\n\t\t\twritePacket(……) // leader收到后会更新tick\n\t\t\tbreak;\n\t}\n\t……\n\twritePacket(ack,True); // leader收到后会更新tick\n}\n”,”attrs”:{}}]},{“type”:”paragraph”,”attrs”:{“indent”:0,”number”:0,”align”:null,”origin”:null},”content”:[{“type”:”text”,”text”:”注意,ZK-B的日志里面表明会两次Snapshotting。至于为什么两次,应该是一个微妙的Bug,(在3.4.5的官方注释里面做了fix,但看日志依旧打了两次),笔者并没有深究。好了,整个时序图就如下所示:”,”attrs”:{}}]},{“type”:”image”,”attrs”:{“src”:”https://static001.geekbang.org/infoq/5c/5c23c2e2c179be0949b0462bd3e4c900.png”,”alt”:””,”title”:null,”style”:[{“key”:”width”,”value”:”75%”},{“key”:”bordertype”,”value”:”none”}],”href”:null,”fromPaste”:true,”pastePass”:true}},{“type”:”paragraph”,”attrs”:{“indent”:0,”number”:0,”align”:null,”origin”:null},”content”:[{“type”:”text”,”text”:”好了,第二种情况也gg了。这一次时间就不是刚刚好出在边缘了,得将近30s才能Okay, 而synedSet只有10s(2*5)。ZK集群就在这两种情况中反复选举,直到人工介入。”,”attrs”:{}}]},{“type”:”heading”,”attrs”:{“align”:null,”level”:2},”content”:[{“type”:”text”,”text”:”调大syncLimit”,”attrs”:{}}]},{“type”:”paragraph”,”attrs”:{“indent”:0,”number”:0,”align”:null,”origin”:null},”content”:[{“type”:”text”,”text”:”针对这种情况,其实我们只要调大syncLimit,应该就可以越过这道坎。”,”attrs”:{}}]},{“type”:”codeblock”,”attrs”:{“lang”:null},”content”:[{“type”:”text”,”text”:”zoo.cfg\ntickTime=2000 // 这个不要动,因为和ZK心跳机制有关\nsyncLimit=50 // 直接调成50,100s!\n”,”attrs”:{}}]},{“type”:”heading”,”attrs”:{“align”:null,”level”:2},”content”:[{“type”:”text”,”text”:”线下复现”,”attrs”:{}}]},{“type”:”paragraph”,”attrs”:{“indent”:0,”number”:0,”align”:null,”origin”:null},”content”:[{“type”:”text”,”text”:”当然了,有了分析还是不够的。我们还需要通过测试去复现并验证我们的结论。我们在线下构造了一个1024G Snapshot的ZookKeeper进行测试,在initLimit=10以及syncLimit=5的情况下确实和线上出现一模一样的那两种现象。在笔者将参数调整后:”,”attrs”:{}}]},{“type”:”codeblock”,”attrs”:{“lang”:null},”content”:[{“type”:”text”,”text”:”zoo.cfg\ntickTime=2000\ninitLimit=100 // 200s\nsyncLimit=50 // 100s\n”,”attrs”:{}}]},{“type”:”paragraph”,”attrs”:{“indent”:0,”number”:0,”align”:null,”origin”:null},”content”:[{“type”:”text”,”text”:”Zookeeper集群终于正常了。”,”attrs”:{}}]},{“type”:”heading”,”attrs”:{“align”:null,”level”:2},”content”:[{“type”:”text”,”text”:”线下用新版本3.4.13尝试复现”,”attrs”:{}}]},{“type”:”paragraph”,”attrs”:{“indent”:0,”number”:0,”align”:null,”origin”:null},”content”:[{“type”:”text”,”text”:”我们在线下还用比较新的版本3.4.13尝试复现,发现Zookeeper在不调整参数的情况下,很快的就选主成功并正常提供服务了。笔者翻了翻源码,发现其直接在Leader.lead()阶段和SyncWithLeader阶段(如果是用Diff的话)将takeSnapshot去掉了。这也就避免了处理snapshot时间过长导致无法提供服务的现象。”,”attrs”:{}}]},{“type”:”codeblock”,”attrs”:{“lang”:null},”content”:[{“type”:”text”,”text”:”Zookeeper 3.4.13\n\nZookeeperServer.java\npublic void loadData(){\n\t…\n\t// takeSnapshot() 删掉了最后一行的takeSnapshot\n}\n\nlearner.java\nprotected void syncWithLeader(…){\n\tboolean snapshotNeeded=true\n\tif(qp.getType() == Leader.DIFF){\n\t\t……\n\t\tsnapshotNeeded = false\n\t}\n\t……\n\tif(snapshotNeeded){\n\t\tzk.takeSnapshot();\n\t}\n\t……\n}\n\n”,”attrs”:{}}]},{“type”:”paragraph”,”attrs”:{“indent”:0,”number”:0,”align”:null,”origin”:null},”content”:[{“type”:”text”,”text”:”还是升级到高版本靠谱呀,这个版本的代码顺带把那个迷惑性的日志也改了!”,”attrs”:{}}]},{“type”:”heading”,”attrs”:{“align”:null,”level”:2},”content”:[{“type”:”text”,”text”:”为何Dubbo-ZK有那么多的数据”,”attrs”:{}}]},{“type”:”paragraph”,”attrs”:{“indent”:0,”number”:0,”align”:null,”origin”:null},”content”:[{“type”:”text”,”text”:”最后的问题就是一个dubbo相关的ZK为什么有那么多数据了!笔者利用ZK使用的”,”attrs”:{}}]},{“type”:”codeblock”,”attrs”:{“lang”:null},”content”:[{“type”:”text”,”text”:”org.apache.zookeeper.server.SnapshotFormatter\n”,”attrs”:{}}]},{“type”:”paragraph”,”attrs”:{“indent”:0,”number”:0,”align”:null,”origin”:null},”content”:[{“type”:”text”,”text”:”工具dump出来并用shell(awk|unique)聚合了一把,发现dubbo的数据只占了其中的1/4。有1/2是Solar的Zookeeper(已经迁移掉,遗留在上面的)。还有1/4是由于某个系统的分布式锁Bug不停的写入进去并且不删除的(已让他们修改)。所以将dubbo-zk和其它ZK数据分离是多么的重要!随便滥用就有可能导致重大事件!”,”attrs”:{}}]},{“type”:”heading”,”attrs”:{“align”:null,”level”:2},”content”:[{“type”:”text”,”text”:”总结”,”attrs”:{}}]},{“type”:”paragraph”,”attrs”:{“indent”:0,”number”:0,”align”:null,”origin”:null},”content”:[{“type”:”text”,”text”:”Zookeeper作为重要的元数据管理系统,其无法提供服务有可能会带来不可估量的影响。感谢双机房建设让我们有充足的时间和轻松的心态处理此问题。另外,虽然ZK选举虽然复杂,但是只要沉下心来慢慢分析,总归能够发现蛛丝马迹,进而找到突破口!”,”attrs”:{}}]}]}

发表回复

您的电子邮箱地址不会被公开。 必填项已用*标注