0%

Spark-HA的worker问题

关于 HA 中 Spark worker节点连接Master的问题

问题:Spark Woker 不去连接ALIVE Master

机器:

  • 192.168.1.128 Master

  • 192.168.1.129 Master Worker

  • 192.168.1.130 Worker

    启动时两个Master的状态不可控,不知道哪个是ALIVE的Master,worker节点在连接Master的时候,会判断当前Master的状态是否为ALIVE,如果为StandBy,则不继续链接,然后去寻找ALIVE,直到找到ALIVE节点的MASTER。

    现在的问题是 Worker在找到StandBy节点后,并没有去寻找新的Master,导致了worker注册不到集群上,自动关闭。

    原因待定。

    根据一些帖子发现,如果配置了Spark on yarn ,则 Spark HA 基本没有任何作用。

错误日志

  • Terminal
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
    17/10/09 13:05:08 INFO Worker: Registered signal handlers for [TERM, HUP, INT]
    17/10/09 13:05:09 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    17/10/09 13:05:09 INFO SecurityManager: Changing view acls to: root
    17/10/09 13:05:09 INFO SecurityManager: Changing modify acls to: root
    17/10/09 13:05:09 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)
    17/10/09 13:05:10 INFO Utils: Successfully started service 'sparkWorker' on port 39766.
    17/10/09 13:05:10 INFO Worker: Starting Spark worker 192.168.10.129:39766 with 4 cores, 4.0 GB RAM
    17/10/09 13:05:10 INFO Worker: Running Spark version 1.6.0
    17/10/09 13:05:10 INFO Worker: Spark home: /opt/dkh/spark-1.6.0-bin-hadoop2.6
    17/10/09 13:05:11 INFO Utils: Successfully started service 'WorkerUI' on port 8081.
    17/10/09 13:05:11 INFO WorkerWebUI: Started WorkerWebUI at http://192.168.10.129:8081
    17/10/09 13:05:11 INFO Worker: Connecting to master dkm:7077...
    17/10/09 13:05:11 WARN Worker: Failed to connect to master dkm:7077
    java.io.IOException: Failed to connect to dkm/192.168.10.128:7077
    at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)
    at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167)
    at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:200)
    at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:187)
    at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:183)
    at java.util.concurrent.FutureTask.run(FutureTask.java:262)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
    Caused by: java.net.ConnectException: 拒绝连接: dkm/192.168.10.128:7077
    at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
    at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
    at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
    at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
    at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
    ... 1 more
    17/10/09 13:05:24 INFO Worker: Retrying connection to master (attempt # 1)
    17/10/09 13:05:24 INFO Worker: Connecting to master dkm:7077...
    17/10/09 13:05:37 INFO Worker: Retrying connection to master (attempt # 2)
    17/10/09 13:05:37 INFO Worker: Connecting to master dkm:7077...
    17/10/09 13:05:50 INFO Worker: Retrying connection to master (attempt # 3)
    17/10/09 13:05:50 INFO Worker: Connecting to master dkm:7077...
    17/10/09 13:06:03 INFO Worker: Retrying connection to master (attempt # 4)
    17/10/09 13:06:03 INFO Worker: Connecting to master dkm:7077...
    17/10/09 13:06:16 INFO Worker: Retrying connection to master (attempt # 5)
    17/10/09 13:06:16 INFO Worker: Connecting to master dkm:7077...
    17/10/09 13:06:29 INFO Worker: Retrying connection to master (attempt # 6)
    17/10/09 13:06:29 INFO Worker: Connecting to master dkm:7077...
    17/10/09 13:07:47 INFO Worker: Retrying connection to master (attempt # 7)
    17/10/09 13:07:47 INFO Worker: Connecting to master dkm:7077...
    17/10/09 13:09:05 INFO Worker: Retrying connection to master (attempt # 8)
    17/10/09 13:09:05 INFO Worker: Connecting to master dkm:7077...
    17/10/09 13:10:23 INFO Worker: Retrying connection to master (attempt # 9)
    17/10/09 13:10:23 INFO Worker: Connecting to master dkm:7077...
    17/10/09 13:11:41 INFO Worker: Retrying connection to master (attempt # 10)
    17/10/09 13:11:41 INFO Worker: Connecting to master dkm:7077...
    17/10/09 13:12:59 INFO Worker: Retrying connection to master (attempt # 11)
    17/10/09 13:12:59 INFO Worker: Connecting to master dkm:7077...
    17/10/09 13:14:17 INFO Worker: Retrying connection to master (attempt # 12)
    17/10/09 13:14:17 INFO Worker: Connecting to master dkm:7077...
    17/10/09 13:15:35 INFO Worker: Retrying connection to master (attempt # 13)
    17/10/09 13:15:35 INFO Worker: Connecting to master dkm:7077...
    17/10/09 13:16:53 INFO Worker: Retrying connection to master (attempt # 14)
    17/10/09 13:16:53 INFO Worker: Connecting to master dkm:7077...
    17/10/09 13:18:11 INFO Worker: Retrying connection to master (attempt # 15)
    17/10/09 13:18:11 INFO Worker: Connecting to master dkm:7077...
    17/10/09 13:19:29 INFO Worker: Retrying connection to master (attempt # 16)
    17/10/09 13:19:29 INFO Worker: Connecting to master dkm:7077...
    17/10/09 13:20:47 ERROR Worker: All masters are unresponsive! Giving up.
    既然如此,那干脆不启动第二个Master,Start-all 后,会发现集群正常,但是没有第二个Master。