Uploaded image for project: 'Engineering'
  1. Engineering
  2. ENG-12876

When a node rejoin fails due to clock skew, the cluster still thinks the node is rejoining

    XMLWordPrintable

    Details

    • Team Backlog:
      DRAM team
    • Release Note:
      Hide
      When rejoining a node to a running cluster , the system clock on the rejoining node must be within the limits for clock skew on the cluster, just like when starting the cluster for the first time. If not, the rejoin operation will fail. Previously, there was an issue where if a rejoin failed due to clock skew, subsequent attempts to rejoin nodes would fail even if the clock skew had been corrected. This issue has been resolved.
      Show
      When rejoining a node to a running cluster , the system clock on the rejoining node must be within the limits for clock skew on the cluster, just like when starting the cluster for the first time. If not, the rejoin operation will fail. Previously, there was an issue where if a rejoin failed due to clock skew, subsequent attempts to rejoin nodes would fail even if the clock skew had been corrected. This issue has been resolved.
    • Sprint:
      DRAM 17, DRAM 18
    • Impact:
      Stability

      Description

      Need to know what release this was introduced, consider for back port.

      --------

      If a node rejoin fails due to clock skew, the cluster still thinks it is rejoining and won't allow any subsequent rejoin attempts.

      Reproducer:
      1. Start a cluster.
      2. Stop one node. Intentionally set the clock 2 minutes ahead of current time.
      3. Rejoin node (fails)
      4. Resync the clock and attempt rejoin. It will fail and report the below error messages continuously (backing off):

      2017-07-18 16:13:09,754 WARN [main] JOINER: Request to join cluster mesh is rejected, retrying in 10 seconds. Only one host can rejoin at a time. Host 8 is still rejoining.
      2017-07-18 16:13:19,773 WARN [main] JOINER: Request to join cluster mesh is rejected, retrying in 29 seconds. Only one host can rejoin at a time. Host 8 is still rejoining.
      2017-07-18 16:13:48,792 WARN [main] JOINER: Request to join cluster mesh is rejected, retrying in 63 seconds. Only one host can rejoin at a time. Host 8 is still rejoining.

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              nclark Nate Clark
              Reporter:
              bballard Ben Ballard
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Zendesk Support