Uploaded image for project: 'Engineering'
  1. Engineering
  2. ENG-10423

Rejoins can start after a partition but before fault-resolution (during the timeout)

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: V6.3
    • Fix Version/s: V6.4
    • Component/s: Core
    • Labels:
    • Team Backlog:
      RAM team

      Description

      If you partition off a node, you can start a rejoin to ANY node and it will proceed as far as it can before giving you an unhelpful error message.

      This was reproduced in a n5k4 scenario where three nodes were killed, so the two remaining would off themselves due to partition detection. But they won't do that until after the timeout. In the meantime, the three nodes can be restarted and can try to rejoin to the two living nodes. This actually starts to work until the living nodes realize the timeout is up and off themselves.

      Fix
      Feels like we should push rejoin requests through zookeeper.

      Note
      You can't prevent simultaneous failures during rejoin, so even if you check that everyone we think is up is actually up before rejoin, that status could change AT ANY TIME. So we still need to resolve failures that happen during a rejoin (or even simultaneous with), but we should be able to deny (or delay) rejoin starts while timeouts to nodes go through fault-resolution.

      Acceptance
      Lets try this in system tests. Probably localcluster unit tests too.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                nshi Ning Shi
                Reporter:
                jhugg John Hugg
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: