Uploaded image for project: 'Engineering'
  1. Engineering
  2. ENG-10421

If you start two rejoins simultaneously, we give scary error messages when we should just forbid it

    Details

    • Team Backlog:
      ELASTIC team
    • Release Note:
      Hide
      Previously, attempting to rejoin nodes simultaneously could result in cryptic fatal error messages. The rejoin operation has been changed to allow only one rejoin at a time. Now, attempting to rejoin two or more nodes at once results in each additional node waiting until the preceding rejoin completes before starting.
      Show
      Previously, attempting to rejoin nodes simultaneously could result in cryptic fatal error messages. The rejoin operation has been changed to allow only one rejoin at a time. Now, attempting to rejoin two or more nodes at once results in each additional node waiting until the preceding rejoin completes before starting.

      Description

      This manifested itself in a 5 node k=4 test where 3 nodes failed and were simultaneously rejoined.

      Two of the rejoining nodes tried to pull from the same source and they both failed, but with different responses.

      I believe this is a scenario we didn't assume could happen, and didn't defensively code to avoid it.

      Fix:
      Simple fix is to reject rejoins when the cluster already has a rejoining node.

      Operationally, it would be nice to give the user the option to have nodes wait if the rejoining cluster is busy. I'm not sure if this should be the default (maybe?), but it would make some scenarios more straightforward.

      Acceptance
      Need at least a localcluster junit test that reproduces the issue, then triggers the fix code.
      Would be nice to add in multiple rejoins into system tests, understanding that one of them should fail.

      Long Term
      Simultaneously rejoining more than one node would be nice to have, but would be a separate feature. Note this becomes even more nice-to-have as the cluster size goes up.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                nshi Ning Shi
                Reporter:
                jhugg John Hugg
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: