Uploaded image for project: 'Engineering'
  1. Engineering
  2. ENG-10486

Recovery planner might choose a log from a killed-minority-partition, possibly losing confirmed writes

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: V6.3
    • Fix Version/s: V6.4
    • Component/s: Core
    • Labels:
    • Team Backlog:
      ELASTIC team
    • Release Note:
      Hide
      There were two race conditions related to K-safety and network partitions that could result in differences between the persisted data and responses to the client. In the first case, if the cluster divides into two viable segments, a write transaction being processed during the partition could be reported as successful by the minor segment before it shuts down due to network partition resolution, although the transaction is never committed by the nodes of the surviving majority segment. In the second case, again where the cluster divides into two viable segments, write transactions in-flight during the network partition can be written to command logs separately to the two segments. On recovery, not all of those write transactions may get replayed. Both cases, found in testing, only occurred under certain conditions and in specific configurations where a network partition could result in two viable cluster segments. Both cases have been resolved.
      Show
      There were two race conditions related to K-safety and network partitions that could result in differences between the persisted data and responses to the client. In the first case, if the cluster divides into two viable segments, a write transaction being processed during the partition could be reported as successful by the minor segment before it shuts down due to network partition resolution, although the transaction is never committed by the nodes of the surviving majority segment. In the second case, again where the cluster divides into two viable segments, write transactions in-flight during the network partition can be written to command logs separately to the two segments. On recovery, not all of those write transactions may get replayed. Both cases, found in testing, only occurred under certain conditions and in specific configurations where a network partition could result in two viable cluster segments. Both cases have been resolved.

      Description

      EDIT July 12, 2016: Read more context about this issue on our website: https://voltdb.com/jepsen-found-issues-depth#lostwrites
      ===

      1. # Take a 2 node cluster with PD on.
      2. Cut the link between the nodes.
      3. Send 12 transactions to non-blessed node. Send 5 different transactions to blessed node. All 15 transactions will be held until fault resolution (but they will be command-logged).
      4. Fault resolution kicks in, kills the non-blessed node, and tells the blessed node it can release it's 5 transactions to the client.
      5. Now kill the blessed node too.
      6. Recover both nodes into a two node cluster.
      7. The recovery planner sees 7 more transactions in the log of the un-blessed node. It will restore the 12 unconfirmed transactions and will lose the 5 confirmed transactions on the blessed node.

      This is a lost writes bug on recovery. It's not trivial to hit. In most cases you only have one surviving complete log+snapshot set, and even when you have more than one, the set that kept running will usually be longer. Still, it's possible and that's not ok.

      Fix
      We think we just need to understand any partition events by reading the logs. There's already a fault-log in there, so we should be able to do that. We may want to add some info to the nodes that were killed by PD to make this easier, though.

      Acceptance

      • We should be able to make a JUnit localcluster test to reproduce this, but we haven't tried yet.
      • Jepsen sign-off.
      • Examine the system tests that check command logs and see if it's possible to make existing system tests hit this.

        Attachments

          Activity

            People

            • Assignee:
              wweiss Walter Weiss
              Reporter:
              jhugg John Hugg
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: