Uploaded image for project: 'Engineering'
  1. Engineering
  2. ENG-10453

Possible lost writes on partition due to race between partition detection and general fault-resolution

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: V6.3
    • Fix Version/s: V6.4
    • Component/s: Core
    • Team Backlog:
      RAM team
    • Release Note:
      Hide
      Improved read level consistency during network partitions A potential read consistency issue was identified and resolved. Read transactions can access data modified by write transactions on the local server, before all nodes confirm those write transactions. During a node failure or network partition, it is possible that the locally completed writes could be rolled back as part of the network partition resolution. This could only happen in the off chance that the read transaction accesses data modified by an immediately preceding write that has not been committed on all copies of the partition prior to a network partition. But to ensure this cannot happen, reads now run on all copies of the partition, guaranteeing consensus among the servers and complete read consistency. However, it also incrementally increases the time required to complete a read-only transaction in a K-safe cluster. If you do not need complete read consistency, you can optionally set the cluster to produce faster read transactions using the old behavior, by setting the read level consistency to "fast" in the deployment file. See the appendix on Server Configuration Options in the VoltDB Administrator's Guide for more information.
      Show
      Improved read level consistency during network partitions A potential read consistency issue was identified and resolved. Read transactions can access data modified by write transactions on the local server, before all nodes confirm those write transactions. During a node failure or network partition, it is possible that the locally completed writes could be rolled back as part of the network partition resolution. This could only happen in the off chance that the read transaction accesses data modified by an immediately preceding write that has not been committed on all copies of the partition prior to a network partition. But to ensure this cannot happen, reads now run on all copies of the partition, guaranteeing consensus among the servers and complete read consistency. However, it also incrementally increases the time required to complete a read-only transaction in a K-safe cluster. If you do not need complete read consistency, you can optionally set the cluster to produce faster read transactions using the old behavior, by setting the read level consistency to "fast" in the deployment file. See the appendix on Server Configuration Options in the VoltDB Administrator's Guide for more information.

      Description

      EDIT July 12, 2016: Read more context about this issue on our website: https://voltdb.com/jepsen-found-issues-depth#lostwrites
      ===

      DuplicateCounter holds competed writes at the initiator until all replicas have responded with identical answers.

      If a node fails, a ZK watch job tells DuplicateCounter not to expect a response from this node, and to release the write if the missing node was blocking a response.

      However, if the node in question is in a minority partition, then partition detection is supposed to stop the minority partition from doing anymore work.

      The catch is that PD is also run from a ZK watch job, and it's asynchronous which will win. Sometimes writes are released to a client before we realize we shouldn't be sending anything to a client because we are in a minority partition.

      The net result is that a write sent to the cluster just before a partition might be confirmed on a minority partition only, then released back to the client. When the minority partition kills itself, the write is lost, even though it was confirmed to the user.

      Technically it's in the command log or the partition-detection snapshot, but it's not in the surviving cluster and that violates ACID/CAP/etc.

      In order to hit this you need a network partition event that divides the cluster into two viable clusters (or else the k-safety checker should kill it), and you need a live connection to the client from the minority partition.

      Fix
      Move partition detection in front of dead site notifications to DuplicateCounter instances.

      Acceptance
      Passing unit test that reproduces the issue before the fix.
      Strong code review.
      Jepsen pass.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                jhugg John Hugg
                Reporter:
                jhugg John Hugg
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: