Uploaded image for project: 'Engineering'
  1. Engineering
  2. ENG-10389

Possible dirty read with RO transactions in certain partition scenarios

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: V6.3
    • Fix Version/s: V6.4
    • Component/s: Core
    • Team Backlog:
      RAM team
    • Release Note:
      Hide
      Improved read level consistency during network partitions A potential read consistency issue was identified and resolved. Read transactions can access data modified by write transactions on the local server, before all nodes confirm those write transactions. During a node failure or network partition, it is possible that the locally completed writes could be rolled back as part of the network partition resolution. This could only happen in the off chance that the read transaction accesses data modified by an immediately preceding write that has not been committed on all copies of the partition prior to a network partition. But to ensure this cannot happen, reads now run on all copies of the partition, guaranteeing consensus among the servers and complete read consistency. However, it also incrementally increases the time required to complete a read-only transaction in a K-safe cluster. If you do not need complete read consistency, you can optionally set the cluster to produce faster read transactions using the old behavior, by setting the read level consistency to "fast" in the deployment file. See the appendix on Server Configuration Options in the VoltDB Administrator's Guide for more information.
      Show
      Improved read level consistency during network partitions A potential read consistency issue was identified and resolved. Read transactions can access data modified by write transactions on the local server, before all nodes confirm those write transactions. During a node failure or network partition, it is possible that the locally completed writes could be rolled back as part of the network partition resolution. This could only happen in the off chance that the read transaction accesses data modified by an immediately preceding write that has not been committed on all copies of the partition prior to a network partition. But to ensure this cannot happen, reads now run on all copies of the partition, guaranteeing consensus among the servers and complete read consistency. However, it also incrementally increases the time required to complete a read-only transaction in a K-safe cluster. If you do not need complete read consistency, you can optionally set the cluster to produce faster read transactions using the old behavior, by setting the read level consistency to "fast" in the deployment file. See the appendix on Server Configuration Options in the VoltDB Administrator's Guide for more information.
    • Sprint:
      RAM 16

      Description

      EDIT July 12, 2016: Read more context about this issue on our website: https://voltdb.com/jepsen-found-issues-depth#lostwrites
      ===

      Today, read operations can be run at a single node and can return responses to the client, even if they occur after a write that is not safe to return to the client.

      Note: this does not affect write transactions or any reads within RW transactions.

      Problem 1
      This can lead to a linearlizability fail, and a split-brain scenario. If a write is completed at node A, but then a partition happens before it can be sent to node B, it's possible to continue sending reads to nodes A and B and to get two different responses. You can repeat the reads and see interleaved non-matching values for many seconds, until partition detection kicks in or the cluster recovers.

      Problem 2
      Reads can read uncommitted data. If a write is complete at node A, but a partition happens before it can be sent to node B, then it's theoretically possible to read the new value, then have node A fail before B knows about the new value. Thus you read a value that didn't commit.

      This would be pretty hard to hit, but not ridiculous. It would require a network partition that disconnected nodes, but not the client (possible) and a simultaneous node failure of a partition master. Technically, a partition master failure with weird network queues could send a read response before it sends out the writes to replicas.

      Fix
      Make reads wait for any writes waiting to be sent to the client to be safe before it considers itself safe.

      It's possible to use the old behavior or the new on a per-transaction or even per-cluster basis, but if it's provided as an option to the user, the default should be more strict.

      Acceptance

      • Requires a jepsen pass.
      • It's not clear how to adjust txnidselfcheck2 to catch this, but should be considered.
      • Black box junit should also be considered.
      • Any changes to the safety dance may require new unit tests.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                jhugg John Hugg
                Reporter:
                jhugg John Hugg
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: