Sunday, February 27, 2011

Checkpoint handling enhancements in Cluster 7.1 -or- *Ding Dong* GCP Stop is dead!

One of the tasks that a data node must perform reliably is the Global Check Point. That is, to flush the transaction redo log to disk. The GCP completion must be synchronized on all data nodes in order to maintain a consistent end point for recovery. A "GCP Stop" is detected when a new checkpoint cannot commit the redo log to disk because the previous one has not finished. Since MySQL Cluster's primary target audience has long been those wanting a real-time database with deterministic performance it was better to just shutdown a node that is not keeping up than to allow it to slow down the entire system.

One way to encounter the GCP Stop is with very large transactions. It is recommended to commit frequently to avoid deadlocks anyway so it is also recommend in this case to avoid GCP stop. However, with the usage of new features such as ndbmtd and on-disk tablespaces, the number of transactions to be flushed to disk during one GCP goes up. At the same time there can be more contention for those disk IO resources by DiskData objects. The likelihood of encountering a GCP Stop increases significantly when using either of these features. These also lead to cascading node failures where the entire cluster shuts down. As Cluster matures it is increasingly looked to as a more general purpose database.The GCP Stop problem has become a particularly vicious thorn in the side of many cluster users. For many (I dare say "most") users it is preferable to sacrifice the real-time performance for system stability

Enter, 7.1.10! With Version 7.1.10 there are a few new enhancements that address the GCP Stop problem.

First is the TimeBetweenEpochsTimeout variable.

  • The maximum possible value for this parameter has been increased from 32000 milliseconds to 256000 milliseconds. Increasing this value has shown to alleviate the GCP Stop in some cases.
  • Setting this parameter to zero now has the effect of disabling GCP stops caused by save timeouts, commit timeouts, or both.
Next is the new RedoOverCommit behavior.

From the documentation: "When the GCP save takes longer than RedoOverCommitLimit (20) seconds, more than RedoOverCommitCounter (3) times, then any pending transactions are aborted. When this happens, the API node that sent the transaction can decide how to handle the transaction either by queuing and re-trying them, or by aborting them, as determined by DefaultOperationRedoProblemAction." This explanation is a bit simplistic and somewhat misleading so I'll expand on this from what I know of the behavior.

NDBD now monitors the rate at which commits are written to RedoBuffer and the rate at which they are flushed to the redo log. Every 128ms the disk IO rate is estimated. The estimated IO throughput is a rolling average of the IO rate over 8sec. Every 1024ms cluster checks how long it would take to flush the entire RedoBuffer to disk (Data In RedoBuffer / Estimated IO throughput). If it would take longer than RedoOverCommitLimit to flush the RedoBuffer a counter is incremented otherwise the counter is reset. If the counter is over RedoOverCommitCounter, the node is flagged as having IO problems and error code "1234: REDO log files overloaded (increase disk hardware): Temporary error: Temporary Resource error" is raised. By default it takes a total of 24 sec. of overload *not 60 sec.* before the error is raised.

When the DefaultOperationRedoProblemAction is "ABORT". The temporary error is raised to the client connection so the decision can be made in the application weather to notify the end user of the delay or if the application will attempt the retry itself.

When the DefaultOperationRedoProblemAction is "QUEUE" ndbd will block the user session until the transaction is successfully written to the RedoBuffer. Each time an operation is to be written to the redo log and the log had been flagged with error 410, 1220 or 1234: "REDO log files overloaded" the operation will be put into a new queue of prepared transactions that wait for processing after events that are already in commit/abort state.

Controlling the behavior of weather an operation should be queued or aborted can be set using the ndbapi on a per operation level. Currently mysqld only uses the global default setting. Setting the problem action per transaction on the SQL level is expected for a future release.

What this means ultimately is that instead of a node (and potentially the whole cluster) being shutdown due to overload, the number of transactions being committed is throttled down to the level the redo log flushing can sustain.