Showing posts with label ndb. Show all posts
Showing posts with label ndb. Show all posts

Tuesday, June 5, 2012

Dynarr256 for DBACC -or- The death of MAX_ROWS

Back in 2006 we became aware of problems storing large numbers of rows in a single table in cluster. Johan Andersson and Yves Trudeau have each blogged about the problem and the common workaround here and here.  We've since then done some cleanup to provide a more proper "Table is full" error message when running into this problem.

As explained in the referenced blog posts, the problem is the result of a limitation on the size of the hash index of each partition. The hash index for each partition would allow at most ~49 million records. By default an ndbd or ndbmtd node have only 1 local query handler (LQH) block and thus 1 partition per node.  The ndbmtd nodes having MaxNoOfExecutionThreads = 4 or 8 cluster will have 2 or 4 LQH per node respectively. So, the point at which you may encounter the "Table is full" error will vary depending upon the size of your configuration.  The recommended workaround was to instruct the NDBCLUSTER engine to create more partitions than the default by adding a MAX_ROWS clause to your ALTER or CREATE TABLE statements. However, the maximum accepted value for MAX_ROWS is limited to 4294967295 by mysqld. This limit causes the maximum number of partitions allowed by this hint to be at or around 88.

mysql> create table t1 (a bigint unsigned primary key ) engine=NDBCLUSTER MAX_ROWS=900000000000;
Query OK, 0 rows affected, 1 warning (1.06 sec)


mysql> show warnings\G
*************************** 1. row ***************************
Level: Warning
Code: 1105
Message: Ndb might have problems storing the max amount of rows specified
1 row in set (0.00 sec)

mysql> show create table t1\G
*************************** 1. row ***************************
Table: t1
Create Table: CREATE TABLE `t1` (
`a` bigint(20) unsigned NOT NULL,
PRIMARY KEY (`a`)
) ENGINE=ndbcluster DEFAULT CHARSET=latin1 MAX_ROWS=4294967295
1 row in set (0.00 sec)


Now, it is a little know fact that there are a total of 8 partitions allowed in each LQH. These 8 slots are divided by NoOfReplicas to give the number of "primary" partitions that can be defined. The maximum number of partitions that could be defined for a given table is calculated as:

#Nodes * #LQH per node * 8 / NoOfReplicas

In a two node cluster, with NoOfReplicas=2 and MaxNoOfExecutionThreads=4 (2 LQH per node), this MAX_ROWS hint would results in a total of 16 partitions allowing 784 million rows to be inserted.  But in a larger six node cluster, for example, with NoOfReplicas=2 and MaxNoOfExecutionThreads=8 (4 LQH per node) cluster should allow a maximum of 6*4*8/2 = 96 partitions, or ~4.7Bn rows total.  But with the MAX_ROWS hint cluster would only create 88 partitions, allowing ~4.3Bn rows per table.

In those rare case where the number of partitions created by the MAX_ROWS hint is not sufficient for the number of  rows you will need to add, you can calculate the maximum number of partitions capable for your configuration and explicitly specify that number using:

ALTER|CREATE TABLE... PARTITION BY KEY () PARTITIONS N;

The empty "BY KEY ()" definition defaults to using the primary key as the partitioning key. The trade off to this method is that when you scale up the cluster you could no longer use ALTER... REORGANIZE PARTITION to redistribute rows onto those newly added nodes in an online fashion.  The MAX_ROWS method adds no such limitation to REORGANIZE PARTITION.

*** CORRECTION ***

"The reports of my death are greatly exaggerated" -- Maximus R. Ows

Now, in version 7.2.6 there is no longer any limitation on the number of rows that may be created in a given partition and for most use cases it should no longer be necessary to use the MAX_ROWS hint or perform explicit key based partitioning to avoid the "Table is full" errors.  Due to a bug discovered at the last minute the fix to remove the hash index size limitation which was intended for 7.2.6 was not actually pushed.  :'-(  http://forums.mysql.com/read.php?25,544882,544882#msg-544882
 
There is still one limitation left which may still require these MAX_ROWS / PARTITION BY KEY workarounds. The maximum size of the fixed-length portion of any partition is maxed at 16 GB.  This limitation is being addressed and will be removed in a future version.

Also at version 7.2.5 the upper limit of MaxNoOfExecutionThreads was raised from 8 to 44, and the number of allowed LQH was raised from 4 to 16 per node.  See also:  http://dev.mysql.com/doc/refman/5.5/en/mysql-cluster-ndbd-definition.html#ndbparam-ndbmtd-threadconfig for how to configure greater than 4 LQH.  The documentation for this change is still a bit incomplete at the moment but I am working to get that cleared up.

Sunday, February 27, 2011

Checkpoint handling enhancements in Cluster 7.1 -or- *Ding Dong* GCP Stop is dead!

One of the tasks that a data node must perform reliably is the Global Check Point. That is, to flush the transaction redo log to disk. The GCP completion must be synchronized on all data nodes in order to maintain a consistent end point for recovery. A "GCP Stop" is detected when a new checkpoint cannot commit the redo log to disk because the previous one has not finished. Since MySQL Cluster's primary target audience has long been those wanting a real-time database with deterministic performance it was better to just shutdown a node that is not keeping up than to allow it to slow down the entire system.

One way to encounter the GCP Stop is with very large transactions. It is recommended to commit frequently to avoid deadlocks anyway so it is also recommend in this case to avoid GCP stop. However, with the usage of new features such as ndbmtd and on-disk tablespaces, the number of transactions to be flushed to disk during one GCP goes up. At the same time there can be more contention for those disk IO resources by DiskData objects. The likelihood of encountering a GCP Stop increases significantly when using either of these features. These also lead to cascading node failures where the entire cluster shuts down. As Cluster matures it is increasingly looked to as a more general purpose database.The GCP Stop problem has become a particularly vicious thorn in the side of many cluster users. For many (I dare say "most") users it is preferable to sacrifice the real-time performance for system stability

Enter, 7.1.10! With Version 7.1.10 there are a few new enhancements that address the GCP Stop problem.

First is the TimeBetweenEpochsTimeout variable.

  • The maximum possible value for this parameter has been increased from 32000 milliseconds to 256000 milliseconds. Increasing this value has shown to alleviate the GCP Stop in some cases.
  • Setting this parameter to zero now has the effect of disabling GCP stops caused by save timeouts, commit timeouts, or both.
Next is the new RedoOverCommit behavior.

From the documentation: "When the GCP save takes longer than RedoOverCommitLimit (20) seconds, more than RedoOverCommitCounter (3) times, then any pending transactions are aborted. When this happens, the API node that sent the transaction can decide how to handle the transaction either by queuing and re-trying them, or by aborting them, as determined by DefaultOperationRedoProblemAction." This explanation is a bit simplistic and somewhat misleading so I'll expand on this from what I know of the behavior.

NDBD now monitors the rate at which commits are written to RedoBuffer and the rate at which they are flushed to the redo log. Every 128ms the disk IO rate is estimated. The estimated IO throughput is a rolling average of the IO rate over 8sec. Every 1024ms cluster checks how long it would take to flush the entire RedoBuffer to disk (Data In RedoBuffer / Estimated IO throughput). If it would take longer than RedoOverCommitLimit to flush the RedoBuffer a counter is incremented otherwise the counter is reset. If the counter is over RedoOverCommitCounter, the node is flagged as having IO problems and error code "1234: REDO log files overloaded (increase disk hardware): Temporary error: Temporary Resource error" is raised. By default it takes a total of 24 sec. of overload *not 60 sec.* before the error is raised.

When the DefaultOperationRedoProblemAction is "ABORT". The temporary error is raised to the client connection so the decision can be made in the application weather to notify the end user of the delay or if the application will attempt the retry itself.

When the DefaultOperationRedoProblemAction is "QUEUE" ndbd will block the user session until the transaction is successfully written to the RedoBuffer. Each time an operation is to be written to the redo log and the log had been flagged with error 410, 1220 or 1234: "REDO log files overloaded" the operation will be put into a new queue of prepared transactions that wait for processing after events that are already in commit/abort state.

Controlling the behavior of weather an operation should be queued or aborted can be set using the ndbapi on a per operation level. Currently mysqld only uses the global default setting. Setting the problem action per transaction on the SQL level is expected for a future release.

What this means ultimately is that instead of a node (and potentially the whole cluster) being shutdown due to overload, the number of transactions being committed is throttled down to the level the redo log flushing can sustain.