Tuesday, June 5, 2012

Dynarr256 for DBACC -or- The death of MAX_ROWS

Back in 2006 we became aware of problems storing large numbers of rows in a single table in cluster. Johan Andersson and Yves Trudeau have each blogged about the problem and the common workaround here and here.  We've since then done some cleanup to provide a more proper "Table is full" error message when running into this problem.

As explained in the referenced blog posts, the problem is the result of a limitation on the size of the hash index of each partition. The hash index for each partition would allow at most ~49 million records. By default an ndbd or ndbmtd node have only 1 local query handler (LQH) block and thus 1 partition per node.  The ndbmtd nodes having MaxNoOfExecutionThreads = 4 or 8 cluster will have 2 or 4 LQH per node respectively. So, the point at which you may encounter the "Table is full" error will vary depending upon the size of your configuration.  The recommended workaround was to instruct the NDBCLUSTER engine to create more partitions than the default by adding a MAX_ROWS clause to your ALTER or CREATE TABLE statements. However, the maximum accepted value for MAX_ROWS is limited to 4294967295 by mysqld. This limit causes the maximum number of partitions allowed by this hint to be at or around 88.

mysql> create table t1 (a bigint unsigned primary key ) engine=NDBCLUSTER MAX_ROWS=900000000000;
Query OK, 0 rows affected, 1 warning (1.06 sec)


mysql> show warnings\G
*************************** 1. row ***************************
Level: Warning
Code: 1105
Message: Ndb might have problems storing the max amount of rows specified
1 row in set (0.00 sec)

mysql> show create table t1\G
*************************** 1. row ***************************
Table: t1
Create Table: CREATE TABLE `t1` (
`a` bigint(20) unsigned NOT NULL,
PRIMARY KEY (`a`)
) ENGINE=ndbcluster DEFAULT CHARSET=latin1 MAX_ROWS=4294967295
1 row in set (0.00 sec)


Now, it is a little know fact that there are a total of 8 partitions allowed in each LQH. These 8 slots are divided by NoOfReplicas to give the number of "primary" partitions that can be defined. The maximum number of partitions that could be defined for a given table is calculated as:

#Nodes * #LQH per node * 8 / NoOfReplicas

In a two node cluster, with NoOfReplicas=2 and MaxNoOfExecutionThreads=4 (2 LQH per node), this MAX_ROWS hint would results in a total of 16 partitions allowing 784 million rows to be inserted.  But in a larger six node cluster, for example, with NoOfReplicas=2 and MaxNoOfExecutionThreads=8 (4 LQH per node) cluster should allow a maximum of 6*4*8/2 = 96 partitions, or ~4.7Bn rows total.  But with the MAX_ROWS hint cluster would only create 88 partitions, allowing ~4.3Bn rows per table.

In those rare case where the number of partitions created by the MAX_ROWS hint is not sufficient for the number of  rows you will need to add, you can calculate the maximum number of partitions capable for your configuration and explicitly specify that number using:

ALTER|CREATE TABLE... PARTITION BY KEY () PARTITIONS N;

The empty "BY KEY ()" definition defaults to using the primary key as the partitioning key. The trade off to this method is that when you scale up the cluster you could no longer use ALTER... REORGANIZE PARTITION to redistribute rows onto those newly added nodes in an online fashion.  The MAX_ROWS method adds no such limitation to REORGANIZE PARTITION.

*** CORRECTION ***

"The reports of my death are greatly exaggerated" -- Maximus R. Ows

Now, in version 7.2.6 there is no longer any limitation on the number of rows that may be created in a given partition and for most use cases it should no longer be necessary to use the MAX_ROWS hint or perform explicit key based partitioning to avoid the "Table is full" errors.  Due to a bug discovered at the last minute the fix to remove the hash index size limitation which was intended for 7.2.6 was not actually pushed.  :'-(  http://forums.mysql.com/read.php?25,544882,544882#msg-544882
 
There is still one limitation left which may still require these MAX_ROWS / PARTITION BY KEY workarounds. The maximum size of the fixed-length portion of any partition is maxed at 16 GB.  This limitation is being addressed and will be removed in a future version.

Also at version 7.2.5 the upper limit of MaxNoOfExecutionThreads was raised from 8 to 44, and the number of allowed LQH was raised from 4 to 16 per node.  See also:  http://dev.mysql.com/doc/refman/5.5/en/mysql-cluster-ndbd-definition.html#ndbparam-ndbmtd-threadconfig for how to configure greater than 4 LQH.  The documentation for this change is still a bit incomplete at the moment but I am working to get that cleared up.

Sunday, February 27, 2011

Checkpoint handling enhancements in Cluster 7.1 -or- *Ding Dong* GCP Stop is dead!

One of the tasks that a data node must perform reliably is the Global Check Point. That is, to flush the transaction redo log to disk. The GCP completion must be synchronized on all data nodes in order to maintain a consistent end point for recovery. A "GCP Stop" is detected when a new checkpoint cannot commit the redo log to disk because the previous one has not finished. Since MySQL Cluster's primary target audience has long been those wanting a real-time database with deterministic performance it was better to just shutdown a node that is not keeping up than to allow it to slow down the entire system.

One way to encounter the GCP Stop is with very large transactions. It is recommended to commit frequently to avoid deadlocks anyway so it is also recommend in this case to avoid GCP stop. However, with the usage of new features such as ndbmtd and on-disk tablespaces, the number of transactions to be flushed to disk during one GCP goes up. At the same time there can be more contention for those disk IO resources by DiskData objects. The likelihood of encountering a GCP Stop increases significantly when using either of these features. These also lead to cascading node failures where the entire cluster shuts down. As Cluster matures it is increasingly looked to as a more general purpose database.The GCP Stop problem has become a particularly vicious thorn in the side of many cluster users. For many (I dare say "most") users it is preferable to sacrifice the real-time performance for system stability

Enter, 7.1.10! With Version 7.1.10 there are a few new enhancements that address the GCP Stop problem.

First is the TimeBetweenEpochsTimeout variable.

  • The maximum possible value for this parameter has been increased from 32000 milliseconds to 256000 milliseconds. Increasing this value has shown to alleviate the GCP Stop in some cases.
  • Setting this parameter to zero now has the effect of disabling GCP stops caused by save timeouts, commit timeouts, or both.
Next is the new RedoOverCommit behavior.

From the documentation: "When the GCP save takes longer than RedoOverCommitLimit (20) seconds, more than RedoOverCommitCounter (3) times, then any pending transactions are aborted. When this happens, the API node that sent the transaction can decide how to handle the transaction either by queuing and re-trying them, or by aborting them, as determined by DefaultOperationRedoProblemAction." This explanation is a bit simplistic and somewhat misleading so I'll expand on this from what I know of the behavior.

NDBD now monitors the rate at which commits are written to RedoBuffer and the rate at which they are flushed to the redo log. Every 128ms the disk IO rate is estimated. The estimated IO throughput is a rolling average of the IO rate over 8sec. Every 1024ms cluster checks how long it would take to flush the entire RedoBuffer to disk (Data In RedoBuffer / Estimated IO throughput). If it would take longer than RedoOverCommitLimit to flush the RedoBuffer a counter is incremented otherwise the counter is reset. If the counter is over RedoOverCommitCounter, the node is flagged as having IO problems and error code "1234: REDO log files overloaded (increase disk hardware): Temporary error: Temporary Resource error" is raised. By default it takes a total of 24 sec. of overload *not 60 sec.* before the error is raised.

When the DefaultOperationRedoProblemAction is "ABORT". The temporary error is raised to the client connection so the decision can be made in the application weather to notify the end user of the delay or if the application will attempt the retry itself.

When the DefaultOperationRedoProblemAction is "QUEUE" ndbd will block the user session until the transaction is successfully written to the RedoBuffer. Each time an operation is to be written to the redo log and the log had been flagged with error 410, 1220 or 1234: "REDO log files overloaded" the operation will be put into a new queue of prepared transactions that wait for processing after events that are already in commit/abort state.

Controlling the behavior of weather an operation should be queued or aborted can be set using the ndbapi on a per operation level. Currently mysqld only uses the global default setting. Setting the problem action per transaction on the SQL level is expected for a future release.

What this means ultimately is that instead of a node (and potentially the whole cluster) being shutdown due to overload, the number of transactions being committed is throttled down to the level the redo log flushing can sustain.

Saturday, July 17, 2010

MySQL-Cluster RPMs: What packages to install on which system.

Oracle distributes a series of 12 RPM packages that make up the MySQL Cluster product. Lately I've been getting a lot of questions asking which of these dozen RPM packages should be installed onto which nodes.

An overview of the contents of each RPM package is provided in the documentation at http://dev.mysql.com/doc/refman/5.1/en/linux-rpm.html and http://dev.mysql.com/doc/refman/5.1/en/mysql-cluster-multi-install.html. I think this section of the documentation is a bit messy so I'll give a short description of each here:

MySQL-Cluster-gpl-client

This package contains the standard MySQL clients and administration tools.
Contains: mysql, mysqladmin, mysqlcheck, mysqldump, etc...
This package should be installed on API nodes.

MySQL-Cluster-gpl-clusterj

This package contains the MySQL Cluster Connector for Java. (OpenJPA)
Contains: libmyjapi.so , clusterj-api.jar, clusterj.jar, clusterjpa.jar, etc...
This package should be installed on Java application (API) nodes that will connect directly to the cluster data nodes bypassing mysql server for simple queries. These hosts will also use Connector/J to connect to mysqld nodes for complex queries.

MySQL-Cluster-gpl-debuginfo

This package contains debug information for package MySQL-Cluster-gpl
Contains: mysqld.debug, various .c/.cpp .h/.hpp files for debugging mysql applications.
This package is optional for production and recommended for development environments that are building mysql or ndbapi applications. (ALL nodes)

MySQL-Cluster-gpl-devel

This package contains the development header files and libraries necessary to develop MySQL client applications.
Contains: mysql_config, mysql.h, mgmapi.h, NdbDictionary.hpp, etc...
This package is optional in production and required for development environments that are are building mysql or ndbapi applications. (ALL nodes)

MySQL-Cluster-gpl-embedded

This package contains the MySQL server as an embedded library.
Contains: libmysqld.a, libmysqld-debug.a
This package should be installed on hosts with applications that embed libmysqld, (API nodes). It is not typically used in cluster deployments

MySQL-Cluster-gpl-extra
This package contains some extra ndbcluster storage engine tools for the advanced user.
Contains: ndb_delete_all, ndb_drop_index, ndb_drop_table, ndb_cpcd
This package is not typically required, but could optionally be installed on API nodes.

MySQL-Cluster-gpl-management

This package contains ndbcluster storage engine management. It is necessary to have this package installed on at least one computer in the cluster.
Contains: ndb_mgmd
This package provides the MGM node and should *not* be installed on hosts that contain MySQL-Cluster-gpl-storage.

MySQL-Cluster-gpl-server

This package includes the MySQL server binary as well as related utilities to run and administer a MySQL server.
Contains: my.cnf, myisamchk, mysql_install_db, mysqld_safe, mysqlhotcopy, mysqld, etc...
This package should be installed on API nodes.

MySQL-Cluster-gpl-shared
This package contains the shared libraries (*.so*) which certain languages and applications need to dynamically load and use MySQL.
Contains: libmysqlclient.so, libmysqlclient_r.so, libndbclient.so
This package should be installed on API nodes.

MySQL-Cluster-gpl-storage
This package contains the ndbcluster storage engine.
Contains: ndbd, ndbdmtd
This package should be installed on NDBD nodes.

MySQL-Cluster-gpl-test

This package contains the MySQL regression test suite.
Contains: mysqltest_embedded, *.result, *.test
This package is optional for development environments (ALL nodes).

MySQL-Cluster-gpl-tools
This package contains ndbcluster storage engine basic tools.
Contains: ndb_config, ndb_desc, ndb_error_reporter, ndb_mgm, ndb_restore, ndb_show_tables, ndb_size.pl, ndb_waiter
This package should be installed on all nodes.

Friday, May 28, 2010

NDB ENGINE LIMITATIONS (rebuttal)

Anirudh Tamsekar made a post yesterday that laid out a few of the limitations of MySQL Cluster that seem to cause him the most pain. However his assessment of the situation is quite misleading. A few of his statements are inaccurate but more than half of the limitations he cites to are out right false. Since comments on Anirudh's blog are being moderated, I give my rebuttal here, and cite sources.

· Database names, table names, and attribute names cannot be as long in NDB tables as with other table handlers. In NDB, attribute names are truncated to 31 characters, and if they are not unique after truncation, errors occur. Database names and table names can total a maximum of 122 characters

False: "Identifiers. Formerly (in MySQL 5.0 and earlier), database names, table names and attribute names could not be as long for NDB tables as tables using other storage engines, because attribute names were truncated internally. In MySQL 5.1 and later, names of MySQL Cluster databases, tables, and table columns follow the same rules regarding length as they do for any other storage engine."
-- http://dev.mysql.com/doc/refman/5.1/en/mysql-cluster-limitations-resolved.html

· NDB does not support prefix indexes; only entire fields can be indexed.

True

· A big limitation is that in MySQL 4.1 and 5.0, all cluster table rows are of fixed length. This means, for example, that if a table has one or more VARCHAR fields containing only relatively small values, more memory and disk space will be required when using the NDB storage engine than would be for the same table and data using the MyISAM engine. This issue is on the “to-fix” list for MySQL Cluster 5.1.

False: As of 5.1 (which has been GA for over a year and a half) this limitation applies only to on-disk columns.

"* Variable-length column support. The NDBCLUSTER storage engine now supports variable-length column types for in-memory tables.
Previously, for example, any Cluster table having one or more VARCHAR fields which contained only relatively small values, much more memory and disk space were required when using the NDBCLUSTER storage engine than would have been the case for the same table and data using the MyISAM engine. In other words, in the case of a VARCHAR column, such a column required the same amount of storage as a CHAR column of the same size. In MySQL 5.1, this is no longer the case for in-memory tables, where storage requirements for variable-length column types such as VARCHAR and BINARY are comparable to those for these column types when used in MyISAM tables (see Section 10.5, “Data Type Storage Requirements”). "
-- http://dev.mysql.com/doc/refman/5.1/en/mysql-cluster-limitations-resolved.html

· In NDB, the maximum number of metadata objects is limited to 20,000, including database tables, system tables, indexes, and BLOBs (binary large objects). This is a hard-coded limit that you cannot override with a configuration option.

True: However the actual number is 20320.

· The maximum permitted size of any one row in NDB is 8KB, not including data stored in BLOB columns (which are actually stored in a separate table internally).

True: However, with version 7.2 we have increased the variable length part of a row to support up to 14KB, with the fixed portion still limited to 8KB

There is actually a constant you can modify at compile time to increase the max row length (as number of 4 byte words).

Current values are:
./storage/ndb/include/kernel/ndb_limits.h
#define MAX_TUPLE_SIZE_IN_WORDS 3500
./storage/ndb/include/ndbapi/ndbapi_limits.h
#define NDB_MAX_TUPLE_SIZE_IN_WORDS 3500
and
./src/ndbjtie/com/mysql/ndbjtie/ndbapi/NDBAPIConst.java
NDB_MAX_TUPLE_SIZE_IN_WORDS = 3500,

Changing these values is untested and unsupported so your mileage may vary.

· The maximum number of attributes per key in NDB is 32.

True:

· Autodiscovery of databases is not supported in NDB for multiple MySQL servers accessing the same cluster in MySQL Cluster. (You have to add each database manually on each SQL node.)

False: "Autodiscovery of databases is now supported for multiple MySQL servers accessing the same MySQL Cluster. Formerly, autodiscovery in MySQL Cluster 5.1 and MySQL Cluster NDB 6.x releases required that a given mysqld was already running and connected to the cluster at the time that the database was created on a different mysqld—in other words, when a mysqld process connected to the cluster after a database named db_name was created, it was necessary to issue a CREATE DATABASE db_name or CREATE SCHEMA db_name statement on the “new” MySQL server when it first accesseed that MySQL Cluster. Beginning with MySQL Cluster NDB 6.2.16 and MySQL Cluster NDB 6.3.18, such a CREATE statement is no longer required. (Bug#39612)

This also means that online schema changes in NDB tables are now possible. That is, the result of operations such as ALTER TABLE and CREATE INDEX performed on one SQL node in the cluster are now visible to the cluster's other SQL nodes without any additional action being taken."
-- http://dev.mysql.com/doc/refman/5.1/en/mysql-cluster-limitations-resolved.html

· MySQL replication does not work correctly in NDB if updates are done on multiple MySQL servers; replication between clusters is on the feature list for MySQL 5.1.

False: Again, 5.1 is GA over a year and a half. No need to treat this is a forward looking statement.

"* Replication with MySQL Cluster. It is now possible to use MySQL replication with Cluster databases. For details, see Section 17.6, “MySQL Cluster Replication”.

Circular Replication. Circular replication is also supported with MySQL Cluster, beginning with MySQL 5.1.18. See Section 17.6.10, “MySQL Cluster Replication: Multi-Master and Circular Replication”.

* auto_increment_increment and auto_increment_offset. The auto_increment_increment and auto_increment_offset server system variables are supported for Cluster replication beginning with MySQL 5.1.20, MySQL Cluster NDB 6.2.5, and MySQL Cluster 6.3.2."
-- http://dev.mysql.com/doc/refman/5.1/en/mysql-cluster-limitations-resolved.html

However, since the SQL node cannot know of the statements executed on another SQL node only row based replication is supported:
"Replication. Statement-based replication is not supported. Use --binlog-format=ROW (or --binlog-format=MIXED) when setting up cluster replication. See Section 17.6, “MySQL Cluster Replication”, for more information. "

· ALTER TABLE is not fully locking in NDB when you’re running multiple MySQL servers.

False: With the changes to support discovery of CREATE DATABASE in MySQL Cluster NDB 6.2.16 and MySQL Cluster NDB 6.3.18 a global schema lock was introduced which consistiently locks tables across all SQL nodes in the cluster during DDL operations

· All storage and management nodes within a cluster in NDB must have the same architecture. This restriction does not apply to machines simply running SQL nodes or any other clients that may be accessing the cluster.

Misleading: All nodes must be the same endian nature. However machines running "mysql" clients have no such restriction. SQL nodes themselves do have limitation of being the same endian type as the data nodes. i.e. A Linux x86 machine cannot be a mysqld (SQL node) front end to ndbd (DATA nodes) running Solaris Sparc.
-- http://dev.mysql.com/doc/refman/5.1/en/mysql-cluster-limitations-exclusive-to-cluster.html

· It is not possible to make online schema changes in NDB, such as those accomplished using ALTER TABLE or CREATE INDEX. (However, you can import or create a table that uses a different storage engine and then convert it to NDB by using ALTER TABLE tbl_name ENGINE=NDBCLUSTER;.) ALTER TABLE works on occasions, but all it does is create a new table with the new structure and then import the data. This generally causes an error as NDB hits a limit somewhere. It is strongly recommended that you not use ALTER TABLE to make online schema changes.

False: Again with the schema changes... See above.

· Adding or removing nodes online is not possible in NDB. (The cluster must be restarted in such cases.)

False: "In MySQL Cluster NDB 7.0 (beginning with MySQL Cluster NDB 6.4.0) and later MySQL Cluster release series, it is possible to add new data nodes to a running MySQL Cluster by performing a rolling restart, so that the cluster and the data stored in it remain available to applications."
-- http://dev.mysql.com/doc/refman/5.1/en/mysql-cluster-limitations-resolved.html

Removal of data nodes online is possible for nodes that were added online. Nodes that were in the cluster at initial system startup cannot be removed without a total shutdown and restore. However online removal is not thouroghly tested so it is not officially supported.

· The maximum number of storage nodes within an NDB cluster is 48.

True:

· The total maximum number of nodes in a cluster in MySQL Cluster is 63. This number includes all MySQL servers (that is, SQL nodes), storage nodes, and management servers.

False: "Starting with MySQL Cluster NDB 6.1.1, the total maximum number of nodes in a MySQL Cluster is 255, including all SQL nodes (MySQL Servers), API nodes (applications accessing the cluster other than MySQL servers), data nodes, and management servers. The total number of data nodes and management nodes beginning with this version is 63, of which up to 48 can be data nodes.

Note: The limitation that a data node cannot have a node ID greater than 49 continues to apply."
-- http://dev.mysql.com/doc/refman/5.1/en/mysql-cluster-limitations-resolved.html