【SQL】MySql InnoDB count(*) slow－兔窩

【SQL】MySql InnoDB count(*) slow

問題：

　　此篇文章要談的問題是 SQL 速度優化相關，問題為 MySql InnoDB 在資料量大時想取得資料總數、執行 SELECT count(*) 的時候速度慢。

　　描述這個問題我們借用一下 w3schools MySql 教學中的 Try it Yourself 功能，並使用其中名為 OrderDetails 的 table 作為範例。

SHOW CREATE TABLE OrderDetails;
CREATE TABLE `OrderDetails` (
    `OrderDetailID` int(11) NOT NULL AUTO_INCREMENT,
    `OrderID` int(11) DEFAULT NULL,
    `ProductID` int(11) DEFAULT NULL,
    `Quantity` int(11) DEFAULT NULL,
    PRIMARY KEY (`OrderDetailID`)
) ENGINE=InnoDB AUTO_INCREMENT=2156 DEFAULT CHARSET=latin1

　　EXPLAIN SELECT count(*) 顯示 type index、key PRIMARY、rows 2122、Extra Using index，就是在對 PRIMARY KEY Index 做全掃描。而問題就來自於此，當資料量越大、PRIMARY KEY Index 越大、掃描越慢。

解答：

　　首先需要先了解為什麼 SELECT count(*) 需要全掃描，MySql 官方文件的解釋如下：

InnoDB does not keep an internal count of rows in a table because concurrent transactions might “see” different numbers of rows at the same time.

Consequently, SELECT COUNT(*) statements only count rows visible to the current transaction.

來源：MySQL :: MySQL 8.0 Reference Manual :: 15.23 InnoDB Restrictions and Limitations

　　原因來自於 InnoDB 支援 transaction，每個並行中的 transaction 之操作 (刪除、新增) 會導致其看見的總數是不一的，並不存在正確又唯一的總數值。最終導致各 transaction 執行 SELECT count(*) 時不論如何都需要掃描某個 index 來自己計算該 transaction 看見的總數。

　　在不避免掃描的前提下，可以藉由為 PRIMARY KEY 或資料型態儲存容量較小的欄位新增 Secondary Index，並使用該 Secondary Index 來掃描來加速。因為 InnoDB 的 PRIMARY KEY 為 Cluster Index，Cluster Index 內會包含部分 KEY 以外 (其他欄位) 的資料，這些資料在掃描時會需要額外的 I/O、記憶體，反之 Secondary Index 則只會儲存該 KEY 以及對應的 PRIMARY KEY，容量較小取得時的資源消耗低，速度自然較快。

　　有關 Cluster Index 與 Secondary Index 的介紹可以參考這篇文章：淺談 InnoDB 的 Cluster Index 和 Secondary Index | by Genchi Lu | Medium，寫得很清楚。

　　而官方文件則建議了兩種避免掃描的取捨方案：

Processing SELECT COUNT(*) statements takes some time if index records are not entirely in the buffer pool.

For a faster count, create a counter table and let your application update it according to the inserts and deletes it does.

However, this method may not scale well in situations where thousands of concurrent transactions are initiating updates to the same counter table.

If an approximate row count is sufficient, use SHOW TABLE STATUS.

來源：MySQL :: MySQL 8.0 Reference Manual :: 12.20.1 Aggregate Function Descriptions

使用額外儲存空間自行維護總數 (對於寫入刪除頻繁的系統要小心，這個總數值會大量且同時的被 Update，造成效能瓶頸或 race condition)
使用 SHOW TABLE STATUS 或 SELECT information_schema table 等方式從 table 狀態中取得近似值。

　　使用 SELECT information_schema 來取得範例 table 的總數近似值。

延伸閱讀：

　　由於這個問題來自支援 transaction，故 PostgreSQL 也有一樣的問題存在，同時也提供近似值存取。

The reason is related to the MVCC implementation in PostgreSQL. The fact that multiple transactions can see different states of the data means that there can be no straightforward way for "COUNT(*)" to summarize data across the whole table.

PostgreSQL must walk through all rows to determine visibility. This normally results in a sequential scan reading information about every row in the table.

PostgreSQL can query "estimates" or "cached" values for a table's size (much faster). See Count estimate for details.

來源：Slow Counting - PostgreSQL wiki

　　而不支援 transaction 的 MySql MyISAM 理所當然的就沒有這個問題，正確又唯一的總數值被額外存起來。

For MyISAM tables, COUNT(*) is optimized to return very quickly if the SELECT retrieves from one table, no other columns are retrieved, and there is no WHERE clause.

This optimization only applies to MyISAM tables, because an exact row count is stored for this storage engine and can be accessed very quickly.

COUNT(1) is only subject to the same optimization if the first column is defined as NOT NULL.

來源：MySQL :: MySQL 8.0 Reference Manual :: 12.20.1 Aggregate Function Descriptions

參考資料：

[1] MySQL Tutorial

https://www.w3schools.com/mySQl/default.asp

[2] MySQL :: MySQL 8.0 Reference Manual :: 15.23 InnoDB Restrictions and Limitations

https://dev.mysql.com/doc/refman/8.0/en/innodb-restrictions-limitations.html

[3] MySQL :: MySQL 8.0 Reference Manual :: 12.20.1 Aggregate Function Descriptions

https://dev.mysql.com/doc/refman/8.0/en/aggregate-functions.html

[4] mysql - Why is count(*) slow, when explain knows the answer? - Database Administrators Stack Exchange

https://dba.stackexchange.com/questions/184685/why-is-count-slow-when-explain-knows-the-answer