单表五亿数据的查询优化 | Mysql、StarRocks

发布时间：2026/6/24 8:06:58

本次测试所用服务器硬件配置如下此机器除了 Mysql 和 StartRocks 还部署了其它很多 Docker 服务。CPUAMD Ryzen™ 7 8745H w/ Radeon™ 780M Graphics × 16内存DDR5 5600 MT/S 32G(16G*2)磁盘性能Timing cached reads: 64174 MB in 1.99 seconds 32256.62 MB/sec Timing buffered disk reads: 3562 MB in 3.00 seconds 1186.39 MB/sec初始化数据库环境MysqlCREATE TABLE users ( id bigint(20) unsigned NOT NULL AUTO_INCREMENT, username varchar(32) NOT NULL, phone char(11) NOT NULL, email varchar(64) NOT NULL, gender tinyint(3) unsigned NOT NULL DEFAULT 0 COMMENT 0未知 1男 2女, age tinyint(3) unsigned NOT NULL DEFAULT 0, status tinyint(3) unsigned NOT NULL DEFAULT 1 COMMENT 1正常 2禁用 3注销, province_id smallint(5) unsigned NOT NULL DEFAULT 0, city_id mediumint(8) unsigned NOT NULL DEFAULT 0, register_source tinyint(3) unsigned NOT NULL DEFAULT 1 COMMENT 1web 2ios 3android 4api, score int(10) unsigned NOT NULL DEFAULT 0, created_at datetime NOT NULL, updated_at datetime NOT NULL, last_login_at datetime DEFAULT NULL, PRIMARY KEY (id), UNIQUE KEY uk_phone (phone), KEY idx_status_created_id (status,created_at,id), KEY idx_created_at_id (created_at,id), KEY idx_email (email), KEY users_created_at_index (created_at) ) ENGINEInnoDB AUTO_INCREMENT935300001 DEFAULT CHARSETutf8mb4 ROW_FORMATDYNAMICStarRocksCREATE TABLE users ( id bigint NOT NULL COMMENT , created_at datetime NOT NULL COMMENT , username varchar(32) NOT NULL COMMENT , phone varchar(11) NOT NULL COMMENT , email varchar(64) NOT NULL COMMENT , gender tinyint NOT NULL DEFAULT 0 COMMENT , age tinyint NOT NULL DEFAULT 0 COMMENT , status tinyint NOT NULL DEFAULT 1 COMMENT , province_id smallint NOT NULL DEFAULT 0 COMMENT , city_id int NOT NULL DEFAULT 0 COMMENT , register_source tinyint NOT NULL DEFAULT 1 COMMENT , score int NOT NULL DEFAULT 0 COMMENT , updated_at datetime NOT NULL COMMENT , last_login_at datetime NULL COMMENT ) ENGINEOLAP PRIMARY KEY(id, created_at) PARTITION BY RANGE(created_at) ( START (2020-01-01) END (2026-12-31) EVERY (INTERVAL 1 MONTH) ) DISTRIBUTED BY HASH(id) BUCKETS 8 PROPERTIES ( replication_num 1 );Python 写入数据的脚本from concurrent.futures import ThreadPoolExecutor, as_completed from datetime import datetime, timedelta import pymysql HOST 192.168.1.1 PORT 3306 // 或 9030 USER root PASSWORD 123456 DATABASE testdata START_ID 1 TOTAL_ROWS 500_000_000 WORKERS 8 BATCH_SIZE 10_000 BASE_TIME datetime(2024, 1, 1, 0, 0, 0) INSERT_SQL INSERT INTO users ( id, username, phone, email, gender, age, status, province_id, city_id, register_source, score, created_at, updated_at, last_login_at ) VALUES ( %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s ) def make_conn(): return pymysql.connect( hostHOST, portPORT, userUSER, passwordPASSWORD, databaseDATABASE, charsetutf8mb4, autocommitFalse, read_timeout300, write_timeout300, connect_timeout30, ) def build_rows(start_id: int, end_id: int): rows [] for n in range(start_id, end_id): created_at BASE_TIME timedelta(secondsn % 31_536_000) updated_at created_at last_login_at created_at timedelta(daysn % 30) rows.append(( n, fuser_{n}, f1{n:010d}, fuser_{n}test.local, n % 3, 18 (n % 43), 2 if n % 20 0 else 1, (n % 34) 1, (n % 340) 1, (n % 4) 1, n % 100000, created_at.strftime(%Y-%m-%d %H:%M:%S), updated_at.strftime(%Y-%m-%d %H:%M:%S), last_login_at.strftime(%Y-%m-%d %H:%M:%S), )) return rows def worker(worker_no: int, start_id: int, end_id: int): conn make_conn() inserted 0 try: with conn.cursor() as cur: current start_id while current end_id: next_id min(current BATCH_SIZE, end_id 1) rows build_rows(current, next_id) cur.executemany(INSERT_SQL, rows) conn.commit() inserted len(rows) current next_id if inserted % 100000 0 or current end_id: print(fworker{worker_no} inserted{inserted} range{start_id}-{end_id}) finally: conn.close() def split_ranges(start_id: int, total_rows: int, workers: int): base total_rows // workers remain total_rows % workers current start_id result [] for i in range(workers): size base (1 if i remain else 0) s current e current size - 1 result.append((i 1, s, e)) current e 1 return result def main(): ranges split_ranges(START_ID, TOTAL_ROWS, WORKERS) print(ranges:, ranges) with ThreadPoolExecutor(max_workersWORKERS) as pool: futures [pool.submit(worker, worker_no, s, e) for worker_no, s, e in ranges] for future in as_completed(futures): future.result() print(done) if __name__ __main__: main()5亿条数据到底占多大空间本次测试中MySQL实际写入数据量为4.6亿条因测试过程中未完成5亿条写入StarRocks按计划写入5亿条数据以下为两者的存储占用详情。对于 Mysql8.0K ./testdata/users.frm 136G ./testdata/users.ibd 4.0K ./testdata/db.opt 136G ./testdataStarRocks 是 256G。Mysql 表里面创建了比较多的索引通过以下 SQL 可以获取表的索引以及索引数据占用的空间SELECT TABLE_NAME AS 表名, CONCAT(ROUND((INDEX_LENGTH / 1024 / 1024), 2), MB) AS 索引大小, CONCAT(ROUND((DATA_LENGTH / 1024 / 1024), 2), MB) AS 数据大小, CONCAT(ROUND(((INDEX_LENGTH DATA_LENGTH) / 1024 / 1024), 2), MB) AS 总大小 FROM information_schema.TABLES WHERE TABLE_SCHEMA testdata AND TABLE_NAME users;指标数值换算索引大小79943.56 MB≈78.07 GB数据大小51887.00 MB≈50.67 GB总大小131830.56 MB≈128.74 GB笔者这台机器部署了很多服务所以实际上日常运行 Mysql、StarRocks 两个数据库应该不需要 10G 内存。统计数据量大数据量下全表数据量统计select count(*)是业务中常见场景以下为MySQL与StarRocks的性能对比每组测试重复3次取平均值以减少偶然误差。select count(*) from users;在 Mysql 里面统计数据量是大坑需要 1-2 分钟。[2026-04-11 14:21:22] 在 1 m 11 s 165 ms (execution: 1 m 11 s 153 ms, fetching: 12 ms) 内检索到从 1 开始的 1 行StarRocks 只需要 260ms。[2026-04-11 14:20:29] 在 261 ms (execution: 254 ms, fetching: 7 ms) 内检索到从 1 开始的 1 行所以对在业务系统中统计数据量是非常麻烦的一个事情如果只是需要知道全表数据量有很多方法可以做例如单独设计数据量统计表、Redis 记录数据量等但是往往页面显示数据量时需要分页、搜索、筛选会导致在大数据量下耗时非常长。因此在大数量时读写分离很有必要统计数据量、分页查询大小、join 条件等通过 StarRocks 来操作。有索引会多块数据库本身已有以下索引一开始 AI 给我生成表的时候我在想为什么有的索引只包含字段有的列把id也包进去了查了资料发现 InnoDB 二级索引默认含主键所以实际上(created_at,id)跟(created_at)是一样的但是在排序方面有区别因为索引的数据是会排序的。以下场景显式定义(col, id)有明显收益查询包含ORDER BY col, id查询需要按col分页如WHERE col x ORDER BY col, id LIMIT nid可用于稳定分页顺序col的区分度很低如大量重复值加上id可提高索引的 “区分度”优化索引查找效率。回归正题在使用主键的情况下Mysql 读取 1000 条数据select * from users where id in (...);[2026-04-12 08:52:56] 在 113 ms (execution: 79 ms, fetching: 34 ms) 内检索到从 1 开始的 64 行所以在表数据量非常大时完全可以在 StarRocks 执行一些查询操作最终获取一份数据 id 后在 Mysql 业务数据库查询数据做业务处理。对于手机号这种字符串字段如果做了索引其实各种查询操作也不会慢的。select * from users where phone like 1000% order by phone desc limit 100 offset 10; select * from users where phone like %1000% order by phone desc limit 100 offset 10;Mysql:[2026-04-11 14:23:05] 在 48 ms (execution: 13 ms, fetching: 35 ms) 内检索到从 1 开始的 100 行 [2026-04-12 09:11:44] 在 375 ms (execution: 347 ms, fetching: 28 ms) 内检索到从 1 开始的 100 行StarRocks[2026-04-12 09:08:09] 在 576 ms (execution: 546 ms, fetching: 30 ms) 内检索到从 1 开始的 100 行 [2026-04-12 09:11:36] 在 822 ms (execution: 789 ms, fetching: 33 ms) 内检索到从 1 开始的 100 行上面的测验可以说明几个问题。对于字符串走前缀区配时like xxx%性能性能也会非常好4.6 亿数据执行时间只需要 13ms。只有前缀匹配like xxx%才能真正利用索引做范围扫描rangeB 树可以直接定位到前缀匹配的起始位置只扫描符合范围的索引节点所以%xxx%这种走不了索引会导致全盘扫描导致 MySQL 执行耗时从 13ms 暴涨到 347ms性能下降了 26 倍。不过对于 StarRocksStarRocks 是 OLAP 引擎默认的前缀索引对like %x%完全无效所以查询都会比 Mysql 慢。对于字符串等场景如果设计的查询方案可以走索引那么即使数据量很大其实也不需要担心查询时间。优化筛选查询单是用户表在业务需求中往往需要对手机号、用户名、邮箱等进行模糊查询like %xxx%这种情况必然会出现我们不可能让产品经理改需求但是无论在 Mysql 还是 StarRocks 使用like %xxx%在大数据量时耗时都会变大所以我们需要找到一种方式既可以满足产品对于订单、用户表等多个动态字段模糊搜索又要让查询速度变快。select * from users where phone like %1000% or email like %user_10% or username like %user_11% order by phone desc limit 100 offset 100;Mysql:[2026-04-12 09:31:37] 在 912 ms (execution: 880 ms, fetching: 32 ms) 内检索到从 1 开始的 100 行StarRocks:[2026-04-12 09:32:24] 在 1 s 588 ms (execution: 1 s 557 ms, fetching: 31 ms) 内检索到从 1 开始的 100 行StarRocks/Doris 支持ngram 分词倒排索引原理和 ES 类似但直接集成在数仓引擎中避免了数据同步的麻烦适合分析场景。但是经过笔者测试where phone like %1000% or email like %user_10% or username like %user_11%走不了索引查询速度也好慢。无论是 Mysql 还是 StarRocks 在多条件模糊查询时由于索引机制都会导致查询速度缓慢最后只能使用 ElasticSearch 做模糊查询ElasticSearch 这方面非常强。POST /users/_search { from: 0, size: 10, query: { bool: { should: [ { match: { phone: 1000 }}, { match: { email: user_10 }}, { match: { username: user_11 }} ], minimum_should_match: 1 } } }一个查询只能使用一个索引在 Mysql 的 users 表中我们给很多字段设置了索引包括创建时间idx_created_at_id。PRIMARY KEY (id), UNIQUE KEY uk_phone (phone), KEY idx_status_created_id (status,created_at,id), KEY idx_created_at_id (created_at,id), KEY idx_email (email)如果我们按时间来排查范围有idx_created_at_id的加持下面的 SQL 执行速度会不会非常快select * from users where phone like 1000% and created_at 2024-04-25 17:46:20 order by phone desc limit 100 offset 10;然而实际测试Mysql[2026-04-11 11:55:19] 在 24 s 86 ms (execution: 24 s 66 ms, fetching: 20 ms) 内检索到从 1 开始的 9 行StarRocks在 19 s 578 ms (execution: 19 s 561 ms, fetching: 17 ms) 内检索到 0 行但是我不是给 phone、create_at 都创建索引了嘛为什么还会这么慢先看执行计划。确实有两个索引uk_phone(phone)idx_created_at_id(created_at, id)但问题是 MySQL 通常只能选一个 “最有用” 的索引然后回表过滤其他条件。所以

文章详情

单表五亿数据的查询优化 | Mysql、StarRocks

相关新闻

最新新闻

日新闻

周新闻

月新闻