各位用户为了找寻关于postgreSQL 非count方法算记录数操作的资料费劲了很多周折。这里教程网为您整理了关于postgreSQL 非count方法算记录数操作的相关资料,仅供查阅,以下为您介绍关于postgreSQL 非count方法算记录数操作的详细内容
一般方法
select count(1) from table_name;
全量扫描一遍表,记录越多,查询速度越慢
新法
PostgreSQL 还真提供了一个这样的途径,那就是系统表 pg_class,这个系统表里头,存储着每个表的统计信息,其中 reltuples 就是对应的表的统计行,统计行的数据是pg有个独立进程,定期扫描不同的表,收集这些表的统计信息,保存在系统表里头。
方法如下:
? 1 2 3 4 5 6 7select
reltuples::
int
as
total
from
pg_class
where
relname =
'table_name'
and
relnamespace = (
select
oid
from
pg_namespace
where
nspname =
'schema'
);
新方法不是通用的,如果要求特精确还是使用select count(1),如果是类似分页的,且分页数量超过比较多的,也不是要求特别精准的,这就是一个好方法!
count(1) over 计算记录数
select count(1) over(), * from table_name;
补充
count 是最常用的聚集函数之一,看似简单,其实还是有坑的,如:
1、count(*):返回结果集的行数,是null也统计
2、count(1):和count(*)基本没区别,pg92之前都是扫描全表的,pg92之后增加了index only scan一般会变成扫主键索引,如果没有主键或者是表的列很多的情况下,count(1)快一些,因为不会考虑表的全部字段
3、count(field):返回数据表中指定字段值不等于null的行数
拓展:理解 PostgreSQL 的 count 函数的行为
关于 count 函数的使用一直存在争议,尤其是在 MySQL 中,作为流行度越来越高的 PostgreSQL 是否也有类似的问题呢,我们通过实践来理解一下 PostgreSQL 中 count 函数的行为。
构建测试数据库
创建测试数据库,并创建测试表。测试表中有自增 ID、创建时间、内容三个字段,自增 ID 字段是主键。
create database performance_test;
create table test_tbl (id serial primary key, created_at timestamp, content varchar(512));
生成测试数据
使用 generate_series 函数生成自增 ID,使用 now() 函数生成 created_at 列,对于 content 列,使用了 repeat(md5(random()::text), 10) 生成 10 个 32 位长度的 md5 字符串。使用下列语句,插入 1000w 条记录用于测试。
performance_test=# insert into test_tbl select generate_series(1,10000000),now(),repeat(md5(random()::text),10); INSERT 0 10000000 Time: 212184.223 ms (03:32.184)
由 count 语句引发的思考
默认情况下 PostgreSQL 不开启 SQL 执行时间的显示,所以需要手动开启一下,方便后面的测试对比。
timing on
count(*) 和 count(1) 的性能区别是经常被讨论的问题,分别使用 count(*) 和 count(1) 执行一次查询。
? 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15performance_test=#
select
count
(*)
from
test_tbl;
count
----------
10000000
(1 row)
Time
: 115090.380 ms (01:55.090)
performance_test=#
select
count
(1)
from
test_tbl;
count
----------
10000000
(1 row)
Time
: 738.502 ms
可以看到两次查询的速度差别非常大,count(1) 真的有这么大的性能提升?接下来再次运行查询语句。
? 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15performance_test=#
select
count
(*)
from
test_tbl;
count
----------
10000000
(1 row)
Time
: 657.831 ms
performance_test=#
select
count
(1)
from
test_tbl;
count
----------
10000000
(1 row)
Time
: 682.157 ms
可以看到第一次查询时候会非常的慢,后面三次速度非常快并且时间相近,这里就有两个问题出现了:
为什么第一次查询速度这么慢?
count(*) 和 count(1) 到底存不存在性能差别?
查询缓存
使用 explain 语句重新执行查询语句
explain (analyze,buffers,verbose) select count(*) from test_tbl;
可以看到如下输出:
? 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23Finalize Aggregate (cost=529273.69..529273.70
rows
=1 width=8) (actual
time
=882.569..882.570
rows
=1 loops=1)
Output
:
count
(*)
Buffers: shared hit=96
read
=476095
-> Gather (cost=529273.48..529273.69
rows
=2 width=8) (actual
time
=882.492..884.170
rows
=3 loops=1)
Output
: (
PARTIAL
count
(*))
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=96
read
=476095
->
Partial
Aggregate (cost=528273.48..528273.49
rows
=1 width=8) (actual
time
=881.014..881.014
rows
=1 loops=3)
Output
:
PARTIAL
count
(*)
Buffers: shared hit=96
read
=476095
Worker 0: actual
time
=880.319..880.319
rows
=1 loops=1
Buffers: shared hit=34
read
=158206
Worker 1: actual
time
=880.369..880.369
rows
=1 loops=1
Buffers: shared hit=29
read
=156424
-> Parallel Seq Scan
on
public
.test_tbl (cost=0.00..517856.98
rows
=4166598 width=0) (actual
time
=0.029..662.165
rows
=3333333 loops=3)
Buffers: shared hit=96
read
=476095
Worker 0: actual
time
=0.026..661.807
rows
=3323029 loops=1
Buffers: shared hit=34
read
=158206
Worker 1: actual
time
=0.030..660.197
rows
=3285513 loops=1
Buffers: shared hit=29
read
=156424
Planning
time
: 0.043 ms
Execution
time
: 884.207 ms
注意里面的 shared hit,表示命中了内存中缓存的数据,这就可以解释为什么后面的查询会比第一次快很多。接下来去掉缓存,并重启 PostgreSQL。
? 1 2 3service postgresql stop
echo 1 > /proc/sys/vm/drop_caches
service postgresql start
重新执行 SQL 语句,速度慢了很多。
? 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23Finalize Aggregate (cost=529273.69..529273.70
rows
=1 width=8) (actual
time
=50604.564..50604.564
rows
=1 loops=1)
Output
:
count
(*)
Buffers: shared
read
=476191
-> Gather (cost=529273.48..529273.69
rows
=2 width=8) (actual
time
=50604.508..50606.141
rows
=3 loops=1)
Output
: (
PARTIAL
count
(*))
Workers Planned: 2
Workers Launched: 2
Buffers: shared
read
=476191
->
Partial
Aggregate (cost=528273.48..528273.49
rows
=1 width=8) (actual
time
=50591.550..50591.551
rows
=1 loops=3)
Output
:
PARTIAL
count
(*)
Buffers: shared
read
=476191
Worker 0: actual
time
=50585.182..50585.182
rows
=1 loops=1
Buffers: shared
read
=158122
Worker 1: actual
time
=50585.181..50585.181
rows
=1 loops=1
Buffers: shared
read
=161123
-> Parallel Seq Scan
on
public
.test_tbl (cost=0.00..517856.98
rows
=4166598 width=0) (actual
time
=92.491..50369.691
rows
=3333333 loops=3)
Buffers: shared
read
=476191
Worker 0: actual
time
=122.170..50362.271
rows
=3320562 loops=1
Buffers: shared
read
=158122
Worker 1: actual
time
=14.020..50359.733
rows
=3383583 loops=1
Buffers: shared
read
=161123
Planning
time
: 11.537 ms
Execution
time
: 50606.215 ms
shared read 表示没有命中缓存,通过这个现象可以推断出,上一小节的四次查询中,第一次查询没有命中缓存,剩下三次查询都命中了缓存。
count(1) 和 count(*) 的区别
接下来探究 count(1) 和 count(*) 的区别是什么,继续思考最开始的四次查询,第一次查询使用了 count(*),第二次查询使用了 count(1) ,却依然命中了缓存,不正是说明 count(1) 和 count(*) 是一样的吗?
事实上,PostgreSQL 官方对于 is there a difference performance-wise between select count(1) and select count(*)? 问题的回复也证实了这一点:
Nope. In fact, the latter is converted to the former during parsing.[2]
既然 count(1) 在性能上没有比 count(*) 更好,那么使用 count(*) 就是更好的选择。
sequence scan 和 index scan
接下来测试一下,在不同数据量大小的情况下 count(*) 的速度,将查询语句写在 count.sql 文件中,使用 pgbench 进行测试。
pgbench -c 5 -t 20 performance_test -r -f count.sql
分别测试 200w - 1000w 数据量下的 count 语句耗时
绘制成耗时曲线
曲线的趋势在 600w - 700w 数据量之间出现了转折,200w - 600w 是线性增长,600w 之后 count 的耗时就基本相同了。使用 explain 语句分别查看 600w 和 700w 数据时的 count 语句执行。
700w:
? 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25Finalize Aggregate (cost=502185.93..502185.94
rows
=1 width=8) (actual
time
=894.361..894.361
rows
=1 loops=1)
Output
:
count
(*)
Buffers: shared hit=16344
read
=352463
-> Gather (cost=502185.72..502185.93
rows
=2 width=8) (actual
time
=894.232..899.763
rows
=3 loops=1)
Output
: (
PARTIAL
count
(*))
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=16344
read
=352463
->
Partial
Aggregate (cost=501185.72..501185.73
rows
=1 width=8) (actual
time
=889.371..889.371
rows
=1 loops=3)
Output
:
PARTIAL
count
(*)
Buffers: shared hit=16344
read
=352463
Worker 0: actual
time
=887.112..887.112
rows
=1 loops=1
Buffers: shared hit=5459
read
=118070
Worker 1: actual
time
=887.120..887.120
rows
=1 loops=1
Buffers: shared hit=5601
read
=117051
-> Parallel
Index
Only
Scan using test_tbl_pkey
on
public
.test_tbl (cost=0.43..493863.32
rows
=2928960 width=0) (actual
time
=0.112..736.376
rows
=2333333 loops=3)
Index
Cond: (test_tbl.id < 7000000)
Heap Fetches: 2328492
Buffers: shared hit=16344
read
=352463
Worker 0: actual
time
=0.107..737.180
rows
=2344479 loops=1
Buffers: shared hit=5459
read
=118070
Worker 1: actual
time
=0.133..737.960
rows
=2327028 loops=1
Buffers: shared hit=5601
read
=117051
Planning
time
: 0.165 ms
Execution
time
: 899.857 ms
600w:
? 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25Finalize Aggregate (cost=429990.94..429990.95
rows
=1 width=8) (actual
time
=765.575..765.575
rows
=1 loops=1)
Output
:
count
(*)
Buffers: shared hit=13999
read
=302112
-> Gather (cost=429990.72..429990.93
rows
=2 width=8) (actual
time
=765.557..770.889
rows
=3 loops=1)
Output
: (
PARTIAL
count
(*))
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=13999
read
=302112
->
Partial
Aggregate (cost=428990.72..428990.73
rows
=1 width=8) (actual
time
=763.821..763.821
rows
=1 loops=3)
Output
:
PARTIAL
count
(*)
Buffers: shared hit=13999
read
=302112
Worker 0: actual
time
=762.742..762.742
rows
=1 loops=1
Buffers: shared hit=4638
read
=98875
Worker 1: actual
time
=763.308..763.308
rows
=1 loops=1
Buffers: shared hit=4696
read
=101570
-> Parallel
Index
Only
Scan using test_tbl_pkey
on
public
.test_tbl (cost=0.43..422723.16
rows
=2507026 width=0) (actual
time
=0.053..632.199
rows
=2000000 loops=3)
Index
Cond: (test_tbl.id < 6000000)
Heap Fetches: 2018490
Buffers: shared hit=13999
read
=302112
Worker 0: actual
time
=0.059..633.156
rows
=1964483 loops=1
Buffers: shared hit=4638
read
=98875
Worker 1: actual
time
=0.038..634.271
rows
=2017026 loops=1
Buffers: shared hit=4696
read
=101570
Planning
time
: 0.055 ms
Execution
time
: 770.921 ms
根据以上现象推断,PostgreSQL 似乎在 count 的数据量小于数据表长度的某一比例时,才使用 index scan,通过查看官方 wiki 也可以看到相关描述:
It is important to realise that the planner is concerned with minimising the total cost of the query. With databases, the cost of I/O typically dominates. For that reason, "count(*) without any predicate" queries will only use an index-only scan if the index is significantly smaller than its table. This typically only happens when the table's row width is much wider than some indexes'.[3]
根据 Stackoverflow 上的回答,count 语句查询的数量大于表大小的 3/4 时候就会用使用全表扫描代替索引扫描[4]。
结论
不要用 count(1) 或 count(列名) 代替 count(*)
count 本身是非常耗时的
count 可能是 index scan 也可能是 sequence scan,取决于 count 数量占表大小的比例
以上为个人经验,希望能给大家一个参考,也希望大家多多支持。如有错误或未考虑完全的地方,望不吝赐教。
原文链接:https://blog.csdn.net/u011944141/article/details/52485132