原创 parquet、orc、Avro存储空间和查询耗时对比
在 HDFS 中,不同的存储格式,对存储空间和查询耗时差异较大,以下是几种常见的存储格式parquet、orc、Avro的存储空间和查询耗时对比。
从结果中来看 parquet、orc 在存储空间和查询耗时方面明显优于 Avro。
压缩算法 | 存储空间 | Hive SQL 查询耗时 | 特征 |
---|---|---|---|
parquet | 180M | 14.8s | 列式存储 |
orc | 96.7M | 5.6s | 列式存储 |
Avro | 2.7G | 82.8s | 序列化 |
测试数据:2kw 行公开测试数据
1. 创建表测试
原始表的文件格式为 Avro
创建 parquet 格式的表
CREATE TABLE `queryip_parquet`(
`ip` string COMMENT '',
`isp` string COMMENT '',
`country` string COMMENT '',
`province` string COMMENT '',
`city` string COMMENT '',
`deprecate` string COMMENT '',
`useragent` string COMMENT '',
`referer` string COMMENT '',
`path` string COMMENT '',
`method` string COMMENT '',
`protocol` string COMMENT '',
`timestamp` string COMMENT '')
PARTITIONED BY (
`dt` string)
STORED AS PARQUET;
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
insert overwrite table queryip_parquet partition (dt)
SELECT * FROM queryip;
1
2
2
创建 parquet 格式的表
CREATE TABLE `queryip_orc`(
`ip` string COMMENT '',
`isp` string COMMENT '',
`country` string COMMENT '',
`province` string COMMENT '',
`city` string COMMENT '',
`deprecate` string COMMENT '',
`useragent` string COMMENT '',
`referer` string COMMENT '',
`path` string COMMENT '',
`method` string COMMENT '',
`protocol` string COMMENT '',
`timestamp` string COMMENT '')
PARTITIONED BY (
`dt` string)
STORED AS ORC;
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
insert overwrite table queryip_orc partition (dt)
SELECT * FROM queryip;
1
2
2
2. 查看元数据
使用 hive analyze 和 describe 命令
hive> analyze table queryip_parquet partition(dt) compute statistics ;
hive> analyze table queryip_orc partition(dt) compute statistics ;
hive> describe formatted queryip_parquet;
Table Parameters:
COLUMN_STATS_ACCURATE {\"BASIC_STATS\":\"true\"}
numFiles 43
numPartitions 43
numRows 21105927
rawDataSize 274377051
totalSize 188784031
transient_lastDdlTime 1623294389
# Storage Information
SerDe Library: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
Compressed: No
Num Buckets: 0
Bucket Columns: []
Sort Columns: []
Time taken: 0.125 seconds, Fetched: 46 row(s)
hive> describe formatted queryip_orc;
OK
Table Parameters:
COLUMN_STATS_ACCURATE {\"BASIC_STATS\":\"true\"}
numFiles 43
numPartitions 43
numRows 21184763
rawDataSize 0
totalSize 101381351
transient_lastDdlTime 1623294395
# Storage Information
SerDe Library: org.apache.hadoop.hive.ql.io.orc.OrcSerde
InputFormat: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
Compressed: No
Num Buckets: 0
Bucket Columns: []
Sort Columns: []
Time taken: 0.135 seconds, Fetched: 46 row(s)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
3. Hive SQL 查询耗时
parquet , 14.779 seconds
hive> SELECT country,
> province,
> city,
> count(*) AS cnt
> FROM `default`.`queryip_parquet`
> GROUP BY country,
> province,
> city
> ORDER BY cnt DESC
> LIMIT 10;
Query ID = root_20210610112412_3a1eaff7-5153-4dd0-985e-bbe23e67aaa2
Query Hive on Spark job[4] stages: [5, 6, 4]
Spark job[4] status = RUNNING
--------------------------------------------------------------------------------------
STAGES ATTEMPT STATUS TOTAL COMPLETED RUNNING PENDING FAILED
--------------------------------------------------------------------------------------
Stage-4 ........ 0 FINISHED 2 2 0 0 0
Stage-5 ........ 0 FINISHED 9 9 0 0 0
Stage-6 ........ 0 FINISHED 1 1 0 0 0
--------------------------------------------------------------------------------------
STAGES: 03/03 [==========================>>] 100% ELAPSED TIME: 14.03 s
--------------------------------------------------------------------------------------
Spark job[4] finished successfully in 14.03 second(s)
Spark Job[4] Metrics: TaskDurationTime: 18289, ExecutorCpuTime: 15458, JvmGCTime: 93, BytesRead / RecordsRead: 31766596 / 21105927, BytesReadEC: 0, ShuffleTotalBytesRead / ShuffleRecordsRead: 40566 / 1423, ShuffleBytesWritten / ShuffleRecordsWritten: 40566 / 1423
OK
Time taken: 14.779 seconds, Fetched: 10 row(s)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
orc, 5.645 seconds
hive> SELECT country,
> province,
> city,
> count(*) AS cnt
> FROM `default`.`queryip_orc`
> GROUP BY country,
> province,
> city
> ORDER BY cnt DESC
> LIMIT 10;
Query Hive on Spark job[5] stages: [9, 7, 8]
Spark job[5] status = RUNNING
--------------------------------------------------------------------------------------
STAGES ATTEMPT STATUS TOTAL COMPLETED RUNNING PENDING FAILED
--------------------------------------------------------------------------------------
Stage-7 ........ 0 FINISHED 2 2 0 0 0
Stage-8 ........ 0 FINISHED 4 4 0 0 0
Stage-9 ........ 0 FINISHED 1 1 0 0 0
--------------------------------------------------------------------------------------
STAGES: 03/03 [==========================>>] 100% ELAPSED TIME: 5.01 s
--------------------------------------------------------------------------------------
Spark job[5] finished successfully in 5.01 second(s)
Spark Job[5] Metrics: TaskDurationTime: 6666, ExecutorCpuTime: 4968, JvmGCTime: 38, BytesRead / RecordsRead: 26663580 / 20715, BytesReadEC: 0, ShuffleTotalBytesRead / ShuffleRecordsRead: 36165 / 1384, ShuffleBytesWritten / ShuffleRecordsWritten: 36165 / 1384
OK
Time taken: 5.645 seconds, Fetched: 10 row(s)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
Avro, 82.76 seconds
hive> SELECT country,
> province,
> city,
> count(*) AS cnt
> FROM `default`.`queryip`
> GROUP BY country,
> province,
> city
> ORDER BY cnt DESC
> LIMIT 10
> ;
Query ID = root_20210610112449_359c8c81-0588-45fd-ba42-2e5be5dff421
Query Hive on Spark job[6] stages: [12, 10, 11]
Spark job[6] status = RUNNING
--------------------------------------------------------------------------------------
STAGES ATTEMPT STATUS TOTAL COMPLETED RUNNING PENDING FAILED
--------------------------------------------------------------------------------------
Stage-10 ....... 0 FINISHED 12 12 0 0 0
Stage-11 ....... 0 FINISHED 86 86 0 0 0
Stage-12 ....... 0 FINISHED 1 1 0 0 0
--------------------------------------------------------------------------------------
STAGES: 03/03 [==========================>>] 100% ELAPSED TIME: 82.15 s
--------------------------------------------------------------------------------------
Spark job[6] finished successfully in 82.16 second(s)
Spark Job[6] Metrics: TaskDurationTime: 494625, ExecutorCpuTime: 369951, JvmGCTime: 10414, BytesRead / RecordsRead: 2885901995 / 21184763, BytesReadEC: 0, ShuffleTotalBytesRead / ShuffleRecordsRead: 263896 / 6562, ShuffleBytesWritten / ShuffleRecordsWritten: 263896 / 6562
OK
Time taken: 82.76 seconds, Fetched: 10 row(s)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
4. 存储空间对比
# hdfs dfs -du -s -h /user/hive/warehouse/queryip_orc
96.7 M 193.4 M /user/hive/warehouse/queryip_orc
# hdfs dfs -du -s -h /user/hive/warehouse/queryip
2.7 G 7.0 G /user/hive/warehouse/queryip
# hdfs dfs -du -s -h /user/hive/warehouse/queryip_parquet
180.0 M 360.1 M /user/hive/warehouse/queryip_parquet
1
2
3
4
5
6
2
3
4
5
6