1、通过亿级数据量在hive和impala中查询比较text、orc和parquet性能表现(二)

这篇具有很好参考价值的文章主要介绍了1、通过亿级数据量在hive和impala中查询比较text、orc和parquet性能表现(二)。希望对大家有所帮助。如果存在错误或未考虑完全的地方,请大家不吝赐教,您也可以点击"举报违法"按钮提交疑问。


本文通过在hdfs中三种不同数据格式文件存储相同数量的数据,通过hive和impala两种客户端查询进行比较。
本文前提:熟悉hadoop、hive和impala、kafka、flink等,并且其环境都可正常使用。(在后续的专栏中都会将对应的内容补全,目前已经完成了zookeeper和hadoop的部分。)
本文分为五个部分,即结论、三种文件介绍、需求、实现步骤、实现和网上别人的结论与验证。

由于本文太长,导致阅读可能比较麻烦,故一篇文章分为两篇,第一篇是准备数据,第二篇是查询比较。本文是第二篇。

本文接1、通过亿级数据量在hive和impala中查询比较text、orc和parquet性能表现(一),阅读本文前需要先阅读第一篇。

9、分别在hive和impala中查询验证结果(比較HDFS存儲三種格式文件的查詢性能textfile、orc、parquet)

-----------------1、sql----------------------------
select count(*) from t_kafkauser_orc;
select count(*) from t_kafkauser_parquet;

-- 2、隨便找一條信息,按照name查詢
-----------------1、sql----------------------------
select * from t_kafkauser_orc where name = '';
select * from t_kafkauser_parquet where name = '';


-- 3、按照時間區間查詢
-----------------1、sql----------------------------
select * from t_kafkauser_orc where create_time between '' and '';
select * from t_kafkauser_parquet where create_time between '' and '';

1)、查詢總條數

1、通过亿级数据量在hive和impala中查询比较text、orc和parquet性能表现(二),# hive专栏,# impala专栏,# kafka专栏,hadoop,kafka,flink,hive,大数据
具體查詢結果如下:

sql:
    select count(*) from t_kafkauser;
    select count(name) from t_kafkauser;
    select count(id) from t_kafkauser;
    
    select count(*) from t_kafkauser_orc;
    select count(name) from t_kafkauser_orc;
    select count(id) from t_kafkauser_orc;
    
    select count(*) from t_kafkauser_parquet;
    select count(name) from t_kafkauser_parquet;
    select count(id) from t_kafkauser_parquet;
    
0、前提驗證MR運行正常與否
    驗證MR能否正常的運行,命令如下:
    cd /opt/cloudera/parcels/CDH-6.2.1-1.cdh6.2.1.p0.1425774/lib/hadoop-mapreduce    
    hadoop jar hadoop-mapreduce-examples-3.0.0-cdh6.2.1.jar pi 1 1

1、hive查詢
    0: jdbc:hive2://server8:10000> select count(*) from t_kafkauser;

    INFO  : Total jobs = 1

    INFO  : number of splits:31

    INFO  : Stage-Stage-1: Map: 31  Reduce: 1   Cumulative CPU: 152.83 sec   HDFS Read: 8376828994 HDFS Write: 109 HDFS EC Read: 0 SUCCESS
    INFO  : Total MapReduce CPU Time Spent: 2 minutes 32 seconds 830 msec
    INFO  : Completed executing command(queryId=hive_20230202093505_2d22ca60-dc1a-4492-b2f7-493e80a136c4); Time taken: 42.478 seconds

    +------------+
    |    _c0     |
    +------------+
    | 100000000  |
    +------------+
    1 row selected (42.725 seconds)
    
    0: jdbc:hive2://server8:10000> select count(name) from t_kafkauser;

    INFO  : Total jobs = 1

    INFO  : number of splits:31

    INFO  : Stage-Stage-1: Map: 31  Reduce: 1   Cumulative CPU: 195.78 sec   HDFS Read: 8376832781 HDFS Write: 109 HDFS EC Read: 0 SUCCESS
    INFO  : Total MapReduce CPU Time Spent: 3 minutes 15 seconds 780 msec
    INFO  : Completed executing command(queryId=hive_20230202093616_76a6e3dc-8ec4-4409-afe7-130977713eca); Time taken: 41.496 seconds

    +------------+
    |    _c0     |
    +------------+
    | 100000000  |
    +------------+
    1 row selected (41.698 seconds)
    
    0: jdbc:hive2://server8:10000> select count(id) from t_kafkauser;

    INFO  : Total jobs = 1

    INFO  : number of splits:31

    INFO  : Stage-Stage-1: Map: 31  Reduce: 1   Cumulative CPU: 203.88 sec   HDFS Read: 8376832818 HDFS Write: 109 HDFS EC Read: 0 SUCCESS
    INFO  : Total MapReduce CPU Time Spent: 3 minutes 23 seconds 880 msec
    INFO  : Completed executing command(queryId=hive_20230202093719_ba62f1d1-b866-4507-b5fa-2283e59053c6); Time taken: 40.668 seconds

    +------------+
    |    _c0     |
    +------------+
    | 100000000  |
    +------------+
    1 row selected (40.823 seconds)
    
    0: jdbc:hive2://server8:10000> select count(*) from t_kafkauser_orc;

    INFO  : Total jobs = 1

    INFO  : number of splits:2


    INFO  : Stage-Stage-1: Map: 2  Reduce: 1   Cumulative CPU: 15.92 sec   HDFS Read: 83789 HDFS Write: 109 HDFS EC Read: 0 SUCCESS
    INFO  : Total MapReduce CPU Time Spent: 15 seconds 920 msec
    INFO  : Completed executing command(queryId=hive_20230201184450_28482688-b11e-4a18-9fa9-947cfba0589c); Time taken: 26.617 seconds

    +------------+
    |    _c0     |
    +------------+
    | 100000000  |
    +------------+
    1 row selected (27.605 seconds)
    
    0: jdbc:hive2://server8:10000> select count(name) from t_kafkauser_orc;

    INFO  : Total jobs = 1

    INFO  : number of splits:2

    INFO  : Stage-Stage-1: Map: 2  Reduce: 1   Cumulative CPU: 17.95 sec   HDFS Read: 293956548 HDFS Write: 109 HDFS EC Read: 0 SUCCESS
    INFO  : Total MapReduce CPU Time Spent: 17 seconds 950 msec
    INFO  : Completed executing command(queryId=hive_20230202085041_e16bf186-b3b5-4d38-8c10-d60452e75dae); Time taken: 25.029 seconds

    +------------+
    |    _c0     |
    +------------+
    | 100000000  |
    +------------+
    1 row selected (25.202 seconds)
    
    0: jdbc:hive2://server8:10000> select count(id) from t_kafkauser_orc;

    INFO  : Total jobs = 1

    INFO  : number of splits:2

    INFO  : Stage-Stage-1: Map: 2  Reduce: 1   Cumulative CPU: 15.19 sec   HDFS Read: 299434 HDFS Write: 109 HDFS EC Read: 0 SUCCESS
    INFO  : Total MapReduce CPU Time Spent: 15 seconds 190 msec
    INFO  : Completed executing command(queryId=hive_20230202085132_dbc210c6-18b7-45d3-9264-807346fb3fa0); Time taken: 25.213 seconds

    +------------+
    |    _c0     |
    +------------+
    | 100000000  |
    +------------+
    1 row selected (25.395 seconds)
    
    0: jdbc:hive2://server8:10000> select count(*) from t_kafkauser_parquet;

    INFO  : Total jobs = 1

    INFO  : number of splits:21

    INFO  : Stage-Stage-1: Map: 21  Reduce: 1   Cumulative CPU: 233.89 sec   HDFS Read: 5794677889 HDFS Write: 109 HDFS EC Read: 0 SUCCESS
    INFO  : Total MapReduce CPU Time Spent: 3 minutes 53 seconds 890 msec
    INFO  : Completed executing command(queryId=hive_20230202085230_920b24bf-4525-4412-a297-03caf7cb8a79); Time taken: 41.679 seconds

    +------------+
    |    _c0     |
    +------------+
    | 100000000  |
    +------------+
    1 row selected (41.874 seconds)
    
    0: jdbc:hive2://server8:10000> select count(name) from t_kafkauser_parquet;

    INFO  : Total jobs = 1

    INFO  : number of splits:21

    INFO  : Stage-Stage-1: Map: 21  Reduce: 1   Cumulative CPU: 169.43 sec   HDFS Read: 1789619337 HDFS Write: 109 HDFS EC Read: 0 SUCCESS
    INFO  : Total MapReduce CPU Time Spent: 2 minutes 49 seconds 430 msec
    INFO  : Completed executing command(queryId=hive_20230202085352_7b1b02b6-8507-4197-b74d-e0495e87cf22); Time taken: 36.554 seconds

    +------------+
    |    _c0     |
    +------------+
    | 100000000  |
    +------------+
    1 row selected (36.731 seconds)
    
    0: jdbc:hive2://server8:10000> select count(id) from t_kafkauser_parquet;

    INFO  : Total jobs = 1

    INFO  : number of splits:21

    INFO  : Stage-Stage-1: Map: 21  Reduce: 1   Cumulative CPU: 145.24 sec   HDFS Read: 400635630 HDFS Write: 109 HDFS EC Read: 0 SUCCESS
    INFO  : Total MapReduce CPU Time Spent: 2 minutes 25 seconds 240 msec
    INFO  : Completed executing command(queryId=hive_20230202085452_29b54852-cc94-4098-b87b-99623c9367f3); Time taken: 34.608 seconds

    +------------+
    |    _c0     |
    +------------+
    | 100000000  |
    +------------+
    1 row selected (34.796 seconds)    

2、impala查詢
    [server7:21000] test_million_data> select count(*) from t_kafkauser;
   
    +-----------+
    | count(*)  |
    +-----------+
    | 100000000 |
    +-----------+
    Fetched 1 row(s) in 1.52s
    [server7:21000] test_million_data> select count(id) from t_kafkauser;
   
    +-----------+
    | count(id) |
    +-----------+
    | 100000000 |
    +-----------+
    Fetched 1 row(s) in 1.43s
    [server7:21000] test_million_data> select count(name) from t_kafkauser;
   
    +-------------+
    | count(name) |
    +-------------+
    | 100000000   |
    +-------------+
    Fetched 1 row(s) in 1.32s
    
    [server7:21000] test_million_data> select count(*) from t_kafkauser_orc;
    
    +-----------+
    | count(*)  |
    +-----------+
    | 100000000 |
    +-----------+
    Fetched 1 row(s) in 0.21s
    [server7:21000] test_million_data> select count(id) from t_kafkauser_orc;
    +-----------+
    | count(id) |
    +-----------+
    | 100000000 |
    +-----------+
    Fetched 1 row(s) in 1.51s
    [server7:21000] test_million_data> select count(name) from t_kafkauser_orc;
   
    +-------------+
    | count(name) |
    +-------------+
    | 100000000   |
    +-------------+
    Fetched 1 row(s) in 2.72s
    
    [server7:21000] test_million_data> select count(*) from t_kafkauser_parquet;
    
    +-----------+
    | count(*)  |
    +-----------+
    | 100000000 |
    +-----------+
    Fetched 1 row(s) in 0.11s
    [server7:21000] test_million_data> select count(id) from t_kafkauser_parquet;
   
    +-----------+
    | count(id) |
    +-----------+
    | 100000000 |
    +-----------+
    Fetched 1 row(s) in 0.41s
    [server7:21000] test_million_data> select count(name) from t_kafkauser_parquet;
    
    +-------------+
    | count(name) |
    +-------------+
    | 100000000   |
    +-------------+
    Fetched 1 row(s) in 0.51s 

2)、隨便找一條信息,按照name查詢

1、通过亿级数据量在hive和impala中查询比较text、orc和parquet性能表现(二),# hive专栏,# impala专栏,# kafka专栏,hadoop,kafka,flink,hive,大数据

1、SQL
    select * from t_kafkauser where  name = 'alan6522345_t';
    select * from t_kafkauser_orc where  name = 'alan6522345_t';
    select * from t_kafkauser_parquet  where name = 'alan6522345_t';

2、Hive查詢
    0: jdbc:hive2://server8:10000> select * from t_kafkauser where name = 'alan6522345_t';

    INFO  : Total jobs = 1

    INFO  : number of splits:31

    INFO  : Stage-Stage-1: Map: 31   Cumulative CPU: 193.94 sec   HDFS Read: 8376850688 HDFS Write: 2790 HDFS EC Read: 0 SUCCESS
    INFO  : Total MapReduce CPU Time Spent: 3 minutes 13 seconds 940 msec
    INFO  : Completed executing command(queryId=hive_20230202104525_12cbc21b-fcc2-4f4b-b4c5-dbebf916f8c4); Time taken: 39.02 seconds

    +-----------------+---------------------+-------------------+------------------------------+--------------------------+
    | t_kafkauser.id  | t_kafkauser.userid  | t_kafkauser.name  |       t_kafkauser.url        | t_kafkauser.create_time  |
    +-----------------+---------------------+-------------------+------------------------------+--------------------------+
    | 14565794        | 6522345             | alan6522345_t     | https://www.win.com/6522345  | 2023-01-18 07:44:05.0    |
    +-----------------+---------------------+-------------------+------------------------------+--------------------------+
    1 row selected (39.252 seconds)
    
    0: jdbc:hive2://server8:10000> select * from t_kafkauser_orc where name = 'alan6522345_t';

    INFO  : Total jobs = 1

    INFO  : number of splits:2

    INFO  : Stage-Stage-1: Map: 2   Cumulative CPU: 8.53 sec   HDFS Read: 12458143 HDFS Write: 267 HDFS EC Read: 0 SUCCESS
    INFO  : Total MapReduce CPU Time Spent: 8 seconds 530 msec
    INFO  : Completed executing command(queryId=hive_20230202105023_3ae232fb-d3c0-4082-b154-0a7dd9793ee4); Time taken: 16.772 seconds

    +---------------------+-------------------------+-----------------------+------------------------------+------------------------------+
    | t_kafkauser_orc.id  | t_kafkauser_orc.userid  | t_kafkauser_orc.name  |     t_kafkauser_orc.url      | t_kafkauser_orc.create_time  |
    +---------------------+-------------------------+-----------------------+------------------------------+------------------------------+
    | 14565794            | 6522345                 | alan6522345_t         | https://www.win.com/6522345  | 2023-01-18 07:44:05.0        |
    +---------------------+-------------------------+-----------------------+------------------------------+------------------------------+
    1 row selected (16.978 seconds)
    
    0: jdbc:hive2://server8:10000> select * from t_kafkauser_parquet where name = 'alan6522345_t';

    INFO  : Total jobs = 1

    INFO  : number of splits:21

    INFO  : Stage-Stage-1: Map: 21   Cumulative CPU: 110.97 sec   HDFS Read: 536025572 HDFS Write: 1920 HDFS EC Read: 0 SUCCESS
    INFO  : Total MapReduce CPU Time Spent: 1 minutes 50 seconds 970 msec
    INFO  : Completed executing command(queryId=hive_20230202105140_e843cd7b-79bd-4c5b-bf7f-60a525aa1faa); Time taken: 27.3 seconds

    +-------------------------+-----------------------------+---------------------------+------------------------------+----------------------------------+
    | t_kafkauser_parquet.id  | t_kafkauser_parquet.userid  | t_kafkauser_parquet.name  |   t_kafkauser_parquet.url    | t_kafkauser_parquet.create_time  |
    +-------------------------+-----------------------------+---------------------------+------------------------------+----------------------------------+
    | 14565794                | 6522345                     | alan6522345_t             | https://www.win.com/6522345  | 2023-01-18 07:44:05.0            |
    +-------------------------+-----------------------------+---------------------------+------------------------------+----------------------------------+
    1 row selected (27.54 seconds)

3、impala查詢
    [server7:21000] test_million_data> select * from t_kafkauser where name = 'alan6522345_t';

    +----------+---------+---------------+-----------------------------+-----------------------+
    | id       | userid  | name          | url                         | create_time           |
    +----------+---------+---------------+-----------------------------+-----------------------+
    | 14565794 | 6522345 | alan6522345_t | https://www.win.com/6522345 | 2023-01-18 07:44:05.0 |
    +----------+---------+---------------+-----------------------------+-----------------------+
    Fetched 1 row(s) in 1.84s
    [server7:21000] test_million_data> select * from t_kafkauser_orc where name = 'alan6522345_t';
    Query: select * from t_kafkauser_orc where name = 'alan6522345_t'
    Query submitted at: 2023-02-02 10:54:09 (Coordinator: http://server7:25000)
    Query progress can be monitored at: http://server7:25000/query_plan?query_id=9247e834afb1281b:742462f600000000
    +----------+---------+---------------+-----------------------------+-----------------------+
    | id       | userid  | name          | url                         | create_time           |
    +----------+---------+---------------+-----------------------------+-----------------------+
    | 14565794 | 6522345 | alan6522345_t | https://www.win.com/6522345 | 2023-01-18 07:44:05.0 |
    +----------+---------+---------------+-----------------------------+-----------------------+
    Fetched 1 row(s) in 10.67s
    [server7:21000] test_million_data> select * from t_kafkauser_parquet where name = 'alan6522345_t';
    Query: select * from t_kafkauser_parquet where name = 'alan6522345_t'
    Query submitted at: 2023-02-02 10:55:21 (Coordinator: http://server7:25000)
    Query progress can be monitored at: http://server7:25000/query_plan?query_id=9740d6450f2529e5:f013a3cb00000000
    +----------+---------+---------------+-----------------------------+-----------------------+
    | id       | userid  | name          | url                         | create_time           |
    +----------+---------+---------------+-----------------------------+-----------------------+
    | 14565794 | 6522345 | alan6522345_t | https://www.win.com/6522345 | 2023-01-18 07:44:05.0 |
    +----------+---------+---------------+-----------------------------+-----------------------+
    Fetched 1 row(s) in 0.61s

3)、按照多条件查询

1、通过亿级数据量在hive和impala中查询比较text、orc和parquet性能表现(二),# hive专栏,# impala专栏,# kafka专栏,hadoop,kafka,flink,hive,大数据

1、sql
select
  * from t_kafkauser 
    where name = 'alan6522345_t' 
    and id = '14565794' 
    and userid = '6522345' 
    or url = 'https://www.win.com/6522345';
select
  * from t_kafkauser_orc 
    where name = 'alan6522345_t' 
    and id = '14565794' 
    and userid = '6522345' 
    or url = 'https://www.win.com/6522345';
select
  * from t_kafkauser_parquet 
    where name = 'alan6522345_t' 
    and id = '14565794' 
    and userid = '6522345' 
    or url = 'https://www.win.com/6522345';

2、hive查詢
    0: jdbc:hive2://server8:10000> select * from t_kafkauser where name = 'alan6522345_t' and id = '14565794' and userid = '6522345' or url = 'https://www.win.com/6522345';

    INFO  : Total jobs = 1

    INFO  : number of splits:31

    INFO  : Stage-Stage-1: Map: 31   Cumulative CPU: 216.8 sec   HDFS Read: 8376852455 HDFS Write: 2790 HDFS EC Read: 0 SUCCESS
    INFO  : Total MapReduce CPU Time Spent: 3 minutes 36 seconds 800 msec
    INFO  : Completed executing command(queryId=hive_20230202110048_d977e61a-f378-4d42-a4b3-81722742916c); Time taken: 39.688 seconds

    +-----------------+---------------------+-------------------+------------------------------+--------------------------+
    | t_kafkauser.id  | t_kafkauser.userid  | t_kafkauser.name  |       t_kafkauser.url        | t_kafkauser.create_time  |
    +-----------------+---------------------+-------------------+------------------------------+--------------------------+
    | 14565794        | 6522345             | alan6522345_t     | https://www.win.com/6522345  | 2023-01-18 07:44:05.0    |
    +-----------------+---------------------+-------------------+------------------------------+--------------------------+
    1 row selected (39.935 seconds)
    0: jdbc:hive2://server8:10000> select * from t_kafkauser_orc where name = 'alan6522345_t' and id = '14565794' and userid = '6522345' or url = 'https://www.win.com/6522345';

    INFO  : Total jobs = 1

    INFO  : number of splits:2

    INFO  : Stage-Stage-1: Map: 2   Cumulative CPU: 10.16 sec   HDFS Read: 12458871 HDFS Write: 267 HDFS EC Read: 0 SUCCESS
    INFO  : Total MapReduce CPU Time Spent: 10 seconds 160 msec
    INFO  : Completed executing command(queryId=hive_20230202110142_76710f60-5317-4584-9381-cc26e732843e); Time taken: 16.993 seconds

    +---------------------+-------------------------+-----------------------+------------------------------+------------------------------+
    | t_kafkauser_orc.id  | t_kafkauser_orc.userid  | t_kafkauser_orc.name  |     t_kafkauser_orc.url      | t_kafkauser_orc.create_time  |
    +---------------------+-------------------------+-----------------------+------------------------------+------------------------------+
    | 14565794            | 6522345                 | alan6522345_t         | https://www.win.com/6522345  | 2023-01-18 07:44:05.0        |
    +---------------------+-------------------------+-----------------------+------------------------------+------------------------------+
    1 row selected (17.197 seconds)
    0: jdbc:hive2://server8:10000> select * from t_kafkauser_parquet where name = 'alan6522345_t' and id = '14565794' and userid = '6522345' or url = 'https://www.win.com/6522345';

    INFO  : Total jobs = 1

    INFO  : number of splits:21

    INFO  : Stage-Stage-1: Map: 21   Cumulative CPU: 107.92 sec   HDFS Read: 536028491 HDFS Write: 1920 HDFS EC Read: 0 SUCCESS
    INFO  : Total MapReduce CPU Time Spent: 1 minutes 47 seconds 920 msec
    INFO  : Completed executing command(queryId=hive_20230202110301_3b351bdc-7262-4641-9ca0-6250e8287a93); Time taken: 27.76 seconds

    +-------------------------+-----------------------------+---------------------------+------------------------------+----------------------------------+
    | t_kafkauser_parquet.id  | t_kafkauser_parquet.userid  | t_kafkauser_parquet.name  |   t_kafkauser_parquet.url    | t_kafkauser_parquet.create_time  |
    +-------------------------+-----------------------------+---------------------------+------------------------------+----------------------------------+
    | 14565794                | 6522345                     | alan6522345_t             | https://www.win.com/6522345  | 2023-01-18 07:44:05.0            |
    +-------------------------+-----------------------------+---------------------------+------------------------------+----------------------------------+
    1 row selected (27.95 seconds)


3、impala查詢
    [server7:21000] test_million_data> select * from t_kafkauser where name = 'alan6522345_t' and id = 14565794 and userid = 6522345 or url = 'https://www.win.com/6522345';

    +----------+---------+---------------+-----------------------------+-----------------------+
    | id       | userid  | name          | url                         | create_time           |
    +----------+---------+---------------+-----------------------------+-----------------------+
    | 14565794 | 6522345 | alan6522345_t | https://www.win.com/6522345 | 2023-01-18 07:44:05.0 |
    +----------+---------+---------------+-----------------------------+-----------------------+
    Fetched 1 row(s) in 1.85s
    [server7:21000] test_million_data> select * from t_kafkauser_orc where name = 'alan6522345_t' and id = 14565794 and userid = 6522345 or url = 'https://www.win.com/6522345';

    +----------+---------+---------------+-----------------------------+-----------------------+
    | id       | userid  | name          | url                         | create_time           |
    +----------+---------+---------------+-----------------------------+-----------------------+
    | 14565794 | 6522345 | alan6522345_t | https://www.win.com/6522345 | 2023-01-18 07:44:05.0 |
    +----------+---------+---------------+-----------------------------+-----------------------+
    Fetched 1 row(s) in 11.31s
    [server7:21000] test_million_data> select * from t_kafkauser_parquet where name = 'alan6522345_t' and id = 14565794 and userid = 6522345 or url = 'https://www.win.com/6522345';

    +----------+---------+---------------+-----------------------------+-----------------------+
    | id       | userid  | name          | url                         | create_time           |
    +----------+---------+---------------+-----------------------------+-----------------------+
    | 14565794 | 6522345 | alan6522345_t | https://www.win.com/6522345 | 2023-01-18 07:44:05.0 |
    +----------+---------+---------------+-----------------------------+-----------------------+
    Fetched 1 row(s) in 4.28s

4)、按照時間區間查詢

1、通过亿级数据量在hive和impala中查询比较text、orc和parquet性能表现(二),# hive专栏,# impala专栏,# kafka专栏,hadoop,kafka,flink,hive,大数据

1、sql
    select count(url) from t_kafkauser where create_time BETWEEN '2023-01-18 08:01:27' and '2023-01-18 08:01:28' ;
    select count(url) from  t_kafkauser_orc where create_time BETWEEN '2023-01-18 08:01:27' and  '2023-01-18 08:01:28' ;
    select count(url) from  t_kafka_user_parquet where create_time BETWEEN '2023-01-18 08:01:27' and  '2023-01-18 08:01:28' ;

2、hive查詢
    0: jdbc:hive2://server8:10000> select count(url) from t_kafkauser where create_time BETWEEN '2023-01-18 08:01:27' and '2023-01-18 08:01:28' ;

    INFO  : Total jobs = 1

    INFO  : number of splits:31

    INFO  : Stage-Stage-1: Map: 31  Reduce: 1   Cumulative CPU: 215.09 sec   HDFS Read: 8376853562 HDFS Write: 105 HDFS EC Read: 0 SUCCESS
    INFO  : Total MapReduce CPU Time Spent: 3 minutes 35 seconds 90 msec
    INFO  : Completed executing command(queryId=hive_20230202132555_547f89e1-aa9c-4234-aeca-69f33f5eceb8); Time taken: 41.917 seconds

    +--------+
    |  _c0   |
    +--------+
    | 95500  |
    +--------+
    1 row selected (42.106 seconds)
    
    0: jdbc:hive2://server8:10000> select count(url) from t_kafkauser_orc where create_time BETWEEN '2023-01-18 08:01:27' and '2023-01-18 08:01:28' ;

    INFO  : Total jobs = 1

    INFO  : number of splits:2

    INFO  : Stage-Stage-1: Map: 2  Reduce: 1   Cumulative CPU: 12.87 sec   HDFS Read: 828205 HDFS Write: 105 HDFS EC Read: 0 SUCCESS
    INFO  : Total MapReduce CPU Time Spent: 12 seconds 870 msec
    INFO  : Completed executing command(queryId=hive_20230202132656_c1fbba6c-05c0-4a9e-a760-b65b6eb26f58); Time taken: 22.105 seconds

    +--------+
    |  _c0   |
    +--------+
    | 95500  |
    +--------+
    1 row selected (22.274 seconds)
    
    0: jdbc:hive2://server8:10000> select count(url) from t_kafkauser_parquet where create_time BETWEEN '2023-01-18 08:01:27' and '2023-01-18 08:01:28' ;

    INFO  : Total jobs = 1

    INFO  : number of splits:21

    INFO  : Stage-Stage-1: Map: 21  Reduce: 1   Cumulative CPU: 130.16 sec   HDFS Read: 64663344 HDFS Write: 105 HDFS EC Read: 0 SUCCESS
    INFO  : Total MapReduce CPU Time Spent: 2 minutes 10 seconds 160 msec
    INFO  : Completed executing command(queryId=hive_20230202132729_01d23207-6ddb-45f3-8d72-d505da333fe5); Time taken: 30.511 seconds

    +--------+
    |  _c0   |
    +--------+
    | 95500  |
    +--------+
    1 row selected (30.685 seconds)


3、impala查詢
    [server7:21000] test_million_data> select count(url) from t_kafkauser where create_time BETWEEN '2023-01-18 08:01:27' and '2023-01-18 08:01:28' ;
    
    +------------+
    | count(url) |
    +------------+
    | 95500      |
    +------------+
    Fetched 1 row(s) in 4.33s
    [server7:21000] test_million_data> select count(url) from t_kafkauser_orc where create_time BETWEEN '2023-01-18 08:01:27' and '2023-01-18 08:01:28' ;
    
    +------------+
    | count(url) |
    +------------+
    | 95500      |
    +------------+
    Fetched 1 row(s) in 5.93s
    [server7:21000] test_million_data> select count(url) from t_kafkauser_parquet where create_time BETWEEN '2023-01-18 08:01:27' and '2023-01-18 08:01:28' ;
   
    +------------+
    | count(url) |
    +------------+
    | 95500      |
    +------------+
    Fetched 1 row(s) in 0.31s

5)、兩張表join

1、sql
    select a.* from t_kafkauser a join t_kafkauser_orc b on a.id = b.id and a.id = 108045695;
    select a.* from t_kafkauser_parquet a join t_kafkauser_orc b on a.id = b.id and a.id = 108045695;
    select a.* from t_kafkauser_parquet a join t_kafkauser b on a.id = b.id and a.id = 108045695;

2、hive查詢
    0: jdbc:hive2://server8:10000> select a.* from t_kafkauser a join t_kafkauser_orc b on a.id = b.id and a.id = 108045695;

    INFO  : Total jobs = 1

    INFO  : number of splits:33

    INFO  : Stage-Stage-1: Map: 33  Reduce: 138   Cumulative CPU: 638.43 sec   HDFS Read: 8377892321 HDFS Write: 12103 HDFS EC Read: 0 SUCCESS
    INFO  : Total MapReduce CPU Time Spent: 10 minutes 38 seconds 430 msec
    INFO  : Completed executing command(queryId=hive_20230202134136_f7336a78-9201-4be8-b721-31e4c6e19674); Time taken: 166.162 seconds

    +------------+-----------+-----------------+-------------------------------+------------------------+
    |    a.id    | a.userid  |     a.name      |             a.url             |     a.create_time      |
    +------------+-----------+-----------------+-------------------------------+------------------------+
    | 108045695  | 99999999  | alan99999999_t  | https://www.win.com/99999999  | 2023-01-18 08:01:28.0  |
    +------------+-----------+-----------------+-------------------------------+------------------------+
    1 row selected (166.772 seconds)
    0: jdbc:hive2://server8:10000> select a.* from t_kafkauser_parquet a join t_kafkauser_orc b on a.id = b.id and a.id = 108045695;

    INFO  : Total jobs = 1

    INFO  : number of splits:23

    INFO  : Stage-Stage-1: Map: 23  Reduce: 100   Cumulative CPU: 440.63 sec   HDFS Read: 117456695 HDFS Write: 8797 HDFS EC Read: 0 SUCCESS
    INFO  : Total MapReduce CPU Time Spent: 7 minutes 20 seconds 630 msec
    INFO  : Completed executing command(queryId=hive_20230202135628_4ed7755b-b567-47b5-bb93-c523a0043af6); Time taken: 121.57 seconds

    +------------+-----------+-----------------+-------------------------------+------------------------+
    |    a.id    | a.userid  |     a.name      |             a.url             |     a.create_time      |
    +------------+-----------+-----------------+-------------------------------+------------------------+
    | 108045695  | 99999999  | alan99999999_t  | https://www.win.com/99999999  | 2023-01-18 08:01:28.0  |
    +------------+-----------+-----------------+-------------------------------+------------------------+
    1 row selected (121.973 seconds)
    0: jdbc:hive2://server8:10000> select a.* from t_kafkauser_parquet a join t_kafkauser b on a.id = b.id and a.id = 108045695;

    INFO  : Total jobs = 1

    INFO  : number of splits:52

    INFO  : Stage-Stage-1: Map: 52  Reduce: 212   Cumulative CPU: 1022.64 sec   HDFS Read: 8494722690 HDFS Write: 18541 HDFS EC Read: 0 SUCCESS
    INFO  : Total MapReduce CPU Time Spent: 17 minutes 2 seconds 640 msec
    INFO  : Completed executing command(queryId=hive_20230202140053_84155ad0-e645-4c5f-bbeb-feaf41dcbac6); Time taken: 242.307 seconds

    +------------+-----------+-----------------+-------------------------------+------------------------+
    |    a.id    | a.userid  |     a.name      |             a.url             |     a.create_time      |
    +------------+-----------+-----------------+-------------------------------+------------------------+
    | 108045695  | 99999999  | alan99999999_t  | https://www.win.com/99999999  | 2023-01-18 08:01:28.0  |
    +------------+-----------+-----------------+-------------------------------+------------------------+
    1 row selected (242.882 seconds)

3、impala查詢
    [server7:21000] test_million_data> select a.* from t_kafkauser a join t_kafkauser_orc b on a.id = b.id and a.id = 108045695;

    +-----------+----------+----------------+------------------------------+-----------------------+
    | id        | userid   | name           | url                          | create_time           |
    +-----------+----------+----------------+------------------------------+-----------------------+
    | 108045695 | 99999999 | alan99999999_t | https://www.win.com/99999999 | 2023-01-18 08:01:28.0 |
    +-----------+----------+----------------+------------------------------+-----------------------+
    Fetched 1 row(s) in 2.98s
    [server7:21000] test_million_data> select a.* from t_kafkauser_parquet a join t_kafkauser_orc b on a.id = b.id and a.id = 108045695;

    +-----------+----------+----------------+------------------------------+-----------------------+
    | id        | userid   | name           | url                          | create_time           |
    +-----------+----------+----------------+------------------------------+-----------------------+
    | 108045695 | 99999999 | alan99999999_t | https://www.win.com/99999999 | 2023-01-18 08:01:28.0 |
    +-----------+----------+----------------+------------------------------+-----------------------+
    Fetched 1 row(s) in 1.62s
    [server7:21000] test_million_data> select a.* from t_kafkauser_parquet a join t_kafkauser b on a.id = b.id and a.id = 108045695;

    +-----------+----------+----------------+------------------------------+-----------------------+
    | id        | userid   | name           | url                          | create_time           |
    +-----------+----------+----------------+------------------------------+-----------------------+
    | 108045695 | 99999999 | alan99999999_t | https://www.win.com/99999999 | 2023-01-18 08:01:28.0 |
    +-----------+----------+----------------+------------------------------+-----------------------+
    Fetched 1 row(s) in 2.03s

6)、總結

1、文件存儲
  • text文件存儲文件最大,orc最小,parquet居中
  • text:1.92G 4個文件,合計大小近8G
  • orc:208.12M 4個,合計大小近832M
  • parquet:1.34G 4個,合計大小近5.36G
  • parquet-snappy:442M左右,4個 ,合計大小近1.76G
2、hive查詢與impala查詢速度
  • 總體上看,hive的查詢速度與文件類型關係較大,text最慢,orc最快,parquet居中;impala的查詢速度與文件類型關係較大,text居中,orc最慢,parquet最快。不管是hive還是impala,與查詢類別關係不大。impala的查詢速度比hive中的查詢速度快上2個數量級。
3、不同查詢類型的查詢速度
  • 查詢速度與查詢類型關係不大,只是稍微會慢一些。
4、結論
  • 僅從查詢速度上考慮,如果是hive中使用首選orc文件格式,如果是impala中使用首選parquet文件格式。
  • 如果綜合查詢速度與存儲大小上考慮,如果是hive中使用首選是orc文件格式,如果是impala中使用首選orc文件格式(文件大小相差近7倍,但查詢速度僅相差4倍)。
  • 一般而言,綜合系統應用而言,選擇parquet文件格式是常見的選擇。但還需要考慮文件的相關特性,比如parquet不支持索引頁,主要與impala配合使用。

1、通过亿级数据量在hive和impala中查询比较text、orc和parquet性能表现(二),# hive专栏,# impala专栏,# kafka专栏,hadoop,kafka,flink,hive,大数据

五、下文是網上找的別人驗證結果及結論

1、通过亿级数据量在hive和impala中查询比较text、orc和parquet性能表现(二),# hive专栏,# impala专栏,# kafka专栏,hadoop,kafka,flink,hive,大数据

[server7:21000] test_million_data> select FROM_UNIXTIME(UNIX_TIMESTAMP(create_time),'yyyy-MM-dd HH:mm') dt,count(*) 
                                 > from t_kafkauser 
                                 > group by FROM_UNIXTIME(UNIX_TIMESTAMP(create_time),'yyyy-MM-dd HH:mm') 
                                 > order by FROM_UNIXTIME(UNIX_TIMESTAMP(create_time),'yyyy-MM-dd HH:mm');

+------------------+----------+
| dt               | count(*) |
+------------------+----------+
| 2023-01-18 07:42 | 929901   |
| 2023-01-18 07:43 | 5188500  |
| 2023-01-18 07:44 | 5053100  |
| 2023-01-18 07:45 | 5495900  |
| 2023-01-18 07:46 | 5280500  |
| 2023-01-18 07:47 | 5318800  |
| 2023-01-18 07:48 | 5322200  |
| 2023-01-18 07:49 | 5310600  |
| 2023-01-18 07:50 | 5523100  |
| 2023-01-18 07:51 | 5296400  |
| 2023-01-18 07:52 | 5810100  |
| 2023-01-18 07:53 | 5374300  |
| 2023-01-18 07:54 | 5350600  |
| 2023-01-18 07:55 | 5445300  |
| 2023-01-18 07:56 | 5309300  |
| 2023-01-18 07:57 | 5350400  |
| 2023-01-18 07:58 | 5351923  |
| 2023-01-18 07:59 | 5303077  |
| 2023-01-18 08:00 | 5474800  |
| 2023-01-18 08:01 | 2511199  |
+------------------+----------+
Fetched 20 row(s) in 35.12s
[server7:21000] test_million_data> select FROM_UNIXTIME(UNIX_TIMESTAMP(create_time),'yyyy-MM-dd HH:mm') dt,count(*) 
                                 > from t_kafkauser_orc 
                                 > group by FROM_UNIXTIME(UNIX_TIMESTAMP(create_time),'yyyy-MM-dd HH:mm') 
                                 > order by FROM_UNIXTIME(UNIX_TIMESTAMP(create_time),'yyyy-MM-dd HH:mm');

+------------------+----------+
| dt               | count(*) |
+------------------+----------+
| 2023-01-18 07:42 | 929901   |
| 2023-01-18 07:43 | 5188500  |
| 2023-01-18 07:44 | 5053100  |
| 2023-01-18 07:45 | 5495900  |
| 2023-01-18 07:46 | 5280500  |
| 2023-01-18 07:47 | 5318800  |
| 2023-01-18 07:48 | 5322200  |
| 2023-01-18 07:49 | 5310600  |
| 2023-01-18 07:50 | 5523100  |
| 2023-01-18 07:51 | 5296400  |
| 2023-01-18 07:52 | 5810100  |
| 2023-01-18 07:53 | 5374300  |
| 2023-01-18 07:54 | 5350600  |
| 2023-01-18 07:55 | 5445300  |
| 2023-01-18 07:56 | 5309300  |
| 2023-01-18 07:57 | 5350400  |
| 2023-01-18 07:58 | 5351923  |
| 2023-01-18 07:59 | 5303077  |
| 2023-01-18 08:00 | 5474800  |
| 2023-01-18 08:01 | 2511199  |
+------------------+----------+
Fetched 20 row(s) in 46.10s
[server7:21000] test_million_data> select FROM_UNIXTIME(UNIX_TIMESTAMP(create_time),'yyyy-MM-dd HH:mm') dt,count(*) 
                                 > from t_kafkauser_parquet 
                                 > group by FROM_UNIXTIME(UNIX_TIMESTAMP(create_time),'yyyy-MM-dd HH:mm') 
                                 > order by FROM_UNIXTIME(UNIX_TIMESTAMP(create_time),'yyyy-MM-dd HH:mm');

+------------------+----------+
| dt               | count(*) |
+------------------+----------+
| 2023-01-18 07:42 | 929901   |
| 2023-01-18 07:43 | 5188500  |
| 2023-01-18 07:44 | 5053100  |
| 2023-01-18 07:45 | 5495900  |
| 2023-01-18 07:46 | 5280500  |
| 2023-01-18 07:47 | 5318800  |
| 2023-01-18 07:48 | 5322200  |
| 2023-01-18 07:49 | 5310600  |
| 2023-01-18 07:50 | 5523100  |
| 2023-01-18 07:51 | 5296400  |
| 2023-01-18 07:52 | 5810100  |
| 2023-01-18 07:53 | 5374300  |
| 2023-01-18 07:54 | 5350600  |
| 2023-01-18 07:55 | 5445300  |
| 2023-01-18 07:56 | 5309300  |
| 2023-01-18 07:57 | 5350400  |
| 2023-01-18 07:58 | 5351923  |
| 2023-01-18 07:59 | 5303077  |
| 2023-01-18 08:00 | 5474800  |
| 2023-01-18 08:01 | 2511199  |
+------------------+----------+
Fetched 20 row(s) in 36.08s
[server7:21000] test_million_data> select FROM_UNIXTIME(UNIX_TIMESTAMP(create_time),'yyyy-MM-dd HH:mm') dt,count(*) 
                                 > from t_kafkauser_parquet_snappy 
                                 > group by FROM_UNIXTIME(UNIX_TIMESTAMP(create_time),'yyyy-MM-dd HH:mm') 
                                 > order by FROM_UNIXTIME(UNIX_TIMESTAMP(create_time),'yyyy-MM-dd HH:mm');

+------------------+----------+
| dt               | count(*) |
+------------------+----------+
| 2023-01-18 07:42 | 929901   |
| 2023-01-18 07:43 | 5188500  |
| 2023-01-18 07:44 | 5053100  |
| 2023-01-18 07:45 | 5495900  |
| 2023-01-18 07:46 | 5280500  |
| 2023-01-18 07:47 | 5318800  |
| 2023-01-18 07:48 | 5322200  |
| 2023-01-18 07:49 | 5310600  |
| 2023-01-18 07:50 | 5523100  |
| 2023-01-18 07:51 | 5296400  |
| 2023-01-18 07:52 | 5810100  |
| 2023-01-18 07:53 | 5374300  |
| 2023-01-18 07:54 | 5350600  |
| 2023-01-18 07:55 | 5445300  |
| 2023-01-18 07:56 | 5309300  |
| 2023-01-18 07:57 | 5350400  |
| 2023-01-18 07:58 | 5351923  |
| 2023-01-18 07:59 | 5303077  |
| 2023-01-18 08:00 | 5474800  |
| 2023-01-18 08:01 | 2511199  |
+------------------+----------+
Fetched 20 row(s) in 34.58s

至此,完成了整个比较。文章来源地址https://www.toymoban.com/news/detail-522819.html

到了这里,关于1、通过亿级数据量在hive和impala中查询比较text、orc和parquet性能表现(二)的文章就介绍完了。如果您还想了解更多内容,请在右上角搜索TOY模板网以前的文章或继续浏览下面的相关文章,希望大家以后多多支持TOY模板网!

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处: 如若内容造成侵权/违法违规/事实不符,请点击违法举报进行投诉反馈,一经查实,立即删除!

领支付宝红包 赞助服务器费用

相关文章

  • Hive和Impala的行列转换

    explode+lateral group by+collect_list 一、列转行 (对某列拆分,形成新列) 使用函数:lateral view explode(split(column, ‘,’)) num eg: 如表:t_row_to_column_tmp 数据如下,对tag列进行拆分 二、行转列 (根据主键,对某列进行合并) 使用函数:concat_ws(‘,’,collect_set(column)) 说明:collect_list 不去重,

    2024年02月12日
    浏览(37)
  • HIVE表数据快速构造(分区表、orc、text)

    引言 当需要在hive数仓中去创建测试表并构造测试数据时,通常需要在安装了hive客户端的服务器环境下,通过执行命令的方式建表。通过在HDFS上上传和加载数据文件的方式来加载数据到hive表中。其中操作算不得多复杂,但比较依赖对环境和命令的熟悉,并且操作不够可视化

    2024年02月16日
    浏览(47)
  • 数仓知识11:Hadoop生态及Hive、HBase、Impala、HDFS之间的关系

    Hive、HBase、Impala、HDFS是Hadoop生态体系中常用的开源产品,各个产品间是一个什么样的关系,许多人都搞不清楚,本文将进行研究分析。 Hadoop生态 在了解Hive、HBase、Impala、和HDFS之前,先熟悉一下Hadoop的生态。 Apache Hadoop软件库是一个框架,允许使用简单的编程模型在计算机集

    2023年04月08日
    浏览(55)
  • 大数据开发之Hive案例篇14:某个节点HDFS块比较多

    今天早上到公司,突然收到CDH集群某个节点的存储量的告警,如下图所示: 从图中可以看出,每个节点的HDFS空间是相同的,大多节点HDFS使用量在40%左右,而出问题的这个节点居然直逼80%,鉴于之前问题出现过多次,且每次都是利用空余时间使用HDFS的rebalance进行解决的,此处

    2024年02月11日
    浏览(41)
  • Hive(22):Select查询数据之基础查询

    1 语法树 table_reference指示查询的输入。它可以是普通物理表,视图,join查询结果或子查询结果。 表名和列名不区分大小写。 2 案例:美国Covid-19新冠select查询 下面来准备一下语法测试环境,在附件资料中有一份数据文件《us-covid19-counties.dat》,里面记录了2021-01-28美国各个县

    2024年02月16日
    浏览(40)
  • 【dbeaver】win环境的kerberos认证和Clouders/cdh集群中Kerberos认证使用Dbeaver连接Hive、Impala和Phoenix

    1.1 下载安装MIT KERBEROS客户端 MIT KERBEROS 下载较新的版本即可。 下载之后一路默认安装即可。 注意:不要修改软件安装位置。 修改系统环境变量中的Path。将刚刚的安装路径置顶。(不置顶,也要比 %JAVA_HOME%bin 和 anaconda 相关的高) 使用CMD命令确认下: 1.2 修改 krb5.conf 文件并

    2024年02月02日
    浏览(54)
  • 大数据Hive--查询

    1)查询语句语法: SELECT [ALL | DISTINCT] select_expr, select_expr, … FROM table_reference – 从什么表查 [ WHERE where_condition] – 过滤 [ GROUP BY col_list] – 分组查询 [ HAVING col_list] --分组后过滤 [ ORDER BY col_list] – 排序 [CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list] ] [ LIMIT number] – 限制输

    2024年01月16日
    浏览(35)
  • Apache Hive之数据查询

    本博客的内容基于我个人学习黑马程序员课程的学习笔记整理而成。我特此声明,所有版权属于黑马程序员或相关权利人所有。本博客的目的仅为个人学习和交流之用,并非商业用途。 我在整理学习笔记的过程中尽力确保准确性,但无法保证内容的完整性和时效性。本博客的

    2024年02月09日
    浏览(39)
  • mysql千万级别的数据使用count(*)查询比较慢怎么解决?

    当MySQL中的数据量达到千万级别时,使用 COUNT() 查询可能会变得比较慢。这是因为 COUNT() 会扫描整个表并计算匹配的行数,对于大表来说,这个过程可能会非常耗时。 在上图中查询一千三百多万数据耗时 28s左右。 以下是一些优化COUNT(*)查询的方法: 使用索引: 确保查询的字

    2024年02月03日
    浏览(50)
  • 基于Hive的天气情况大数据分析系统(通过hive进行大数据分析将分析的数据通过sqoop导入到mysql,通过Django基于mysql的数据做可视化)

    Hive介绍: Hive是建立在Hadoop之上的数据仓库基础架构,它提供了类似于SQL的语言(HQL),可以对大规模数据集进行查询和分析。通过Hive,我们可以在分布式存储系统中进行复杂的数据处理和分析。 Sqoop简介: Sqoop是一个用于在Apache Hadoop和关系型数据库之间传输数据的工具。

    2024年04月13日
    浏览(43)

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

博客赞助

微信扫一扫打赏

请作者喝杯咖啡吧~博客赞助

支付宝扫一扫领取红包,优惠每天领

二维码1

领取红包

二维码2

领红包