Hive---拉链表-Toy模板网

这篇具有很好参考价值的文章主要介绍了Hive---拉链表。希望对大家有所帮助。如果存在错误或未考虑完全的地方，请大家不吝赐教，您也可以点击"举报违法"按钮提交疑问。

拉链表

定义

拉链表是一种数据模型，主要是针对数据仓库设计中表存储数据的方式而定义的，顾名思义，所谓拉链，就是记录历史。记录一个事物从开始，一直到当前状态的所有变化的信息。拉链表可以避免按每一天存储所有记录造成的海量存储问题，同时也是处理缓慢变化数据（SCD2）的一种常见方式。

用途

1.数据量比较大
2.表中的部分字段会被update,如用户的地址，产品的描述信息，订单的状态等等
3。需要查看某一个时间点或者时间段的历史快照信息，比如，查看某一个订单在历史某一个时间点的状态，比如，查看某一个用户在过去某一段时间内，更新过几次等等
4.变化的比例和频率不是很大，比如，总共有1000万的会员，每天新增和发生变化的有10万左右
5.如果对这边表每天都保留一份全量，那么每次全量中会保存很多不变的信息，对存储是极大的浪费

拉链历史表，既能满足反应数据的历史状态，又可以最大程度的节省存储

案例

全量流程

这里有张info表
Hive---拉链表

通过sqoop导出到Hive的ODS层

日期：2021-09-26
start_time:2021-09-26（抽取时间）
表名：ods_info
Hive---拉链表

增量流程

日期：2021-09-27
info表增加了一条数据并且数据进行了修改
Hive---拉链表

sqoop导出
新增数据+更新数据都要采集

select sid,sname,address,Create_time,Update_time from info where Create_time='2021-09-26' or Update_time='2021-09-26'

在ODS层再创建一张保存当天的新增和更新的数据
ods_info_new
表结构和ods_info一致

Hive---拉链表

原始拉链表数据和更新的数据进行合并处理

也就是ods_info和ods_info_new进行合并处理
Hive---拉链表

增量流程日期：2021-09-28
数据发生变化
Hive---拉链表

sqoop导出
新增数据+更新数据都要采集

select sid,sname,address,Create_time,Update_time from info where Create_time='2021-09-27' or Update_time='2021-09-27'

原表删除，建ods_info_new存放更改的数据
Hive---拉链表

9.27拉链表数据和更新的数据进行合并处理

也就是ods_info和ods_info_new进行合并处理
Hive---拉链表

这个整个过程就是拉链表的过程

合并过程

Hive---拉链表

第一步

更改原始拉链表(ods_info)的数据
select
a.sid,
a.sname,
a.address,
a.Create_time,
a.Update-time,
a.Start_time
if(a.end_time=‘9999-12-31’ and b.sid is not null,b.start_time,a.end_time) as End_time
from ods_info a
left join ods_info_new b
on a.sid=b.sid

第二步

将left join 的结果和新增更新表(ods_info_new)进行union all 合并操作保存在临时表中(ods_info_tmp)

第三步

将临时表(ods_info_tmp)覆盖到原始拉链表(ods_info)中 Insert overwrite ods_info select * from ods_info_tmp

insert into table ods_info_tmp select *
from (
         (select sid, sname, address, Create_time, Update_time, Start_time, End_time from ods_info_new)
         union all
         (
             select a.sid,
                    a.sname,
                    a.address,
                    a.Create_time,
                    a.Update_time,
                    a.Start_time,
                    if(a.end_time = '9999-12-31' and b.sid is not null, b.start_time, a.end_time) as End_time
             from ods_info a
                      left join ods_info_new b on a.sid = b.sid
         ));
insert overwrite table ods_info select * from ods_info_new;

案例二（含分区）

原始表数据
Hive---拉链表

创建外部表orders

create external table orders(
    orderid int,
    createdate string,
    modifiedtime string,
    status string
)
row format delimited fields terminated by '\t'
location '/tmp/lalian/orders';

增量分区表

create table ods_orders_inc(
    orderid int,
    createdate string,
    modifiedtime string,
    status string
)
partitioned by (day string)
row format delimited fields terminated by '\t';
insert overwrite table ods_orders_inc partition (day='2023-03-03')

历史记录表

create table dws_orders_his(
    orderid int,
    createdate string,
    modifiedtime string,
    status string,
    start_time string,
    end_time string
)
row format delimited  fields terminated by '\t';

insert overwrite table dws_orders_his
select orderid,createdate,modifiedtime,status,modifiedtime,'9999-12-31' from ods_orders_inc where day='2023-03-03';

2023-03-04凌晨

ods_orders_inc 分区表按日期分区
将orders数据导入到2023一03一03分区中

2023-03-05凌晨

原表数据发生更改
Hive---拉链表

将orders数据导入到2023-03-04分区中

insert overwrite table ods_orders_inc partition(day='2023-03-04')
select orderid, createdate, modifiedtime, status
from orders
where modifiedtime='2023-3-4';

合并操作

select tb.orderid,tb.createdate,tb.modifiedtime,tb.status,tb.start_time,tb.end_time
       from(
(select orderid, createdate, modifiedtime,status,modifiedtime as start_time,'9999-12-31' as end_time
from ods_orders_inc where day='2023-03-04')
union all
(select
       t1.orderid,
       t1.createdate,
       t1.modifiedtime,
       t1.status,
       t1.start_time,
       case when t2.orderid is not null
                     and t1.end_time >'2023-3-4' then '2023-3-4' else t1.end_time end end_time
from dws_orders_his as t1
    left join (select orderid from ods_orders_inc where day='2023-03-04') as t2 on t1.orderid=t2.orderid))tb
order by tb.orderid,tb.start_time;

Hive---拉链表
2023-03-06凌晨
原表数据发生更改

将orders数据导入到2023-03-05分区中

insert overwrite table ods_orders_inc partition (day='2023-03-05')
select orderid,createdate,modifiedtime,status from orders
where modifiedtime='2023-3-5' or (createdate='2023-3-5' and modifiedtime='2023-3-5');

合并操作

select tb1.*
from (
         (select orderid, createdate, modifiedtime, status, modifiedtime start_time, '9999-12-31' end_time
          from ods_orders_inc
          where day = '2023-03-05')
         union all
         (select t1.orderid,
                 t1.createdate,
                 t1.modifiedtime,
                 t1.status,
                 t1.start_time,
                 `if`(t2.orderid is not null and t1.end_time > '2023-03-05', '2023-03-05', t1.end_time) end_time
          from dws_orders_his t1
                   left join (select orderid, modifiedtime
                              from ods_orders_inc
                              where day = '2023-03-05') t2 on t1.orderid = t2.orderid)
     ) tb1 order by tb1.orderid,tb1.modifiedtime;

Hive---拉链表
以此类推，整个过程即为拉链表

每日的用户更新表该怎么获取，有3种方式拿到或者间接拿到每日的用户增量，因为它比较重要，所以详细说明：
我们可以监听 Mysql 库数据的变化，比如说用 Canal，最后合并每日的变化，获取到最后的一个状态。
假设我们每天都会获得一份切片数据，我们可以通过取两天切片数据的不同来作为每日更新表，这种情况下我们可以对所有的字段先进行 concat，再取 md5，这样就 ok 了。
流水表！有每日的变更流水表。通过etl工具对操作型数据库按照时间字段增量抽取到ods或者数据仓库(每天抽取前一天的数据)，形成每天的增量数据(实际中使用最多的情形)。文章来源地址https://www.toymoban.com/news/detail-496641.html

到了这里，关于Hive---拉链表的文章就介绍完了。如果您还想了解更多内容，请在右上角搜索TOY模板网以前的文章或继续浏览下面的相关文章，希望大家以后多多支持TOY模板网！