1. 多维函数说明
- 多维函数作用:是一种将多个
group by
逻辑写在一个sql语句中的便利写法。等价于将不同维度的group by
结果集进行union all
- 与union all 效率对比:Grouping Sets 版本的 SQL 不仅在表达上更加简洁,在性能上也更加高效
- Hive中多维函数有哪些?
- grouping sets()
- with cube
- with rollup
2. grouping sets()
(1) 语法
假设现在有如下3个SQL:
-- 第一个sql:以a维度计算num
select a, sum(num) from t group by a
-- 第二个sql:以b维度计算num
select b, sum(num) from t group by b
-- 第三个sql:以a, b维度计算num
select a, b, sum(num) from t group by a, b
如何把这三个SQL写到一个SQL中?
-
方式一:用
union all
select a, -- 只有a维度 null as b, -- 没有b维度 sum(num) as num from t group by a union all select null, -- 没有a维度 b, -- 只有b维度 sum(num) as num from t group by b union all select a, -- 有a维度 b, -- 也有维度 sum(num) as num from t group by a, b
-
方式二:用
grouping sets()
函数-- 将这两个合在一起的更简单的写法:使用 grouping sets select a, b, sum(num) as num from t group by a, b grouping sets (a, b, (a, b)) -- sets后面的括号指名了要取出现哪些维度 order by grouping__id -- grouping__id表示结果属于哪一个分组集合,以二进制的形式表达
- 其他有用的文档:Hive/Presto中函数grouping sets用法详解(踩坑总结,看到赚到)
- 注意:
grouping sets
会把group by
子句中没有参与聚合的维度列置null
。也就是说在结果上完全等价于方式一。
(2) grouping__id字段
select a,b,c,grouping__id from test group by a,b,c
grouping sets (
(a,b),
(a, c),
(a, b, c)
)
- 在Hive中grouping__id的数字规则:根据group by 字段 ,从低到高,group sets中出现的字段是1没出现是0。
因此,上述SQL中(a, b)
对应011
;(a,c)
对应101
- Spark中grouping__id的数字规则:根据group by 字段 ,从高到低,group sets中出现的字段是0没出现是1
因此,上述SQL中(a, b)
对应001
;(a,c)
对应010
(3) 例子
-
cookie_info.txt
文件:2018-03,2018-03-10,cookie1 2018-03,2018-03-10,cookie5 2018-03,2018-03-12,cookie7 2018-04,2018-04-12,cookie3 2018-04,2018-04-13,cookie2 2018-04,2018-04-13,cookie4 2018-04,2018-04-16,cookie4 2018-03,2018-03-10,cookie2 2018-03,2018-03-10,cookie3 2018-04,2018-04-12,cookie5 2018-04,2018-04-13,cookie6 2018-04,2018-04-15,cookie3 2018-04,2018-04-15,cookie2 2018-04,2018-04-16,cookie1
-
导入数据
CREATE TABLE cookie_info( month STRING, day STRING, cookieid STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; load data local inpath '/root/hivedata/cookie_info.txt' into table cookie_info; select * from cookie_info;
-
分别统计每一天的cookie和每月的cookie并联合在一张表中
SELECT month, day, COUNT(DISTINCT cookieid) AS nums, GROUPING__ID FROM cookie_info GROUP BY month,day GROUPING SETS (month,day) ORDER BY GROUPING__ID;
结果:
3. with cube
cube() 函数表示根据GROUP BY的维度的所有组合进行聚合。
对于cube来说,如果有n个维度,则所有组合的总个数是: 2 n 2^{n} 2n
比如cube有a,b,c 3个维度,则所有组合情况是: (a,b,c)
,(a,b)
,(a,c)
,(a)
,(b,c)
,(b)
,(c)
,()
SELECT
month,
day,
COUNT(DISTINCT cookieid) AS nums,
GROUPING__ID
FROM cookie_info
GROUP BY month,day
WITH CUBE
ORDER BY GROUPING__ID;
--等价于
select NULL, NULL, nums, GROUPING__ID from (SELECT NULL,NULL,COUNT(DISTINCT cookieid) AS nums,0 AS GROUPING__ID FROM cookie_info)
UNION ALL
select month, NULL, nums, GROUPING__ID from (SELECT month,NULL,COUNT(DISTINCT cookieid) AS nums,1 AS GROUPING__ID FROM cookie_info GROUP BY month)
UNION ALL
select NULL, day, nums, GROUPING__ID from (SELECT NULL,day,COUNT(DISTINCT cookieid) AS nums,2 AS GROUPING__ID FROM cookie_info GROUP BY day)
UNION ALL
select month, day, nums, GROUPING__ID from (SELECT month,day,COUNT(DISTINCT cookieid) AS nums,3 AS GROUPING__ID FROM cookie_info GROUP BY month,day;)
4. with rollup
rollup() 函数是cube的子集,以最左侧的维度为主。文章来源:https://www.toymoban.com/news/detail-642508.html
比如ROLLUP有a,b,c3个维度,则所有组合情况是:(a,b,c)
,(a,b)
,(a)
,()
文章来源地址https://www.toymoban.com/news/detail-642508.html
SELECT
month,
day,
COUNT(DISTINCT cookieid) AS nums,
GROUPING__ID
FROM cookie_info
GROUP BY month,day
WITH ROLLUP
ORDER BY GROUPING__ID;
-- 等价于
select month, day, nums, GROUPING__ID from (SELECT month,day,COUNT(DISTINCT cookieid) AS nums,3 AS GROUPING__ID FROM cookie_info GROUP BY month,day;)
UNION ALL
select month, NULL, nums, GROUPING__ID from (SELECT month,NULL,COUNT(DISTINCT cookieid) AS nums,1 AS GROUPING__ID FROM cookie_info GROUP BY month)
UNION ALL
select NULL, NULL, nums, GROUPING__ID from (SELECT NULL,NULL,COUNT(DISTINCT cookieid) AS nums,0 AS GROUPING__ID FROM cookie_info)
到了这里,关于【Hive】grouping sets() 函数的文章就介绍完了。如果您还想了解更多内容,请在右上角搜索TOY模板网以前的文章或继续浏览下面的相关文章,希望大家以后多多支持TOY模板网!