1. 窗口函数概述
1.1. 什么是窗口函数
- 简单理解,就是对查询的结果多出一列,这一列可以是聚合值,也可以是排序值。
- 开窗函数一般就是说的是over()函数,其窗口是由一个 OVER 子句 定义的多行记录
- 开窗函数一般分为两类,聚合开窗函数和排序开窗函数。
1.2. 窗口函数的功能
<窗口函数> over (partition by <用于分组的列名> order by <用于排序的列名> [rows between ?? and ???])
1.3. 窗口函数的种类和位置
- 专用窗口函数,包括rank, dense_rank, row_number等专用窗口函数
- 聚合函数,如sum(). avg(), count(), max(), min()等,rows between…and…
- 因为窗口函数是对where或者group by子句处理后的结果进行操作,所以窗口函数原则上只能写在select子句中
1.4. 窗口函数说明
- over():是窗口函数的核心,需要开窗的全部都需要使用这个函数
- partition by:用来对表分组,后续跟上分组的字段,会给该字段中每个不同的值开一个窗(可以添加多个字段)
- order by:是对分组后的结果进行排序,默认是按照升序(asc)排列,可以进行多个字段降序
- rows between ?? and ???:用来对窗口中已经排序完成后的上下数据进行筛选
2. 聚合窗口函数
聚合窗口函数就是在通过over开窗的情况下,使用sum、max等聚合函数对开窗之后的结果进行聚合汇总;跟普通的聚合函数不同,普通聚合函数只会生成一条结果数据,聚合窗口函数会在每一行中都新增一列,用了存储这聚合结果,这个窗口有多少条数据,那就会有多少条一模一样的结果数据。
原始数据如下表所示(只列举出了部分数据):
user_id |
cat_id |
score |
618 |
488 |
0.0320793066566 |
7325 |
5648 |
0.1794971885383 |
7325 |
5653 |
0.0195317941826 |
9549 |
472 |
0.2734451185567 |
9549 |
316 |
0.0102923143349 |
10187 |
560 |
0.09287464288 |
16093 |
299 |
0.8203353556701 |
19109 |
450 |
0.0129389094682 |
29043 |
5580 |
0.0780796261011 |
29043 |
360 |
0.0445708866509 |
30207 |
301 |
0.1283038644617 |
37469 |
318 |
0.0672950094793 |
42121 |
364 |
0.0285768131938 |
45021 |
476 |
0.0290241345941 |
48161 |
308 |
1.3997567045733 |
48161 |
530 |
0.0280696255429 |
53008 |
510 |
0.0292976912317 |
53568 |
302 |
0.6823961209357 |
58397 |
300 |
0.0356667545944 |
SQL编写和查询结果展示:
select
user_id
, third_cat_id as cat_id
, score
, max(score) over(partition by user_id) max_score_of_user
, min(score) over(partition by user_id) min_score_of_user
, count(score) over(partition by user_id) cnt_score_of_user
, avg(score) over(partition by user_id) avg_score_of_user
, sum(score) over(partition by user_id) sum_score_of_user
from yishou_recommendation_system.user_third_category_score_15day
user_id |
cat_id |
score |
max_score_of_user |
min_score_of_user |
cnt_score_of_user |
avg_score_of_user |
sum_score_of_user |
1697 |
325 |
0.24999999928 |
0.24999999928 |
0.24999999928 |
1 |
0.24999999928 |
0.24999999928 |
13638 |
401 |
0.1924408114735 |
2.3144764447547 |
0.0153345765378 |
29 |
0.30592698646857247 |
8.871882607588601 |
13638 |
318 |
0.8433564379566 |
2.3144764447547 |
0.0153345765378 |
29 |
0.30592698646857247 |
8.871882607588601 |
13638 |
295 |
0.8473076857954 |
2.3144764447547 |
0.0153345765378 |
29 |
0.30592698646857247 |
8.871882607588601 |
13638 |
299 |
0.3413452337674 |
2.3144764447547 |
0.0153345765378 |
29 |
0.30592698646857247 |
8.871882607588601 |
13638 |
556 |
0.4636178949445 |
2.3144764447547 |
0.0153345765378 |
29 |
0.30592698646857247 |
8.871882607588601 |
13638 |
472 |
0.0769230769231 |
2.3144764447547 |
0.0153345765378 |
29 |
0.30592698646857247 |
8.871882607588601 |
13638 |
314 |
0.1531243564897 |
2.3144764447547 |
0.0153345765378 |
29 |
0.30592698646857247 |
8.871882607588601 |
13638 |
363 |
0.0153345765378 |
2.3144764447547 |
0.0153345765378 |
29 |
0.30592698646857247 |
8.871882607588601 |
13638 |
431 |
0.0928746430137 |
2.3144764447547 |
0.0153345765378 |
29 |
0.30592698646857247 |
8.871882607588601 |
13638 |
303 |
0.1512020592657 |
2.3144764447547 |
0.0153345765378 |
29 |
0.30592698646857247 |
8.871882607588601 |
13638 |
316 |
0.1252590167457 |
2.3144764447547 |
0.0153345765378 |
29 |
0.30592698646857247 |
8.871882607588601 |
13638 |
490 |
0.0153345765378 |
2.3144764447547 |
0.0153345765378 |
29 |
0.30592698646857247 |
8.871882607588601 |
13638 |
301 |
0.4550003444724 |
2.3144764447547 |
0.0153345765378 |
29 |
0.30592698646857247 |
8.871882607588601 |
13638 |
488 |
0.4961183604825 |
2.3144764447547 |
0.0153345765378 |
29 |
0.30592698646857247 |
8.871882607588601 |
13638 |
300 |
0.2698680479978 |
2.3144764447547 |
0.0153345765378 |
29 |
0.30592698646857247 |
8.871882607588601 |
13638 |
361 |
0.0769230769231 |
2.3144764447547 |
0.0153345765378 |
29 |
0.30592698646857247 |
8.871882607588601 |
13638 |
493 |
0.0153345765378 |
2.3144764447547 |
0.0153345765378 |
29 |
0.30592698646857247 |
8.871882607588601 |
13638 |
521 |
0.1332110723867 |
2.3144764447547 |
0.0153345765378 |
29 |
0.30592698646857247 |
8.871882607588601 |
13638 |
561 |
0.0769230769231 |
2.3144764447547 |
0.0153345765378 |
29 |
0.30592698646857247 |
8.871882607588601 |
3. 排序窗口函数
3.1. 不同排序窗口函数的区别
- rank():如果有并列名次的行,会占用下一名次的位置,排序结果是 1,1,3
- dense_rank():如果有并列名次的行,不占用下一名次的位置,排序结果是 1,1,2
- row_number():不考虑并列名次的情况,排序结果是 1,2,3
3.2. SQL编写和查询结果展示
select
user_id
, third_cat_id as cat_id
, score
, rank() over(partition by user_id order by score) score_rank_of_user
, dense_rank() over(partition by user_id order by score) score_dense_rank_of_user
, row_number() over(partition by user_id order by score) score_row_number_of_user
from yishou_recommendation_system.user_third_category_score_15day
user_id |
cat_id |
score |
score_rank_of_user |
score_dense_rank_of_user |
score_row_number_of_user |
1697 |
325 |
0.24999999928 |
1 |
1 |
1 |
13638 |
493 |
0.0153345765378 |
1 |
1 |
1 |
13638 |
363 |
0.0153345765378 |
1 |
1 |
2 |
13638 |
5657 |
0.0153345765378 |
1 |
1 |
3 |
13638 |
490 |
0.0153345765378 |
1 |
1 |
4 |
13638 |
325 |
0.0153345765378 |
1 |
1 |
5 |
13638 |
340 |
0.0379776680745 |
6 |
2 |
6 |
13638 |
472 |
0.0769230769231 |
7 |
3 |
7 |
13638 |
361 |
0.0769230769231 |
7 |
3 |
8 |
13638 |
561 |
0.0769230769231 |
7 |
3 |
9 |
13638 |
562 |
0.0769230769231 |
7 |
3 |
10 |
13638 |
510 |
0.0928746430137 |
11 |
4 |
11 |
13638 |
431 |
0.0928746430137 |
11 |
4 |
12 |
13638 |
316 |
0.1252590167457 |
13 |
5 |
13 |
13638 |
521 |
0.1332110723867 |
14 |
6 |
14 |
13638 |
394 |
0.1401607366123 |
15 |
7 |
15 |
13638 |
303 |
0.1512020592657 |
16 |
8 |
16 |
13638 |
314 |
0.1531243564897 |
17 |
9 |
17 |
13638 |
401 |
0.1924408114735 |
18 |
10 |
18 |
13638 |
300 |
0.2698680479978 |
19 |
11 |
19 |
13638 |
299 |
0.3413452337674 |
20 |
12 |
20 |
13638 |
304 |
0.4085346499663 |
21 |
13 |
21 |
13638 |
308 |
0.4296759077854 |
22 |
14 |
22 |
13638 |
301 |
0.4550003444724 |
23 |
15 |
23 |
13638 |
556 |
0.4636178949445 |
24 |
16 |
24 |
13638 |
488 |
0.4961183604825 |
25 |
17 |
25 |
13638 |
294 |
0.4990914022087 |
26 |
18 |
26 |
13638 |
318 |
0.8433564379566 |
27 |
19 |
27 |
13638 |
295 |
0.8473076857954 |
28 |
20 |
28 |
13638 |
302 |
2.3144764447547 |
29 |
21 |
29 |
4. 排序截取数据函数
4.1. 排序截取数据函数的区别
- lag(col,n,default_val):获取往前第n行数据,col是列名,n是往上的行数,当第n行为null的时候取default_val
- lead(col,n, default_val):往后第n行数据,col是列名,n是往下的行数,当第n行为null的时候取default_val
- ntile(n):把有序分区中的行分发到指定数据的组中,各个组有编号,编号从1开始,对于每一行,NTILE返回此行所属的组的编号。
- cume_dist(),计算某个窗口或分区中某个值的累积分布。假定升序排序,则使用以下公式确定累积分布:小于等于当前值x的行数 / 窗口或partition分区内的总行数。其中,x 等于 order by 子句中指定的列的当前行中的值。
4.2. 原始数据
user_id |
record_date |
11343325 |
2022-09-04 06:17:24 |
11343325 |
2022-09-04 06:17:24 |
11343326 |
2022-09-04 06:17:25 |
11343326 |
2022-09-04 06:17:25 |
11343327 |
2022-09-04 06:17:26 |
11343327 |
2022-09-04 06:17:26 |
11343326 |
2022-09-04 06:17:27 |
11343328 |
2022-09-04 06:17:27 |
11343328 |
2022-09-04 06:17:27 |
11343329 |
2022-09-04 06:17:28 |
11343329 |
2022-09-04 06:17:28 |
4654923 |
2022-09-04 06:17:30 |
11343330 |
2022-09-04 06:17:32 |
11343330 |
2022-09-04 06:17:32 |
11343331 |
2022-09-04 06:17:35 |
11343331 |
2022-09-04 06:17:35 |
11343332 |
2022-09-04 06:17:37 |
11343332 |
2022-09-04 06:17:37 |
11343333 |
2022-09-04 06:17:37 |
11343333 |
2022-09-04 06:17:37 |
11343334 |
2022-09-04 06:17:40 |
11343334 |
2022-09-04 06:17:40 |
4.3. lag和lead函数的使用和结果
select
user_id
, record_date
, lag(record_date, 1, '1970-01-01 00:00:00') over(partition by user_id order by record_date) as start_date
, lead(record_date, 1, '9999-99-99 00:00:00') over(partition by user_id order by record_date) as end_date
from user_table
使用user_id分组,并使用record_date进行升序排序;lag函数会获取之前一行的record_date数据作为start_date字段的值,并回使用 '1970-01-01 00:00:00' 作为默认值;lead函数会获取之后一行的record_date数据作为end_date字段的值,并回使用 '9999-99-99 00:00:00' 作为默认值;结果如下所示:
user_id |
record_date |
start_date |
end_date |
1000045 |
2022-09-03 11:41:50 |
1970-01-01 00:00:00 |
9999-99-99 00:00:00 |
1000061 |
2022-09-03 15:22:21 |
1970-01-01 00:00:00 |
9999-99-99 00:00:00 |
100013 |
2022-09-03 08:14:36 |
1970-01-01 00:00:00 |
2022-09-03 14:47:43 |
100013 |
2022-09-03 14:47:43 |
2022-09-03 08:14:36 |
2022-09-03 14:48:25 |
100013 |
2022-09-03 14:48:25 |
2022-09-03 14:47:43 |
2022-09-03 14:49:11 |
100013 |
2022-09-03 14:49:11 |
2022-09-03 14:48:25 |
2022-09-03 15:08:41 |
100013 |
2022-09-03 15:08:41 |
2022-09-03 14:49:11 |
2022-09-03 15:09:27 |
100013 |
2022-09-03 15:09:27 |
2022-09-03 15:08:41 |
9999-99-99 00:00:00 |
具体赋值如下图所示:
5. 排序限制分区函数
5.1. 各排序限制分区函数说明
- rows必须跟在Order by 子句之后,对排序的结果进行限制,使用固定的行数来限制分区中的数据行数量;
- OVER():指定分析函数工作的数据窗口大小,这个数据窗口大小可能会随着行的变而变化;
- current row:当前行;
- n preceding:往前n行数据;
- n following:往后n行数据;
- unbounded preceding:该窗口数据的起点;
- unbounded following:该窗口数据的终点;
5.2. SQL编写和结果展示
select
user_id
, third_cat_id as cat_id
, score
-- 以user_id分组,按照score进行升序排序
, row_number() over(partition by user_id order by score) as row_number_id
-- 以user_id分组,按照score进行升序排序,求该窗口中第一行到当前行的和
, sum(score) over(partition by user_id order by score rows between unbounded preceding and current row) sum_1
-- 以user_id分组,按照score进行升序排序,求该窗口中之前一行到当前行的和
, sum(score) over(partition by user_id order by score rows between 1 preceding and current row) sum_2
-- 以user_id分组,按照score进行升序排序,求该窗口中当前行到之后一行的和
, sum(score) over(partition by user_id order by score rows between current row and 1 following) sum_3
-- 以user_id分组,按照score进行升序排序,求该窗口中当前行到最后一行的和
, sum(score) over(partition by user_id order by score rows between current row and unbounded following) sum_3
from yishou_recommendation_system.user_third_category_score_15day
user_id |
cat_id |
score |
row_number_id |
sum_1 |
sum_2 |
sum_3 |
sum_3 |
1697 |
325 |
0.24999999928 |
1 |
0.24999999928 |
0.24999999928 |
0.24999999928 |
0.24999999928 |
13638 |
490 |
0.0153345765378 |
1 |
0.0153345765378 |
0.0153345765378 |
0.0306691530756 |
8.8718826075886 |
13638 |
493 |
0.0153345765378 |
2 |
0.0306691530756 |
0.0306691530756 |
0.0306691530756 |
8.8565480310508 |
13638 |
363 |
0.0153345765378 |
3 |
0.0460037296134 |
0.0306691530756 |
0.0306691530756 |
8.841213454513 |
13638 |
325 |
0.0153345765378 |
4 |
0.0613383061512 |
0.0306691530756 |
0.0306691530756 |
8.8258788779752 |
13638 |
5657 |
0.0153345765378 |
5 |
0.076672882689 |
0.0306691530756 |
0.053312244612300004 |
8.8105443014374 |
13638 |
340 |
0.0379776680745 |
6 |
0.1146505507635 |
0.053312244612300004 |
0.1149007449976 |
8.7952097248996 |
13638 |
561 |
0.0769230769231 |
7 |
0.19157362768660002 |
0.1149007449976 |
0.1538461538462 |
8.7572320568251 |
13638 |
472 |
0.0769230769231 |
8 |
0.26849670460970004 |
0.1538461538462 |
0.1538461538462 |
8.680308979902 |
13638 |
562 |
0.0769230769231 |
9 |
0.34541978153280006 |
0.1538461538462 |
0.1538461538462 |
8.6033859029789 |
13638 |
361 |
0.0769230769231 |
10 |
0.4223428584559001 |
0.1538461538462 |
0.1697977199368 |
8.5264628260558 |
13638 |
431 |
0.0928746430137 |
11 |
0.5152175014696001 |
0.1697977199368 |
0.1857492860274 |
8.449539749132699 |
13638 |
510 |
0.0928746430137 |
12 |
0.6080921444833001 |
0.1857492860274 |
0.21813365975940002 |
8.356665106119 |
13638 |
316 |
0.1252590167457 |
13 |
0.733351161229 |
0.21813365975940002 |
0.2584700891324 |
8.2637904631053 |
13638 |
521 |
0.1332110723867 |
14 |
0.8665622336157 |
0.2584700891324 |
0.273371808999 |
8.1385314463596 |
13638 |
394 |
0.1401607366123 |
15 |
1.006722970228 |
0.273371808999 |
0.29136279587799996 |
8.0053203739729 |
13638 |
303 |
0.1512020592657 |
16 |
1.1579250294937 |
0.29136279587799996 |
0.3043264157554 |
7.8651596373606 |
13638 |
314 |
0.1531243564897 |
17 |
1.3110493859834 |
0.3043264157554 |
0.3455651679632 |
7.713957578094901 |
13638 |
401 |
0.1924408114735 |
18 |
1.5034901974569 |
0.3455651679632 |
0.4623088594713 |
7.5608332216052005 |
13638 |
300 |
0.2698680479978 |
19 |
1.7733582454546999 |
0.4623088594713 |
0.6112132817652001 |
7.368392410131699 |
13638 |
299 |
0.3413452337674 |
20 |
2.1147034792221 |
0.6112132817652001 |
0.7498798837337 |
7.098524362133899 |
13638 |
304 |
0.4085346499663 |
21 |
2.5232381291883996 |
0.7498798837337 |
0.8382105577517001 |
6.7571791283665 |
13638 |
308 |
0.4296759077854 |
22 |
2.9529140369738 |
0.8382105577517001 |
0.8846762522578 |
6.3486444784002 |
13638 |
301 |
0.4550003444724 |
23 |
3.4079143814462 |
0.8846762522578 |
0.9186182394169 |
5.918968570614799 |
13638 |
556 |
0.4636178949445 |
24 |
3.8715322763907 |
0.9186182394169 |
0.959736255427 |
5.4639682261424 |
13638 |
488 |
0.4961183604825 |
25 |
4.3676506368732 |
0.959736255427 |
0.9952097626912 |
5.0003503311979 |
13638 |
294 |
0.4990914022087 |
26 |
4.8667420390819 |
0.9952097626912 |
1.3424478401653 |
4.5042319707154 |
13638 |
318 |
0.8433564379566 |
27 |
5.7100984770385 |
1.3424478401653 |
1.690664123752 |
4.0051405685067 |
13638 |
295 |
0.8473076857954 |
28 |
6.5574061628338995 |
1.690664123752 |
3.1617841305501 |
3.1617841305501 |
13638 |
302 |
2.3144764447547 |
29 |
8.8718826075886 |
3.1617841305501 |
2.3144764447547 |
2.3144764447547 |
16530 |
314 |
0.0833333330933 |
1 |
0.0833333330933 |
0.0833333330933 |
0.1666666661866 |
1.6825354114854 |
16530 |
318 |
0.0833333330933 |
2 |
0.1666666661866 |
0.1666666661866 |
0.23571003949270003 |
1.5992020783921002 |
16530 |
493 |
0.1523767063994 |
3 |
0.319043372586 |
0.23571003949270003 |
0.3047534127988 |
1.5158687452988 |
16530 |
303 |
0.1523767063994 |
4 |
0.47142007898540006 |
0.3047534127988 |
0.3381259921594 |
1.3634920388994 |
16530 |
325 |
0.18574928576 |
5 |
0.6571693647454 |
0.3381259921594 |
0.37149857152 |
1.2111153325 |
16530 |
489 |
0.18574928576 |
6 |
0.8429186505054 |
0.37149857152 |
1.02536604674 |
1.02536604674 |
16530 |
324 |
0.83961676098 |
7 |
1.6825354114854 |
1.02536604674 |
0.83961676098 |
0.83961676098 |