pivot by
pivot by 是 DolphinDB 的独有功能,是对标准 SQL 语句的拓展。它将表中一列或多列的内容按照两个维度重新排列,亦可配合数据转换函数使用。与 select 子句一起使用时返回一个表,而和 exec 语句一起使用时返回一个矩阵。 参考 pivot 和 unpivot。
1.20.4 版本提供了以下两点改进:
pivot by 后面的列数可为两个以上,其中最后一列为列维度,其它各列均为行维度。
使用 pivot by 时,select 子句可使用多个指标。
例子
例1:行维度为一列(即 pivot by 之后仅有两列)
$ sym = `C`MS`MS`MS`IBM`IBM`C`C`C
$ price= 49.6 29.46 29.52 30.02 174.97 175.23 50.76 50.32 51.29
$ qty = 2200 1900 2100 3200 6800 5400 1300 2500 8800
$ timestamp = [09:34:07,09:35:42,09:36:51,09:36:59,09:35:47,09:36:26,09:34:16,09:35:26,09:36:12]
$ t2 = table(timestamp, sym, qty, price);
$ t2;
timestamp |
sym |
qty |
price |
---|---|---|---|
09:34:07 |
C |
2200 |
49.6 |
09:35:42 |
MS |
1900 |
29.46 |
09:36:51 |
MS |
2100 |
29.52 |
09:36:59 |
MS |
3200 |
30.02 |
09:35:47 |
IBM |
6800 |
174.97 |
09:36:26 |
IBM |
5400 |
175.23 |
09:34:16 |
C |
1300 |
50.76 |
09:35:26 |
C |
2500 |
50.32 |
09:36:12 |
C |
8800 |
51.29 |
$ select price from t2 pivot by timestamp, sym;
timestamp |
C |
IBM |
MS |
---|---|---|---|
09:34:07 |
49.6 |
||
09:34:16 |
50.76 |
||
09:35:26 |
50.32 |
||
09:35:42 |
29.46 |
||
09:35:47 |
174.97 |
||
09:36:12 |
51.29 |
||
09:36:26 |
175.23 |
||
09:36:51 |
29.52 |
||
09:36:59 |
30.02 |
$ a=select last(price) from t2 pivot by timestamp.minute(), sym;
$ a;
minute_timestamp |
C |
IBM |
MS |
---|---|---|---|
09:34m |
50.76 |
||
09:35m |
50.32 |
174.97 |
29.46 |
09:36m |
51.29 |
175.23 |
30.02 |
$ typestr a;
TABLE
$ b=exec count(price) from t2 pivot by timestamp.minute(), sym;
$ b;
label |
C |
IBM |
MS |
---|---|---|---|
09:34m |
2 |
||
09:35m |
1 |
1 |
1 |
09:36m |
1 |
1 |
2 |
$ typestr b;
FAST DOUBLE MATRIX
例2:计算 ETF 的内在价值。为简便起见,假设 ETF 有两个成分股:AAPL 和 FB,其成分权重为 [0.6, 0.4]。
$ symbol=take(`AAPL, 6) join take(`FB, 5)
$ time=2019.02.27T09:45:01.000000000+[146, 278, 412, 445, 496, 789, 212, 556, 598, 712, 989]
$ price=173.27 173.26 173.24 173.25 173.26 173.27 161.51 161.50 161.49 161.50 161.51
$ quotes=table(symbol, time, price)
$ weights=dict(`AAPL`FB, 0.6 0.4)
$ ETF = select symbol, time, price*weights[symbol] as price from quotes;
$ select last(price) from ETF pivot by time, symbol;
Time |
AAPL |
FB |
---|---|---|
2019.02.27T09:45:01.000000146 |
103.962 |
|
2019.02.27T09:45:01.000000212 |
64.604 |
|
2019.02.27T09:45:01.000000278 |
103.956 |
|
2019.02.27T09:45:01.000000412 |
103.944 |
|
2019.02.27T09:45:01.000000445 |
103.95 |
|
2019.02.27T09:45:01.000000496 |
103.956 |
|
2019.02.27T09:45:01.000000556 |
64.6 |
|
2019.02.27T09:45:01.000000598 |
64.596 |
|
2019.02.27T09:45:01.000000712 |
64.6 |
|
2019.02.27T09:45:01.000000789 |
103.962 |
|
2019.02.27T09:45:01.000000989 |
64.604 |
以上结果为在每个时间戳经过 ETF 权重调整的股票价格。为了计算 ETF 在每个时间戳的内在价值,需要使用 ffill 函数来填充每个股票的 NULL 值,然后将所有股票价格相加。可在一个 SQL 语句中完成。
$ select rowSum(ffill(last(price))) from ETF pivot by time, symbol;
Time |
rowSum |
---|---|
2019.02.27T09:45:01.000000146 |
103.962 |
2019.02.27T09:45:01.000000212 |
168.566 |
2019.02.27T09:45:01.000000278 |
168.56 |
2019.02.27T09:45:01.000000412 |
168.548 |
2019.02.27T09:45:01.000000445 |
168.554 |
2019.02.27T09:45:01.000000496 |
168.56 |
2019.02.27T09:45:01.000000556 |
168.556 |
2019.02.27T09:45:01.000000598 |
168.552 |
2019.02.27T09:45:01.000000712 |
168.556 |
2019.02.27T09:45:01.000000789 |
168.562 |
2019.02.27T09:45:01.000000989 |
168.566 |
例3:使用 pivot by 时某些新建列名的自动转换
1.30.14/2.00.2 版本以前,DolphinDB 中列名仅可使用中文或英文字母、数字或下划线 (_),且必须以中文或英文字母开头。
1.30.14/2.00.2 版本开始,由 pivot by, addColumn 操作产生的列名,支持包含特殊字符,或以数字开头。
详情可以参考:创建表章节。
$ date = take(2021.08.01 2021.08.02 2021.08.03, 12)
$ sym = take(["IBM N", "_MSFTN", "3_GOOGS", ""], 12).sort()
$ value = 1..12
$ t=table(date, sym, value)
$ re = select value from t pivot by date, sym
date |
NULL |
3_GOOGS |
IBM N |
_MSFTN |
---|---|---|---|---|
2021.08.01 |
1 |
4 |
7 |
10 |
2021.08.02 |
2 |
5 |
8 |
11 |
2021.08.03 |
3 |
6 |
9 |
12 |
$ select _"NULL" from re
NULL |
---|
1 |
2 |
3 |
$ select _"3_GOOGS" from re
3_GOOGS |
---|
4 |
5 |
6 |
下例,设置 removeSpecialCharInColumnName = true,股票代码 "600300" 与 "600600" 在成为列名后,系统自动将其处理为 "C600300" 与 "C600600"。
$ symbol=take(`600300, 6) join take(`600600, 5)
$ time=2019.02.27T09:45:01.000000000+[146, 278, 412, 445, 496, 789, 212, 556, 598, 712, 989]
$ price=173.27 173.26 173.24 173.25 173.26 173.27 161.51 161.50 161.49 161.50 161.51
$ quotes=table(symbol, time, price)
$ weights=dict(`600300`600600, 0.6 0.4)
$ ETF = select symbol, time, price*weights[symbol] as price from quotes;
$ select last(price) from ETF pivot by time, symbol;
time |
C600300 |
C600600 |
---|---|---|
2019.02.27T09:45:01.000000146 |
103.962 |
|
2019.02.27T09:45:01.000000212 |
64.604 |
|
2019.02.27T09:45:01.000000278 |
103.956 |
|
2019.02.27T09:45:01.000000412 |
103.944 |
|
2019.02.27T09:45:01.000000445 |
103.95 |
|
2019.02.27T09:45:01.000000496 |
103.956 |
|
2019.02.27T09:45:01.000000556 |
64.6 |
|
2019.02.27T09:45:01.000000598 |
64.596 |
|
2019.02.27T09:45:01.000000712 |
64.6 |
|
2019.02.27T09:45:01.000000789 |
103.962 |
|
2019.02.27T09:45:01.000000989 |
64.604 |
例4:行维度为多列(即 pivot by 之后多于两列)
$ date = 2020.09.21 + 0 0 0 0 1 1 1 1
$ sym = `MS`MS`GS`GS`MS`MS`GS`GS$SYMBOL
$ factorNum = 1 2 1 2 1 2 1 2
$ factorValue = 1.2 -3.4 -2.5 6.3 1.1 -3.2 -2.1 5.6
$ t = table(date, sym, factorNum, factorValue);
$ t;
date |
sym |
factorNum |
factorValue |
---|---|---|---|
2020.09.21 |
MS |
1 |
1.2 |
2020.09.21 |
MS |
2 |
-3.4 |
2020.09.21 |
GS |
1 |
-2.5 |
2020.09.21 |
GS |
2 |
6.3 |
2020.09.22 |
MS |
1 |
1.1 |
2020.09.22 |
MS |
2 |
-3.2 |
2020.09.22 |
GS |
1 |
-2.1 |
2020.09.22 |
GS |
2 |
5.6 |
$ select factorValue from t pivot by date, sym, factorNum;
date |
sym |
C1 |
C2 |
---|---|---|---|
2020.09.21 |
GS |
-2.5 |
6.3 |
2020.09.21 |
MS |
1.2 |
-3.4 |
2020.09.22 |
GS |
-2.1 |
5.6 |
2020.09.22 |
MS |
1.1 |
-3.2 |
例5:select 子句包含多个指标
$ select factorValue, factorValue>0 as factorSign from t pivot by date, sym, factorNum;
date |
sym |
factorValue_1 |
factorValue_2 |
factorSign_1 |
factorSign_2 |
---|---|---|---|---|---|
2020.09.21 |
GS |
-2.5 |
6.3 |
0 |
1 |
2020.09.21 |
MS |
1.2 |
-3.4 |
1 |
0 |
2020.09.22 |
GS |
-2.1 |
5.6 |
0 |
1 |
2020.09.22 |
MS |
1.1 |
-3.2 |
1 |
0 |
例6:若 pivot by 后面的列在原表中存在多个相同的值,pivot by 会进行去重,仅保留最后一个值的数据。
$ t = table((`C+string(1..10)) join (`C+string(1..10)) as sym, (2020.10.01+11..20) join (2020.10.01+11..20) as time, 21..40 as val)
$ select val from t pivot by time, sym;
time |
C1 |
C10 |
C2 |
C3 |
C4 |
C5 |
C6 |
C7 |
C8 |
C9 |
---|---|---|---|---|---|---|---|---|---|---|
2020.10.12 |
31 |
|||||||||
2020.10.13 |
32 |
|||||||||
2020.10.14 |
33 |
|||||||||
2020.10.15 |
34 |
|||||||||
2020.10.16 |
35 |
|||||||||
2020.10.17 |
36 |
|||||||||
2020.10.18 |
37 |
|||||||||
2020.10.19 |
38 |
|||||||||
2020.10.20 |
39 |
|||||||||
2020.10.21 |
40 |