pivot by

pivot by是DolphinDB的独有功能,是对标准SQL语句的拓展。它将表中一列或多列的内容按照两个维度重新排列,亦可配合数据转换函数使用。与select子句一起使用时返回一个表,而和exec语句一起使用时返回一个矩阵。参考 pivotunpivot

1.20.4版本提供了以下两点改进:

  • pivot by后面的列数可为两个以上,其中最后一列为列维度,其它各列均为行维度。

  • 使用pivot by时,select子句可使用多个指标。

例子

例1:行维度为一列(即pivot by之后仅有两列)

sym = `C`MS`MS`MS`IBM`IBM`C`C`C
price= 49.6 29.46 29.52 30.02 174.97 175.23 50.76 50.32 51.29
qty = 2200 1900 2100 3200 6800 5400 1300 2500 8800
timestamp = [09:34:07,09:35:42,09:36:51,09:36:59,09:35:47,09:36:26,09:34:16,09:35:26,09:36:12]
t2 = table(timestamp, sym, qty, price);

t2;

timestamp

sym

qty

price

09:34:07

C

2200

49.6

09:35:42

MS

1900

29.46

09:36:51

MS

2100

29.52

09:36:59

MS

3200

30.02

09:35:47

IBM

6800

174.97

09:36:26

IBM

5400

175.23

09:34:16

C

1300

50.76

09:35:26

C

2500

50.32

09:36:12

C

8800

51.29

select price from t2 pivot by timestamp, sym;

timestamp

C

IBM

MS

09:34:07

49.6

09:34:16

50.76

09:35:26

50.32

09:35:42

29.46

09:35:47

174.97

09:36:12

51.29

09:36:26

175.23

09:36:51

29.52

09:36:59

30.02

a=select last(price) from t2 pivot by timestamp.minute(), sym;

a;

minute_timestamp

C

IBM

MS

09:34m

50.76

09:35m

50.32

174.97

29.46

09:36m

51.29

175.23

30.02

$ typestr a;
TABLE

$ b=exec count(price) from t2 pivot by timestamp.minute(), sym;
$ b;

label

C

IBM

MS

09:34m

2

09:35m

1

1

1

09:36m

1

1

2

$ typestr b;
FAST DOUBLE MATRIX

例2:计算ETF的内在价值。为简便起见,假设ETF有两个成分股:AAPL和FB,其成分权重为[0.6, 0.4]。

symbol=take(`AAPL, 6) join take(`FB, 5)
time=2019.02.27T09:45:01.000000000+[146, 278, 412, 445, 496, 789, 212, 556, 598, 712, 989]
price=173.27 173.26 173.24 173.25 173.26 173.27 161.51 161.50 161.49 161.50 161.51
quotes=table(symbol, time, price)
weights=dict(`AAPL`FB, 0.6 0.4)
ETF = select symbol, time, price*weights[symbol] as price from quotes;

select last(price) from ETF pivot by time, symbol;

Time

AAPL

FB

2019.02.27T09:45:01.000000146

103.962

2019.02.27T09:45:01.000000212

64.604

2019.02.27T09:45:01.000000278

103.956

2019.02.27T09:45:01.000000412

103.944

2019.02.27T09:45:01.000000445

103.95

2019.02.27T09:45:01.000000496

103.956

2019.02.27T09:45:01.000000556

64.6

2019.02.27T09:45:01.000000598

64.596

2019.02.27T09:45:01.000000712

64.6

2019.02.27T09:45:01.000000789

103.962

2019.02.27T09:45:01.000000989

64.604

以上结果为在每个时间戳经过ETF权重调整的股票价格。为了计算ETF在每个时间戳的内在价值,需要使用 ffill 函数来填充每个股票的NULL值,然后将所有股票价格相加。可在一个SQL语句中完成。

select rowSum(ffill(last(price))) from ETF pivot by time, symbol;

Time

rowSum

2019.02.27T09:45:01.000000146

103.962

2019.02.27T09:45:01.000000212

168.566

2019.02.27T09:45:01.000000278

168.56

2019.02.27T09:45:01.000000412

168.548

2019.02.27T09:45:01.000000445

168.554

2019.02.27T09:45:01.000000496

168.56

2019.02.27T09:45:01.000000556

168.556

2019.02.27T09:45:01.000000598

168.552

2019.02.27T09:45:01.000000712

168.556

2019.02.27T09:45:01.000000789

168.562

2019.02.27T09:45:01.000000989

168.566

例3:使用pivot by时某些新建列名的自动转换

DolphinDB中列名仅可使用中文或英文字母、数字或下划线(_),且必须以中文或英文字母开头。使用pivot by时,若成为列名的值不符合以上要求,系统会依据以下规则自动调整列名:

  • 若数据中列名存在中文或英文字母、数字或下划线之外的字符,将其转换为下划线。

  • 若数据中列名第一个字符不是中文或英文字母,添加"c"作为该列名首字符。

1.30.14版本后,DolphinDB开始支持列名中包含特殊符号,对pivot by产生的数据表的列名不再特殊处理,包括不用删除特殊字符,也不再要求列名必须以字母开头。但需要注意:

  • 引用包含特殊符号的列名时,为了在SQL中引用不规范的列名,必须使用双引号包含该列名,并在双引号前加下划线,例如 _”IBM.N”, _”000001.SH”。

  • 包含特殊符号的列名只能通过SQL语句或者tb["_C"]的方式访问,不能通过tb.col的方式访问。

  • 当表内含有空字段,系统会将其转化为 NULL。

为了与之前版本的代码兼容,引入了配置变量 removeSpecialCharInColumnName,默认值是false,表示允许列名包含特殊字符。如果要跟以前兼容,可以将该变量配置为true。

symbol=take(`600300, 6) join take(`600600, 5)
time=2019.02.27T09:45:01.000000000+[146, 278, 412, 445, 496, 789, 212, 556, 598, 712, 989]
price=173.27 173.26 173.24 173.25 173.26 173.27 161.51 161.50 161.49 161.50 161.51
quotes=table(symbol, time, price)
weights=dict(`600300`600600, 0.6 0.4)
ETF = select symbol, time, price*weights[symbol] as price from quotes;

select last(price) from ETF pivot by time, symbol;

time

C600300

C600600

2019.02.27T09:45:01.000000146

103.962

2019.02.27T09:45:01.000000212

64.604

2019.02.27T09:45:01.000000278

103.956

2019.02.27T09:45:01.000000412

103.944

2019.02.27T09:45:01.000000445

103.95

2019.02.27T09:45:01.000000496

103.956

2019.02.27T09:45:01.000000556

64.6

2019.02.27T09:45:01.000000598

64.596

2019.02.27T09:45:01.000000712

64.6

2019.02.27T09:45:01.000000789

103.962

2019.02.27T09:45:01.000000989

64.604

上例中的股票代码"600300"与"600600"在成为列名后,系统自动将其调整为"C600300"与"C600600"。

例4:行维度为多列(即pivot by之后多于两列)

date = 2020.09.21 + 0 0 0 0 1 1 1 1
sym = `MS`MS`GS`GS`MS`MS`GS`GS$SYMBOL
factorNum = 1 2 1 2 1 2 1 2
factorValue = 1.2 -3.4 -2.5 6.3 1.1 -3.2 -2.1 5.6
t = table(date, sym, factorNum, factorValue);

t;

date

sym

factorNum

factorValue

2020.09.21

MS

1

1.2

2020.09.21

MS

2

-3.4

2020.09.21

GS

1

-2.5

2020.09.21

GS

2

6.3

2020.09.22

MS

1

1.1

2020.09.22

MS

2

-3.2

2020.09.22

GS

1

-2.1

2020.09.22

GS

2

5.6

select factorValue from t pivot by date, sym, factorNum;

date

sym

C1

C2

2020.09.21

GS

-2.5

6.3

2020.09.21

MS

1.2

-3.4

2020.09.22

GS

-2.1

5.6

2020.09.22

MS

1.1

-3.2

例5:select子句包含多个指标

select factorValue, factorValue>0 as factorSign from t pivot by date, sym, factorNum;

date

sym

factorValue_1

factorValue_2

factorSign_1

factorSign_2

2020.09.21

GS

-2.5

6.3

0

1

2020.09.21

MS

1.2

-3.4

1

0

2020.09.22

GS

-2.1

5.6

0

1

2020.09.22

MS

1.1

-3.2

1

0

例6:列名包含特殊符号

date = take(2021.08.01  2021.08.02 2021.08.03, 12)
sym = take(["IBM N", "_MSFTN", "3_GOOGS", ""], 12).sort()
value = 1..12
t=table(date, sym, value)
re = select value from t pivot by date, sym

date

NULL

3_GOOGS

IBM N

_MSFTN

2021.08.01

1

4

7

10

2021.08.02

2

5

8

11

2021.08.03

3

6

9

12

select  _"NULL" from re

NULL

1

2

3

select  _"3_GOOGS" from re

3_GOOGS

4

5

6