group by

关键字会自动加入到结果集中,用户可以不在select语句中指定该列。生成的表的顺序为:group by中select未指定的字段排列在前,select指定的字段排列在后。

例子

$ sym = `C`MS`MS`MS`IBM`IBM`C`C`C$SYMBOL
$ price= 49.6 29.46 29.52 30.02 174.97 175.23 50.76 50.32 51.29
$ qty = 2200 1900 2100 3200 6800 5400 1300 2500 8800
$ timestamp = [09:34:07,09:36:42,09:36:51,09:36:59,09:32:47,09:35:26,09:34:16,09:34:26,09:38:12]
$ t1 = table(timestamp, sym, qty, price);

$ t1;

timestamp

sym

qty

price

09:34:07

C

2200

49.6

09:36:42

MS

1900

29.46

09:36:51

MS

2100

29.52

09:36:59

MS

3200

30.02

09:32:47

IBM

6800

174.97

09:35:26

IBM

5400

175.2

09:34:16

C

1300

50.76

09:34:26

C

2500

50.32

09:38:12

C

8800

51.29

select count(sym) as counts from t1 group by sym;

sym

counts

C

4

MS

3

IBM

2

select max(price), sym from t1 group by sym, minute(timestamp)

minute_timestamp

max_price

sym

09:34m

50.76

C

09:36m

30.02

MS

09:32m

174.97

IBM

09:35m

175.23

IBM

09:38m

51.29

C

select avg(qty) from t1 group by sym;

sym

avg_qty

C

3700

MS

2400

IBM

6100

select wavg(price, qty) as vwap, sum(qty) from t1 group by sym;

sym

vwap

sum_qty

C

50.828378

14800

IBM

175.085082

12200

MS

29.726389

7200

select wsum(price, qty) as dollarVolume, sum(qty) from t1 group by minute(timestamp) as ts;

ts

dollarVolume

sum_qty

09:32m

1.189796e+006

6800

09:34m

300908

6000

09:35m

946242

5400

09:36m

214030

7200

09:38m

451352

8800

select sum(qty) from t1 group by sym, timestamp.minute() as minute;

sym

minute

sum_qty

C

09:34m

6000

C

09:38m

8800

IBM

09:32m

6800

IBM

09:35m

5400

MS

09:36m

7200

group by和order by配合使用。order by的字段必须是group by结果表中的字段。

select sum(qty) from t1 group by sym, timestamp.minute() as minute order by minute;

sym

minute

sum_qty

IBM

09:32m

6800

C

09:34m

6000

IBM

09:35m

5400

MS

09:36m

7200

C

09:38m

8800

在上述例子中,每个分组的函数结果都是标量。在其他情况下,函数可能输出向量或者字典,而不是输出标量。比如,stat 函数输出字典; ols 输出一个系数向量或一个含有参数估计如t-stat, R²的字典等。为了在多列中输出结果,需要将向量或字典输出转换成多个标量值。换而言之,需要将一个合成列转换成多个列。我们可以通过as关键字和一个代表列名的常量字符串向量来完成这项工作。

$ y=1..15
$ factor1=3.2 1.2 5.9 6.9 11.1 9.6 1.4 7.3 2.0 0.1 6.1 2.9 6.3 8.4 5.6
$ factor2=1.7 1.3 4.2 6.8 9.2 1.3 1.4 7.8 7.9 9.9 9.3 4.6 7.8 2.4 8.7
$ id=take(1 2 3, 15).sort();
$ t=table(id, y, factor1, factor2);

$ t;

id

y

factor1

factor2

1

1

3.2

1.7

1

2

1.2

1.3

1

3

5.9

4.2

1

4

6.9

6.8

1

5

11.1

9.2

2

6

9.6

1.3

2

7

1.4

1.4

2

8

7.3

7.8

2

9

2

7.9

2

10

0.1

9.9

3

11

6.1

9.3

3

12

2.9

4.6

3

13

6.3

7.8

3

14

8.4

2.4

3

15

5.6

8.7

select ols(y,(factor1,factor2),true,0) as `int`factor1`factor2 from t group by id;

id

int

factor1

factor2

1

1.063991

-0.258685

0.732795

2

6.886877

-0.148325

0.303584

3

11.833867

0.272352

-0.065526

select ols(y,(factor1,factor2),true,2).Coefficient.tstat[1:] as `t1`t2 from t group by id;

id

t1

t2

1

-0.891868

2.253451

2

-5.73315

11.433117

3

0.510866

-0.183903

若想要忽略函数的一些输出元素,可将列名留空。

select ols(y,(factor1,factor2),true,2).Coefficient.beta as ``factor1`factor2 from t group by id;

id

factor1

factor2

1

-0.258685

0.732795

2

-0.148325

0.303584

3

0.272352

-0.065526

若想要自定义输出格式,可以写一个简单的包装函数。下面例子中,输出包含了系数估计和R²:

$ def myols(y,x) {
$   r=ols(y,x,true,2)
$   return r.Coefficient.beta join r.RegressionStat.statistics[0]
$ }

$ select myols(y,(factor1,factor2)) as `int`factor1`factor2`R2 from t group by id;

id

int

factor1

factor2

R2

1

1.063991

-0.258685

0.732795

0.946056

2

6.886877

-0.148325

0.303584

0.992413

3

11.833867

0.272352

-0.065526

0.144837