process
介绍
提供常用的因子处理操作,如去极值,中性化等
standardize
jaqs_fxdayu.research.signaldigger.process.standardize(factor_df, index_member=None)
简要描述:
- 横截面z-score标准化
参数:
| 字段 | 必选 | 类型 | 说明 |
|---|---|---|---|
| factor_df | 是 | pandas.DataFrame | 日期为索引,证券品种为columns的二维因子表格 |
| index_member | 否 | pandas.DataFrame of bool | 是否是指数成分股。日期为索引,证券品种为columns的二维bool值表格,True代表该品种在该日期下属于指数成分股。传入该参数,则进行标准化所纳入的样本只有每期横截面上属于对应指数成分股的股票,默认为空 |
返回:
标准化后的因子
示例:
import warnings
warnings.filterwarnings('ignore')
from jaqs_fxdayu.data import DataView
from jaqs_fxdayu.research.signaldigger.process import standardize
# 加载dataview数据集
dv = DataView()
dataview_folder = './data'
dv.load_dataview(dataview_folder)
# z-score标准化
standardize(factor_df = dv.get_ts("pe"), index_member = dv.get_ts("index_member")).head()
Dataview loaded successfully.
| symbol | 000001.SZ | 000002.SZ | 000008.SZ | 000009.SZ | 000027.SZ | 000039.SZ | 000060.SZ | 000061.SZ | 000063.SZ | 000069.SZ | ... | 601988.SH | 601989.SH | 601992.SH | 601997.SH | 601998.SH | 603000.SH | 603160.SH | 603858.SH | 603885.SH | 603993.SH |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| trade_date | |||||||||||||||||||||
| 20170502 | -0.363380 | -0.340032 | -0.106714 | 0.152518 | -0.266414 | 0.216918 | 0.086421 | 0.857408 | -0.411592 | -0.343106 | ... | -0.366394 | 0.891601 | NaN | NaN | -0.361782 | 0.677455 | NaN | NaN | -0.248940 | 0.131240 |
| 20170503 | -0.364271 | -0.341856 | -0.107757 | 0.151190 | -0.268283 | 0.219121 | 0.083804 | 0.852450 | -0.412694 | -0.344699 | ... | -0.367529 | 0.879934 | NaN | NaN | -0.363002 | 0.697502 | NaN | NaN | -0.248411 | 0.128307 |
| 20170504 | -0.364991 | -0.340861 | -0.107070 | 0.154148 | -0.267100 | 0.213994 | 0.078180 | 0.849831 | -0.412865 | -0.344161 | ... | -0.367343 | 0.871015 | NaN | NaN | -0.363119 | 0.674523 | NaN | NaN | -0.248024 | 0.118993 |
| 20170505 | -0.364277 | -0.339788 | -0.116436 | 0.142003 | -0.266276 | 0.199128 | 0.080549 | 0.857999 | -0.412033 | -0.343666 | ... | -0.365914 | 0.858166 | NaN | NaN | -0.362034 | 0.659895 | NaN | NaN | -0.243558 | 0.114178 |
| 20170508 | -0.360932 | -0.337663 | -0.121213 | 0.133428 | -0.265375 | 0.197282 | 0.087274 | 0.871560 | -0.408468 | -0.340375 | ... | -0.361849 | 0.824399 | NaN | NaN | -0.358094 | 0.662941 | NaN | NaN | -0.242522 | 0.121454 |
5 rows × 330 columns
winsorize
jaqs_fxdayu.research.signaldigger.process.winsorize(factor_df, alpha=0.05, index_member=None)
简要描述:
- 横截面去极值
参数:
| 字段 | 必选 | 类型 | 说明 |
|---|---|---|---|
| factor_df | 是 | pandas.DataFrame | 日期为索引,证券品种为columns的二维因子表格 |
| alpha | 否 | float | 去极值的边界,如0.05代表去掉左右两边各2.5%分位的极端值(保留中心部分95%分布的数据)。默认0.05 |
| index_member | 否 | pandas.DataFrame of bool | 是否是指数成分股。日期为索引,证券品种为columns的二维bool值表格,True代表该品种在该日期下属于指数成分股。传入该参数,则进行去极值所纳入的样本只有每期横截面上属于对应指数成分股的股票,默认为空 |
返回:
去极值后的因子
示例:
from jaqs_fxdayu.research.signaldigger.process import winsorize
winsorize(factor_df = dv.get_ts("pe"),
alpha=0.05,
index_member = dv.get_ts("index_member")).head()
| symbol | 000001.SZ | 000002.SZ | 000008.SZ | 000009.SZ | 000027.SZ | 000039.SZ | 000060.SZ | 000061.SZ | 000063.SZ | 000069.SZ | ... | 601988.SH | 601989.SH | 601992.SH | 601997.SH | 601998.SH | 603000.SH | 603160.SH | 603858.SH | 603885.SH | 603993.SH |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| trade_date | |||||||||||||||||||||
| 20170502 | 6.7925 | 10.0821 | 42.9544 | 79.4778 | 20.4542 | 88.5511 | 70.1653 | 178.7903 | 0.0 | 9.6490 | ... | 6.3679 | 183.6078 | NaN | NaN | 7.0177 | 153.4365 | NaN | NaN | 22.9161 | 76.4800 |
| 20170503 | 6.7697 | 9.9035 | 42.6314 | 78.8332 | 20.1893 | 88.3302 | 69.4123 | 176.8719 | 0.0 | 9.5060 | ... | 6.3143 | 180.7143 | NaN | NaN | 6.9472 | 155.2097 | NaN | NaN | 22.9674 | 75.6340 |
| 20170504 | 6.6405 | 9.9876 | 42.4161 | 78.6490 | 20.2187 | 86.9501 | 68.1117 | 175.1454 | 0.0 | 9.5298 | ... | 6.3143 | 178.0838 | NaN | NaN | 6.9002 | 150.8288 | NaN | NaN | 22.8647 | 73.7727 |
| 20170505 | 6.5570 | 9.9193 | 40.5860 | 76.0703 | 20.0127 | 83.9137 | 67.6325 | 174.3781 | 0.0 | 9.3869 | ... | 6.3322 | 174.4011 | NaN | NaN | 6.8649 | 147.1780 | NaN | NaN | 23.1319 | 72.2499 |
| 20170508 | 6.5114 | 9.6988 | 39.3479 | 74.2284 | 19.6007 | 82.9752 | 67.9063 | 175.3372 | 0.0 | 9.3273 | ... | 6.3858 | 168.8771 | NaN | NaN | 6.9002 | 146.7608 | NaN | NaN | 22.7311 | 72.5883 |
5 rows × 330 columns
mad
jaqs_fxdayu.research.signaldigger.process.mad(factor_df, index_member=None)
简要描述:
- 横截面去极值
参数:
| 字段 | 必选 | 类型 | 说明 |
|---|---|---|---|
| factor_df | 是 | pandas.DataFrame | 日期为索引,证券品种为columns的二维因子表格 |
| index_member | 否 | pandas.DataFrame of bool | 是否是指数成分股。日期为索引,证券品种为columns的二维bool值表格,True代表该品种在该日期下属于指数成分股。传入该参数,则进行去极值所纳入的样本只有每期横截面上属于对应指数成分股的股票,默认为空 |
返回:
去极值后的因子
示例:
from jaqs_fxdayu.research.signaldigger.process import mad
mad(factor_df = dv.get_ts("pe"),
index_member = dv.get_ts("index_member")).head()
| symbol | 000001.SZ | 000002.SZ | 000008.SZ | 000009.SZ | 000027.SZ | 000039.SZ | 000060.SZ | 000061.SZ | 000063.SZ | 000069.SZ | ... | 601988.SH | 601989.SH | 601992.SH | 601997.SH | 601998.SH | 603000.SH | 603160.SH | 603858.SH | 603885.SH | 603993.SH |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| trade_date | |||||||||||||||||||||
| 20170502 | 6.7925 | 10.0821 | 42.9544 | 79.4778 | 20.4542 | 88.5511 | 70.1653 | 91.92400 | 0.0 | 9.6490 | ... | 6.3679 | 91.92400 | NaN | NaN | 7.0177 | 91.92400 | NaN | NaN | 22.9161 | 76.4800 |
| 20170503 | 6.7697 | 9.9035 | 42.6314 | 78.8332 | 20.1893 | 88.3302 | 69.4123 | 91.87230 | 0.0 | 9.5060 | ... | 6.3143 | 91.87230 | NaN | NaN | 6.9472 | 91.87230 | NaN | NaN | 22.9674 | 75.6340 |
| 20170504 | 6.6405 | 9.9876 | 42.4161 | 78.6490 | 20.2187 | 86.9501 | 68.1117 | 92.15105 | 0.0 | 9.5298 | ... | 6.3143 | 92.15105 | NaN | NaN | 6.9002 | 92.15105 | NaN | NaN | 22.8647 | 73.7727 |
| 20170505 | 6.5570 | 9.9193 | 40.5860 | 76.0703 | 20.0127 | 83.9137 | 67.6325 | 86.81125 | 0.0 | 9.3869 | ... | 6.3322 | 86.81125 | NaN | NaN | 6.8649 | 86.81125 | NaN | NaN | 23.1319 | 72.2499 |
| 20170508 | 6.5114 | 9.6988 | 39.3479 | 74.2284 | 19.6007 | 82.9752 | 67.9063 | 86.30405 | 0.0 | 9.3273 | ... | 6.3858 | 86.30405 | NaN | NaN | 6.9002 | 86.30405 | NaN | NaN | 22.7311 | 72.5883 |
5 rows × 330 columns
rank_standardize
jaqs_fxdayu.research.signaldigger.process.rank_standardize(factor_df, index_member=None)
简要描述:
- 排序标准化。将因子处理成横截面上的排序值(升序),并处理到0-1之间——仅保留原因子的顺序特征,剔除分布特征
参数:
| 字段 | 必选 | 类型 | 说明 |
|---|---|---|---|
| factor_df | 是 | pandas.DataFrame | 日期为索引,证券品种为columns的二维因子表格 |
| index_member | 否 | pandas.DataFrame of bool | 是否是指数成分股。日期为索引,证券品种为columns的二维bool值表格,True代表该品种在该日期下属于指数成分股。传入该参数,则进行排序标准化所纳入的样本只有每期横截面上属于对应指数成分股的股票,默认为空 |
返回:
排序标准化后的因子
示例:
from jaqs_fxdayu.research.signaldigger.process import rank_standardize
rank_standardize(factor_df = dv.get_ts("pe"),
index_member = dv.get_ts("index_member")).head()
| symbol | 000001.SZ | 000002.SZ | 000008.SZ | 000009.SZ | 000027.SZ | 000039.SZ | 000060.SZ | 000061.SZ | 000063.SZ | 000069.SZ | ... | 601988.SH | 601989.SH | 601992.SH | 601997.SH | 601998.SH | 603000.SH | 603160.SH | 603858.SH | 603885.SH | 603993.SH |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| trade_date | |||||||||||||||||||||
| 20170502 | 0.063545 | 0.117057 | 0.722408 | 0.886288 | 0.361204 | 0.90301 | 0.862876 | 0.966555 | 0.0 | 0.107023 | ... | 0.053512 | 0.969900 | NaN | NaN | 0.070234 | 0.943144 | NaN | NaN | 0.408027 | 0.876254 |
| 20170503 | 0.063545 | 0.113712 | 0.722408 | 0.886288 | 0.354515 | 0.90301 | 0.859532 | 0.966555 | 0.0 | 0.100334 | ... | 0.053512 | 0.969900 | NaN | NaN | 0.066890 | 0.939799 | NaN | NaN | 0.408027 | 0.872910 |
| 20170504 | 0.063545 | 0.113712 | 0.725753 | 0.886288 | 0.357860 | 0.90301 | 0.852843 | 0.963211 | 0.0 | 0.100334 | ... | 0.053512 | 0.969900 | NaN | NaN | 0.066890 | 0.939799 | NaN | NaN | 0.408027 | 0.872910 |
| 20170505 | 0.063545 | 0.113712 | 0.712375 | 0.882943 | 0.351171 | 0.90301 | 0.859532 | 0.963211 | 0.0 | 0.100334 | ... | 0.053512 | 0.966555 | NaN | NaN | 0.070234 | 0.943144 | NaN | NaN | 0.424749 | 0.872910 |
| 20170508 | 0.060201 | 0.103679 | 0.719064 | 0.882943 | 0.331104 | 0.90301 | 0.862876 | 0.969900 | 0.0 | 0.096990 | ... | 0.053512 | 0.963211 | NaN | NaN | 0.070234 | 0.946488 | NaN | NaN | 0.421405 | 0.879599 |
5 rows × 330 columns
get_disturbed_factor
jaqs_fxdayu.research.signaldigger.process.rank_standardizeget_disturbed_factor(factor_df)
简要描述:
- 将因子值加一个极小的扰动项,用于对quantile分组做区分
参数:
| 字段 | 必选 | 类型 | 说明 |
|---|---|---|---|
| factor_df | 是 | pandas.DataFrame | 日期为索引,证券品种为columns的二维因子表格 |
返回:
加扰动项后的因子
示例:
from jaqs_fxdayu.research.signaldigger.process import get_disturbed_factor
get_disturbed_factor(factor_df = dv.get_ts("pe")).head()
| symbol | 000001.SZ | 000002.SZ | 000008.SZ | 000009.SZ | 000027.SZ | 000039.SZ | 000060.SZ | 000061.SZ | 000063.SZ | 000069.SZ | ... | 601988.SH | 601989.SH | 601992.SH | 601997.SH | 601998.SH | 603000.SH | 603160.SH | 603858.SH | 603885.SH | 603993.SH |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| trade_date | |||||||||||||||||||||
| 20170502 | 6.7925 | 10.0821 | 42.9544 | 79.4778 | 20.4542 | 88.5511 | 70.1653 | 178.7903 | 3.437688e-10 | 9.6490 | ... | 6.3679 | 183.6078 | 32.7886 | 9.8565 | 7.0177 | 153.4365 | 50.8349 | 31.0157 | 22.9161 | 76.4800 |
| 20170503 | 6.7697 | 9.9035 | 42.6314 | 78.8332 | 20.1893 | 88.3302 | 69.4123 | 176.8719 | 4.412786e-10 | 9.5060 | ... | 6.3143 | 180.7143 | 30.2450 | 9.8817 | 6.9472 | 155.2097 | 50.7259 | 31.0311 | 22.9674 | 75.6340 |
| 20170504 | 6.6405 | 9.9876 | 42.4161 | 78.6490 | 20.2187 | 86.9501 | 68.1117 | 175.1454 | 4.559244e-10 | 9.5298 | ... | 6.3143 | 178.0838 | 31.4771 | 9.8188 | 6.9002 | 150.8288 | 50.3727 | 30.6805 | 22.8647 | 73.7727 |
| 20170505 | 6.5570 | 9.9193 | 40.5860 | 76.0703 | 20.0127 | 83.9137 | 67.6325 | 174.3781 | 6.587699e-10 | 9.3869 | ... | 6.3322 | 174.4011 | 30.8809 | 9.5609 | 6.8649 | 147.1780 | 49.3963 | 30.2527 | 23.1319 | 72.2499 |
| 20170508 | 6.5114 | 9.6988 | 39.3479 | 74.2284 | 19.6007 | 82.9752 | 67.9063 | 175.3372 | 6.254412e-10 | 9.3273 | ... | 6.3858 | 168.8771 | 27.9399 | 9.3282 | 6.9002 | 146.7608 | 50.3779 | 29.5167 | 22.7311 | 72.5883 |
5 rows × 330 columns
neutralize
jaqs_fxdayu.research.signaldigger.process.neutralize(factor_df,group,float_mv=None,index_member=None)
简要描述:
- 对因子做行业、市值中性化
参数:
| 字段 | 必选 | 类型 | 说明 |
|---|---|---|---|
| factor_df | 是 | pandas.DataFrame | 因子。日期为索引,证券品种为columns的二维表格 |
| group | 是 | pandas.DataFrame | 行业分类(也可以是其他分组方式)。日期为索引,证券品种为columns的二维表格,对应每一个品种在某期所属的分类 |
| float_mv | 否 | pandas.DataFrame | 流通市值。日期为索引,证券品种为columns的二维表格。默认为空,为空时不进行市值中性化处理 |
| index_member | 否 | pandas.DataFrame of bool | 是否是指数成分股。日期为索引,证券品种为columns的二维bool值表格,True代表该品种在该日期下属于指数成分股。传入该参数,则进行行业、市值中性化所纳入的样本只有每期横截面上属于对应指数成分股的股票,默认为空 |
返回:
行业、市值中性化后的因子
示例:
from jaqs_fxdayu.research.signaldigger.process import neutralize
neutralize(factor_df = dv.get_ts("pe"),
group = dv.get_ts("sw1")).head()
| symbol | 000001.SZ | 000002.SZ | 000008.SZ | 000009.SZ | 000027.SZ | 000039.SZ | 000060.SZ | 000061.SZ | 000063.SZ | 000069.SZ | ... | 601988.SH | 601989.SH | 601992.SH | 601997.SH | 601998.SH | 603000.SH | 603160.SH | 603858.SH | 603885.SH | 603993.SH |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| trade_date | |||||||||||||||||||||
| 20170502 | -2.662629 | -7.782230 | -27.98201 | -38.33350 | -4.083109 | 17.61469 | -83.013425 | 107.385838 | -168.217857 | -8.215330 | ... | -3.087229 | 26.912500 | 9.87725 | 0.401371 | -2.437429 | 108.584833 | 3.357346 | -55.150405 | -6.266428 | -76.698725 |
| 20170503 | -2.682662 | -7.829960 | -28.76077 | -39.50720 | -3.544909 | 16.93803 | -83.589442 | 105.819463 | -168.313357 | -8.227460 | ... | -3.138062 | 24.588600 | 8.60545 | 0.429338 | -2.505162 | 110.440158 | 3.330400 | -55.949523 | -6.084489 | -77.367742 |
| 20170504 | -2.815043 | -7.733890 | -28.67189 | -38.74790 | -4.016945 | 15.86211 | -82.429800 | 104.271487 | -168.140586 | -8.191690 | ... | -3.141243 | 26.910367 | 9.39235 | 0.363257 | -2.555343 | 106.489883 | 3.025662 | -55.572859 | -6.116911 | -76.768800 |
| 20170505 | -2.762233 | -7.653145 | -28.89854 | -37.39795 | -3.835882 | 14.42916 | -82.012883 | 103.912075 | -167.959957 | -8.185545 | ... | -2.987033 | 27.853900 | 9.18745 | 0.241667 | -2.454333 | 103.302950 | 2.387592 | -55.251945 | -5.618411 | -77.395483 |
| 20170508 | -2.564538 | -7.591140 | -29.02696 | -36.10530 | -3.576855 | 14.60034 | -80.613975 | 104.639975 | -167.807429 | -7.962640 | ... | -2.690138 | 30.356133 | 7.68590 | 0.252262 | -2.175738 | 102.756587 | 4.286408 | -54.397259 | -5.803628 | -75.931975 |
5 rows × 330 columns