Skip to content

Commit 86f08e4

Browse files
chenditcyou-n-g
andauthored
Qlib data doc (microsoft#1207)
* Explain data crawler structure * Add documentation for data and feature * Update scripts/data_collector/yahoo/README.md Co-authored-by: you-n-g <[email protected]> * Remove some confusing wording * Add third party data source * Fix command typo * Update commands Co-authored-by: you-n-g <[email protected]>
1 parent 8199822 commit 86f08e4

File tree

5 files changed

+97
-2
lines changed

5 files changed

+97
-2
lines changed

docs/component/data.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -51,7 +51,7 @@ Also, ``Qlib`` provides a high-frequency dataset. Users can run a high-frequency
5151

5252
Qlib Format Dataset
5353
-------------------
54-
``Qlib`` has provided an off-the-shelf dataset in `.bin` format, users could use the script ``scripts/get_data.py`` to download the China-Stock dataset as follows.
54+
``Qlib`` has provided an off-the-shelf dataset in `.bin` format, users could use the script ``scripts/get_data.py`` to download the China-Stock dataset as follows. User can also use numpy to load `.bin` file to validate data.
5555
The price volume data look different from the actual dealling price because of they are **adjusted** (`adjusted price <https://www.investopedia.com/terms/a/adjusted_closing_price.asp>`_). And then you may find that the adjusted price may be different from different data sources. This is because different data sources may vary in the way of adjusting prices. Qlib normalize the price on first trading day of each stock to 1 when adjusting them.
5656
Users can leverage `$factor` to get the original trading price (e.g. `$close / $factor` to get the original close price).
5757

qlib/contrib/data/handler.py

+48
Original file line numberDiff line numberDiff line change
@@ -259,113 +259,161 @@ def parse_config_to_fields(config):
259259
def use(x):
260260
return x not in exclude and (include is None or x in include)
261261

262+
# Some factor ref: https://guorn.com/static/upload/file/3/134065454575605.pdf
262263
if use("ROC"):
264+
# https://www.investopedia.com/terms/r/rateofchange.asp
265+
# Rate of change, the price change in the past d days, divided by latest close price to remove unit
263266
fields += ["Ref($close, %d)/$close" % d for d in windows]
264267
names += ["ROC%d" % d for d in windows]
265268
if use("MA"):
269+
# https://www.investopedia.com/ask/answers/071414/whats-difference-between-moving-average-and-weighted-moving-average.asp
270+
# Simple Moving Average, the simple moving average in the past d days, divided by latest close price to remove unit
266271
fields += ["Mean($close, %d)/$close" % d for d in windows]
267272
names += ["MA%d" % d for d in windows]
268273
if use("STD"):
274+
# The standard diviation of close price for the past d days, divided by latest close price to remove unit
269275
fields += ["Std($close, %d)/$close" % d for d in windows]
270276
names += ["STD%d" % d for d in windows]
271277
if use("BETA"):
278+
# The rate of close price change in the past d days, divided by latest close price to remove unit
279+
# For example, price increase 10 dollar per day in the past d days, then Slope will be 10.
272280
fields += ["Slope($close, %d)/$close" % d for d in windows]
273281
names += ["BETA%d" % d for d in windows]
274282
if use("RSQR"):
283+
# The R-sqaure value of linear regression for the past d days, represent the trend linearity for past d days.
275284
fields += ["Rsquare($close, %d)" % d for d in windows]
276285
names += ["RSQR%d" % d for d in windows]
277286
if use("RESI"):
287+
# The redisdual for linear regression for the past d days, represent the trend linearity for past d days.
278288
fields += ["Resi($close, %d)/$close" % d for d in windows]
279289
names += ["RESI%d" % d for d in windows]
280290
if use("MAX"):
291+
# The max price for past d days, divided by latest close price to remove unit
281292
fields += ["Max($high, %d)/$close" % d for d in windows]
282293
names += ["MAX%d" % d for d in windows]
283294
if use("LOW"):
295+
# The low price for past d days, divided by latest close price to remove unit
284296
fields += ["Min($low, %d)/$close" % d for d in windows]
285297
names += ["MIN%d" % d for d in windows]
286298
if use("QTLU"):
299+
# The 80% quantile of past d day's close price, divided by latest close price to remove unit
300+
# Used with MIN and MAX
287301
fields += ["Quantile($close, %d, 0.8)/$close" % d for d in windows]
288302
names += ["QTLU%d" % d for d in windows]
289303
if use("QTLD"):
304+
# The 20% quantile of past d day's close price, divided by latest close price to remove unit
290305
fields += ["Quantile($close, %d, 0.2)/$close" % d for d in windows]
291306
names += ["QTLD%d" % d for d in windows]
292307
if use("RANK"):
308+
# Get the percentile of current close price in past d day's close price.
309+
# Represent the current price level comparing to past N days, add additional information to moving average.
293310
fields += ["Rank($close, %d)" % d for d in windows]
294311
names += ["RANK%d" % d for d in windows]
295312
if use("RSV"):
313+
# Represent the price position between upper and lower resistent price for past d days.
296314
fields += ["($close-Min($low, %d))/(Max($high, %d)-Min($low, %d)+1e-12)" % (d, d, d) for d in windows]
297315
names += ["RSV%d" % d for d in windows]
298316
if use("IMAX"):
317+
# The number of days between current date and previous highest price date.
318+
# Part of Aroon Indicator https://www.investopedia.com/terms/a/aroon.asp
319+
# The indicator measures the time between highs and the time between lows over a time period.
320+
# The idea is that strong uptrends will regularly see new highs, and strong downtrends will regularly see new lows.
299321
fields += ["IdxMax($high, %d)/%d" % (d, d) for d in windows]
300322
names += ["IMAX%d" % d for d in windows]
301323
if use("IMIN"):
324+
# The number of days between current date and previous lowest price date.
325+
# Part of Aroon Indicator https://www.investopedia.com/terms/a/aroon.asp
326+
# The indicator measures the time between highs and the time between lows over a time period.
327+
# The idea is that strong uptrends will regularly see new highs, and strong downtrends will regularly see new lows.
302328
fields += ["IdxMin($low, %d)/%d" % (d, d) for d in windows]
303329
names += ["IMIN%d" % d for d in windows]
304330
if use("IMXD"):
331+
# The time period between previous lowest-price date occur after highest price date.
332+
# Large value suggest downward momemtum.
305333
fields += ["(IdxMax($high, %d)-IdxMin($low, %d))/%d" % (d, d, d) for d in windows]
306334
names += ["IMXD%d" % d for d in windows]
307335
if use("CORR"):
336+
# The correlation between absolute close price and log scaled trading volume
308337
fields += ["Corr($close, Log($volume+1), %d)" % d for d in windows]
309338
names += ["CORR%d" % d for d in windows]
310339
if use("CORD"):
340+
# The correlation between price change ratio and volume change ratio
311341
fields += ["Corr($close/Ref($close,1), Log($volume/Ref($volume, 1)+1), %d)" % d for d in windows]
312342
names += ["CORD%d" % d for d in windows]
313343
if use("CNTP"):
344+
# The percentage of days in past d days that price go up.
314345
fields += ["Mean($close>Ref($close, 1), %d)" % d for d in windows]
315346
names += ["CNTP%d" % d for d in windows]
316347
if use("CNTN"):
348+
# The percentage of days in past d days that price go down.
317349
fields += ["Mean($close<Ref($close, 1), %d)" % d for d in windows]
318350
names += ["CNTN%d" % d for d in windows]
319351
if use("CNTD"):
352+
# The diff between past up day and past down day
320353
fields += ["Mean($close>Ref($close, 1), %d)-Mean($close<Ref($close, 1), %d)" % (d, d) for d in windows]
321354
names += ["CNTD%d" % d for d in windows]
322355
if use("SUMP"):
356+
# The total gain / the absolute total price changed
357+
# Similar to RSI indicator. https://www.investopedia.com/terms/r/rsi.asp
323358
fields += [
324359
"Sum(Greater($close-Ref($close, 1), 0), %d)/(Sum(Abs($close-Ref($close, 1)), %d)+1e-12)" % (d, d)
325360
for d in windows
326361
]
327362
names += ["SUMP%d" % d for d in windows]
328363
if use("SUMN"):
364+
# The total lose / the absolute total price changed
365+
# Can be derived from SUMP by SUMN = 1 - SUMP
366+
# Similar to RSI indicator. https://www.investopedia.com/terms/r/rsi.asp
329367
fields += [
330368
"Sum(Greater(Ref($close, 1)-$close, 0), %d)/(Sum(Abs($close-Ref($close, 1)), %d)+1e-12)" % (d, d)
331369
for d in windows
332370
]
333371
names += ["SUMN%d" % d for d in windows]
334372
if use("SUMD"):
373+
# The diff ratio between total gain and total lose
374+
# Similar to RSI indicator. https://www.investopedia.com/terms/r/rsi.asp
335375
fields += [
336376
"(Sum(Greater($close-Ref($close, 1), 0), %d)-Sum(Greater(Ref($close, 1)-$close, 0), %d))"
337377
"/(Sum(Abs($close-Ref($close, 1)), %d)+1e-12)" % (d, d, d)
338378
for d in windows
339379
]
340380
names += ["SUMD%d" % d for d in windows]
341381
if use("VMA"):
382+
# Simple Volume Moving average: https://www.barchart.com/education/technical-indicators/volume_moving_average
342383
fields += ["Mean($volume, %d)/($volume+1e-12)" % d for d in windows]
343384
names += ["VMA%d" % d for d in windows]
344385
if use("VSTD"):
386+
# The standard deviation for volume in past d days.
345387
fields += ["Std($volume, %d)/($volume+1e-12)" % d for d in windows]
346388
names += ["VSTD%d" % d for d in windows]
347389
if use("WVMA"):
390+
# The volume weighted price change volatility
348391
fields += [
349392
"Std(Abs($close/Ref($close, 1)-1)*$volume, %d)/(Mean(Abs($close/Ref($close, 1)-1)*$volume, %d)+1e-12)"
350393
% (d, d)
351394
for d in windows
352395
]
353396
names += ["WVMA%d" % d for d in windows]
354397
if use("VSUMP"):
398+
# The total volume increase / the absolute total volume changed
355399
fields += [
356400
"Sum(Greater($volume-Ref($volume, 1), 0), %d)/(Sum(Abs($volume-Ref($volume, 1)), %d)+1e-12)"
357401
% (d, d)
358402
for d in windows
359403
]
360404
names += ["VSUMP%d" % d for d in windows]
361405
if use("VSUMN"):
406+
# The total volume increase / the absolute total volume changed
407+
# Can be derived from VSUMP by VSUMN = 1 - VSUMP
362408
fields += [
363409
"Sum(Greater(Ref($volume, 1)-$volume, 0), %d)/(Sum(Abs($volume-Ref($volume, 1)), %d)+1e-12)"
364410
% (d, d)
365411
for d in windows
366412
]
367413
names += ["VSUMN%d" % d for d in windows]
368414
if use("VSUMD"):
415+
# The diff ratio between total volume increase and total volume decrease
416+
# RSI indicator for volume
369417
fields += [
370418
"(Sum(Greater($volume-Ref($volume, 1), 0), %d)-Sum(Greater(Ref($volume, 1)-$volume, 0), %d))"
371419
"/(Sum(Abs($volume-Ref($volume, 1)), %d)+1e-12)" % (d, d, d)

scripts/README.md

+7
Original file line numberDiff line numberDiff line change
@@ -67,3 +67,10 @@ from qlib.constant import REG_CN
6767
provider_uri = "~/.qlib/qlib_data/cn_data" # target_dir
6868
qlib.init(provider_uri=provider_uri, region=REG_CN)
6969
```
70+
71+
## Use Crowd Sourced Data
72+
The is also a [crowd sourced version of qlib data](data_collector/crowd_source/README.md): https://github.com/chenditc/investment_data/releases
73+
```bash
74+
wget https://github.com/chenditc/investment_data/releases/download/20220720/qlib_bin.tar.gz
75+
tar -zxvf qlib_bin.tar.gz -C ~/.qlib/qlib_data/cn_data --strip-components=2
76+
```
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
# Crowd Source Data
2+
3+
## Initiative
4+
Public data source like yahoo is flawed, it might miss data for stock which is delisted and it might has data which is wrong. This can introduce survivorship bias into our training process.
5+
6+
The crowd sourced data is introduced to merged data from multiple data source and cross validate against each other, so that:
7+
1. We will have a more complete history record.
8+
2. We can identify the anomaly data and apply correction when necessary.
9+
10+
## Related Repo
11+
The raw data is hosted on dolthub repo: https://www.dolthub.com/repositories/chenditc/investment_data
12+
13+
The processing script and sql is hosted on github repo: https://github.com/chenditc/investment_data
14+
15+
The pakcaged docker runtime is hosted on dockerhub: https://hub.docker.com/repository/docker/chenditc/investment_data
16+
17+
## How to use it in qlib
18+
### Option 1: Download release bin data
19+
User can download data in qlib bin format and use it directly: https://github.com/chenditc/investment_data/releases/tag/20220720
20+
```bash
21+
wget https://github.com/chenditc/investment_data/releases/download/20220720/qlib_bin.tar.gz
22+
tar -zxvf qlib_bin.tar.gz -C ~/.qlib/qlib_data/cn_data --strip-components=2
23+
```
24+
25+
### Option 2: Generate qlib data from dolthub
26+
Dolthub data will be update daily, so that if user wants to get up to date data, they can dump qlib bin using docker:
27+
```
28+
docker run -v /<some output directory>:/output -it --rm chenditc/investment_data bash dump_qlib_bin.sh && cp ./qlib_bin.tar.gz /output/
29+
```
30+
31+
## FAQ and other info
32+
See: https://github.com/chenditc/investment_data/blob/main/README.md

scripts/data_collector/yahoo/README.md

+9-1
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@ pip install -r requirements.txt
3636
- `target_dir`: save dir, by default *~/.qlib/qlib_data/cn_data*
3737
- `version`: dataset version, value from [`v1`, `v2`], by default `v1`
3838
- `v2` end date is *2021-06*, `v1` end date is *2020-09*
39-
- user can append data to `v2`: [automatic update of daily frequency data](#automatic-update-of-daily-frequency-datafrom-yahoo-finance)
39+
- If users want to incrementally update data, they need to use yahoo collector to [collect data from scratch](#collector-yahoofinance-data-to-qlib).
4040
- **the [benchmarks](https://github.com/microsoft/qlib/tree/main/examples/benchmarks) for qlib use `v1`**, *due to the unstable access to historical data by YahooFinance, there are some differences between `v2` and `v1`*
4141
- `interval`: `1d` or `1min`, by default `1d`
4242
- `region`: `cn` or `us` or `in`, by default `cn`
@@ -62,6 +62,8 @@ pip install -r requirements.txt
6262
> collector *YahooFinance* data and *dump* into `qlib` format.
6363
> If the above ready-made data can't meet users' requirements, users can follow this section to crawl the latest data and convert it to qlib-data.
6464
1. download data to csv: `python scripts/data_collector/yahoo/collector.py download_data`
65+
66+
This will download the raw data such as high, low, open, close, adjclose price from yahoo to a local directory. One file per symbol.
6567

6668
- parameters:
6769
- `source_dir`: save the directory
@@ -99,6 +101,10 @@ pip install -r requirements.txt
99101
```
100102
2. normalize data: `python scripts/data_collector/yahoo/collector.py normalize_data`
101103
104+
This will:
105+
1. Normalize high, low, close, open price using adjclose.
106+
2. Normalize the high, low, close, open price so that the first valid trading date's close price is 1.
107+
102108
- parameters:
103109
- `source_dir`: csv directory
104110
- `normalize_dir`: result directory
@@ -136,6 +142,8 @@ pip install -r requirements.txt
136142
```
137143
3. dump data: `python scripts/dump_bin.py dump_all`
138144
145+
This will convert the normalized csv in `feature` directory as numpy array and store the normalized data one file per column and one symbol per directory.
146+
139147
- parameters:
140148
- `csv_path`: stock data path or directory, **normalize result(normalize_dir)**
141149
- `qlib_dir`: qlib(dump) data director

0 commit comments

Comments
 (0)