Qlib data doc (microsoft#1207)

chenditc · you-n-g · web-flow · commit 86f08e47e8ab · 2022-07-22T09:24:58.000+08:00
* Explain data crawler structure

* Add documentation for data and feature

* Update scripts/data_collector/yahoo/README.md

Co-authored-by: you-n-g &lt;you-n-g@users.noreply.github.com&gt;

* Remove some confusing wording

* Add third party data source

* Fix command typo

* Update commands

Co-authored-by: you-n-g &lt;you-n-g@users.noreply.github.com&gt;
diff --git a/docs/component/data.rst b/docs/component/data.rst
@@ -51,7 +51,7 @@ Also, ``Qlib`` provides a high-frequency dataset. Users can run a high-frequency
 
 Qlib Format Dataset
 -------------------
-``Qlib`` has provided an off-the-shelf dataset in `.bin` format, users could use the script ``scripts/get_data.py`` to download the China-Stock dataset as follows.
+``Qlib`` has provided an off-the-shelf dataset in `.bin` format, users could use the script ``scripts/get_data.py`` to download the China-Stock dataset as follows. User can also use numpy to load `.bin` file to validate data.
 The price volume data look different from the actual dealling price because of they are **adjusted** (`adjusted price <https://www.investopedia.com/terms/a/adjusted_closing_price.asp>`_).  And then you may find that the adjusted price may be different from different data sources. This is because different data sources may vary in the way of adjusting prices. Qlib normalize the price on first trading day of each stock to 1 when adjusting them.
 Users can leverage `$factor` to get the original trading price (e.g. `$close / $factor` to get the original close price).
 
diff --git a/qlib/contrib/data/handler.py b/qlib/contrib/data/handler.py
@@ -259,113 +259,161 @@ def parse_config_to_fields(config):
             def use(x):
                 return x not in exclude and (include is None or x in include)
 
+            # Some factor ref: https://guorn.com/static/upload/file/3/134065454575605.pdf
             if use("ROC"):
+                # https://www.investopedia.com/terms/r/rateofchange.asp
+                # Rate of change, the price change in the past d days, divided by latest close price to remove unit
                 fields += ["Ref($close, %d)/$close" % d for d in windows]
                 names += ["ROC%d" % d for d in windows]
             if use("MA"):
+                # https://www.investopedia.com/ask/answers/071414/whats-difference-between-moving-average-and-weighted-moving-average.asp
+                # Simple Moving Average, the simple moving average in the past d days, divided by latest close price to remove unit
                 fields += ["Mean($close, %d)/$close" % d for d in windows]
                 names += ["MA%d" % d for d in windows]
             if use("STD"):
+                # The standard diviation of close price for the past d days, divided by latest close price to remove unit
                 fields += ["Std($close, %d)/$close" % d for d in windows]
                 names += ["STD%d" % d for d in windows]
             if use("BETA"):
+                # The rate of close price change in the past d days, divided by latest close price to remove unit
+                # For example, price increase 10 dollar per day in the past d days, then Slope will be 10.
                 fields += ["Slope($close, %d)/$close" % d for d in windows]
                 names += ["BETA%d" % d for d in windows]
             if use("RSQR"):
+                # The R-sqaure value of linear regression for the past d days, represent the trend linearity for past d days.
                 fields += ["Rsquare($close, %d)" % d for d in windows]
                 names += ["RSQR%d" % d for d in windows]
             if use("RESI"):
+                # The redisdual for linear regression for the past d days, represent the trend linearity for past d days. 
                 fields += ["Resi($close, %d)/$close" % d for d in windows]
                 names += ["RESI%d" % d for d in windows]
             if use("MAX"):
+                # The max price for past d days, divided by latest close price to remove unit
                 fields += ["Max($high, %d)/$close" % d for d in windows]
                 names += ["MAX%d" % d for d in windows]
             if use("LOW"):
+                # The low price for past d days, divided by latest close price to remove unit
                 fields += ["Min($low, %d)/$close" % d for d in windows]
                 names += ["MIN%d" % d for d in windows]
             if use("QTLU"):
+                # The 80% quantile of past d day's close price, divided by latest close price to remove unit
+                # Used with MIN and MAX 
                 fields += ["Quantile($close, %d, 0.8)/$close" % d for d in windows]
                 names += ["QTLU%d" % d for d in windows]
             if use("QTLD"):
+                # The 20% quantile of past d day's close price, divided by latest close price to remove unit
                 fields += ["Quantile($close, %d, 0.2)/$close" % d for d in windows]
                 names += ["QTLD%d" % d for d in windows]
             if use("RANK"):
+                # Get the percentile of current close price in past d day's close price. 
+                # Represent the current price level comparing to past N days, add additional information to moving average.
                 fields += ["Rank($close, %d)" % d for d in windows]
                 names += ["RANK%d" % d for d in windows]
             if use("RSV"):
+                # Represent the price position between upper and lower resistent price for past d days.
                 fields += ["($close-Min($low, %d))/(Max($high, %d)-Min($low, %d)+1e-12)" % (d, d, d) for d in windows]
                 names += ["RSV%d" % d for d in windows]
             if use("IMAX"):
+                # The number of days between current date and previous highest price date.
+                # Part of Aroon Indicator https://www.investopedia.com/terms/a/aroon.asp
+                # The indicator measures the time between highs and the time between lows over a time period. 
+                # The idea is that strong uptrends will regularly see new highs, and strong downtrends will regularly see new lows.
                 fields += ["IdxMax($high, %d)/%d" % (d, d) for d in windows]
                 names += ["IMAX%d" % d for d in windows]
             if use("IMIN"):
+                # The number of days between current date and previous lowest price date.
+                # Part of Aroon Indicator https://www.investopedia.com/terms/a/aroon.asp
+                # The indicator measures the time between highs and the time between lows over a time period. 
+                # The idea is that strong uptrends will regularly see new highs, and strong downtrends will regularly see new lows.
                 fields += ["IdxMin($low, %d)/%d" % (d, d) for d in windows]
                 names += ["IMIN%d" % d for d in windows]
             if use("IMXD"):
+                # The time period between previous lowest-price date occur after highest price date.
+                # Large value suggest downward momemtum.
                 fields += ["(IdxMax($high, %d)-IdxMin($low, %d))/%d" % (d, d, d) for d in windows]
                 names += ["IMXD%d" % d for d in windows]
             if use("CORR"):
+                # The correlation between absolute close price and log scaled trading volume
                 fields += ["Corr($close, Log($volume+1), %d)" % d for d in windows]
                 names += ["CORR%d" % d for d in windows]
             if use("CORD"):
+                # The correlation between price change ratio and volume change ratio
                 fields += ["Corr($close/Ref($close,1), Log($volume/Ref($volume, 1)+1), %d)" % d for d in windows]
                 names += ["CORD%d" % d for d in windows]
             if use("CNTP"):
+                # The percentage of days in past d days that price go up.
                 fields += ["Mean($close>Ref($close, 1), %d)" % d for d in windows]
                 names += ["CNTP%d" % d for d in windows]
             if use("CNTN"):
+                # The percentage of days in past d days that price go down.
                 fields += ["Mean($close<Ref($close, 1), %d)" % d for d in windows]
                 names += ["CNTN%d" % d for d in windows]
             if use("CNTD"):
+                # The diff between past up day and past down day
                 fields += ["Mean($close>Ref($close, 1), %d)-Mean($close<Ref($close, 1), %d)" % (d, d) for d in windows]
                 names += ["CNTD%d" % d for d in windows]
             if use("SUMP"):
+                # The total gain / the absolute total price changed
+                # Similar to RSI indicator. https://www.investopedia.com/terms/r/rsi.asp
                 fields += [
                     "Sum(Greater($close-Ref($close, 1), 0), %d)/(Sum(Abs($close-Ref($close, 1)), %d)+1e-12)" % (d, d)
                     for d in windows
                 ]
                 names += ["SUMP%d" % d for d in windows]
             if use("SUMN"):
+                # The total lose / the absolute total price changed
+                # Can be derived from SUMP by SUMN = 1 - SUMP
+                # Similar to RSI indicator. https://www.investopedia.com/terms/r/rsi.asp
                 fields += [
                     "Sum(Greater(Ref($close, 1)-$close, 0), %d)/(Sum(Abs($close-Ref($close, 1)), %d)+1e-12)" % (d, d)
                     for d in windows
                 ]
                 names += ["SUMN%d" % d for d in windows]
             if use("SUMD"):
+                # The diff ratio between total gain and total lose
+                # Similar to RSI indicator. https://www.investopedia.com/terms/r/rsi.asp
                 fields += [
                     "(Sum(Greater($close-Ref($close, 1), 0), %d)-Sum(Greater(Ref($close, 1)-$close, 0), %d))"
                     "/(Sum(Abs($close-Ref($close, 1)), %d)+1e-12)" % (d, d, d)
                     for d in windows
                 ]
                 names += ["SUMD%d" % d for d in windows]
             if use("VMA"):
+                # Simple Volume Moving average: https://www.barchart.com/education/technical-indicators/volume_moving_average
                 fields += ["Mean($volume, %d)/($volume+1e-12)" % d for d in windows]
                 names += ["VMA%d" % d for d in windows]
             if use("VSTD"):
+                # The standard deviation for volume in past d days.
                 fields += ["Std($volume, %d)/($volume+1e-12)" % d for d in windows]
                 names += ["VSTD%d" % d for d in windows]
             if use("WVMA"):
+                # The volume weighted price change volatility
                 fields += [
                     "Std(Abs($close/Ref($close, 1)-1)*$volume, %d)/(Mean(Abs($close/Ref($close, 1)-1)*$volume, %d)+1e-12)"
                     % (d, d)
                     for d in windows
                 ]
                 names += ["WVMA%d" % d for d in windows]
             if use("VSUMP"):
+                # The total volume increase / the absolute total volume changed
                 fields += [
                     "Sum(Greater($volume-Ref($volume, 1), 0), %d)/(Sum(Abs($volume-Ref($volume, 1)), %d)+1e-12)"
                     % (d, d)
                     for d in windows
                 ]
                 names += ["VSUMP%d" % d for d in windows]
             if use("VSUMN"):
+                # The total volume increase / the absolute total volume changed
+                # Can be derived from VSUMP by VSUMN = 1 - VSUMP
                 fields += [
                     "Sum(Greater(Ref($volume, 1)-$volume, 0), %d)/(Sum(Abs($volume-Ref($volume, 1)), %d)+1e-12)"
                     % (d, d)
                     for d in windows
                 ]
                 names += ["VSUMN%d" % d for d in windows]
             if use("VSUMD"):
+                # The diff ratio between total volume increase and total volume decrease
+                # RSI indicator for volume
                 fields += [
                     "(Sum(Greater($volume-Ref($volume, 1), 0), %d)-Sum(Greater(Ref($volume, 1)-$volume, 0), %d))"
                     "/(Sum(Abs($volume-Ref($volume, 1)), %d)+1e-12)" % (d, d, d)
diff --git a/scripts/README.md b/scripts/README.md
@@ -67,3 +67,10 @@ from qlib.constant import REG_CN
 provider_uri = "~/.qlib/qlib_data/cn_data"  # target_dir
 qlib.init(provider_uri=provider_uri, region=REG_CN)
 ```
+
+## Use Crowd Sourced Data
+The is also a [crowd sourced version of qlib data](data_collector/crowd_source/README.md): https://github.com/chenditc/investment_data/releases
+```bash
+wget https://github.com/chenditc/investment_data/releases/download/20220720/qlib_bin.tar.gz
+tar -zxvf qlib_bin.tar.gz -C ~/.qlib/qlib_data/cn_data --strip-components=2
+```
diff --git a/scripts/data_collector/crowd_source/README.md b/scripts/data_collector/crowd_source/README.md
@@ -0,0 +1,32 @@
+# Crowd Source Data
+
+## Initiative
+Public data source like yahoo is flawed, it might miss data for stock which is delisted and it might has data which is wrong. This can introduce survivorship bias into our training process.
+
+The crowd sourced data is introduced to merged data from multiple data source and cross validate against each other, so that:
+1. We will have a more complete history record.
+2. We can identify the anomaly data and apply correction when necessary.
+
+## Related Repo
+The raw data is hosted on dolthub repo: https://www.dolthub.com/repositories/chenditc/investment_data
+
+The processing script and sql is hosted on github repo: https://github.com/chenditc/investment_data
+
+The pakcaged docker runtime is hosted on dockerhub: https://hub.docker.com/repository/docker/chenditc/investment_data
+
+## How to use it in qlib
+### Option 1: Download release bin data
+User can download data in qlib bin format and use it directly: https://github.com/chenditc/investment_data/releases/tag/20220720
+```bash
+wget https://github.com/chenditc/investment_data/releases/download/20220720/qlib_bin.tar.gz
+tar -zxvf qlib_bin.tar.gz -C ~/.qlib/qlib_data/cn_data --strip-components=2
+```
+
+### Option 2: Generate qlib data from dolthub
+Dolthub data will be update daily, so that if user wants to get up to date data, they can dump qlib bin using docker:
+```
+docker run -v /<some output directory>:/output -it --rm chenditc/investment_data bash dump_qlib_bin.sh && cp ./qlib_bin.tar.gz /output/
+```
+
+## FAQ and other info
+See: https://github.com/chenditc/investment_data/blob/main/README.md
diff --git a/scripts/data_collector/yahoo/README.md b/scripts/data_collector/yahoo/README.md
@@ -36,7 +36,7 @@ pip install -r requirements.txt
     - `target_dir`: save dir, by default *~/.qlib/qlib_data/cn_data*
     - `version`: dataset version, value from [`v1`, `v2`], by default `v1`
       - `v2` end date is *2021-06*, `v1` end date is *2020-09*
-      - user can append data to `v2`: [automatic update of daily frequency data](#automatic-update-of-daily-frequency-datafrom-yahoo-finance)
+      - If users want to incrementally update data, they need to use yahoo collector to [collect data from scratch](#collector-yahoofinance-data-to-qlib).
       - **the [benchmarks](https://github.com/microsoft/qlib/tree/main/examples/benchmarks) for qlib use `v1`**, *due to the unstable access to historical data by YahooFinance, there are some differences between `v2` and `v1`*
     - `interval`: `1d` or `1min`, by default `1d`
     - `region`: `cn` or `us` or `in`, by default `cn`
@@ -62,6 +62,8 @@ pip install -r requirements.txt
 > collector *YahooFinance* data and *dump* into `qlib` format.
 > If the above ready-made data can't meet users' requirements,  users can follow this section to crawl the latest data and convert it to qlib-data.
   1. download data to csv: `python scripts/data_collector/yahoo/collector.py download_data`
+     
+     This will download the raw data such as high, low, open, close, adjclose price from yahoo to a local directory. One file per symbol.
 
      - parameters:
           - `source_dir`: save the directory
@@ -99,6 +101,10 @@ pip install -r requirements.txt
           ```
   2. normalize data: `python scripts/data_collector/yahoo/collector.py normalize_data`
      
+     This will:
+     1. Normalize high, low, close, open price using adjclose.
+     2. Normalize the high, low, close, open price so that the first valid trading date's close price is 1. 
+
      - parameters:
           - `source_dir`: csv directory
           - `normalize_dir`: result directory
@@ -136,6 +142,8 @@ pip install -r requirements.txt
         ```
   3. dump data: `python scripts/dump_bin.py dump_all`
     
+     This will convert the normalized csv in `feature` directory as numpy array and store the normalized data one file per column and one symbol per directory. 
+    
      - parameters:
        - `csv_path`: stock data path or directory, **normalize result(normalize_dir)**
        - `qlib_dir`: qlib(dump) data director