You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/component/data.rst
+1-1
Original file line number
Diff line number
Diff line change
@@ -51,7 +51,7 @@ Also, ``Qlib`` provides a high-frequency dataset. Users can run a high-frequency
51
51
52
52
Qlib Format Dataset
53
53
-------------------
54
-
``Qlib`` has provided an off-the-shelf dataset in `.bin` format, users could use the script ``scripts/get_data.py`` to download the China-Stock dataset as follows.
54
+
``Qlib`` has provided an off-the-shelf dataset in `.bin` format, users could use the script ``scripts/get_data.py`` to download the China-Stock dataset as follows. User can also use numpy to load `.bin` file to validate data.
55
55
The price volume data look different from the actual dealling price because of they are **adjusted** (`adjusted price <https://www.investopedia.com/terms/a/adjusted_closing_price.asp>`_). And then you may find that the adjusted price may be different from different data sources. This is because different data sources may vary in the way of adjusting prices. Qlib normalize the price on first trading day of each stock to 1 when adjusting them.
56
56
Users can leverage `$factor` to get the original trading price (e.g. `$close / $factor` to get the original close price).
The is also a [crowd sourced version of qlib data](data_collector/crowd_source/README.md): https://github.com/chenditc/investment_data/releases
Public data source like yahoo is flawed, it might miss data for stock which is delisted and it might has data which is wrong. This can introduce survivorship bias into our training process.
5
+
6
+
The crowd sourced data is introduced to merged data from multiple data source and cross validate against each other, so that:
7
+
1. We will have a more complete history record.
8
+
2. We can identify the anomaly data and apply correction when necessary.
9
+
10
+
## Related Repo
11
+
The raw data is hosted on dolthub repo: https://www.dolthub.com/repositories/chenditc/investment_data
12
+
13
+
The processing script and sql is hosted on github repo: https://github.com/chenditc/investment_data
14
+
15
+
The pakcaged docker runtime is hosted on dockerhub: https://hub.docker.com/repository/docker/chenditc/investment_data
16
+
17
+
## How to use it in qlib
18
+
### Option 1: Download release bin data
19
+
User can download data in qlib bin format and use it directly: https://github.com/chenditc/investment_data/releases/tag/20220720
Copy file name to clipboardExpand all lines: scripts/data_collector/yahoo/README.md
+9-1
Original file line number
Diff line number
Diff line change
@@ -36,7 +36,7 @@ pip install -r requirements.txt
36
36
-`target_dir`: save dir, by default *~/.qlib/qlib_data/cn_data*
37
37
-`version`: dataset version, value from [`v1`, `v2`], by default `v1`
38
38
-`v2` end date is *2021-06*, `v1` end date is *2020-09*
39
-
-user can append data to `v2`: [automatic update of daily frequency data](#automatic-update-of-daily-frequency-datafrom-yahoo-finance)
39
+
-If users want to incrementally update data, they need to use yahoo collector to [collect data from scratch](#collector-yahoofinance-data-to-qlib).
40
40
-**the [benchmarks](https://github.com/microsoft/qlib/tree/main/examples/benchmarks) for qlib use `v1`**, *due to the unstable access to historical data by YahooFinance, there are some differences between `v2` and `v1`*
41
41
-`interval`: `1d` or `1min`, by default `1d`
42
42
-`region`: `cn` or `us` or `in`, by default `cn`
@@ -62,6 +62,8 @@ pip install -r requirements.txt
62
62
> collector *YahooFinance* data and *dump* into `qlib` format.
63
63
> If the above ready-made data can't meet users' requirements, users can follow this section to crawl the latest data and convert it to qlib-data.
64
64
1. download data to csv: `python scripts/data_collector/yahoo/collector.py download_data`
65
+
66
+
This will download the raw data such as high, low, open, close, adjclose price from yahoo to a local directory. One file per symbol.
This will convert the normalized csv in`feature` directory as numpy array and store the normalized data one file per column and one symbol per directory.
146
+
139
147
- parameters:
140
148
- `csv_path`: stock data path or directory, **normalize result(normalize_dir)**
0 commit comments