Please pay ATTENTION that the data is collected from Yahoo Finance and the data might not be perfect. We recommend users to prepare their own data if they have high-quality dataset. For more information, users can refer to the related document
NOTE: Yahoo! Finance has blocked the access from China. Please change your network if you want to use the Yahoo data crawler.
Examples of abnormal data
We have considered STOCK PRICE ADJUSTMENT, but some price series seem still very abnormal.
pip install -r requirements.txt
qlib-data
from YahooFinance, is the data that has been dumped and can be used directly inqlib
. This ready-made qlib-data is not updated regularly. If users want the latest data, please follow these steps download the latest data.
- get data:
python scripts/get_data.py qlib_data
- parameters:
target_dir
: save dir, by default ~/.qlib/qlib_data/cn_dataversion
: dataset version, value from [v1
,v2
], by defaultv1
v2
end date is 2021-06,v1
end date is 2020-09- If users want to incrementally update data, they need to use yahoo collector to collect data from scratch.
- the benchmarks for qlib use
v1
, due to the unstable access to historical data by YahooFinance, there are some differences betweenv2
andv1
interval
:1d
or1min
, by default1d
region
:cn
orus
orin
, by defaultcn
delete_old
: delete existing data fromtarget_dir
(features, calendars, instruments, dataset_cache, features_cache), value from [True
,False
], by defaultTrue
exists_skip
: traget_dir data already exists, skipget_data
, value from [True
,False
], by defaultFalse
- examples:
# cn 1d python scripts/get_data.py qlib_data --target_dir ~/.qlib/qlib_data/cn_data --region cn # cn 1min python scripts/get_data.py qlib_data --target_dir ~/.qlib/qlib_data/cn_data_1min --region cn --interval 1min # us 1d python scripts/get_data.py qlib_data --target_dir ~/.qlib/qlib_data/us_data --region us --interval 1d # us 1min python scripts/get_data.py qlib_data --target_dir ~/.qlib/qlib_data/us_data_1min --region us --interval 1min # in 1d python scripts/get_data.py qlib_data --target_dir ~/.qlib/qlib_data/in_data --region in --interval 1d # in 1min python scripts/get_data.py qlib_data --target_dir ~/.qlib/qlib_data/in_data_1min --region in --interval 1min
collector YahooFinance data and dump into
qlib
format. If the above ready-made data can't meet users' requirements, users can follow this section to crawl the latest data and convert it to qlib-data.
-
download data to csv:
python scripts/data_collector/yahoo/collector.py download_data
This will download the raw data such as high, low, open, close, adjclose price from yahoo to a local directory. One file per symbol.
- parameters:
source_dir
: save the directoryinterval
:1d
or1min
, by default1d
due to the limitation of the YahooFinance API, only the last month's data is available in
1min
region
:CN
orUS
orIN
orBR
, by defaultCN
delay
:time.sleep(delay)
, by default 0.5start
: start datetime, by default "2000-01-01"; closed interval(including start)end
: end datetime, by defaultpd.Timestamp(datetime.datetime.now() + pd.Timedelta(days=1))
; open interval(excluding end)max_workers
: get the number of concurrent symbols, it is not recommended to change this parameter in order to maintain the integrity of the symbol data, by default 1check_data_length
: check the number of rows per symbol, by defaultNone
if
len(symbol_df) < check_data_length
, it will be re-fetched, with the number of re-fetches coming from themax_collector_count
parametermax_collector_count
: number of "failed" symbol retries, by default 2
- examples:
# cn 1d data python collector.py download_data --source_dir ~/.qlib/stock_data/source/cn_data --start 2020-01-01 --end 2020-12-31 --delay 1 --interval 1d --region CN # cn 1min data python collector.py download_data --source_dir ~/.qlib/stock_data/source/cn_data_1min --delay 1 --interval 1min --region CN # us 1d data python collector.py download_data --source_dir ~/.qlib/stock_data/source/us_data --start 2020-01-01 --end 2020-12-31 --delay 1 --interval 1d --region US # us 1min data python collector.py download_data --source_dir ~/.qlib/stock_data/source/us_data_1min --delay 1 --interval 1min --region US # in 1d data python collector.py download_data --source_dir ~/.qlib/stock_data/source/in_data --start 2020-01-01 --end 2020-12-31 --delay 1 --interval 1d --region IN # in 1min data python collector.py download_data --source_dir ~/.qlib/stock_data/source/in_data_1min --delay 1 --interval 1min --region IN # br 1d data python collector.py download_data --source_dir ~/.qlib/stock_data/source/br_data --start 2003-01-03 --end 2022-03-01 --delay 1 --interval 1d --region BR # br 1min data python collector.py download_data --source_dir ~/.qlib/stock_data/source/br_data_1min --delay 1 --interval 1min --region BR
- parameters:
-
normalize data:
python scripts/data_collector/yahoo/collector.py normalize_data
This will:
- Normalize high, low, close, open price using adjclose.
- Normalize the high, low, close, open price so that the first valid trading date's close price is 1.
- parameters:
source_dir
: csv directorynormalize_dir
: result directorymax_workers
: number of concurrent, by default 1interval
:1d
or1min
, by default1d
if
interval == 1min
,qlib_data_1d_dir
cannot beNone
region
:CN
orUS
orIN
, by defaultCN
date_field_name
: column name identifying time in csv files, by defaultdate
symbol_field_name
: column name identifying symbol in csv files, by defaultsymbol
end_date
: if notNone
, normalize the last date saved (including end_date); ifNone
, it will ignore this parameter; by defaultNone
qlib_data_1d_dir
: qlib directory(1d data)if interval==1min, qlib_data_1d_dir cannot be None, normalize 1min needs to use 1d data; qlib_data_1d can be obtained like this: $ python scripts/get_data.py qlib_data --target_dir <qlib_data_1d_dir> --interval 1d $ python scripts/data_collector/yahoo/collector.py update_data_to_bin --qlib_data_1d_dir <qlib_data_1d_dir> --end_date <end_date> or: download 1d data from YahooFinance
- examples:
# normalize 1d cn python collector.py normalize_data --source_dir ~/.qlib/stock_data/source/cn_data --normalize_dir ~/.qlib/stock_data/source/cn_1d_nor --region CN --interval 1d # normalize 1min cn python collector.py normalize_data --qlib_data_1d_dir ~/.qlib/qlib_data/cn_data --source_dir ~/.qlib/stock_data/source/cn_data_1min --normalize_dir ~/.qlib/stock_data/source/cn_1min_nor --region CN --interval 1min # normalize 1d br python scripts/data_collector/yahoo/collector.py normalize_data --source_dir ~/.qlib/stock_data/source/br_data --normalize_dir ~/.qlib/stock_data/source/br_1d_nor --region BR --interval 1d # normalize 1min br python collector.py normalize_data --qlib_data_1d_dir ~/.qlib/qlib_data/br_data --source_dir ~/.qlib/stock_data/source/br_data_1min --normalize_dir ~/.qlib/stock_data/source/br_1min_nor --region BR --interval 1min
-
dump data:
python scripts/dump_bin.py dump_all
This will convert the normalized csv in
feature
directory as numpy array and store the normalized data one file per column and one symbol per directory.- parameters:
csv_path
: stock data path or directory, normalize result(normalize_dir)qlib_dir
: qlib(dump) data directorfreq
: transaction frequency, by defaultday
freq_map = {1d:day, 1mih: 1min}
max_workers
: number of threads, by default 16include_fields
: dump fields, by default""
exclude_fields
: fields not dumped, by default `"""dump_fields =
include_fields if include_fields else set(symbol_df.columns) - set(exclude_fields) exclude_fields else symbol_df.columns
symbol_field_name
: column name identifying symbol in csv files, by defaultsymbol
date_field_name
: column name identifying time in csv files, by defaultdate
- examples:
# dump 1d cn python dump_bin.py dump_all --csv_path ~/.qlib/stock_data/source/cn_1d_nor --qlib_dir ~/.qlib/qlib_data/cn_data --freq day --exclude_fields date,symbol # dump 1min cn python dump_bin.py dump_all --csv_path ~/.qlib/stock_data/source/cn_1min_nor --qlib_dir ~/.qlib/qlib_data/cn_data_1min --freq 1min --exclude_fields date,symbol
- parameters:
It is recommended that users update the data manually once (--trading_date 2021-05-25) and then set it to update automatically.
NOTE: Users can't incrementally update data based on the offline data provided by Qlib(some fields are removed to reduce the data size). Users should use yahoo collector to download Yahoo data from scratch and then incrementally update it.
-
Automatic update of data to the "qlib" directory each trading day(Linux)
-
use crontab:
crontab -e
-
set up timed tasks:
* * * * 1-5 python <script path> update_data_to_bin --qlib_data_1d_dir <user data dir>
- script path: scripts/data_collector/yahoo/collector.py
-
-
Manual update of data
python scripts/data_collector/yahoo/collector.py update_data_to_bin --qlib_data_1d_dir <user data dir> --end_date <end date>
end_date
: end of trading day(not included)check_data_length
: check the number of rows per symbol, by defaultNone
if
len(symbol_df) < check_data_length
, it will be re-fetched, with the number of re-fetches coming from themax_collector_count
parameter
-
scripts/data_collector/yahoo/collector.py update_data_to_bin
parameters:source_dir
: The directory where the raw data collected from the Internet is saved, default "Path(file).parent/source"normalize_dir
: Directory for normalize data, default "Path(file).parent/normalize"qlib_data_1d_dir
: the qlib data to be updated for yahoo, usually from: download qlib dataend_date
: end datetime, defaultpd.Timestamp(trading_date + pd.Timedelta(days=1))
; open interval(excluding end)region
: region, value from ["CN", "US"], default "CN"interval
: interval, default "1d"(Currently only supports 1d data)exists_skip
: exists skip, by default False
import qlib
from qlib.data import D
# 1d data cn
# freq=day, freq default day
qlib.init(provider_uri="~/.qlib/qlib_data/cn_data", region="cn")
df = D.features(D.instruments("all"), ["$close"], freq="day")
# 1min data cn
# freq=1min
qlib.init(provider_uri="~/.qlib/qlib_data/cn_data_1min", region="cn")
inst = D.list_instruments(D.instruments("all"), freq="1min", as_list=True)
# get 100 symbols
df = D.features(inst[:100], ["$close"], freq="1min")
# get all symbol data
# df = D.features(D.instruments("all"), ["$close"], freq="1min")
# 1d data us
qlib.init(provider_uri="~/.qlib/qlib_data/us_data", region="us")
df = D.features(D.instruments("all"), ["$close"], freq="day")
# 1min data us
qlib.init(provider_uri="~/.qlib/qlib_data/us_data_1min", region="cn")
inst = D.list_instruments(D.instruments("all"), freq="1min", as_list=True)
# get 100 symbols
df = D.features(inst[:100], ["$close"], freq="1min")
# get all symbol data
# df = D.features(D.instruments("all"), ["$close"], freq="1min")