This repository contains the dataset splits used in Pitfalls of Static Language Modelling (Lazaridou, Kuncoro, Gribovskaya et al., 2021).
We provide splits of two public datasets used in the paper: WMT News Crawl and arXiv abstracts.
Each subset is stored on Google Cloud Storage as gzipped text file specifying publication dates (in the YYYYMMDD
format) and IDs of documents contained in the subset.
arXiv abstracts and publication dates were obtained through arXiv's OAI-PMH service on January 2, 2021. We used the value in the created
field as the article's publication date. The arXiv dataset can also be downloaded from Kaggle.
We downloaded document-split versions of the English and German WMT News Crawl dataset. As the dataset does not provide document IDs, we used SHA256 hashes of the Base64 encoded unsplit texts of articles as their IDs, i.e.:
import gzip
import hashlib
with gzip.open('news-docs.2007.en.filtered.gz', 'rb') as gz_file:
for line in gz_file:
date, sentence_split_text, unsplit_text = line.decode('utf-8').strip().split('\t')
docid = hashlib.sha256(unsplit_text.encode('utf-8')).hexdigest()
yield docid, (date, sentence_split_text, unsplit_text)
We trained models on sentence split article texts. Some articles may appear multiple times in the dataset with different publication dates; we used each article's earliest publication date.
Experiments | Dataset | Splits |
---|---|---|
Sections 3-5 | WMT | control: train, validation time-stratified: train, validation test |
arXiv | control: train, validation time-stratified: train, validation test |
|
Appendix B: The effect of outdated models persists beyond the 2018/2019 test period | WMT | test period 2017/2018: control: train, validation; time-stratified: train, validation; test test period 2016/2017: control: train, validation; time-stratified: train, validation; test test period 2015/2016: control: train, validation; time-stratified: train, validation; test test period 2014/2015: control: train, validation; time-stratified: train, validation; test test period 2013/2014: control: train, validation; time-stratified: train, validation; test |
Appendix C: The effect of outdated models persists beyond the two-year gap | WMT | test: same as the one for Sections 3-5 validation: same as the one for the time-stratified setup for Sections 3-5 train until: 2017-09-30, 2017-03-31, 2016-09-30, 2016-03-31, 2015-09-30, 2015-03-31, 2014-09-30, 2014-03-31, 2013-09-30, 2013-03-31, 2012-09-30 |