Skip to content

Latest commit

 

History

History
 
 

pitfalls_static_language_models

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 

Pitfalls of Static Language Modelling

This repository contains the dataset splits used in Pitfalls of Static Language Modelling (Lazaridou, Kuncoro, Gribovskaya et al., 2021).

Datasets

We provide splits of two public datasets used in the paper: WMT News Crawl and arXiv abstracts.

Each subset is stored on Google Cloud Storage as gzipped text file specifying publication dates (in the YYYYMMDD format) and IDs of documents contained in the subset.

arXiv

arXiv abstracts and publication dates were obtained through arXiv's OAI-PMH service on January 2, 2021. We used the value in the created field as the article's publication date. The arXiv dataset can also be downloaded from Kaggle.

WMT

We downloaded document-split versions of the English and German WMT News Crawl dataset. As the dataset does not provide document IDs, we used SHA256 hashes of the Base64 encoded unsplit texts of articles as their IDs, i.e.:

import gzip
import hashlib

with gzip.open('news-docs.2007.en.filtered.gz', 'rb') as gz_file:
  for line in gz_file:
    date, sentence_split_text, unsplit_text = line.decode('utf-8').strip().split('\t')
    docid = hashlib.sha256(unsplit_text.encode('utf-8')).hexdigest()
    yield docid, (date, sentence_split_text, unsplit_text)

We trained models on sentence split article texts. Some articles may appear multiple times in the dataset with different publication dates; we used each article's earliest publication date.

Splits used in experiments

Experiments Dataset Splits
Sections 3-5 WMT control: train, validation
time-stratified: train, validation
test
arXiv control: train, validation
time-stratified: train, validation
test
Appendix B: The effect of outdated models persists beyond the 2018/2019 test period WMT test period 2017/2018: control: train, validation; time-stratified: train, validation; test
test period 2016/2017: control: train, validation; time-stratified: train, validation; test
test period 2015/2016: control: train, validation; time-stratified: train, validation; test
test period 2014/2015: control: train, validation; time-stratified: train, validation; test
test period 2013/2014: control: train, validation; time-stratified: train, validation; test
Appendix C: The effect of outdated models persists beyond the two-year gap WMT test: same as the one for Sections 3-5
validation: same as the one for the time-stratified setup for Sections 3-5
train until: 2017-09-30, 2017-03-31, 2016-09-30, 2016-03-31, 2015-09-30, 2015-03-31, 2014-09-30, 2014-03-31, 2013-09-30, 2013-03-31, 2012-09-30