Skip to content
This repository has been archived by the owner on Mar 19, 2024. It is now read-only.

Latest commit

 

History

History
192 lines (163 loc) · 31.9 KB

crawl-vectors.md

File metadata and controls

192 lines (163 loc) · 31.9 KB
id title
crawl-vectors
Word vectors for 157 languages

We distribute pre-trained word vectors for 157 languages, trained on Common Crawl and Wikipedia using fastText. These models were trained using CBOW with position-weights, in dimension 300, with character n-grams of length 5, a window of size 5 and 10 negatives. We also distribute three new word analogy datasets, for French, Hindi and Polish.

Download directly with command line or from python

In order to download with command line or from python code, you must have installed the python package as described here.

$ ./download_model.py en     # English
Downloading https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.bin.gz
 (19.78%) [=========>                                         ]

Once the download is finished, use the model as usual:

$ ./fasttext nn cc.en.300.bin 10
Query word?
>>> import fasttext.util
>>> fasttext.util.download_model('en', if_exists='ignore')  # English
>>> ft = fasttext.load_model('cc.en.300.bin')

Adapt the dimension

The pre-trained word vectors we distribute have dimension 300. If you need a smaller size, you can use our dimension reducer. In order to use that feature, you must have installed the python package as described here.

For example, in order to get vectors of dimension 100:

$ ./reduce_model.py cc.en.300.bin 100
Loading model
Reducing matrix dimensions
Saving model
cc.en.100.bin saved

Then you can use the cc.en.100.bin model file as usual.

>>> import fasttext
>>> import fasttext.util
>>> ft = fasttext.load_model('cc.en.300.bin')
>>> ft.get_dimension()
300
>>> fasttext.util.reduce_model(ft, 100)
>>> ft.get_dimension()
100

Then you can use ft model object as usual:

>>> ft.get_word_vector('hello').shape
(100,)
>>> ft.get_nearest_neighbors('hello')
[(0.775576114654541, u'heyyyy'), (0.7686290144920349, u'hellow'), (0.7663413286209106, u'hello-'), (0.7579624056816101, u'heyyyyy'), (0.7495524287223816, u'hullo'), (0.7473770380020142, u'.hello'), (0.7407292127609253, u'Hiiiii'), (0.7402616739273071, u'hellooo'), (0.7399682402610779, u'hello.'), (0.7396857738494873, u'Heyyyyy')]

or save it for later use:

>>> ft.save_model('cc.en.100.bin')

Format

The word vectors are available in both binary and text formats.

Using the binary models, vectors for out-of-vocabulary words can be obtained with

$ ./fasttext print-word-vectors wiki.it.300.bin < oov_words.txt

where the file oov_words.txt contains out-of-vocabulary words.

In the text format, each line contain a word followed by its vector. Each value is space separated, and words are sorted by frequency in descending order. These text models can easily be loaded in Python using the following code:

import io

def load_vectors(fname):
    fin = io.open(fname, 'r', encoding='utf-8', newline='\n', errors='ignore')
    n, d = map(int, fin.readline().split())
    data = {}
    for line in fin:
        tokens = line.rstrip().split(' ')
        data[tokens[0]] = map(float, tokens[1:])
    return data

Tokenization

We used the Stanford word segmenter for Chinese, Mecab for Japanese and UETsegmenter for Vietnamese. For languages using the Latin, Cyrillic, Hebrew or Greek scripts, we used the tokenizer from the Europarl preprocessing tools. For the remaining languages, we used the ICU tokenizer.

More information about the training of these models can be found in the article Learning Word Vectors for 157 Languages.

License

The word vectors are distributed under the Creative Commons Attribution-Share-Alike License 3.0.

References

If you use these word vectors, please cite the following paper:

E. Grave*, P. Bojanowski*, P. Gupta, A. Joulin, T. Mikolov, Learning Word Vectors for 157 Languages

@inproceedings{grave2018learning,
  title={Learning Word Vectors for 157 Languages},
  author={Grave, Edouard and Bojanowski, Piotr and Gupta, Prakhar and Joulin, Armand and Mikolov, Tomas},
  booktitle={Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018)},
  year={2018}
}

Evaluation datasets

The analogy evaluation datasets described in the paper are available here: French, Hindi, Polish.

Models

The models can be downloaded from:

Afrikaans: bin, text Albanian: bin, text Alemannic: bin, text
Amharic: bin, text Arabic: bin, text Aragonese: bin, text
Armenian: bin, text Assamese: bin, text Asturian: bin, text
Azerbaijani: bin, text Bashkir: bin, text Basque: bin, text
Bavarian: bin, text Belarusian: bin, text Bengali: bin, text
Bihari: bin, text Bishnupriya Manipuri: bin, text Bosnian: bin, text
Breton: bin, text Bulgarian: bin, text Burmese: bin, text
Catalan: bin, text Cebuano: bin, text Central Bicolano: bin, text
Chechen: bin, text Chinese: bin, text Chuvash: bin, text
Corsican: bin, text Croatian: bin, text Czech: bin, text
Danish: bin, text Divehi: bin, text Dutch: bin, text
Eastern Punjabi: bin, text Egyptian Arabic: bin, text Emilian-Romagnol: bin, text
English: bin, text Erzya: bin, text Esperanto: bin, text
Estonian: bin, text Fiji Hindi: bin, text Finnish: bin, text
French: bin, text Galician: bin, text Georgian: bin, text
German: bin, text Goan Konkani: bin, text Greek: bin, text
Gujarati: bin, text Haitian: bin, text Hebrew: bin, text
Hill Mari: bin, text Hindi: bin, text Hungarian: bin, text
Icelandic: bin, text Ido: bin, text Ilokano: bin, text
Indonesian: bin, text Interlingua: bin, text Irish: bin, text
Italian: bin, text Japanese: bin, text Javanese: bin, text
Kannada: bin, text Kapampangan: bin, text Kazakh: bin, text
Khmer: bin, text Kirghiz: bin, text Korean: bin, text
Kurdish (Kurmanji): bin, text Kurdish (Sorani): bin, text Latin: bin, text
Latvian: bin, text Limburgish: bin, text Lithuanian: bin, text
Lombard: bin, text Low Saxon: bin, text Luxembourgish: bin, text
Macedonian: bin, text Maithili: bin, text Malagasy: bin, text
Malay: bin, text Malayalam: bin, text Maltese: bin, text
Manx: bin, text Marathi: bin, text Mazandarani: bin, text
Meadow Mari: bin, text Minangkabau: bin, text Mingrelian: bin, text
Mirandese: bin, text Mongolian: bin, text Nahuatl: bin, text
Neapolitan: bin, text Nepali: bin, text Newar: bin, text
North Frisian: bin, text Northern Sotho: bin, text Norwegian (Bokmål): bin, text
Norwegian (Nynorsk): bin, text Occitan: bin, text Oriya: bin, text
Ossetian: bin, text Palatinate German: bin, text Pashto: bin, text
Persian: bin, text Piedmontese: bin, text Polish: bin, text
Portuguese: bin, text Quechua: bin, text Romanian: bin, text
Romansh: bin, text Russian: bin, text Sakha: bin, text
Sanskrit: bin, text Sardinian: bin, text Scots: bin, text
Scottish Gaelic: bin, text Serbian: bin, text Serbo-Croatian: bin, text
Sicilian: bin, text Sindhi: bin, text Sinhalese: bin, text
Slovak: bin, text Slovenian: bin, text Somali: bin, text
Southern Azerbaijani: bin, text Spanish: bin, text Sundanese: bin, text
Swahili: bin, text Swedish: bin, text Tagalog: bin, text
Tajik: bin, text Tamil: bin, text Tatar: bin, text
Telugu: bin, text Thai: bin, text Tibetan: bin, text
Turkish: bin, text Turkmen: bin, text Ukrainian: bin, text
Upper Sorbian: bin, text Urdu: bin, text Uyghur: bin, text
Uzbek: bin, text Venetian: bin, text Vietnamese: bin, text
Volapük: bin, text Walloon: bin, text Waray: bin, text
Welsh: bin, text West Flemish: bin, text West Frisian: bin, text
Western Punjabi: bin, text Yiddish: bin, text Yoruba: bin, text
Zazaki: bin, text Zeelandic: bin, text