You can find all the models in Huggingface. This repo contains config files for:
- bg_news_sm
- bg_news_lg
- bg_news_trf
All Bulgarian pipelines contains a tokenizer, trainable lemmatizer, POS tagger, dependency parser, morphologizer and NER components. The dataset for the NER pipeline is currently shared only privately (trying to figure out how to share it publicly without any copyright issues), so if you need it just email me.
You can download all the models from Huggingface:
Small Bulgarian model:
pip install https://huggingface.co/sakelariev/bg_news_sm/resolve/main/bg_news_sm-3.5.4-py3-none-any.whl
Large Bulgarian model:
pip install https://huggingface.co/sakelariev/bg_news_lg/resolve/main/bg_news_lg-3.5.4-py3-none-any.whl
Transformer based Bulgarian model:
pip install https://huggingface.co/sakelariev/bg_news_trf/resolve/main/bg_news_trf-3.5.4-py3-none-any.whl
After installing the models via pip you can directly use by loading into spaCy:
import spacy
nlp = spacy.load("bg_news_sm")
doc = nlp("През 1843 г. Петко Славейков става учител в Търново.")
Coming soon
CC-BY-NC-SA-3.0
Sadly the original license of the Bulgarian Treebank, which was used for the training of the main pipeline components (pretty much everything without the NER) was released under this non-commercial/share-alike license, which prevents me to release that models under any other license terms (my intention was releasing this under the MIT license). So all of the models cannot be used for commercial purposes.
I'm planning to release more Bulgarian models for spaCy, so I consider this repo an ongoing project. Some of the models I have already started working on:
-
bg_web_sm, bg_web_lg, bg_web_trf – improved NER models (right now the models are trained only a news text corpus for only 3 labels). Those models are going to be trained on a mixed corpus of web data – news, legal, fiction, conversation data. Also planning to add more labels -
PERSON
,GPE
,LOC
,ORG
,LANGUAGE
,NAT_REL_POL
,DATETIME
,PERIOD
,QUANTITY
,MONEY
,ORDINAL
,FACILITY
,WORK_OF_ART
,EVENT
-
bg_news_sm, bg_news_lg, bg_news_trf 2.0 versions. Train on new data and add new labels -
PERSON
,GPE
,LOC
,ORG
,LANGUAGE
,NAT_REL_POL
,DATETIME
,PERIOD
,QUANTITY
,MONEY
,ORDINAL
,FACILITY
,WORK_OF_ART
,EVENT
-
coreference resolution spaCy model for Bulgarian