Bulgarian spaCy core models (BGspaCy)

You can find all the models in Huggingface. This repo contains config files for:

bg_news_sm
bg_news_lg
bg_news_trf

All Bulgarian pipelines contains a tokenizer, trainable lemmatizer, POS tagger, dependency parser, morphologizer and NER components. The dataset for the NER pipeline is currently shared only privately (trying to figure out how to share it publicly without any copyright issues), so if you need it just email me.

Installation

You can download all the models from Huggingface:

Small Bulgarian model:

pip install https://huggingface.co/sakelariev/bg_news_sm/resolve/main/bg_news_sm-3.5.4-py3-none-any.whl

Large Bulgarian model:

pip install https://huggingface.co/sakelariev/bg_news_lg/resolve/main/bg_news_lg-3.5.4-py3-none-any.whl

Transformer based Bulgarian model:

pip install https://huggingface.co/sakelariev/bg_news_trf/resolve/main/bg_news_trf-3.5.4-py3-none-any.whl

Usage

After installing the models via pip you can directly use by loading into spaCy:

import spacy
nlp = spacy.load("bg_news_sm")

doc = nlp("През 1843 г. Петко Славейков става учител в Търново.")

Tutorials

Coming soon

License

CC-BY-NC-SA-3.0

Sadly the original license of the Bulgarian Treebank, which was used for the training of the main pipeline components (pretty much everything without the NER) was released under this non-commercial/share-alike license, which prevents me to release that models under any other license terms (my intention was releasing this under the MIT license). So all of the models cannot be used for commercial purposes.

Future work

I'm planning to release more Bulgarian models for spaCy, so I consider this repo an ongoing project. Some of the models I have already started working on:

bg_web_sm, bg_web_lg, bg_web_trf – improved NER models (right now the models are trained only a news text corpus for only 3 labels). Those models are going to be trained on a mixed corpus of web data – news, legal, fiction, conversation data. Also planning to add more labels - PERSON, GPE, LOC, ORG, LANGUAGE, NAT_REL_POL, DATETIME, PERIOD, QUANTITY, MONEY, ORDINAL, FACILITY, WORK_OF_ART, EVENT
bg_news_sm, bg_news_lg, bg_news_trf 2.0 versions. Train on new data and add new labels - PERSON, GPE, LOC, ORG, LANGUAGE, NAT_REL_POL, DATETIME, PERIOD, QUANTITY, MONEY, ORDINAL, FACILITY, WORK_OF_ART, EVENT
coreference resolution spaCy model for Bulgarian

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
bg_news_lg		bg_news_lg
bg_news_sm		bg_news_sm
bg_news_trf		bg_news_trf
floret_vectors		floret_vectors
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bulgarian spaCy core models (BGspaCy)

Installation

Usage

Tutorials

License

Future work

About

Releases

Packages

Languages

License

sakelariev/bulgarian-spacy-models

Folders and files

Latest commit

History

Repository files navigation

Bulgarian spaCy core models (BGspaCy)

Installation

Usage

Tutorials

License

Future work

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages