In T338292: Add sentence segmenter feature we added a sentence segmentation system to MinT. It works as follows:
- Use a global sentence terminator characters list(source from unicode) and use that to find sentence boundaries.
- Make sure those boundaries are not ending with abbreviations. For this, we need abbreviation detection system and that is language specific.
There are en and ml abbreviation detection logic in the current code base. It need to be expanded to more languages - at least to the top 10 source languages we see in Content Translation.
Finding the most commonly used abbreviations in a language is not difficult. For example, see https://en.wikipedia.org/wiki/List_of_German_abbreviations
wiki-nlp-tools library also has a collection.
Result
Detailed blog post: Blog Post: sentencex: Empowering NLP with Multilingual Sentence Extraction
sentencex python library
- Souce code: github repository. Please refer the documentation for usage examples
- Python package: sentencex
- Demo: https://wikimedia.github.io/sentencex/
sentencex js library
- Souce code: github repository. Please refer the documentation for usage examples
- NPM package: sentencex
- Demo: https://wikimedia.github.io/sentencex-js