Maniphest T343781

Expand sentence segmentation system
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	santhosh
	Aug 8 2023, 7:58 AM

Tags

Referenced Files

None

Subscribers

Description

In T338292: Add sentence segmenter feature we added a sentence segmentation system to MinT. It works as follows:

Use a global sentence terminator characters list(source from unicode) and use that to find sentence boundaries.
Make sure those boundaries are not ending with abbreviations. For this, we need abbreviation detection system and that is language specific.

There are en and ml abbreviation detection logic in the current code base. It need to be expanded to more languages - at least to the top 10 source languages we see in Content Translation.

Finding the most commonly used abbreviations in a language is not difficult. For example, see https://en.wikipedia.org/wiki/List_of_German_abbreviations
wiki-nlp-tools library also has a collection.

Result

Detailed blog post: Blog Post: sentencex: Empowering NLP with Multilingual Sentence Extraction

sentencex python library

Souce code: github repository. Please refer the documentation for usage examples
Python package: sentencex
Demo: https://wikimedia.github.io/sentencex/

sentencex js library

Souce code: github repository. Please refer the documentation for usage examples
NPM package: sentencex
Demo: https://wikimedia.github.io/sentencex-js

Related Objects

Mentioned In: T347389: Integrate improved sentence segmentation algorithm in CXServer
Mentioned Here: T338292: Add sentence segmenter feature

Event Timeline

santhosh created this task.Aug 8 2023, 7:58 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 8 2023, 7:58 AM

Pginer-WMF triaged this task as Medium priority.Sep 7 2023, 9:39 AM

Pginer-WMF added a project: Language-Team (Language-2023-July-September).

Pginer-WMF moved this task from Backlog to Infrastructure on the MinT board.Sep 25 2023, 11:27 AM

Pginer-WMF mentioned this in T347389: Integrate improved sentence segmentation algorithm in CXServer.Sep 26 2023, 11:08 AM

Pginer-WMF edited projects, added Language-Team (Language-2023-October-December); removed Language-Team (Language-2023-July-September).Sep 29 2023, 1:46 PM

We now have a library for this - in js and python.

Detailed blog post:

https://phabricator.wikimedia.org/phame/post/view/308/sentencex_empowering_nlp_with_multilingual_sentence_extraction/

Pginer-WMF closed this task as Resolved.Oct 11 2023, 7:42 AM

Pginer-WMF updated the task description. (Show Details)Oct 13 2023, 9:50 AM

Pginer-WMF updated the task description. (Show Details)

Pginer-WMF updated the task description. (Show Details)