"Jaba" Chinese word segmentation, do the best Python Chinese word segmentation component. Four word segmentation modes are supported. Precise mode, which tries to cut the sentence most precisely, suitable for text analysis. Full mode, scans all the words that can be formed into words in the sentence, the speed is very fast, but the ambiguity cannot be resolved. The search engine mode, on the basis of the precise mode, divides the long words again to improve the recall rate, which is suitable for word segmentation in search engines. The paddle mode uses the PaddlePaddle deep learning framework to train the sequence labeling (bidirectional GRU) network model to achieve word segmentation. Also supports part-of-speech tagging. To use paddle mode, you need to install paddlepaddle-tiny, pip install paddlepaddle-tiny==1.6.1. Currently paddle mode supports jieba v0.40 and above. For versions below jieba v0.40, please upgrade jieba, pip install jieba --upgrade.
Features
- Although jieba has the ability to recognize new words, adding new words by yourself can ensure a higher accuracy rate
- Developers can specify their own custom dictionaries to include words that are not in the jieba thesaurus
- Dictionaries can be modified dynamically in the program
- Keyword extraction based on TextRank Algorithm
- The Inverse Document Frequency (IDF) text corpus used for keyword extraction can be switched to the path of a custom corpus
- Dynamic programming is used to find the maximum probability path