Chinese Word Representation

Most of the popular methods used for learning word representations only consider the external context of words and ignore their internal structures. These techniques are limited for learning effective representations of Chinese words, the internal structures of which can also be semantically important. This project generates and compares Chinese word representations that exploit different subword and subcharacter components including characters, graphical components, and Wubi codes.

Data

Training corpus: Chinese Wikipedia dump
Evalution data: Chinese Word similarity tasks wordsim-240 and wordsim-296
Dictionaries: subchar_dict/graphical_dict.p and subchar_dict/wubi_dict.p contain dictionaries that map Chinese characters to graphical components and Wubi codes. These two file can be used to convert characters into subcharacter components or to train joint word embeddings (JWE).

Quick Start

Install Packages

Run pip install -r requirements.txt

Preprocess Wiki Dump:

After downloading the latest Chinese Wikipedia dump from the link above, use preprocess_wiki.py to

Convert traditional Chinese to simplified Chinese
Remove non-Chinese characters including punctuations and spaces
Split sentences into words separated by space
Convert xml into txt file

Parameters:

input           # Wiki dump XML file path
output          # Preprocessed txt file path

Convert Tokenized Wiki Text to Subcharacter Components

Use convert_subchar.py to convert tokenized text into subcharacter components (graphical components or Wubi codes) while keeping the delimeters between words Parameters:

input           # Tokenized txt file path
output          # Subcharacter output txt file path
subchar         # Subcharacter types {radical, wubi}

Learn Word Representations

CBOW or skipgram

Use python train.py to train continuous-bag-of-words (CBOW) or skipgram models using fastText Parameters:

input           # Training txt file path
model_path      # Trained model path
model           # Model type {cbow, skipgram}
dim             # Size of the word vectors
ws              # Size of the context window
epoch           # Number of epochs
minn            # Minimum length of subword ngram
maxn            # Maximal length of subword ngram

Note: Depending on whether the training corpus contains characters, graphical components, or Wubi codes, word length may vary significantly, and different minn and maxn should be chosen accordingly. When maxn = 0, no subword information will be considered.

JWE

For jointly learning word vectors, character vectors and subcharacter vectors, see JWE

Evaluate

CBOW or skipgram

Use eval.py to evaluate the trained CBOW and skipgram models on word similarity tasks. The script will first compute the cosine similarity between each pair of words, and then compute the Spearman correlation coefficient between cosine similarity and human-labeled scores as final evaluation metric for model performance. Parameters:

input           # Evaluation data file path
model_path      # Trained model path
output          # Predicted score output txt file path
subword         # Subword type {character, graphical, wubi}

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
JWE @ f28b78c		JWE @ f28b78c
hanzi_chaizi @ 1c85761		hanzi_chaizi @ 1c85761
subchar_dict		subchar_dict
.gitmodules		.gitmodules
README.md		README.md
convert_subchar.py		convert_subchar.py
eval.py		eval.py
preprocess_wiki.py		preprocess_wiki.py
requirements.txt		requirements.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Chinese Word Representation

Data

Quick Start

Install Packages

Preprocess Wiki Dump:

Convert Tokenized Wiki Text to Subcharacter Components

Learn Word Representations

CBOW or skipgram

JWE

Evaluate

CBOW or skipgram

About

Releases

Packages

Contributors 4

Languages

yipenglai/Chinese-Word-Representation

Folders and files

Latest commit

History

Repository files navigation

Chinese Word Representation

Data

Quick Start

Install Packages

Preprocess Wiki Dump:

Convert Tokenized Wiki Text to Subcharacter Components

Learn Word Representations

CBOW or skipgram

JWE

Evaluate

CBOW or skipgram

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages