-
-
Notifications
You must be signed in to change notification settings - Fork 675
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Could I use this model to train a Chinese text? #46
Comments
There is something like this. |
I don't know if the model would work fine with a language other than English. That being said, how do you input your text in Chinese? If you load it from disk by yourself, make sure that:
file_name = 'chinese_text.txt'
with open(file_name, 'r', encoding='utf8') as f:
data = f.read() |
@woctezuma Thx for your answer. The detail is in this [https://colab.research.google.com/drive/1VLG8e7YSEwypxU-noRNhsv5dW4NfTGce](Colaboratory notebook), I just can upload my text and not find somewhere can change the encoding. I using the following code in colab notebook, but it is seems unuseful. I do not know what should I do. |
The simplest solution to your issue would be to convert your text file from Indeed, based on this StackOverflow answer, it seems that the two encodings use different byte sequences, so characters are interpreted as different characters:
|
Thx so much. It seems work! |
I can't guarantee the quality of fine-tuning in non-English languages, especially CJK. |
I tried it, the quality seems ok, but the meaning of the context seems to be somewhat incoherent. |
Could you share your code about training Chinese? I met some problems when training Chinese. |
I don't change the code, I just changed the encoding type for my corpus. |
thx for your reply. Did you change encoding type to utf-8? |
Yes! |
When I try to finetune with Chinese corpus an exception has been raised: |
|
I am sorry for the problem you have encountered. I did not have this problem and I do not know what happend for this problem. |
I put a Chinese text into this model and train it, but the generation is messy code. I think it may be the "encoding" error. How could I fix the problem?
Thx
The text was updated successfully, but these errors were encountered: