Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could I use this model to train a Chinese text? #46

Open
LudoArt opened this issue May 12, 2019 · 14 comments
Open

Could I use this model to train a Chinese text? #46

LudoArt opened this issue May 12, 2019 · 14 comments

Comments

@LudoArt
Copy link

LudoArt commented May 12, 2019

I put a Chinese text into this model and train it, but the generation is messy code. I think it may be the "encoding" error. How could I fix the problem?
Thx

@LudoArt
Copy link
Author

LudoArt commented May 12, 2019

There is something like this.
ǫϡڻڣѡңĪ̫ӵڸڼסڸݶ򣬼ڸԪ̺˴ݳֲֲ ǣϾ췢ɽͨˡݶ󡢴󡢰ɽԼɽȥǣڸݼǫ۰֮ дһ £ȫ͢Զڣ͢԰䣬ƳΪأŹףңǨ䣬ʥɽٺ͢Թʥ͠ߣ⡣ ¹¡顣 սڸȽʤɽʹ˹͢Իɡɽײͣһ꺣͢ԻѶɽȥƸ԰ʳʥ۶ѡ г̫ͭ٣Ī尷ʹ۴䡣ߪʦ˽աΪԴࡣżʮȣǨֳߣߪӪ֣ ʮ꺣۴ߣ̡֮˲룬ʥΪȥ飬ʧˣ԰߿ࡣԴޱԸگ͢ྡӹݰԭʮա࣬һ̰̲ʳǵ̲¡̰䣬͢ЭײĿԭ͡ ʮ꺡ͺ󶼶Ǩ䣬͢԰䣬ָȫĪﴯ֣ʥӽ괺ʥ͢гط̩ȪŻ۴ڣʱʮҲ绳Ƀھʼӷߡ֮չ̱ߣڶɱޫԷΪݻڶʵң͢Գǫֱշǫֳߣؽ̩ѿھʰǫԣ綳ɣȴտȪвۡƽӪݣùۣɡҲԹһˣۣ꣬ΪŸDZ꣬Ϊաߪ̫ĸگ͢ò۲ʳǿԻҲǧ־ݶ̼ٺ٣͢Թǫֳ̼̿顣̳䣬Ƴɡ ̡̩ʱݳѫӺ׹ɽ׻ҲĻʱɽʨӪƳȫʹԴ֣Ӫܾ˶־ȥΪ۴˳ߣĿ֪Ϊ۴ȳ֮ǣԴ޲Դɽ־ӪԪء꣬Ϊ׻ԲȪͣҲȥʹ͢Ԫ֮֮˱Ҳ֮ҲҲ ¼ȣ¼ʮӦ̫Ӹ֮֮ʮУأܾһǧ͢ʮ¼ʮٸߣͨ۴ﱤԪ꣬Ϊ͡޻Ѱ飬ʮ¼ΪȸƳУ˼ӡ˼ͨӪɽԪ˼ӵڼ䣬ܾԡȱѸ޲ɽʮ¼ʱ󰲺ߣӵȺݣĴɽ֯ĪʥأΪʥգǰ鱤ձȪʳǣԡȴտɽ ̩ӽ룬պ߾ʱʮߣʥʥͼȫԻѶٶߣڳɽʱأְսȴղɽɽΪֲ ݰ꺣ԽƣԶ¼Уӡ֮ȳ֮

@woctezuma
Copy link
Contributor

woctezuma commented May 12, 2019

I don't know if the model would work fine with a language other than English.

That being said, how do you input your text in Chinese?

If you load it from disk by yourself, make sure that:

  1. the file is encoded in UTF-8,
  2. you load it that way:
file_name = 'chinese_text.txt'

with open(file_name, 'r', encoding='utf8') as f:
    data = f.read()

@LudoArt
Copy link
Author

LudoArt commented May 12, 2019

I don't know if the model would work fine with a language other than English.

That being said, how do you input your text in Chinese?

If you load it from disk by yourself, make sure that:

  1. the file is encoded in UTF-8,
  2. you load it that way:
file_name = 'chinese_text.txt'

with open(file_name, 'r', encoding='utf8') as f:
    data = f.read()

@woctezuma Thx for your answer.

The detail is in this [https://colab.research.google.com/drive/1VLG8e7YSEwypxU-noRNhsv5dW4NfTGce](Colaboratory notebook), I just can upload my text and not find somewhere can change the encoding.
The file is not encoded in UTF-8, it has some character may not show correct, so I need to using the "gb18030".

I using the following code in colab notebook, but it is seems unuseful.
!export PYTHONIOENCODING=gb18030

I do not know what should I do.

@woctezuma
Copy link
Contributor

woctezuma commented May 12, 2019

The simplest solution to your issue would be to convert your text file from gb18030 to utf8 on your desktop computer, and then upload the UTF-8 version. Notepad++ or Visual Studio Code should be able to convert from one encoding to another.

Indeed, based on this StackOverflow answer, it seems that the two encodings use different byte sequences, so characters are interpreted as different characters:

That is, all Unicode characters can be encoded in GB18030, but they will be encoded with different byte sequences than would be generated with UTF-8 or UTF-16. Handling the GB18030 encoding doesn't require any more special techniques than are required for any other non-Unicode encoding.

@LudoArt
Copy link
Author

LudoArt commented May 12, 2019

The simplest solution to your issue would be to convert your text file from gb18030 to utf8 on your desktop computer, and then upload the UTF-8 version. Notepad++ or Visual Studio Code should be able to convert from one encoding to another.

Indeed, based on this StackOverflow answer, it seems that the two encodings use different byte sequences, so characters are interpreted as different characters:

That is, all Unicode characters can be encoded in GB18030, but they will be encoded with different byte sequences than would be generated with UTF-8 or UTF-16. Handling the GB18030 encoding doesn't require any more special techniques than are required for any other non-Unicode encoding.

Thx so much.

It seems work!

@minimaxir
Copy link
Owner

I can't guarantee the quality of fine-tuning in non-English languages, especially CJK.

@LudoArt
Copy link
Author

LudoArt commented May 16, 2019

I can't guarantee the quality of fine-tuning in non-English languages, especially CJK.

I tried it, the quality seems ok, but the meaning of the context seems to be somewhat incoherent.

@dickkky
Copy link

dickkky commented May 20, 2019

Could you share your code about training Chinese? I met some problems when training Chinese.

@LudoArt
Copy link
Author

LudoArt commented May 21, 2019

Could you share your code about training Chinese? I met some problems when training Chinese.

I don't change the code, I just changed the encoding type for my corpus.
If you have any problem, you can tell me, I will try my best to help you.

@dickkky
Copy link

dickkky commented May 21, 2019

thx for your reply. Did you change encoding type to utf-8?

@LudoArt
Copy link
Author

LudoArt commented May 21, 2019

thx for your reply. Did you change encoding type to utf-8?

Yes!

@Berrywrq
Copy link

thx for your reply. Did you change encoding type to utf-8?

Yes!

When I try to finetune with Chinese corpus an exception has been raised:
tuple.index(x): x not in tuple File "D:\VSCodeProjects\1907\gpt-2-simple-master\gpt_2_simple\src\encoder.py", line 73, in bpe j = word.index(first, i)
can you give me some suggestions? Thx!

@Berrywrq
Copy link

I can't guarantee the quality of fine-tuning in non-English languages, especially CJK.

I tried it, the quality seems ok, but the meaning of the context seems to be somewhat incoherent.
Since the encoder.json just define the key-value pairs of English text units, how can model generate Chinese characters?And how can model transform these Chinese characters to index with just an English encoder.json when training with Chinese corpus? That is where I can‘t understand,please give me some tips,thanks. @minimaxir @LudoArt

@LudoArt
Copy link
Author

LudoArt commented Jul 17, 2019

thx for your reply. Did you change encoding type to utf-8?

Yes!

When I try to finetune with Chinese corpus an exception has been raised:
tuple.index(x): x not in tuple File "D:\VSCodeProjects\1907\gpt-2-simple-master\gpt_2_simple\src\encoder.py", line 73, in bpe j = word.index(first, i)
can you give me some suggestions? Thx!

I am sorry for the problem you have encountered. I did not have this problem and I do not know what happend for this problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants