Could I use this model to train a Chinese text? #46

LudoArt · 2019-05-12T08:00:27Z

I put a Chinese text into this model and train it, but the generation is messy code. I think it may be the "encoding" error. How could I fix the problem?
Thx

LudoArt · 2019-05-12T08:03:32Z

There is something like this.
ǫϡڻڣѡңĪ̫ӵڸڼסڸݶ򣬼ڸԪ̺˴ݳֲֲ ǣϾ췢ɽͨˡݶ󡢴󡢰ɽԼɽȥǣڸݼǫ۰֮ дһ £ȫ͢Զڣ͢԰䣬ƳΪأŹףңǨ䣬ʥɽٺ͢Թʥ͠ߣ⡣ ¹¡顣 սڸȽʤɽʹ˹͢Իɡɽײͣһ꺣͢ԻѶɽȥƸ԰ʳʥ۶ѡ г̫ͭ٣Ī尷ʹ۴䡣ߪʦ˽աΪԴࡣżʮȣǨֳߣߪӪ֣ ʮ꺣۴ߣ̡֮˲룬ʥΪȥ飬ʧˣ԰߿ࡣԴޱԸگ͢ྡӹݰԭʮա࣬һ̰̲ʳǵ̲¡̰䣬͢ЭײĿԭ͡ ʮ꺡ͺ󶼶Ǩ䣬͢԰䣬ָȫĪﴯ֣ʥӽ괺ʥ͢гط̩ȪŻ۴ڣʱʮҲ绳Ƀھʼӷߡ֮չ̱ߣڶɱޫԷΪݻڶʵң͢Գǫֱշǫֳߣؽ̩ѿھʰǫԣ綳ɣȴտȪвۡƽӪݣùۣɡҲԹһˣۣ꣬ΪŸǱ꣬Ϊաߪ̫ĸگ͢ò۲ʳǿԻҲǧ־ݶ̼ٺ٣͢Թǫֳ̼̿顣̳䣬Ƴɡ ̡̩ʱݳѫӺ׹ɽ׻ҲĻʱɽʨӪƳȫʹԴ֣Ӫܾ˶־ȥΪ۴˳ߣĿ֪Ϊ۴ȳ֮ǣԴ޲Դɽ־ӪԪء꣬Ϊ׻ԲȪͣҲȥʹ͢Ԫ֮֮˱Ҳ֮ҲҲ ¼ȣ¼ʮӦ̫Ӹ֮֮ʮУأܾһǧ͢ʮ¼ʮٸߣͨ۴ﱤԪ꣬Ϊ͡޻Ѱ飬ʮ¼ΪȸƳУ˼ӡ˼ͨӪɽԪ˼ӵڼ䣬ܾԡȱѸ޲ɽʮ¼ʱ󰲺ߣӵȺݣĴɽ֯ĪʥأΪʥգǰ鱤ձȪʳǣԡȴտɽ ̩ӽ룬պ߾ʱʮߣʥʥͼȫԻѶٶߣڳɽʱأְսȴղɽɽΪֲ ݰ꺣ԽƣԶ¼Уӡ֮ȳ֮

woctezuma · 2019-05-12T09:22:27Z

I don't know if the model would work fine with a language other than English.

That being said, how do you input your text in Chinese?

If you load it from disk by yourself, make sure that:

the file is encoded in UTF-8,
you load it that way:

file_name = 'chinese_text.txt'

with open(file_name, 'r', encoding='utf8') as f:
    data = f.read()

LudoArt · 2019-05-12T10:00:01Z

I don't know if the model would work fine with a language other than English.

That being said, how do you input your text in Chinese?

If you load it from disk by yourself, make sure that:

the file is encoded in UTF-8,

you load it that way:
file_name = 'chinese_text.txt'

with open(file_name, 'r', encoding='utf8') as f:
    data = f.read()

@woctezuma Thx for your answer.

The detail is in this [https://colab.research.google.com/drive/1VLG8e7YSEwypxU-noRNhsv5dW4NfTGce](Colaboratory notebook), I just can upload my text and not find somewhere can change the encoding.
The file is not encoded in UTF-8, it has some character may not show correct, so I need to using the "gb18030".

I using the following code in colab notebook, but it is seems unuseful.
!export PYTHONIOENCODING=gb18030

I do not know what should I do.

woctezuma · 2019-05-12T11:00:18Z

The simplest solution to your issue would be to convert your text file from gb18030 to utf8 on your desktop computer, and then upload the UTF-8 version. Notepad++ or Visual Studio Code should be able to convert from one encoding to another.

Indeed, based on this StackOverflow answer, it seems that the two encodings use different byte sequences, so characters are interpreted as different characters:

That is, all Unicode characters can be encoded in GB18030, but they will be encoded with different byte sequences than would be generated with UTF-8 or UTF-16. Handling the GB18030 encoding doesn't require any more special techniques than are required for any other non-Unicode encoding.

LudoArt · 2019-05-12T13:07:18Z

The simplest solution to your issue would be to convert your text file from gb18030 to utf8 on your desktop computer, and then upload the UTF-8 version. Notepad++ or Visual Studio Code should be able to convert from one encoding to another.

Indeed, based on this StackOverflow answer, it seems that the two encodings use different byte sequences, so characters are interpreted as different characters:

That is, all Unicode characters can be encoded in GB18030, but they will be encoded with different byte sequences than would be generated with UTF-8 or UTF-16. Handling the GB18030 encoding doesn't require any more special techniques than are required for any other non-Unicode encoding.

Thx so much.

It seems work!

minimaxir · 2019-05-16T04:02:55Z

I can't guarantee the quality of fine-tuning in non-English languages, especially CJK.

LudoArt · 2019-05-16T08:57:40Z

I can't guarantee the quality of fine-tuning in non-English languages, especially CJK.

I tried it, the quality seems ok, but the meaning of the context seems to be somewhat incoherent.

dickkky · 2019-05-20T09:45:19Z

Could you share your code about training Chinese? I met some problems when training Chinese.

LudoArt · 2019-05-21T12:11:18Z

Could you share your code about training Chinese? I met some problems when training Chinese.

I don't change the code, I just changed the encoding type for my corpus.
If you have any problem, you can tell me, I will try my best to help you.

dickkky · 2019-05-21T12:44:27Z

thx for your reply. Did you change encoding type to utf-8?

LudoArt · 2019-05-21T12:44:57Z

thx for your reply. Did you change encoding type to utf-8?

Yes!

Berrywrq · 2019-07-16T14:19:21Z

thx for your reply. Did you change encoding type to utf-8?

Yes!

When I try to finetune with Chinese corpus an exception has been raised：
tuple.index(x): x not in tuple File "D:\VSCodeProjects\1907\gpt-2-simple-master\gpt_2_simple\src\encoder.py", line 73, in bpe j = word.index(first, i)
can you give me some suggestions? Thx!

Berrywrq · 2019-07-17T14:45:31Z

I can't guarantee the quality of fine-tuning in non-English languages, especially CJK.

I tried it, the quality seems ok, but the meaning of the context seems to be somewhat incoherent.
Since the encoder.json just define the key-value pairs of English text units, how can model generate Chinese characters？And how can model transform these Chinese characters to index with just an English encoder.json when training with Chinese corpus? That is where I can‘t understand，please give me some tips，thanks. @minimaxir @LudoArt

LudoArt · 2019-07-17T15:01:18Z

thx for your reply. Did you change encoding type to utf-8?

Yes!

When I try to finetune with Chinese corpus an exception has been raised：
tuple.index(x): x not in tuple File "D:\VSCodeProjects\1907\gpt-2-simple-master\gpt_2_simple\src\encoder.py", line 73, in bpe j = word.index(first, i)
can you give me some suggestions? Thx!

I am sorry for the problem you have encountered. I did not have this problem and I do not know what happend for this problem.

woctezuma mentioned this issue Aug 6, 2019

text generation quality for Chinese #95

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Could I use this model to train a Chinese text? #46

Could I use this model to train a Chinese text? #46

LudoArt commented May 12, 2019

LudoArt commented May 12, 2019

woctezuma commented May 12, 2019 •

edited

Loading

LudoArt commented May 12, 2019

woctezuma commented May 12, 2019 •

edited

Loading

LudoArt commented May 12, 2019

minimaxir commented May 16, 2019

LudoArt commented May 16, 2019

dickkky commented May 20, 2019 •

edited

Loading

LudoArt commented May 21, 2019

dickkky commented May 21, 2019

LudoArt commented May 21, 2019

Berrywrq commented Jul 16, 2019

Berrywrq commented Jul 17, 2019

LudoArt commented Jul 17, 2019

Could I use this model to train a Chinese text? #46

Could I use this model to train a Chinese text? #46

Comments

LudoArt commented May 12, 2019

LudoArt commented May 12, 2019

woctezuma commented May 12, 2019 • edited Loading

LudoArt commented May 12, 2019

woctezuma commented May 12, 2019 • edited Loading

LudoArt commented May 12, 2019

minimaxir commented May 16, 2019

LudoArt commented May 16, 2019

dickkky commented May 20, 2019 • edited Loading

LudoArt commented May 21, 2019

dickkky commented May 21, 2019

LudoArt commented May 21, 2019

Berrywrq commented Jul 16, 2019

Berrywrq commented Jul 17, 2019

LudoArt commented Jul 17, 2019

woctezuma commented May 12, 2019 •

edited

Loading

woctezuma commented May 12, 2019 •

edited

Loading

dickkky commented May 20, 2019 •

edited

Loading