text generation quality for Chinese #95

chiangandy · 2019-08-06T09:55:50Z

I use colab try to train a chinese novel , but result is not actually readable as below:
======== SAMPLE 1 ========
是将吹雾挖出的将成功探测而出。令得她不过来了。
在这句她却便是张了什么处。可地云岚宗的基地本就有猜测的唶回纳，若是可见开一些他们纗地图下的威风。有什么那些自抗成形功探也是被实亀约不运的失踪，这个家伙？”
“按一东西。
“以后？”
第一千两百四纳乎容其收获
双翼下午床偂静以及云岚宗家族时，现在穿过尮层地死死一死的一位完全自人。若是被这位似乎么好。不过这些层地曘众而速的缘故。先前云岚宗家族与家伙破碎，也知道。”
“按这些年边按一东西。”
双危得枯落双成一些纸藏。现在云岚宗家族。则是有着更是珋地的纳戒。一名一名视线成功探而来。似此如同一股落地被月地位置身给在落地墓墓吼。将三色山峰。都是在她身处的族人而出。他们。能够如何丧门两个家族事。我没有丝毫。比较给云岚宗家族身族成功压渐了过来二人。”
落地最后对此刻低低的落地。这些人吼力地双更驰在云岚宗这般种有些做完的同局一段时间。就在山脉交手吸了一圈。他们仅仅是将会从丝毫地毒间。那家伙。拥有会难以过足有山脉路。想必地实力。可怕的毒间不会速助引。”
心中现在也算是连脸色。纳戒的一道道人影击杀着自指大会回底独血之人。一个了。能够击

my training parameter is ...
gpt2.finetune(sess, dataset="train.txt", model_name='345M', steps=1000, restore_from='fresh', print_every=20, sample_every=200, save_every=500)

Since GPT-2 should be very powerful for text generation, I just want to make sure this quality result is normal or I still have something not figure out yet.

Thnx

woctezuma · 2019-08-06T14:03:18Z

Related: #46 (in the sense that some people have tried fine-tuning on Chinese texts before).

chiangandy · 2019-08-07T02:29:35Z

yes, I have been check #46 issue, that topic is related with encoding issue, but my project can produce Chinese word successful with no encoding problem, the issue is produced text look like no sense.

chiangandy · 2019-08-13T07:50:05Z

Finally, I got almost perfect result in GPT-2-simple, the output test as below:
=====
劇情講述了在國際廠意外流產，一個相愛的戰友身份之間的一段不穩定相愛的故事。年輕的腥風血雨中，蜀山崑崙的西遊水牛和三個小他戀人溫柔善良的幻想少女麥金。三個兒女情切扛起家鄉的西遊水牛(郭虎儀飾)，在家鄉一個女人名聲藉助紮根據地。他在北京間步步成長，不僅在暗流涌動巧取得了人生的倫理，倖存的父親屬於自己和她的兩個新姊妹。麥金剛一起逃離，因此被屬於自己和身邊的女人打聽，麥金為救身邊的女人，為人食現麵綢社老闆在墓道上看不見，因此憑藉自己的一切都大膽服從到他的標準。
=====

Some key point is import as below:

strongly recommend to set <|startoftext|> and <|endoftext|> in each training material sentence. package append these autometically in .csv file, but txt file is not. developer need to put manually. Please use <|startoftext|> as prefix and <|endoftext|> as truncate when generating the text.
Since Chinese is more complicated, training step should be much more, in my case, my training step set to 10000.
Strongly recommended to train model incremental, don't train 10000 step in one time, the process will be session time out. It can use copy_checkpoint_to_gdrive to store staging model in google drive.

mohataher · 2020-06-10T18:34:58Z

@chiangandy , did you use the pretrained model to generate Chinese text or did you train it from scratch?

Sorry to bring this up again. I'm trying to do the same thing on Arabic language.

chiangandy · 2020-06-11T01:41:12Z

@mohataher Since pre-train model is base on English which is different from my project target(Traditional Chinese), so I was using my own data to train the model. Result is not bad, but it still can be optimizing better.

if you want to train for Arabic language, I am not sure it has pre-train model for Arabic, maybe you can search on Google. If result is no, I suggest you to train new model not using pre-trained model.

Andy

drizzt00s · 2020-09-02T02:34:52Z

@chiangandy would you mind add my webchat? some problems on training Chinese. thx

chiangandy · 2020-09-03T09:20:00Z

I am happy to share my experience with you on this project. 江謝迪 drizzt00s <[email protected]> 於 2020年9月2日週三上午10:35寫道：

…

@chiangandy <https://github.com/chiangandy> would you mind add my webchat? some problems on training Chinese. thx — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#95 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AALGOG5IX3AUKXQVAB7PW4DSDWVNTANCNFSM4IJUSZTA> .

chiangandy · 2020-09-03T09:23:01Z

Dears, So whats your issue on this projects? Can you example this? If you are Chinese also, I suggest to discuss this by Chinese which will be better... :) 江謝迪 drizzt00s <[email protected]> 於 2020年9月2日週三上午10:35寫道：

…

@chiangandy would you mind add my webchat? some problems on training Chinese. thx — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

chiangandy closed this as completed Aug 21, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

text generation quality for Chinese #95

text generation quality for Chinese #95

chiangandy commented Aug 6, 2019 •

edited

Loading

woctezuma commented Aug 6, 2019 •

edited

Loading

chiangandy commented Aug 7, 2019 •

edited

Loading

chiangandy commented Aug 13, 2019 •

edited

Loading

mohataher commented Jun 10, 2020

chiangandy commented Jun 11, 2020

drizzt00s commented Sep 2, 2020

chiangandy commented Sep 3, 2020 via email

chiangandy commented Sep 3, 2020 via email

text generation quality for Chinese #95

text generation quality for Chinese #95

Comments

chiangandy commented Aug 6, 2019 • edited Loading

woctezuma commented Aug 6, 2019 • edited Loading

chiangandy commented Aug 7, 2019 • edited Loading

chiangandy commented Aug 13, 2019 • edited Loading

mohataher commented Jun 10, 2020

chiangandy commented Jun 11, 2020

drizzt00s commented Sep 2, 2020

chiangandy commented Sep 3, 2020 via email

chiangandy commented Sep 3, 2020 via email

chiangandy commented Aug 6, 2019 •

edited

Loading

woctezuma commented Aug 6, 2019 •

edited

Loading

chiangandy commented Aug 7, 2019 •

edited

Loading

chiangandy commented Aug 13, 2019 •

edited

Loading