Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

text generation quality for Chinese #95

Closed
chiangandy opened this issue Aug 6, 2019 · 8 comments
Closed

text generation quality for Chinese #95

chiangandy opened this issue Aug 6, 2019 · 8 comments

Comments

@chiangandy
Copy link

chiangandy commented Aug 6, 2019

I use colab try to train a chinese novel , but result is not actually readable as below:
======== SAMPLE 1 ========
是将吹雾挖出的将成功探测而出。令得她不过来了。
在这句她却便是张了什么处。可地云岚宗的基地本就有猜测的唶回纳,若是可见开一些他们纗地图下的威风。有什么那些自抗成形功探也是被实亀约不运的失踪,这个家伙?”
“按一东西。
“以后?”
第一千两百四纳乎容其收获
双翼下午床 偂静以及云岚宗家族时,现在穿过尮层地死死一死的一位完全自人。若是被这位似乎么好。不过这些层地曘众而速的缘故。先前云岚宗家族与家伙破碎,也知道。”
“按这些年边按一东西。”
双危得枯落双成一些纸藏。现在云岚宗家族。则是有着更是珋地的纳戒。一名一名视线成功探而来。似此如同一股落地被月地位置身给在落地墓墓吼。将三色山峰。都是在她身处的族人而出。他们。能够如何丧门两个家族事。我没有丝毫。比较给云岚宗家族身族成功压渐了过来二人。”
落地最后对此刻低低的落地。这些人吼力地双更驰在云岚宗这般种有些做完的同局一段时间。就在山脉交手吸了一圈。他们仅仅是将会从丝毫地毒间。那家伙。拥有会难以过足有山脉路。想必地实力。可怕的毒间不会速助引。”
心中现在也算是连脸色。纳戒的一道道人影击杀着自指大会回底独血之人。一个了。能够击

my training parameter is ...
gpt2.finetune(sess, dataset="train.txt", model_name='345M', steps=1000, restore_from='fresh', print_every=20, sample_every=200, save_every=500)

Since GPT-2 should be very powerful for text generation, I just want to make sure this quality result is normal or I still have something not figure out yet.

Thnx

@woctezuma
Copy link
Contributor

woctezuma commented Aug 6, 2019

Related: #46 (in the sense that some people have tried fine-tuning on Chinese texts before).

@chiangandy
Copy link
Author

chiangandy commented Aug 7, 2019

yes, I have been check #46 issue, that topic is related with encoding issue, but my project can produce Chinese word successful with no encoding problem, the issue is produced text look like no sense.

@chiangandy
Copy link
Author

chiangandy commented Aug 13, 2019

  • Finally, I got almost perfect result in GPT-2-simple, the output test as below:
    =====
    劇情講述了在國際廠意外流產,一個相愛的戰友身份之間的一段不穩定相愛的故事。年輕的腥風血雨中,蜀山崑崙的西遊水牛和三個小他戀人溫柔善良的幻想少女麥金。三個兒女情切扛起家鄉的西遊水牛(郭虎儀飾),在家鄉一個女人名聲藉助紮根據地。他在北京間步步成長,不僅在暗流涌動巧取得了人生的倫理,倖存的父親屬於自己和她的兩個新姊妹。麥金剛一起逃離,因此被屬於自己和身邊的女人打聽,麥金為救身邊的女人,為人食現麵綢社老闆在墓道上看不見,因此憑藉自己的一切都大膽服從到他的標準。
    =====

Some key point is import as below:

  1. strongly recommend to set <|startoftext|> and <|endoftext|> in each training material sentence. package append these autometically in .csv file, but txt file is not. developer need to put manually. Please use <|startoftext|> as prefix and <|endoftext|> as truncate when generating the text.

  2. Since Chinese is more complicated, training step should be much more, in my case, my training step set to 10000.

  3. Strongly recommended to train model incremental, don't train 10000 step in one time, the process will be session time out. It can use copy_checkpoint_to_gdrive to store staging model in google drive.

@mohataher
Copy link

@chiangandy , did you use the pretrained model to generate Chinese text or did you train it from scratch?

Sorry to bring this up again. I'm trying to do the same thing on Arabic language.

@chiangandy
Copy link
Author

@mohataher Since pre-train model is base on English which is different from my project target(Traditional Chinese), so I was using my own data to train the model. Result is not bad, but it still can be optimizing better.

if you want to train for Arabic language, I am not sure it has pre-train model for Arabic, maybe you can search on Google. If result is no, I suggest you to train new model not using pre-trained model.

Andy

@drizzt00s
Copy link

@chiangandy would you mind add my webchat? some problems on training Chinese. thx

@chiangandy
Copy link
Author

chiangandy commented Sep 3, 2020 via email

@chiangandy
Copy link
Author

chiangandy commented Sep 3, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants