Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prefix and suffix and their appearance in generated samples #40

Open
ZheMann opened this issue May 10, 2019 · 4 comments
Open

Prefix and suffix and their appearance in generated samples #40

ZheMann opened this issue May 10, 2019 · 4 comments

Comments

@ZheMann
Copy link

ZheMann commented May 10, 2019

I have a small dataset (~2MB) consisting of columns written by a journalist throughout recent years. Each column is prepended with '<|startoftext|>' and appended with '<|endoftext|>'. I have two questions:

  1. When generating samples by executing gpt2.generate(sess, length=310, temperature=0.7, prefix="<|startoftext|>", include_prefix=False, truncate="<|endoftext|>", nsamples=5, batch_size=5 )

the pre- and suffix still appear in the middle of texts. Am I doing wrong? Or is this normal?

  1. Can someone explain briefly why I should prepend and append each different text within a large file? I mean, assume I have three columns. What would be the difference in behavior of the model if i would prepend and append each column vs. separate columns by e.g. a white line?

Although I spent quite some time on GPT-2 now, I still find it hard to grasp this part, so any help is greatly appreciated.

@woctezuma
Copy link
Contributor

It would have been better to open an issue for each question.

Anyway:

  1. I have the same issues:
  • I do not see any difference when include_prefix is set to True or to False,
  • it happens to find several occurrences of <|startoftext|> and <|endoftext|> inside what should be a single sample, despite setting truncate to <|endoftext|>.
  1. GPT-2 was trained on "8 million documents for a total of 40 GB of text". The tokens <|startoftext|> and <|endoftext|> allow to distinguish each document. This can be useful if you want to generate a fake Wikipedia article: you don't want to have to manually find where the model starts switching from a subject to another. Moreover, it can happen to have blank lines inside a document, so your way to separating documents with blank lines could be misinterpreted by the GPT-2 model. Without these tokens, there is no notion of document, it is just one big wall of text.

@ZheMann
Copy link
Author

ZheMann commented May 12, 2019

@woctezuma thank you for the quick reply and for creating a new issue for question 1.

I partially understand your answer to question 2 in the way that these tags are used to distinguish between documents within a large file. However:

  1. Do these tags have an influence on finetuning the model? Or are they ignored? I'm curious whether these tags are also used during finetuning or only while generating samples.
  1. You wrote:

This can be useful if you want to generate a fake Wikipedia article: you don't want to have to manually find where the model starts switching from a subject to another.

I do not completely understand this. With the settings prefix="<|startoftext|>", include_prefix=False and truncate="<|endoftext|>" we want our pre- and suffix to not appear in our sample, right? So how does this prevent us from having to manually find where the model starts switching to a new subject?

Thanks again for your time and I hope you do not mind answering these questions.

@woctezuma
Copy link
Contributor

woctezuma commented May 12, 2019

Caveat: this is all my understanding of the program, I might be wrong.

  1. These delimiters should have an influence on the fine-tuning of the model. Indeed, if GPT-2 works like the other text generation models which I have seen, the model is given words and tries to predict the following words. And each time the <|startoftext|> delimiter occurs, the model forgets about what was before the delimiter, and tries to predict the following words.

    Say you have articles about countries, and you concatenate articles about France and then Germany. Without the delimiters, the model would learn how to predict sentences about Germany, based on the end of the article about France and the beginning of the article about Germany. With the delimiters in-between the articles, the model would predict sentences about Germany only based on the article about Germany. It is a way to ensure that the model learns relevant information, and does not mix training documents.

  2. Delimiters appear in the generated text, they are just removed afterwards with include_prefix=False and truncate="<|endoftext|>. When you use truncate, what happens is that the generated text is truncated at the first occurrence of <|endoftext|>, so you are supposed to get a nice generated text without manually looking for the cut. If you train the model with Wikipedia articles, then you should have a single generated Wikipedia article. If you opt not to use delimiters, then you generate a text of the required length, but you do not know where the start or the end of the generated Wikipedia article is supposed to be, you might have generated the middle part of an article. You have less control over what you generate.

@ZheMann
Copy link
Author

ZheMann commented May 12, 2019

@woctezuma Thank you so much for your explanation, the answers are very clear! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants