Prefix and suffix and their appearance in generated samples #40

ZheMann · 2019-05-10T09:36:46Z

I have a small dataset (~2MB) consisting of columns written by a journalist throughout recent years. Each column is prepended with '<|startoftext|>' and appended with '<|endoftext|>'. I have two questions:

When generating samples by executing gpt2.generate(sess, length=310, temperature=0.7, prefix="<|startoftext|>", include_prefix=False, truncate="<|endoftext|>", nsamples=5, batch_size=5 )

the pre- and suffix still appear in the middle of texts. Am I doing wrong? Or is this normal?

Can someone explain briefly why I should prepend and append each different text within a large file? I mean, assume I have three columns. What would be the difference in behavior of the model if i would prepend and append each column vs. separate columns by e.g. a white line?

Although I spent quite some time on GPT-2 now, I still find it hard to grasp this part, so any help is greatly appreciated.

The text was updated successfully, but these errors were encountered:

woctezuma · 2019-05-10T16:53:43Z

It would have been better to open an issue for each question.

Anyway:

I have the same issues:

I do not see any difference when include_prefix is set to True or to False,
it happens to find several occurrences of <|startoftext|> and <|endoftext|> inside what should be a single sample, despite setting truncate to <|endoftext|>.

GPT-2 was trained on "8 million documents for a total of 40 GB of text". The tokens <|startoftext|> and <|endoftext|> allow to distinguish each document. This can be useful if you want to generate a fake Wikipedia article: you don't want to have to manually find where the model starts switching from a subject to another. Moreover, it can happen to have blank lines inside a document, so your way to separating documents with blank lines could be misinterpreted by the GPT-2 model. Without these tokens, there is no notion of document, it is just one big wall of text.

ZheMann · 2019-05-12T11:53:12Z

@woctezuma thank you for the quick reply and for creating a new issue for question 1.

I partially understand your answer to question 2 in the way that these tags are used to distinguish between documents within a large file. However:

Do these tags have an influence on finetuning the model? Or are they ignored? I'm curious whether these tags are also used during finetuning or only while generating samples.

You wrote:

This can be useful if you want to generate a fake Wikipedia article: you don't want to have to manually find where the model starts switching from a subject to another.

I do not completely understand this. With the settings prefix="<|startoftext|>", include_prefix=False and truncate="<|endoftext|>" we want our pre- and suffix to not appear in our sample, right? So how does this prevent us from having to manually find where the model starts switching to a new subject?

Thanks again for your time and I hope you do not mind answering these questions.

woctezuma · 2019-05-12T13:47:07Z

Caveat: this is all my understanding of the program, I might be wrong.

These delimiters should have an influence on the fine-tuning of the model. Indeed, if GPT-2 works like the other text generation models which I have seen, the model is given words and tries to predict the following words. And each time the <|startoftext|> delimiter occurs, the model forgets about what was before the delimiter, and tries to predict the following words.

Say you have articles about countries, and you concatenate articles about France and then Germany. Without the delimiters, the model would learn how to predict sentences about Germany, based on the end of the article about France and the beginning of the article about Germany. With the delimiters in-between the articles, the model would predict sentences about Germany only based on the article about Germany. It is a way to ensure that the model learns relevant information, and does not mix training documents.
Delimiters appear in the generated text, they are just removed afterwards with include_prefix=False and truncate="<|endoftext|>. When you use truncate, what happens is that the generated text is truncated at the first occurrence of <|endoftext|>, so you are supposed to get a nice generated text without manually looking for the cut. If you train the model with Wikipedia articles, then you should have a single generated Wikipedia article. If you opt not to use delimiters, then you generate a text of the required length, but you do not know where the start or the end of the generated Wikipedia article is supposed to be, you might have generated the middle part of an article. You have less control over what you generate.

ZheMann · 2019-05-12T13:57:33Z

@woctezuma Thank you so much for your explanation, the answers are very clear! :)

This was referenced May 10, 2019

Occurences of <|startoftext|> and <|endoftext|> can appear in the middle of a generated document #41

Open

Influence of include_prefix #42

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prefix and suffix and their appearance in generated samples #40

Prefix and suffix and their appearance in generated samples #40

ZheMann commented May 10, 2019

woctezuma commented May 10, 2019

ZheMann commented May 12, 2019

woctezuma commented May 12, 2019 •

edited

Loading

ZheMann commented May 12, 2019

Prefix and suffix and their appearance in generated samples #40

Prefix and suffix and their appearance in generated samples #40

Comments

ZheMann commented May 10, 2019

woctezuma commented May 10, 2019

ZheMann commented May 12, 2019

woctezuma commented May 12, 2019 • edited Loading

ZheMann commented May 12, 2019

woctezuma commented May 12, 2019 •

edited

Loading