Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

why is <|startoftext|> not inserted when read in plain text files? #55

Open
fzhang612 opened this issue May 20, 2019 · 3 comments
Open

Comments

@fzhang612
Copy link

I am having some difficulities to understand the logic of load_dataset(). When it's csv file, every line is padded with start_token and end_token. However for other plain text file, only end_token is used. Is there any particular reason to omit start_token for plain text file?

Thanks

@fzhang612 fzhang612 changed the title why is <|startoftext|> not inserted where read in plain text files? why is <|startoftext|> not inserted when read in plain text files? May 20, 2019
@minimaxir
Copy link
Owner

That is the way it was implemented in the original dataset the GPT-2 model was trained on. If you do not provide a prefix, it will start generating as if it was after an <|endoftext|> token.

Text immediately after <|endoftext|> can infer the "beginning" of text, but IMO it's cleaner to make that more explicit, especially if training on more document-oriented datasets like CSVs.

@fzhang612
Copy link
Author

Thanks, that makes sense.

A follow-up question: currently, end_token is not inserted when the length of text is above combine. Should one end_token be appended here as well?

@woctezuma
Copy link
Contributor

The initial question was related to #49.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants