You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am having some difficulities to understand the logic of load_dataset(). When it's csv file, every line is padded with start_token and end_token. However for other plain text file, only end_token is used. Is there any particular reason to omit start_token for plain text file?
Thanks
The text was updated successfully, but these errors were encountered:
fzhang612
changed the title
why is <|startoftext|> not inserted where read in plain text files?
why is <|startoftext|> not inserted when read in plain text files?
May 20, 2019
That is the way it was implemented in the original dataset the GPT-2 model was trained on. If you do not provide a prefix, it will start generating as if it was after an <|endoftext|> token.
Text immediately after <|endoftext|> can infer the "beginning" of text, but IMO it's cleaner to make that more explicit, especially if training on more document-oriented datasets like CSVs.
I am having some difficulities to understand the logic of
load_dataset()
. When it's csv file, every line is padded with start_token and end_token. However for other plain text file, only end_token is used. Is there any particular reason to omit start_token for plain text file?Thanks
The text was updated successfully, but these errors were encountered: