why is <|startoftext|> not inserted when read in plain text files? #55

fzhang612 · 2019-05-20T02:36:24Z

I am having some difficulities to understand the logic of load_dataset(). When it's csv file, every line is padded with start_token and end_token. However for other plain text file, only end_token is used. Is there any particular reason to omit start_token for plain text file?

Thanks

The text was updated successfully, but these errors were encountered:

minimaxir · 2019-05-20T03:50:45Z

That is the way it was implemented in the original dataset the GPT-2 model was trained on. If you do not provide a prefix, it will start generating as if it was after an <|endoftext|> token.

Text immediately after <|endoftext|> can infer the "beginning" of text, but IMO it's cleaner to make that more explicit, especially if training on more document-oriented datasets like CSVs.

fzhang612 · 2019-05-20T05:26:30Z

Thanks, that makes sense.

A follow-up question: currently, end_token is not inserted when the length of text is above combine. Should one end_token be appended here as well?

woctezuma · 2019-05-20T06:47:00Z

The initial question was related to #49.

fzhang612 changed the title ~~why is <|startoftext|> not inserted where read in plain text files?~~ why is <|startoftext|> not inserted when read in plain text files? May 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

why is <|startoftext|> not inserted when read in plain text files? #55

why is <|startoftext|> not inserted when read in plain text files? #55

fzhang612 commented May 20, 2019

minimaxir commented May 20, 2019

fzhang612 commented May 20, 2019

woctezuma commented May 20, 2019

why is <|startoftext|> not inserted when read in plain text files? #55

why is <|startoftext|> not inserted when read in plain text files? #55

Comments

fzhang612 commented May 20, 2019

minimaxir commented May 20, 2019

fzhang612 commented May 20, 2019

woctezuma commented May 20, 2019