Skip to content

Extract information like author and keywords mentioned from pdfs

Notifications You must be signed in to change notification settings

jangirshubham/pdf_to_information

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pdf_to_information

Extract information like author and keywords mentioned from pdfs

'input_pdfs' is the input folder for pdfs

'images' is the input folder for images to be saved, when using OCR

'main_func.py' is the main file to be run (string objects to information)

'utils/data_prep.py' is the data prep file (pdf to text) 'utils/data_prep_image.py' is the data prep images file (pdf to text using images contained)

json files load/dump has also been used for integration across modules

output is a '.csv' file containing filename, author names, institute, companies mentioned

About

Extract information like author and keywords mentioned from pdfs

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages