0% found this document useful (0 votes)
6 views24 pages

Documentation PDFparser

Uploaded by

Eva Iablocova
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views24 pages

Documentation PDFparser

Uploaded by

Eva Iablocova
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Project Name: PDF Parser Automation

Description: Automated pipeline for downloading, parsing, and staging PDF files
with configurable workflows.

Author: Eva Iablocova


Repository: [Link]

1
Contents
Document control ............................................................................................................................. 3
Revision History ............................................................................................................................. 3
Project Overview ............................................................................................................................... 4
Architecture ...................................................................................................................................... 5
Installation ....................................................................................................................................... 7
Usage ............................................................................................................................................... 8
Project location ............................................................................................................................. 8
PDF_parser_sqls – folder with SQL scripts ....................................................................................... 8
How to run Import .......................................................................................................................... 9
1 variant..................................................................................................................................... 9
2 variant..................................................................................................................................... 9
How to run Export ........................................................................................................................ 10
1 variant................................................................................................................................... 10
2 variant................................................................................................................................... 10
Be careful ................................................................................................................................ 10
The result of Export ................................................................................................................... 11
Configuration .................................................................................................................................. 12
There are 3 configuration files: ...................................................................................................... 12
[Link] .................................................................................................................................. 12
config_last_dates_in_db.json ........................................................................................................ 13
ETL_control ................................................................................................................................. 14
Scenarios for Using Configuration Files ......................................................................................... 15
Slack .............................................................................................................................................. 17
Slack Notification Triggers: Five Scenarios for Sending Messages .................................................... 17
Adding new Slack recipient ........................................................................................................... 18
Logging System ............................................................................................................................... 22
dbo.File_loaded ........................................................................................................................... 22
[Link] ..................................................................................................................................... 23
dbo.Logs_Export .......................................................................................................................... 24

2
Document control
Revision History
Version Author Revision for issue Date
1.0. Eva Iablocova Initial draft 21.07.2025
1.1. Eva Iabocova Added Export part, 27.08.2025
Logging System.
Updated Slack part
1.2. Eva Iablocova Added second option 04.09.2025
how to run Export. One
more type of message
in Slack.

3
Project Overview

This project automates the process of downloading, parsing, and staging PDF files based on date
changes and configuration settings. It identifies which files need to be processed, downloads them,
parses their contents, loads the results into a database, and manages file archiving. The workflow is
controlled by command-line arguments and configuration files, with logging at each step for traceability.
The main technologies used are Python and SQL.

4
Architecture

Pic. 1 - Data Flow

5
The project is structured as a modular pipeline for processing PDF files.

Pic.2. PDF parser System

Downloader Module: Loads and compares dates. Downloads changed PDFs.

Parser Module: Parses and cleans files.

Staging Module: Loads files into the database.

Database: Stores data, includes stored procedures and jobs.

Export Module: Exports files to CSV, taking the ETL_control table into account. Sends this CSV file by
email.

6
Installation

1. Clone the repository to your local machine:

git clone <repository_url>

2. Run 0_Deployment_script.sql – one run time script for creating the structure and constraints
3. Install Python 3.8 or newer.
4. Install required dependencies:
pip install -r requirements
5. Configure the project:
Edit [Link] to set paths for download_dir, parsed_data_dir, and other settings.
Ensure all required scripts (1_load_dates_from_site.py, 3_1_parser.py, etc.) are present in
the project directory.
6. Run the entry point script:

python 0_1_entry_point.py [download| parse | stage]

Use download to start from downloading files.

Use parse to start from parsing downloaded files.

Use stage to stage parsed data.

7. Check logs for progress and errors.

7
Usage
Project location

Pic.3 - Project location

PDF_parser_sqls – folder with SQL scripts

Pic.4 - PDF_parser_sqls

0_Deployment_script.sql – one run time script for creating the structure and constraints

1_Merge.sql – the first step in SQL job

2_Validation.sql – the second step in SQL job

3_agregate_message.sql – the third step in SQL job

4_call_sp_to_check_logs.sql - the fourth step in SQL job

8
How to run Import
1 variant
1. Choose the operation mode: download, parse, or stage (by default – download)

2. Run the entry point script with the desired mode:

python 0_1_entry_point.py [download|parse|stage]

download: Downloads new or changed PDF files based on date comparison.

parse: Parses downloaded files (all files in “files_to_parse” folder), cleans data, loads results into the
database, and archives processed files.

stage: Stages parsed data (all files in “parsed_files” folder) for further processing and updates
configuration.

3. Monitor progress and errors in the log messages generated during each step.

4. Check the output directories (download_dir, parsed_data_dir) for processed files.

2 variant
1. Go to the scheduler
2. Find “Daily pdf parser” job and run it

Pic. 5 – Windows job “Daily pdf parser”

9
How to run Export
The Export module uses Python as the orchestrator for export processes:

1) Check collision with import process


2) Export into CSV file
3) Send email
4) Send Slack message

1 variant
To run Export through the command line interface (CLI), use the following command in terminal:

python 9_Export.py

Make sure you are in the directory where 9_Export.py is located, or provide the full path to the script.

2 variant
1. Go to the scheduler
2. Find “ExportPDFparser” job and run it

Pic. 6 – Windows job “ExportPDFparser”

Be careful
The export is linked to the “ETL_controll” table in the database. If you want to specify the export period,
change the dates in the “ETL_controll” table.

10
The result of Export
The export file will be in “export_files” folder.

Pic. 7 – Folder with export files

The recipients will receive a CSV file at the specified email addresses.

Pic. 8 – Export email


11
Configuration
There are 3 configuration files:

Pic.9 – Configuration JSON files

[Link]
[Link] – contains all settings parameters for parsing files

Edit the [Link] file to set the following parameters:

download_dir: Directory for downloaded PDF files.

parsed_data_dir: Directory for parsed data files.

config_last_dates_in_db: Stores the last processed dates for each file.

today_file: Path to the file containing today's dates.


12
file_configs: List of dictionaries specifying keywords and parsing settings for each file type.

Pic. 10 – part of [Link] file

config_last_dates_in_db.json
config_last_dates_in_db.json – contains last updated dates for each file.

13
Pic. 11 – part of config_last_dates_in_db.json file

Pic.12 – ETL_control table

ETL_control

14
ETL_control table – contains the start date and time of the previous export and the start date and time of
the last export.

Pic.13 – ETL_control table

Scenarios for Using Configuration Files


- If we need to download 1 or several files, we have to change ether “start_time” or “end_time” for
these files and run the project (or Windows job “Daily pdf parser”).

Pic. 14 – Before and after changings for a few files

- If we need to make full run, we just can insert into “config_last_dates_in_db.json” instead of all
information “{}”.

15
Pic. 15 – Before and after changings for all files

- To set concrete date we need export from, we need to set [Date_last_exported] for
“Export_table_finish_date” in ETL_control table.

Pic.16 – The export will start from this date and include all data up to the current date

16
Slack
Slack Notification Triggers: Five Scenarios for Sending Messages
1) When task completed successfully and files were downloaded

2) When Task completed successfully and files were not downloaded, because dates were not
changed on the site

3) When Task completed with error

4) When data exported

17
5) Report about potential truncation in rows

Adding new Slack recipient


We have table for Slack configuration settings

Pic. 17 – SlackConfigSetting_SID table

If we need to add new Slack recipient:

1) We need to execute this procedure with valid parameters

Pic. 18 – add_present_values for Slack messages recipient

2) And change the “Recipient” in SQL job “validate_and_insert_data”

18
Pic. 19 – Open “validate_and_insert_data” SQL job

Pic. 20 – Open “send_agregate_message” step

19
Pic. 21 – Change the “Recipient’ at the end of script

3) We need to add the email address and Slack webhook to [Link].

20
Pic.22 – Recipients in [Link]

21
Logging System
We have 3 tables in database for logging all the actions.

Pic.23 – Logging tables in Database

dbo.File_loaded

Pic.24 – dbo.File_loaded table

22
Table dbo.File_loaded contains all files loaded into the database, along with their DatePublished from the
PDF file.

File_loaded_SID is present in each table to indicate which file each row belongs to.

[Link]

Pic.25 – [Link] table

Table [Link] contains a log of each step in the parsing process execution.

23
dbo.Logs_Export

Pic.26 – dbo.Logs_Export table

Table dbo.Logs_Export contains the export period, the number of rows exported, and the export status.

24

You might also like