GTN (Gas Transmission Northwest LLC) Pipeline Data – 12871
⚠️ If the resource you are scraping requires you to agree to any Terms & Conditions,
please do not proceed and notify your contract manager immediately. Under no
circumstances should you create a false account or fake identity.
Description:
Please write a scraper tool to enter a Post Date (from) and Post Date (to) range of 90 days back from
the day the scrape is run (ie. Today going back 90 days). There is a limit of 90 days for the query.
• Parameters Setup,
o Enter the From Date,
o Enter the End Date
o Click Retrieve Button, then you will see the grid of data
• Click each row in the table, you can get into the details for each contract:
Scraping Description
There should be one output dataset:
• Summary Dataset
o Scrape all data display in the summary grid (blue section in the first illustration). You
can either
▪ Manually scrape the data in the html
▪ Using the [Download] button on the top right, which will give you CSV
o Add an additional link column in the dataset.
▪ For each row, the link is the html link pointing to the detail page.
o Add an additional scrape_time column to indicate the scrape time
The desired schema is listed below
Root URL:
http://tcplus.com/GTN/ContractRouteRate/Interruptible
Job Frequency:
Realtime (every minute)
Output Columns:
File One: summary.csv
Column Name Original Columns Type Example
scrape_time - datetime
post_datetime Post Date / Time datetime 20230223 12:10:45
Castleton Commodities Merchant Trading
k_holder_name K Holder Name str
L.P.
k_holder K Holder str 118638852
svc_req_k Svc Req K str 20578
rate_sch Rate Sch str PAL
it_qty_k IT Qty – K int 30000
k_stat K Stat str N
disc_beg_date Disc Beg Date datetime 20230224
disc_end_date Disc End Date datetime 20230224
receipt_loc Loc str 370672
receipt_loc_name Loc Name sts MALIN MC
receipt_loc_qti_desc Loc/QTI Desc str Rec Qty
delivery_loc Loc int 0
delivery_loc_name Loc Name str MALIN MC
delivery_loc_qti_desc Loc/QTI Desc str
loc_ind Loc Ind str I
rate_chgd Rate Chgd float 0.2
max_trf_rate Max Trf Rate float 0.204356
ngtd_rate_ind Ngtd Rate Ind n N
rate_id_desc Rate ID Desc str Loan Chrg-Bal
affil Affil str None
terms_notes Terms/Notes str N
link - str -
• Please note that the original column in the table has the same column names for receipt and
delivery locations. We should treat the first three as the receipt and the last three as
delivery.
Timeline:
You may complete this job any time and submit any required files to the linked GitHub repository
within one week of accepting the job.
Please submit your code here: https://github.com/international-data-repository-cpd/scrape-12871
Submission Files:
Sample.csv for sample data
A requirement.txt
scrape/ - containing all of the source code
Main file: scrape.py that will be run with a output $filename.
Job Schema/Output Format:
You should save the output csv using these settings from a pandas DataFrame:
encoding="utf-8",
line_terminator="\n",
quotechar='"',
quoting=csv.QUOTE_ALL,
index=False
Runtime Environment:
Your code will be copied form the root to/usr/src/scrape
You should feel free to modify the requirements as you need. However, you must keep the
awscli dependency
You may also upload additional binaries into the repository root and reference them
there.
Please do not change the Dockerfile or shell scripts in the repository as this will cause
automated test failure.
python scrape.py $filename
Page access limitations (max requests / day):
If you encounter a captcha during your scrape job, please contact the job poster before continuing.
10% of website traffic max