Skip to content

Additional functionality to core crawler#47

Merged
MikeMeliz merged 11 commits intomasterfrom
resolve-merge-conflict-imp-crawler
Dec 29, 2025
Merged

Additional functionality to core crawler#47
MikeMeliz merged 11 commits intomasterfrom
resolve-merge-conflict-imp-crawler

Conversation

@MikeMeliz
Copy link
Owner

@MikeMeliz MikeMeliz commented Dec 29, 2025

Description

This PR introduces:

  • Fulfills a long-standing TODO, of catching links with Regex
    • Ability to introduce your own patterns, into a separate file called regex_patterns.txt
  • Better support for images, and the ability to have a separate output for them
  • Better support for scripts, and the ability to have a separate output for them
  • Introduced two new arguments, in preparation of bringing visualization:
    • --json : which generates a json file with the output, alongside txt output
    • --xml : which generates an xml file with the output, alongside txt output

Motivation and Context

Long standing TODO list, and closing the gaps with other solutions.
Addressing a minimal approach to #9.

How Has This Been Tested?

python3 torcrawl.py -w -u www.google.com -c -d 1 -p 1 -j
python3 torcrawl.py -w -u www.google.com -c -d 1 -p 1 -x

========================================== tests coverage ===========================================
__________________________ coverage: platform darwin, python 3.9.6-final-0 __________________________

Coverage XML written to file coverage.xml
======================================== 22 passed in 1.22s =========================================
  py: OK (1.44=setup[0.03]+cmd[1.41] seconds)
  congratulations :) (1.45 seconds)

Screenshots (if appropriate):

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.

@MikeMeliz MikeMeliz self-assigned this Dec 29, 2025
@sonarqubecloud
Copy link

@MikeMeliz MikeMeliz marked this pull request as ready for review December 29, 2025 14:17
@MikeMeliz MikeMeliz merged commit b9a71a0 into master Dec 29, 2025
9 checks passed
@MikeMeliz MikeMeliz deleted the resolve-merge-conflict-imp-crawler branch December 29, 2025 14:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant