Skip to content

File name mismatch when using the Extract option #26

@Alessi0X

Description

@Alessi0X

Describe the bug
When using the extract option (i.e., -e), there is a file name mismatch. In fact, the software expects to read from a file called links.txt, but it writes a file with the format <date>_links.txt.

To Reproduce
In order to reproduce the problem, it's just as easy as running one of the examples on the homepage, that is (after minor modifications):
python3 torcrawl.py -v -u http://www.github.com/ -c -d 2 -p 0 -e -w
and the output will be

## Your IP: A.B.C.D.
## URL: http://www.github.com/
## Folder created: www.github.com
## Crawler started from http://www.github.com/ with 2 depth crawl, and 0 second(s) delay.
## Step 1 completed with: 40 result(s)
## Step 2 completed with: 857 result(s)
## File created on /Users/user/TorCrawl.py/www.github.com/links.txt
Error: [Errno 2] No such file or directory: 'www.github.com/links.txt'
## Can't open: www.github.com/links.txt
Traceback (most recent call last):
  File "/Users/user/TorCrawl.py/torcrawl.py", line 210, in <module>
    main()
  File "/Users/user/TorCrawl.py/torcrawl.py", line 199, in main
    extractor(
  File "/Users/user/TorCrawl.py/modules/extractor.py", line 206, in extractor
    cinex(input_file, out_path, selection_yara)
  File "/Users/user/TorCrawl.py/modules/extractor.py", line 72, in cinex
    for line in file:
TypeError: 'type' object is not iterable

in fact, by browsing the newly-created www.github.com folder, we have a file called 20240626_links.txt rather than simply links.txt.

Expected behavior
That TypeError should not appear.

Desktop (please complete the following information):

  • OS: macOS 14.5
  • Python Version: 3.12.4

Fix
The fix is quite straightforward. In torcrawl.py, the line

        if args.extract:
            input_file = out_path + "/links.txt"
            extractor(
                website, args.crawl, output_file, input_file, out_path, selection_yara
            )

should be replaced with

        if args.extract:
            input_file = out_path + "/" + now + "_links.txt"
            extractor(
                website, args.crawl, output_file, input_file, out_path, selection_yara
            )

Metadata

Metadata

Assignees

Labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions