Home‎ > ‎Projects‎ > ‎

Web Crawler

The web Crawler is a python based tool that automatically spider a web site. This tool also look for directory indexing and crawl the directories with indexing again to list all files in it. There is also an option that allows download the files found and it can be used with FOCA or other software to extract metadata from files.

Current stable version is 0.4 and the main features are:
  • Crawl http and https web sites.
  • Crawl http and https web sites not using common ports.
  • Uses regular expressions to find 'href' and 'src' html tag. Also content links.
  • Identifies relative links.
  • Identifies domain related emails.
  • Identifies directory indexing.
  • Detects references to URLs like 'file:', 'feed=', 'mailto:', 'javascript:' and others.
  • Uses CTRL-C to stop current crawler stages and continue working.
  • Identifies file extensions (zip, swf, sql, rar, etc.)
  • Download files to a directory:
    • Download every important file (images, documents, compressed files, etc)
    • Or download specified files types.
    • Or download a predefined set of files (like 'document' files: .doc, .xls, .pdf, .odt, .gnumeric, etc.).
  • Maximum amount of links to crawl. A default value of 5000 URLs is set.
  • Follows redirections using HTML and JavaScript Location tag and HTTP response codes.
Note: This crawler can be used with Domain Analyzer Security Tool. (See Domain Analyzer)


Just copy the python file to the /usr/bin directory. No need to run as root.


See Archives section or Attachments below. Also at Sourceforge!


Please report bugs to [mateslab at gmail dot com]


If you have any question, please send us an email! You can find them in the python files.


Veronica Valeros,
May 18, 2011, 2:54 PM