Content Discovery

INFORMATION GATHERING

Content here, will not refer to what is visible on the website, we are referring to content such as files, pictures, pages, admin panels, databases, videos, or any other piece of information which is not always intended for public access.

There are three main ways that this sort of content can be discovered:

  1. Manually

  2. Automated

  3. OSINT (Open Source INTelligence)

MANUAL DISCOVERY

ROBOTS.TXT

Robots.txt is the name of the file that websites use which tells search engines which pages in the website are allowed to show up on their search engine results, and which ones are banned from being searchable or indexed by a search engine's crawling.

Finding this file is as simple as adding /robots.txt to the end of the root web address. Here is a link with more info.

Try it with some websites to see what pages are blocked from appearing in search results.

You can probably imagine that if a developer wants to hide certain pages, that means that there is information in these pages that could compromise the website's security if accessed or could otherwise contain sensitive or private information. Knowing what pages are hidden in the robots.txt file is a great way to figure out good targets if your goal is to access hidden pages.

SITEMAP.XML

Unlike robots.txt, the sitemap.xml file provides a list of every file that the developer wishes to be listed on a search engine.

This gives you a fairly comprehensive map of the site, which could also include areas of the website that may be harder to find just by browsing or navigating. It could also list some old pages that are no longer in use or linked but are still listed in the sitemap.

FAVICON HUNTING

A favicon is a small icon that is displayed next to the sites' address bar or tab used for branding.

Favicons will usually correspond to a company's logo, but sometimes, when a framework is used to create a website, the framework's favicon will often be left-over if it's not manually removed or changed. If this is the case, then just by looking at the favicon we can extract what framework was used to build the site by using the following method:

A. DOWNLOAD THE FAVICON

If you suspect that a site is using a leftover favicon from a framework, first we will need to download the favicon image of the site.

There are typically two ways that developers specify the favicon for a website:

1) They can give a link to the favicon in the source code: <link rel="icon" type="image/png" href="/somepath/favicon.png" />

2) They can make a URL path for the favicon, this is relative to the server root. Something like: http://www.website.com/favicon.ico

Most browsers apply their favicon by grabbing it from the path /favicon.ico so it's always smart to check there first. Just type the web address followed by /favicon.ico and you will be redirected to a page that only displays the icon that the website displays from the favicon. Save the file.

If this method doesn't work then check the source code. Look for syntax similar to the one above for the link to the favicon, but not all favicons are applied with the same syntax. Once you find the favicon image link in the source code, you can download it.

B. FIND THE FAVICON'S HASH VALUE

There is a website that has a comprehensive list of the framework favicons accompanied by their hash value. There's no reliable way to use an image to search for its match, so we will obtain the favicon's hash value and compare it to the hash value on this website.

There are many ways to obtain a hash value, but one of the easiest is by entering the following command in a Linux terminal: md5sum [type file name with extension here] [path of the file] -- NOTE: You can also drag the file to the terminal window instead of typing the full path.

Example from Geeks for Geeks:

Input : md5sum /home/mandeep/test/test.cpp

Output : c6779ec2960296ed9a04f08d67f64422 /home/mandeep/test/test.cpp

If you're not using Linux, however, there are other ways to obtain the md5 hash value.

C. MATCH THE FAVICON

The OWASP Foundation has put together a website, mentioned in the previous section, that will allow us to match the hash value of the favicon we obtained to their list. This will help us find the website's associated framework or service.

Navigate to https://wiki.owasp.org/index.php/OWASP_favicon_database and see if any of the hash values listed match the one you obtained. If so, then you've just found the name of the framework or service associated with the site's favicon and you can use this information to then guide your search.

For example, if the site's favicon hash is 6cec5a9c106d45e458fc680f70df91b0 you can see in this list that this favicon comes from an obsolete version of WordPress, a popular modular framework platform for creating websites. You can take advantage of any security holes in this outdated framework. Even if the framework isn't obsolete or compromised, just knowing the framework gives you extra information that you can use to further your investigation. Remember, the first stage of pen testing is info gathering. You never know what will be valuable later on.

HTTP HEADERS

Anytime a request is made to a web server, the server replies with various HTTP headers. These headers contain a lot of useful information such as the type and version of the web server that's hosting the site, the scripting language in use, and other types of data.

You can use this site, https://hackertarget.com/http-header-check/, to check for headers. You can also use the dev tools on a browser and check the networking tab.

If you're on Linux, you can simply run the curl command. The syntax is as follows:

curl --head http://website.com

Or, you can run a verbose option to see the HTTP request headers as well as other information:

curl -v http://google.com/


AUTOMATED DISCOVERY

Automated discovery methods are those which use pre-made tools to discover content automatically instead of having to do it by hand.

Many of these tools use what is known as a wordlist, a text file with a long list of commonly used words or terms for whatever purposes it was created for. Rainbow Tables, for example, are wordlists of common passwords and their associated hashes that hackers can use to expedite a brute force password cracking tool in a dictionary attack.

For the purposes of web application content discovery, we are more interested in wordlists that contain common terms for desirable web content, such as common directory URLs.

You can check out https://wordlists.assetnote.io/ for downloadable wordlists that update monthly.

When it comes to automation, Linux really shines as many services and tools can be downloaded and updated directly into the computer using the terminal, and run in the same spot. For these reasons, this section will focus on using these tools with Linux.

FFUF

From Codingo.io: "FFUF, or “Fuzz Faster you Fool” is an open source web fuzzing tool, intended for discovering elements and content within web applications, or web servers. Often when you visit a website you will be presented with the content that the owner of the website wants to serve you with, this could be hosted at a page such as index.php. Within security, often the challenges in a website that need to be corrected exist outside of that. For example, the owner of the website may have content hosted at admin.php, that you both want to know about, and test. FFUF is a tool for uncovering those items, for your perusal."

FFUF is open source, and can be found at: https://github.com/ffuf/ffuf

FFUF is a command-line driven application which means that like most of these applications, you won't get GUI but instead will need to know certain commands and parameters to run your searches.

For detailed information on how to use FFUF check out the comprehensive guide at: https://codingo.io/tools/ffuf/bounty/2020/09/17/everything-you-need-to-know-about-ffuf.html


RECONKY

"Reconky is a script written in bash to automate the task of recon and information gathering. This Bash script allows you to collect some information that will help you identify what to do next and where to look for the required target." source

Here are some of its main features:

  • It will Gathers Subdomains with assetfinder and Sublist3r

  • Duplex check for subdomains using amass

  • Enumerates subdomains on a target domain through dictionary attack using knockpy

  • search for alive domains using Httprobe

  • Investigates for feasible subdomain takeover

  • Scans for open ports using nmap

  • Pulls and Assembly all possible parameters found in wayback_url data

  • Pulls and complies json/js/php/aspx/ files from wayback output

  • Runs eyewitness against all the compiled(alive) domains

Install instructions: https://www.geeksforgeeks.org/reconky-content-discovery-tool/

GitHub: https://github.com/ShivamRai2003/Reconky-Automated_Bash_Script


DIRB

"DIRB is a Web Content Scanner. It looks for existing (and/or hidden) Web Objects. It basically works by launching a dictionary based attack against a web server and analyzing the responses.

DIRB comes with a set of preconfigured attack wordlists for easy usage but you can use your custom wordlists. Also DIRB sometimes can be used as a classic CGI scanner, but remember that it is a content scanner not a vulnerability scanner." " -from https://www.kali.org/tools/dirb/

For installation instruction and how to use check out the Kali dirb page as well as this detailed how-to.


GOBUSTER

Gobuster is a tool for brute-forcing URIs including directories and files as well as DNS subdomains.

Check out the Kali Gobuster page as well as the Gobuster GitHub repository.


BURP SUITE

Another powerful tool is the burp suite, an automated professional toolkit used for scalable web vulnerability scanning.

This is a paid toolkit used by professionals and companies who want a hefty toolkit that does a lot of the heavy lifting for you.

If you're interested in hearing more about burp suite and other web app pen testing tools, check out this article.

OSINT

GOOGLE HACKING/DORKING

Most people will never know how much more powerful and precise a Google search can be if you know some "dorking."

These techniques focus on using advanced operators in the Google search engine to find security holes that a regular search would never reveal.

There are many basics and advanced operators that can be found at https://en.wikipedia.org/wiki/Google_hacking, but let's take a closer look at some specific uses.

  • SITE - ex: site:wikipedia.org (search the specified website and will ignore results from other sites)

  • INURL - ex: inurl:admin (this will return only the results that have the specified word in the URL)

  • FILETYPE - ex: filetype:pdf (this will return only results that match the specified file extension)

  • INTITLE - ex: intitle:admin (this will return only results that contain the specified word in the title)

WAPPALYZER

Wappalyzer is a handy online tool and browser extension that can identify what technologies are associated with a particular website. It can easily reveal things like frameworks, version numbers, Content Management Systems (CMS), payment processors, and more.

Check it out at https://www.wappalyzer.com/


WAYBACK MACHINE

The Wayback Machine is a historical repository of websites dating back to the late 1990's. Think of it as a time machine for the web. If you enter a website at https://archive.org/web/ you can find all of the historical versions of this website saved every time the service scraped the web page and saved its content.

This tool can help you discover older pages that may not otherwise be visible.

GITHUB

GitHub is hosted version of Git, a version control system that hosts repositories of code. Developers use GitHub and other VCS's to track, pull, and publish changes to a large collaborative coding project. These repositories have a lot of functions, including the ability to make certain repositories private, or public.

Searching a company name or website or GitHub can often reveal repositories of your target. Depending on the privacy configuration and security of the repository, you may be able to access the source code, passwords, or other important content.

S3 BUCKETS

S3 Buckets are a storage service by Amazon AWS. These buckets allow users to save content such as files and websites to the cloud. The bucket's owner can set access permissions to make this content private, public, and in some cases, the content can even be writable.

To be human is to err, so oftentimes careless clients can overlook the security settings of these buckets and could leave them accessible or editable to the public. Some companies have had important information exposed by hackers exploiting these buckets.

The syntax to access a bucket is: http(s)://{name}.s3.amazonaws.com

The {name} is set by the bucket's owner and the bucket could be accessible by either the HTTP or HTTPS protocol, depending on settings.