What it’s all about ?
This article is about scraping hidden content/hyperlinks on websites that usually go unnoticed to our human eye while browsing the website that could potentially even have confidential data including database credentials, private images et cetera. I will show you how to create a simple web crawler using simple bash commands to scrape an entire website.
If you wanna know more about crawling, you can check here.
I am gonna use GNU’s WGET to write a one line spider/bot that can download an entire website to your local box (If you are using Linux, install wget by running ‘sudo apt-get install wget’).
wget is a nice tool for downloading resources from the internet. The basic usage is wget url (eg: wget http://test.com/). The power of wget is that you may download sites recursive, meaning you also get all pages (and images and other data) linked on the front page:
wget -r http://www.test.com
However, many sites do not want you to download their entire site. To prevent this, they check how browsers identify. Many sites refuses you to connect or sends a blank page if they detect you are not using a web-browser. You might get a message like:
Sorry, but the download manager you are using to view this site is not supported. We do not support use of such download managers as flashget, go!zilla, or getright
Wget has a very handy -U option for sites like this. Use -U My-browser to tell the site you are using some commonly accepted browser:
wget -r -p -U Mozilla http://www.test.com
The most important command line options are –limit-rate= and –wait=. You should add –wait=20 to pause 20 seconds between retrievals, this makes sure you are not manually added to a blacklist. –limit-rate defaults to bytes, add K to set KB/s.
A web-site owner will probably get upset if you attempt to download his entire site using a simple wget http://foo.com command. However, the web-site owner will not even notice you if you limit the download transfer rate and pause between fetching files.
So our final version of mini-crawler would be:
wget –wait=20 –limit-rate=20K -r -p -U Mozilla http://www.test.com
Running the above command would download entire website on to your machine which means you have pretty much all the source of the target website, you could even have data source credentials in the source if you are lucky. Here’s how my mini-crawler has created a directory with www.test.com (target website) and dumped all its source into it.
Possible Improvements: The above mini-crawler can completely download any website built on Wordpress(~22% of USA’s active websites). You could take the crawler to next level by adding loop detection, time outs and more..chao!