Nutch Tutorial on Ubuntu (10 easy steps)

posted Mar 27, 2012, 9:18 PM by Swapnil Kulkarni   [ updated Jun 23, 2012, 3:05 PM ]

Follow these 10 steps to setup Nutch & crawl your site to create your own Web DB

Step 1:
Download latest binaries from here:

Step 2:
Make required directories
sudo mkdir /usr/local/nutch
sudo mkdir /usr/local/nutch/framework
sudo mkdir /usr/local/nutch/dist

Step 3:
Copy to dist
sudo cp apache-nutch-1.4-bin.tar.gz /usr/local/nutch/dist/

Step 4:
sudo tar -xvzf apache-nutch-1.4-bin.tar.gz -C /usr/local/nutch/framework/

Step 5:
Make executable
sudo chmod +x /usr/local/nutch/framework/apache-nutch-1.4-bin/runtime/local/bin/nutch

Step 6:
Make seed url file
sudo mkdir -p /usr/local/nutch/framework/apache-nutch-1.4-bin/runtime/local/bin/urls
sudo gedit /usr/local/nutch/framework/apache-nutch-1.4-bin/runtime/local/bin/urls/nutch

Add following to nutch.txt

Step 7:
Add Agent
sudo gedit /usr/local/nutch/framework/apache-nutch-1.4-bin/runtime/local/conf/nutch-site.xml

Add this in Configuration
<value>My Spider</value>

Step 8:
Edit regex
sudo gedit /usr/local/nutch/framework/apache-nutch-1.4-bin/runtime/local/conf/regex-urlfilter.txt


# accept anything else

By this
# accept anything else

Then add

Step 9:
Setup JDK & set JAVA_HOME
sudo add-apt-repository ppa:ferramroberto/java
sudo apt-get update
sudo apt-get install sun-java6-jdk
sudo apt-get install sun-java6-jdk sun-java6-jre sun-java6-plugin sun-java6-fonts
export JAVA_HOME=/usr

Step 10:
Start Crawling!
/usr/local/nutch/framework/apache-nutch-1.4-bin/runtime/local/bin/nutch crawl urls -dir crawl -depth 10 -topN 1000

What Facebook and Google are hiding from world!

posted Dec 30, 2011, 2:20 AM by Swapnil Kulkarni   [ updated Dec 30, 2011, 2:20 AM ]

Eli Pariser of the progressive organization MoveOn says the Internet is hiding things from us, and we don't even know it. In this TED Talk he calls out Facebook, Goggle and other corporations who are transforming the Internet to suit their corporate interests.
- Courtesy : TEDx

What Facebook and Google are hiding from world

Google MapsGL

posted Dec 30, 2011, 2:13 AM by Swapnil Kulkarni   [ updated Dec 30, 2011, 2:14 AM ]

Using WebGL technology, Google MapsGL is an experimental project which enables a richer maps experience with immersive 3D buildings, smoother transitions between imagery, fluid transitions into Streetview and more, directly in your browser and all without a plugin.
- Courtesy : Google

Google MapsGL

Amazon Silk - Revolutionary Cloud-Accelerated Web Browser

posted Oct 1, 2011, 12:42 AM by Swapnil Kulkarni   [ updated Oct 1, 2011, 12:47 AM ]

Amazon Silk is a web browser developed by Amazon for Kindle Fire. It uses a split architecture whereby some of the processing is performed on Amazon's servers to improve webpage loading performance. The frontend is based on the WebKit browser engine.
-Courtesy: Amazon & Wikipedia

Amazon Silk

How Google makes improvements to its search algorithm

posted Sep 13, 2011, 6:29 PM by Swapnil Kulkarni   [ updated Sep 20, 2011, 5:51 PM ]

Here's a short video put together that gives you a sense of the work that goes into the changes and improvements of Google search engine. While an improvement to the algorithm may start with a creative idea, it always goes through a process of rigorous scientific testing.
- Courtesy : Google

Google - Search Algorithm

Google - Flight Search

posted Sep 13, 2011, 6:25 PM by Swapnil Kulkarni   [ updated Sep 20, 2011, 5:36 PM ]

Flight search is a feature that helps you explore air travel options for a number of cities, and plan your trip with just a few clicks of the mouse.
- Courtesy : Google

Google - Flight Search

RockMelt - Your Browser. Re-Imagined.

posted Sep 10, 2011, 4:17 PM by Swapnil Kulkarni   [ updated Sep 20, 2011, 5:36 PM ]

RockMelt is re-imagining your online experience by creating a new web browser that makes it easy to stay in touch with friends, search online, and get updates from your favorite websites.
Try it for yourself!
- Courtesy : Rockmelt


Intel's new 3D Transistor

posted Jun 24, 2011, 3:48 AM by Swapnil Kulkarni   [ updated Sep 20, 2011, 5:37 PM ]

Mark Bohr Gets Small: 22nm and explains Intel's new 3D Transistor. Enjoy!
- Courtesy : Intel

Intel's new 3D Transistor

Khan Academy - Taking education to next level using simple technology

posted Jun 12, 2011, 10:18 AM by Swapnil Kulkarni   [ updated Jun 23, 2012, 3:06 PM ]

A free world-class education for anyone anywhere...

The Khan Academy is an organization on a mission. They're a not-for-profit with the goal of changing education for the better by providing a free world-class education to anyone anywhere.

All of the site's resources are available to anyone. It doesn't matter if you are a student, teacher, home-schooler, principal, adult returning to the classroom after 20 years, or a friendly alien just trying to get a leg up in earthly biology. The Khan Academy's materials and resources are available to you completely free of charge
- Courtesy : Khan Academy

Salman Khan with Bill Gates at TED 2011

Google TV

posted Jun 12, 2011, 10:04 AM by Swapnil Kulkarni   [ updated Sep 20, 2011, 5:38 PM ]

Google TV is a Smart TV / Connected TV platform from Google. It was announced on May 20, 2010, at Google’s Google I/O event and was co-developed by Google, Intel, Sony and Logitech. Google TV integrates Google’s Android operating system and the Linux version of Google Chrome browser to create an interactive television overlay on top of existing internet television and WebTV sites to add a 10-foot user interface.
- Courtesy : Google & Wikipedia

Google TV

