Tech-Talk

Nutch Tutorial on Ubuntu (10 easy steps)

posted Mar 27, 2012, 9:18 PM by Swapnil Kulkarni   [ updated Jun 23, 2012, 3:05 PM ]

Follow these 10 steps to setup Nutch & crawl your site to create your own Web DB

In case of any queries drop me an email at mail.swapnilk@gmail.com

Have fun!!

Step 1:
Download latest binaries from here:
http://www.apache.org/dyn/closer.cgi/nutch/

Step 2:
Make required directories
sudo mkdir /usr/local/nutch
sudo mkdir /usr/local/nutch/framework
sudo mkdir /usr/local/nutch/dist


Step 3:
Copy to dist
sudo cp apache-nutch-1.4-bin.tar.gz /usr/local/nutch/dist/

Step 4:
Unpack
sudo tar -xvzf apache-nutch-1.4-bin.tar.gz -C /usr/local/nutch/framework/

Step 5:
Make executable
sudo chmod +x /usr/local/nutch/framework/apache-nutch-1.4-bin/runtime/local/bin/nutch

Step 6:
Make seed url file
sudo mkdir -p /usr/local/nutch/framework/apache-nutch-1.4-bin/runtime/local/bin/urls
sudo gedit /usr/local/nutch/framework/apache-nutch-1.4-bin/runtime/local/bin/urls/nutch


Add following to nutch.txt
http://www.usc.edu/

Step 7:
Add Agent
sudo gedit /usr/local/nutch/framework/apache-nutch-1.4-bin/runtime/local/conf/nutch-site.xml

Add this in Configuration
<property>
<name>http.agent.name</name>
<value>My Spider</value>
</property>


Step 8:
Edit regex
sudo gedit /usr/local/nutch/framework/apache-nutch-1.4-bin/runtime/local/conf/regex-urlfilter.txt

Replace

# accept anything else
+.


By this
# accept anything else
#+.


Then add
+^http://([a-z0-9]*\.)* www.usc.edu/

Step 9:
Setup JDK & set JAVA_HOME
sudo add-apt-repository ppa:ferramroberto/java
sudo apt-get update
sudo apt-get install sun-java6-jdk
sudo apt-get install sun-java6-jdk sun-java6-jre sun-java6-plugin sun-java6-fonts
export JAVA_HOME=/usr


Step 10:
Start Crawling!
/usr/local/nutch/framework/apache-nutch-1.4-bin/runtime/local/bin/nutch crawl urls -dir crawl -depth 10 -topN 1000

What Facebook and Google are hiding from world!

posted Dec 30, 2011, 2:20 AM by Swapnil Kulkarni   [ updated Dec 30, 2011, 2:20 AM ]

Eli Pariser of the progressive organization MoveOn says the Internet is hiding things from us, and we don't even know it. In this TED Talk he calls out Facebook, Goggle and other corporations who are transforming the Internet to suit their corporate interests.
- Courtesy : TEDx


What Facebook and Google are hiding from world



Google MapsGL

posted Dec 30, 2011, 2:13 AM by Swapnil Kulkarni   [ updated Dec 30, 2011, 2:14 AM ]

Using WebGL technology, Google MapsGL is an experimental project which enables a richer maps experience with immersive 3D buildings, smoother transitions between imagery, fluid transitions into Streetview and more, directly in your browser and all without a plugin.
- Courtesy : Google


Google MapsGL



Amazon Silk - Revolutionary Cloud-Accelerated Web Browser

posted Oct 1, 2011, 12:42 AM by Swapnil Kulkarni   [ updated Oct 1, 2011, 12:47 AM ]

Amazon Silk is a web browser developed by Amazon for Kindle Fire. It uses a split architecture whereby some of the processing is performed on Amazon's servers to improve webpage loading performance. The frontend is based on the WebKit browser engine.
-Courtesy: Amazon & Wikipedia


Amazon Silk


How Google makes improvements to its search algorithm

posted Sep 13, 2011, 6:29 PM by Swapnil Kulkarni   [ updated Sep 20, 2011, 5:51 PM ]

Here's a short video put together that gives you a sense of the work that goes into the changes and improvements of Google search engine. While an improvement to the algorithm may start with a creative idea, it always goes through a process of rigorous scientific testing.
- Courtesy : Google


Google - Search Algorithm


Google - Flight Search

posted Sep 13, 2011, 6:25 PM by Swapnil Kulkarni   [ updated Sep 20, 2011, 5:36 PM ]

Flight search is a feature that helps you explore air travel options for a number of cities, and plan your trip with just a few clicks of the mouse.
- Courtesy : Google


Google - Flight Search


RockMelt - Your Browser. Re-Imagined.

posted Sep 10, 2011, 4:17 PM by Swapnil Kulkarni   [ updated Sep 20, 2011, 5:36 PM ]

RockMelt is re-imagining your online experience by creating a new web browser that makes it easy to stay in touch with friends, search online, and get updates from your favorite websites.
Try it for yourself! http://www.rockmelt.com
- Courtesy : Rockmelt


Rockmelt


Intel's new 3D Transistor

posted Jun 24, 2011, 3:48 AM by Swapnil Kulkarni   [ updated Sep 20, 2011, 5:37 PM ]

Mark Bohr Gets Small: 22nm and explains Intel's new 3D Transistor. Enjoy!
- Courtesy : Intel


Intel's new 3D Transistor


Khan Academy - Taking education to next level using simple technology

posted Jun 12, 2011, 10:18 AM by Swapnil Kulkarni   [ updated Jun 23, 2012, 3:06 PM ]

A free world-class education for anyone anywhere...

The Khan Academy is an organization on a mission. They're a not-for-profit with the goal of changing education for the better by providing a free world-class education to anyone anywhere.

All of the site's resources are available to anyone. It doesn't matter if you are a student, teacher, home-schooler, principal, adult returning to the classroom after 20 years, or a friendly alien just trying to get a leg up in earthly biology. The Khan Academy's materials and resources are available to you completely free of charge
- Courtesy : Khan Academy



Salman Khan with Bill Gates at TED 2011


Google TV

posted Jun 12, 2011, 10:04 AM by Swapnil Kulkarni   [ updated Sep 20, 2011, 5:38 PM ]

Google TV is a Smart TV / Connected TV platform from Google. It was announced on May 20, 2010, at Google’s Google I/O event and was co-developed by Google, Intel, Sony and Logitech. Google TV integrates Google’s Android operating system and the Linux version of Google Chrome browser to create an interactive television overlay on top of existing internet television and WebTV sites to add a 10-foot user interface.
- Courtesy : Google & Wikipedia



Google TV

1-10 of 16

Comments