Site owners

  • Swapnil Kulkarni

Page authors

  • Swapnil Kulkarni
    June 23, 2012
Tech-Talk‎ > ‎

How to install Nutch and Solr on Ubuntu 10.04

posted Apr 21, 2011, 10:17 AM by Swapnil Kulkarni   [ updated Jun 23, 2012, 3:13 PM ]
This HOW-TO consists of the following:

  • Installing Solr
  • Installing Nutch
  • Configuring Solr
  • Configuring Nutch
  • Crawling your site
  • Indexing our crawl DB with Solr
  • Search the crawled content in Solr

Prerequisites

I'll assume that you have an Ubuntu 10.04 server installed and that you are logged in as root while working on this
sudo su -

Installing Solr

Luckily, solr 1.4 is present in APT!
apt-get install solr-common solr-tomcat tomcat6

Now please refer following steps to setup your tomcat manager which is very useful in future!
sudo apt-get install tomcat6-admin

Edit /var/lib/tomcat6/conf/tomcat-users.xml
<tomcat-users>
<!--
<role rolename="tomcat"/>
<role rolename="role1"/>
<user username="tomcat" password="tomcat" roles="tomcat"/>
<user username="both" password="tomcat" roles="tomcat,role1"/>
<user username="role1" password="tomcat" roles="role1"/>
-->
</tomcat-users>


To this:
<tomcat-users>
<!--
<role rolename="tomcat"/>
<role rolename="role1"/>
<role rolename="manager"/>
<user username="tomcat" password="tomcat" roles="tomcat,manager"/>
<user username="both" password="tomcat" roles="tomcat,role1"/>
<user username="role1" password="tomcat" roles="role1"/>
-->
</tomcat-users>


Now, restart Tomcat:
sudo service tomcat6 restart


You can access tomcat manager on
http://localhost:8080/manager/html
Username: tomcat
Password: tomcat


Installing Nutch

Go to a proper working directory, download and unpack Nutch
cd /tmp
wget http://mirrorservice.nomedia.no/apache.org/nutch/apache-nutch-1.1-bin.tar.gz
cd /usr/share
tar zxf /tmp/apache-nutch-1.1-bin.tar.gz
ln -s apache-nutch-1.1-bin nutch


Configuring Solr

For the sake of simplicity we are going to use the example configuration of Solr as a base.

Back up the original file:
mv /etc/solr/conf/schema.xml /etc/solr/conf/schema.xml.orig

And replace the Solr schema with the one provided by Nutch
cp /usr/share/nutch/conf/schema.xml /etc/solr/conf/schema.xml

Now, we need to configure Solr to create snippets for search results

Edit /etc/solr/conf/schema.xml and change the following line:
<field name="content" type="text" stored="false" indexed="true"/>
To this:
<field name="content" type="text" stored="true" indexed="true"/>

Create a new dismax request handler, to enabling relevancy tweaks.
Back up the original file:
cp /etc/solr/conf/solrconfig.xml /etc/solr/conf/solrconfig.xml.orig

Add the following fragment to _/etc/solr/conf/solrconfig.xml_:
<requestHandler name="/nutch" class="solr.SearchHandler" >
    <lst name="defaults">
        <str name="defType">dismax</str>
        <str name="echoParams">explicit</str>
        <str name="tie">0.01</str>
        <str name="qf">
            content^0.5 anchor^1.0 title^1.2
        </str>
        <str name="pf">
            content^0.5 anchor^1.5 title^1.2 site^1.5
        </str>
        <str name="fl">
            url
        </str>
        <str name="mm">
            2&lt;-1 5&lt;-2 6&lt;90%
        </str>
        <str name="ps">100</str>
        <str name="hl">true</str>
        <str name="q.alt">*:*</str>
        <str name="hl.fl">title url content</str>
        <str name="f.title.hl.fragsize">0</str>
        <str name="f.title.hl.alternateField">title</str>
        <str name="f.url.hl.fragsize">0</str>
        <str name="f.url.hl.alternateField">url</str>
        <str name="f.content.hl.fragmenter">regex</str>
    </lst>
</requestHandler>

Now, restart Tomcat:
service tomcat6 restart

Configuring Nutch

Go into the nutch directory and do all the work from there:
cd /usr/share/nutch

Edit conf/nutch-site.xml and add the following in between the <configuration>-clauses:
 <property>
    <name>http.robots.agents</name>
    <value>nutch-solr-integration-test,*</value>
    <description></description>
  </property>
  <property>
    <name>http.agent.name</name>
    <value>nutch-solr-integration-test</value>
    <description>Viterbi Bot</description>
  </property>
  <property>
    <name>http.agent.description</name>
    <value>Viterbi Web Crawler using Nutch 1.0</value>
    <description></description>
  </property>
  <property>
    <name>http.agent.url</name>
    <value>http://viterbi.usc.edu/</value>
    <description></description>
  </property>
  <property>
    <name>http.agent.email</name>
    <value>YOUR EMAIL ADDRESS HERE</value>
    <description></description>
  </property>
  <property>
    <name>http.agent.version</name>
    <value></value>
    <description></description>
  </property>
  <property>
    <name>generate.max.per.host</name>
    <value>100</value>
  </property>


You need to ensure that the crawler does not leave our domain, otherwise you would end up crawling the entire Internet.

You need to insert domain into _conf/regex-urlfilter.txt:

# allow urls in viterbi.usc.edu domain
+^http://([a-z0-9\-A-Z]*\.)*viterbi.usc.edu/([a-z0-9\-A-Z]*\/)*
 
# deny anything else
-.

**Important: Make sure that you
Edit this:
# accept anything else
+.

To this:
# accept anything else
#+.

Now, we need to instruct the crawler where to start crawling, so create a seed list:
mkdir urls
echo "http://viterbi.usc.edu/" > urls/seed.txt


**Important:
Here you can add multiple seed urls per line and make sure that you make corresponding changes in regex-urlfilter.txt discussed above

Crawling your site

Let's start crawling!

Start by injecting the seed url(s) to the nutch crawldb:
bin/nutch inject crawl/crawldb urls

Next, generate fetch list:
bin/nutch generate crawl/crawldb crawl/segments

The above command generated a new segment directory under /usr/share/nutch/crawl/segments that contains the urls to be fetched. All following commands require accessing the latest segment directory as their main parameter so we’ll store it in an environment variable:
export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1`

Launch the crawler!
bin/nutch fetch $SEGMENT -noParsing

And parse the fetched content:
bin/nutch parse $SEGMENT

Now we need to update the crawl database to ensure that for all future crawls, Nutch only checks the already crawled pages, and only fetches new and changed pages.
bin/nutch updatedb crawl/crawldb $SEGMENT -filter -normalize

Create a link database:
bin/nutch invertlinks crawl/linkdb -dir crawl/segments

**Important:
The more number of times you repeat above crawling steps you will get better crawl depth!

Indexing our crawl DB with solr

bin/nutch solrindex http://127.0.0.1:8080/solr/ crawl/crawldb crawl/linkdb crawl/segments/*

Search the crawled content in Solr


Now the indexed content is available through Solr. You can try to execute searches from the Solr admin UI from
http://127.0.0.1:8080/solr/admin

or directly with url like:
http://127.0.0.1:8080/solr/select/?q=usc&amp;version=2.2&amp;start=0&amp;rows=10&amp;indent=on&amp;wt=json


Comments