Nutch uses Hadoop for scalability. However Hadoop has some OS level dependency. Therefore to run Nutch on windows, I created a new implementation of FS.java which is not OS dependent. Now I can happily run Nutch using ant!!
build.xml
<project name="nutch" default="crawl">
<path id="nutch.classpath">
<pathelement location="${java.home}/jre/lib/tools.jar"/>
<pathelement location="conf"/>
<pathelement location="plugins"/>
<fileset dir="lib">
<include name="**/*.jar"/>
</fileset>
</path>
<target name="crawl">
<java classname="org.apache.nutch.crawl.Crawl">
<arg value="urls"/>
<arg value="-dir"/>
<arg value="crawl"/>
<arg value="-depth"/>
<arg value="3"/>
<arg value="-topN"/>
<arg value="5"/>
<classpath>
<path refid="nutch.classpath"/>
</classpath>
</java>
</target>
<target name="index">
<java classname="org.apache.nutch.indexer.solr.SolrIndexer">
<arg value="http://localhost:8983/solr/"/>
<arg value="crawl/crawldb"/>
<arg value="crawl/segments/*"/>
<classpath>
<path refid="nutch.classpath"/>
</classpath>
</java>
</target>
<target name="usage">
<java classname="org.apache.nutch.crawl.Crawl" failonerror="false">
<classpath>
<path refid="nutch.classpath"/>
</classpath>
</java>
</target>
</project>
org/apache/hadoop/FS.java
// Get a copy of FS.java from svn and replace the following method. Then place the class file in the conf directory.
private void doDF() throws IOException {
if (lastDF + dfInterval > System.currentTimeMillis())
return;
File file = new File(dirPath);
// this.filesystem = file.get;
// this.mount = tokens.nextToken();
this.capacity = file.getTotalSpace();
this.available = file.getFreeSpace();
this.used = capacity - available;
this.percentUsed = (int) ((used/capacity) * 100);
this.lastDF = System.currentTimeMillis();
}
Comments/suggestions - Send a mail to tamariya@rediffmail.com.