Below, we present how to Japanize NutchWAX-0.12.9
get NutchWAX-0.12.9 from https://webarchive.jira.com/wiki/display/search/NutchWAX the code is based on Nutch-1.0-dev the tarball includes the source code of Nutch-1.0-dev $ tar xzvf nutchwax-0.12.9.tar.gz
get nutchwax-0.12.9-ja-JapaneseAnalyzer-2010-08-25.patch here $ cd nutchwax-0.12.9 $ patch -p1 < ../nutchwax-0.12.9-ja-JapaneseAnalyzer-2010-08-25.patch
$ mkdir tmp $ cp build/nutch-1.0-dev.jar tmp $ ant clean $ ant $ cp -R build/plugins/analysis-ja/ plugins/
$ cp build/plugins/language-identifier/* plugins/language-identifier
$ cp build/plugins/parse-html/* plugins/parse-html/
$ cp build/nutch-1.0.job ./ $ ant jar $ ant war
copy it manually $ cd tmp $ jar xvf nutch-1.0-dev.jar $ cd ../build $ jar xvf nutch-1.0-dev.jar $ cp -R ../tmp/org/archive org/archive $ rm nutch-1.0-dev.jar $ jar cvf nutch-1.0.dev.jar org
NutchWAX
is an web archive extension to index ARC/WARC files by lucene NutchWAX doesn't use Nutch Crawler Since Cache pages are in Open Source Wayback, NutchWAX does not present any cache pages
names of ARC files with your collection name: IAH-20100819073302-00000-dahlia.arc.gz mycollectionIAH-20100819073445-00001-dahlia.arc.gz mycollectionIAH-20100819073445-00002-dahlia.arc.gz mycollection
we use arc-indexer in Open Source Wayback: $ /opt/wayback-1.2.1/bin/arc-indexer IAH-20100819073302-00000-dahlia.arc.gz > 1.cdx /opt/wayback-1.2.1/bin/arc-indexer IAH-20100819073445-00001-dahlia.arc.gz > 2.cdx$ /opt/wayback-1.2.1/bin/arc-indexer IAH-20100819073445-00002-dahlia.arc.gz > 3.cdx
$ bin/nutchwax import -e all.dup manifest $ mkdir crawl $ mv segments crawl $ bin/nutch updatedb crawl/crawldb `ls -d crawl/segments/2* | tail -1` $ bin/nutch invertlinks crawl/linkdb -dir crawl/segments $ bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb `ls -d crawl/segments/2* | tail -1`
same procedure as for ARC files diff is the manifest file
names of WARC files with your collection name: IAH-20100819073302-00000-dahlia.warc.gz mycollectionIAH-20100819073445-00001-dahlia.warc.gz mycollectionIAH-20100819073445-00002-dahlia.warc.gz mycollection
|