メモ‎ > ‎Nutch/NutchWAX‎ > ‎NutchWAX‎ > ‎

NutchWAX-0.12.9 Chinezation

Below, we present how to adapt NutchWAX-0.12.9 for Chinese
  • Firstly, Nutch-1.0 itself can be adapt for Chinese as following link
           We will use Paoding Chinese Analyzer.
  • nutchwax extraction
get NutchWAX-0.12.9 from https://webarchive.jira.com/wiki/display/search/NutchWAX
the code is based on Nutch-1.0-dev
the tarball includes the source code of Nutch-1.0-dev

$ tar xzvf nutchwax-0.12.9.tar.gz

  • patch
get nutchwax-0.12.9-zh-ChineseAnalyzer-2011-03-01.patch here

$ cd nutchwax-0.12.9
$ patch -p1 < ../nutchwax-0.12.9-zh-ChineseAnalyzer-2011-03-01.patch

  • re-build
$ mkdir tmp
$ cp build/nutch-1.0-dev.jar tmp
$ ant clean
$ ant
$ cp -R build/plugins/analysis-zh/ plugins/
$ cp build/plugins/language-identifier/* plugins/language-identifier
$ cp build/plugins/parse-html/* plugins/parse-html/
$ cp build/nutch-1.0.job ./
$ ant jar
$ ant war

  • Unfortunately, the code of web archive extension ( ./org/archive) cannot be extracted automatically
copy it manually

$ cd tmp
$ jar xvf nutch-1.0-dev.jar
$ cd ../build
$ jar xvf nutch-1.0-dev.jar
$ cp -R ../tmp/org/archive org/archive
$ rm nutch-1.0-dev.jar
$ jar cvf nutch-1.0.dev.jar org

  • conf/nutch-site.conf
  • http.agent.name
  • http.agent.description
  • http.agent.url
  • http.agent.email
  • searcher.dir
    • /opt/nutch-1.0/crawl
  • plugin.includes
    •  check "analysis-zh",  "language-identifier", "parse-html"
  • import files from ARC
 NutchWAX is an web archive extension to index ARC/WARC files by lucene
NutchWAX doesn't use Nutch Crawler
Since Cache pages are in Open Source Wayback, NutchWAX does not present any cache pages
  • make "manifest" file
names of ARC files with your collection name:

IAH-20100819073302-00000-dahlia.arc.gz mycollection
IAH-20100819073445-00001-dahlia.arc.gz mycollection
IAH-20100819073445-00002-dahlia.arc.gz mycollection


  •  make "all.dup" for duplicated pages
we use arc-indexer in Open Source Wayback:

$ /opt/wayback-1.2.1/bin/arc-indexer IAH-20100819073302-00000-dahlia.arc.gz > 1.cdx
$
/opt/wayback-1.2.1/bin/arc-indexer IAH-20100819073445-00001-dahlia.arc.gz > 2.cdx
$ /opt/wayback-1.2.1/bin/arc-indexer IAH-20100819073445-00002-dahlia.arc.gz > 3.cdx
$ sort -m ?.cdx > all.cdx
$ ../dedup-cdx all.cdx > all.dup

  • import/indexing
$ bin/nutchwax import -e all.dup manifest
$ mkdir crawl
$ mv segments crawl
$ bin/nutch updatedb crawl/crawldb `ls -d crawl/segments/2* | tail -1`
$ bin/nutch invertlinks crawl/linkdb -dir crawl/segments
$ bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb
`ls -d crawl/segments/2* | tail -1`

  • import files from WARC
 same procedure as for ARC files
 diff is the manifest file
  • make "manifest" file
names of WARC files with your collection name:

IAH-20100819073302-00000-dahlia.warc.gz mycollection
IAH-20100819073445-00001-dahlia.warc.gz mycollection
IAH-20100819073445-00002-dahlia.warc.gz mycollection

  •  deployment of nutch-1.0-dev.war
    • get and extract apache-tomcat-5.5.27, then modify files
      • /opt/apache-tomcat-5.5.27/conf/server.xml
<Connector port="8080" .. URIEncoding="UTF-8">
  • deploy .war file
$ cp build/nutch-1.0-dev.war /opt/apache-tomcat-5.5.27/webapps
$ cd /opt/apache-tomcat-5.5.27
$ bin/catalina.sh start
$ bin/catalina.sh stop
  • modify the config files
    • webapps/nutch-1.0-dev/WEB-INF/classes/nutch-site.xml
      • check the "searcher.dir"
    • webapps/nutch-1.0-dev/ja/include/header.html
      • some files extracted by XSLT might include garbled characters
  • re-deploy

$ bin/catalina.sh start
http://localhost:8080/nutch-1.0-dev/zh
ċ
nutchwax-0.12.9-zh-ChineseAnalyzer-2011-03-01.patch
(10834k)
Masayuki Asahara,
2011/09/07 19:09
Comments