How to Use PyLucene

Read this blog post on Medium.

This blog post will briefly outline how to install PyLucene on Fedora and give some examples of how to use the analyzers to process text.

To begin with, you need to install Apache Ant and Apache Ivy.

sudo dnf install ant ivy

Next, download PyLucene from one of the mirrors found here and extract the archive.

tar -xzvf pylucene-6.5.0-src.tar.gz

To install PyLucene, we’ll mostly follow the instructions found here. The first step is to install JCC.

cd /path/to/pylucene/jcc

Make sure the Java location in jcc/setup.py is correct.

'linux': '/usr/lib/jvm/java-8-oracle' # change this

'linux': '/usr/lib/jvm/java-1.8.0' # mine

Install JCC.

python setup.py build

python setup.py install

Now we'll start the PyLucene installation process.

cd ..

PyLucene assumes an incorrect name for the Python 3 development library, so I had to create a symbolic link to make compiling work (I brought this to the PyLucene team’s attention and it should be fixed in a future release).

ln -s /path/to/.pyenv/versions/3.5.2/lib/libpython3.5m.so.1.0 /path/to/.pyenv/versions/3.5.2/lib/libpython3.5.so

Next, edit the Makefile so that everything is pointing to the proper locations.

# Linux     (Debian Jessie 64-bit, Python 3.4.2, Oracle Java 1.8

# Be sure to also set JDK['linux'] in jcc's setup.py to the JAVA_HOME value

# used below for ANT (and rebuild jcc after changing it).

PREFIX_PYTHON=/home/path/to/.pyenv/versions/3.5.2

ANT=JAVA_HOME=/usr/lib/jvm/java-1.8.0 /usr/bin/ant

PYTHON=$(PREFIX_PYTHON)/bin/python3

JCC=$(PYTHON) -m jcc --shared

NUM_FILES=8

Finally, make and install PyLucene.

make

make test

make install

PyLucene is designed to mimic the Lucene Java API. The following examples demonstrate how to use several different analyzers to process text.

import lucene

from java.io import StringReader

from org.apache.lucene.analysis.ja import JapaneseAnalyzer

from org.apache.lucene.analysis.standard import StandardAnalyzer, StandardTokenizer

from org.apache.lucene.analysis.tokenattributes import CharTermAttribute

lucene.initVM(vmargs=['-Djava.awt.headless=true'])

# Basic tokenizer example.

test = "This is how we do it."

tokenizer = StandardTokenizer()

tokenizer.setReader(StringReader(test))

charTermAttrib = tokenizer.getAttribute(CharTermAttribute.class_)

tokenizer.reset()

tokens = []

while tokenizer.incrementToken():

    tokens.append(charTermAttrib.toString())

print(tokens)

# StandardAnalyzer example.

analyzer = StandardAnalyzer()

stream = analyzer.tokenStream("", StringReader(test))

stream.reset()

tokens = []

while stream.incrementToken():

    tokens.append(stream.getAttribute(CharTermAttribute.class_).toString())

print(tokens)

# JapaneseAnalyzer example.

analyzer = JapaneseAnalyzer()

test = "寿司が食べたい。"

stream = analyzer.tokenStream("", StringReader(test))

stream.reset()

tokens = []

while stream.incrementToken():

    tokens.append(stream.getAttribute(CharTermAttribute.class_).toString())

print(tokens)

Google Sites

Report abuse