Read this blog post on Medium.
This blog post will briefly outline how to install PyLucene on Fedora and give some examples of how to use the analyzers to process text.
To begin with, you need to install Apache Ant and Apache Ivy.
sudo dnf install ant ivyNext, download PyLucene from one of the mirrors found here and extract the archive.
tar -xzvf pylucene-6.5.0-src.tar.gzTo install PyLucene, we’ll mostly follow the instructions found here. The first step is to install JCC.
cd /path/to/pylucene/jccMake sure the Java location in jcc/setup.py is correct.
'linux': '/usr/lib/jvm/java-8-oracle' # change this'linux': '/usr/lib/jvm/java-1.8.0' # mineInstall JCC.
python setup.py buildpython setup.py installNow we'll start the PyLucene installation process.
cd ..PyLucene assumes an incorrect name for the Python 3 development library, so I had to create a symbolic link to make compiling work (I brought this to the PyLucene team’s attention and it should be fixed in a future release).
ln -s /path/to/.pyenv/versions/3.5.2/lib/libpython3.5m.so.1.0 /path/to/.pyenv/versions/3.5.2/lib/libpython3.5.soNext, edit the Makefile so that everything is pointing to the proper locations.
# Linux (Debian Jessie 64-bit, Python 3.4.2, Oracle Java 1.8 # Be sure to also set JDK['linux'] in jcc's setup.py to the JAVA_HOME value # used below for ANT (and rebuild jcc after changing it). PREFIX_PYTHON=/home/path/to/.pyenv/versions/3.5.2ANT=JAVA_HOME=/usr/lib/jvm/java-1.8.0 /usr/bin/antPYTHON=$(PREFIX_PYTHON)/bin/python3JCC=$(PYTHON) -m jcc --sharedNUM_FILES=8Finally, make and install PyLucene.
makemake testmake installPyLucene is designed to mimic the Lucene Java API. The following examples demonstrate how to use several different analyzers to process text.
import lucenefrom java.io import StringReaderfrom org.apache.lucene.analysis.ja import JapaneseAnalyzerfrom org.apache.lucene.analysis.standard import StandardAnalyzer, StandardTokenizerfrom org.apache.lucene.analysis.tokenattributes import CharTermAttributelucene.initVM(vmargs=['-Djava.awt.headless=true'])# Basic tokenizer example.test = "This is how we do it."tokenizer = StandardTokenizer()tokenizer.setReader(StringReader(test))charTermAttrib = tokenizer.getAttribute(CharTermAttribute.class_)tokenizer.reset()tokens = []while tokenizer.incrementToken(): tokens.append(charTermAttrib.toString())print(tokens)# StandardAnalyzer example.analyzer = StandardAnalyzer()stream = analyzer.tokenStream("", StringReader(test))stream.reset()tokens = []while stream.incrementToken(): tokens.append(stream.getAttribute(CharTermAttribute.class_).toString())print(tokens)# JapaneseAnalyzer example.analyzer = JapaneseAnalyzer()test = "寿司が食べたい。"stream = analyzer.tokenStream("", StringReader(test))stream.reset()tokens = []while stream.incrementToken(): tokens.append(stream.getAttribute(CharTermAttribute.class_).toString())print(tokens)