Simple N-Gram fulltext plugin for MySQL


 

[Japanese]

Simple N-Gram (bi-gram) FULLTEXT parser plugin for MySQL 5.1+

NOTE:Now, There is also other bi-gram parser project. "MySQL full-text parser plugin collection". which may be little better than this parser. (or more better "Tritton project" which put "senna" full text engine into MySQL.)

 

MySQL has fulltext index search ability for text field. but it is word based index, it cannot be used for no word delimiter language like Japanese or Chinese. and it also can't search charactors in middle of a words. (cf. searching 'in' will not match word 'ping'.)

Starts from MySQL 5.1, MySQL supports a plugin that allows to change server components (fulltext search parser) without restarting / recompiling the server.

This n-gram parser uses this plugin interface to implement a simple n-gram (bi-gram) fulltext index parser which can index no word delimiter laungage.

N-gram parser

N-gram (bi-gram) is simple algorithms, it just takes 2 sequence of charactors from text to make index. This plugin uses MySQL's internal multi-byte charactor function to get 2 charactor sequence, so this sould works every encoding MySQL supports, including utf-8.

Plugin itself is also very simple. basically I just changed 1 function (bi_gram_parser_parse, in bi_gram_plugin.c) from example fulltext parser which come with MySQL source archive.

Changes from 1.0

-- Version 1.0.1
 * Added missig files. (ChangeLog, COPYING, etc.)
 * License changed from GPL to LGPL.
 * Fixed to use ./configure params.
 * Fixed bug, when searching/indexing by just one char will return result correctly.

Compile

1) get source code archive: bi_gram-src-1.0.1.tar.gz

2) ./configure --prefix=/usr (or /usr/local)

3) make (you need MySQL header installed.)

4) make install (plugin will be in /usr/lib/mysql/)

NOTE: in newer version of MySQL souce, you may see 'In_C_you_should_use_my_bool_instead' error on compile. if so, please replace 'bool' to 'my_bool'.

NOTE 2: If you use i386 computer, precompiled version of binary bi_gramlib.so is available. download and put it into /usr/lib/mysql etc.

Install

1) If you down loaded i386 binary, copy it to /usr/lib/mysql etc. (you do not need to do this if you did 'make install' on compile)

2) modify /etc/my.cnf to add this lines.

[mysqld]
ft_min_word_len=1

3) restart MySQL server. (because my.cnf was changed.)

4)connect to server by 'mysql' commnad. then type,

INSTALL PLUGIN bi_gram SONAME 'bi_gramlib.so';

(you can type 'SHOW PLUGINS' to check this.)


Usage

1) create fulltext index with "WITH PARSER bi_gram". (by 'create table' or 'create index' etc.)

CREATE TABLE t (c VARCHAR(255), FULLTEXT (c) WITH PARSER bi_gram);

or

CREATE FULLTEXT INDEX c ON t(c) WITH PARSER bi_gram;

2) do fulltext search using 'match - against' syntax.

SELECT MATCH(c) AGAINST('case' IN BOOLEAN MODE) FROM t;

NOTE: you need "IN BOOLEAN MODE".

Known bug 

search fails when word is terminated with new line. (from result of ftdump tool, bi-gram parser itself succsessfully parses such word. but it seems like this happen in inside MySQL. I'm investigating further.)

link