Site owners

  • Swapnil Kulkarni

Page authors

  • Swapnil Kulkarni
    June 23, 2012
Tech-Talk‎ > ‎

How to install and run Carrot2 Document Clustering Workbench with Solr Document Source (on Ubuntu 10.04)

posted May 3, 2011, 5:57 PM by Swapnil Kulkarni   [ updated Jun 23, 2012, 3:10 PM ]
What is Carrot2 and what it is not
Carrot2 is a library and a set of supporting applications you can use to build a search results clustering engine. Such an engine will organize your search results into topics, fully   
automatically and without external kowledge such as taxonomies or preclassified content.

Carrot2 contains two document clustering algorighms designed specifically for search results clustering: Suffix Tree Clustering and Lingo. Carrot2 also contains components for fetching       search results from several search engines, such as Yahoo!, MSN Live, Google, but it also supports other sources of documents like Lucene, Solr or Google Desktop index.

Carrot2 is not a search engine itself, it does not have a crawler and indexer. There is a number of Open Source projects you can use to crawl (Nutch), index and search (Lucene, Solr) your content, which can then be queried and clustered by Carrot2

For more details please visit Carrot2 API Manual


Trying Carrot2 clustering with Solr Index on Carrot2 Document Clustering Workbench

Carrot2 Document Clustering Workbench is a standalone GUI application you can use to experiment with Carrot2 clustering on data from common search engines or your own data.

You can use Carrot2 Document Clustering Workbench to:

  • Quickly test Carrot2 clustering with your own data.
  • Fine tune Carrot2 clustering algorithms' settings to work best with your specific data.
  • Run simple performance benchmarks using different settings to predict maximum clustering throughput on a single machine
For more details please visit Carrot2 Document Clustering Workbench

Installation and running of Carrot2 Document Clustering Workbench

**Important:
I assume that you have followed my previous tutorial on How to install Nutch and Solr on Ubuntu 10.04 and you are successfully able to generate Solr response in XML format!

To run Carrot2 Document Clustering Workbench:
  1. Download and install Java Runtime Environment (version 1.6.0 or newer) if you have not done so.
  2. Download Carrot2 Document Clustering Workbench Linux binaries and extract the archive to some local disk location.
  3. Run carrot2-workbench (Linux).
    Now,in the Search view of Carrot2 Document Clustering Workbench, choose following details:
Source: Solr
Algorithm: Lingo (or other as per your choice)

Basic
Query: USC (As per your choice)
Results: 100 (or more if you want)

Medium

Summary Field Name: content
Title Field Name: title
URL Field Name: url

Advanced

Service url: http://127.0.0.1:8080/solr/select/

    Then click on Process to view results in internal browser.You may play with Attribute view to tune clustering..Enjoy :)



Comments