This project was developed as the PG final semester main project.
About:
This blog search engine is named as Blog Xplore (Blog Explore). It features full text search capabilities and also displays results based on clustering based on the category extracted from post content analysis.
Search by text, blog type, post title, blog url, author, phrase control, keyword filterable by archives (year wise)
Sorting of results based on relevance
Display of results as clusters (categories)
Trend Tool support where related keywords of search criteria will be displayed to the user
Spy support where users can know what others are searching
Maintain user logs on what users are searching and display their recent searches
Technologies:
Front-End: HTML, CSS, JavaScript, AJAX
Back-End: JAVA, Java Server Pages (JSP)
Libraries: Jsoup, Apache Lucene
Tools:
Development Tool: Netbeans
Web Server: Apache Tomcat
Technical Implementation:
Crawler (3 levels of DFS)
Parser (Match finding technique using Jsoup)
Content Analysis for Clustering (Yahoo Content Analyzer API)
Storage & Index (Apache Lucene - Store as Documents and use Inverted Indexing technique)
Display of results (Clustered & Traditional views)
User-side Features:
Search input was textual
Search results included Traditional listing and Clustered listing
Users can pick results from clusters (categories) or traditional (summary) listing
Selected results included author name, date of post, keywords, comments and post summary.
Searches can be specific to post title, post content and comments
Searches can be filtered by years
Pagination was provided for searches exceeding the page limit
Top search phrases and recent search phrases were displayed to the users