Data Science and NLP

Generative Adversarial Network based Language Identification for Closely Related Same Language Family

The discrimination between similar languages is one of the main challenges in automatic language identification. In this paper, we address this problem by proposing a generative adversarial network based language identification method for identifying the sentences from closely related languages of same language family. The proposed method works on dual-reward feedback learning comprising of generator to generate nearly close language sentences, discriminator for determining how similar the generated sentences are to that of the training data and classifier for optimal prediction of the correct label. We evaluate the proposed model for pairs of languages and overall testing data comparison on Indo-Aryan languages dataset. The effectiveness of our method is demonstrated in comparison to other existing state-of-the-art methods.

[CoCoNet 2020] (under publication)

Fake Review Detection using Hybrid Ensemble Learning

Opinion spam on online restaurant review sites are a major problem as the reviews influence the users' choice to visit or not to a restaurant. In this paper, we address the problem of detecting genuine and fake reviews in restaurant online reviews. We propose a fake review detection technique comprising data preprocessing, detection and ensemble learning that learns the reviews and their features to filter out the fake reviews. Initially, we preprocess to obtain the refined reviews and employ two independent classifiers using deep machine learning and feature-based machine learning techniques for detection.

These classifiers tackle the problem in two aspects i.e., the deep machine learning model learns the word distributions and the feature-based machine learning model extracts the relevant features from the reviews. Finally, a hybrid ensemble model from the two classifiers are built to detect the genuine and fake reviews. The experimental analysis of the proposed approach on Yelp datasets outperforms the existing state-of-the-art methods.

[CoCoNet 2020] (under publication)

Sentiment based Food Classification for Restaurant Business

In this paper, we address the problem of classifying the restaurant reviews into different meal courses based on the review sentiments. Based on our survey, we observe that there are no works which focus on classifying the food items into different meal courses with top n food items under each course using restaurant reviews. Most of the works in literature address the issues like predicting the ratings of the restaurant or restaurant business strategies by considering various attributes of the restaurant. Thus, in this paper, we propose a sentiment based food classification framework consisting of two tasks, namely, sentiment classification and four-course meal classification. In sentiment classification, we classify the reviews into positive and negative categories based on the sentiments of the reviews. In four-course meal classification, we categorize the reviews into four courses, namely, soups and salads, appetizers, main course and desserts. We list the top n food items liked by most of the customers in each of these courses. In order to select the suitable classification technique for addressing the identified problem, we analyze the learning curves based on the bias-variance trade-off. We observe that support vector machines (SVM) classification technique outperforms other techniques. The performance analysis of the proposed framework is carried out on the standard Yelp dataset. The top 5 food items obtained in each of the categories are soups and salads: salad, corn, coffee, cocktail, tea; appetizers: burger, pizza, bread, cheese, frie; main course: steak, meal, chicken, meat, beef; and desserts: doughnut, cream, cake, chocolate, flan.

[ICACCI 2018]

Restaurant setup business analysis using yelp dataset

In this paper, we address the issues associated with setting-up of a new restaurant business. To strategize a new restaurant business, we propose a restaurant business framework which comprises of 3 most important tasks, namely, high frequency attributes, most crowded day and location of the restaurant. First, we identify the features/attributes of the restaurants in which the customers are most interested in and provide those facilities and services to increase the profit. Next, we identify the day of the week when the restaurants are heavily crowded so that the best recipes and offers are made available on those days. Finally, since location has a profound effect on the success of a restaurant business, we consider location to be the most important to know the nearby restaurants and their facilities before coming up with a new restaurant business. The performance analysis of the proposed framework was carried out on the standard Yelp dataset. Thus, we found credit card to be the most preferred attribute, the most crowded day to be Monday and Divey to be the most desired ambience among the customers. We also demonstrate how the new restaurant can be setup by identifying the nearest restaurants and the services.

[ICACCI 2017]

Vehicular traffic analysis from social media data

In this paper, we address the problem of vehicular traffic congestion occurring in densely populated cities. Towards this we propose to provide a framework for optimal vehicular traffic solution using social media live data. Typically, the traffic congestion problem addressed in literature focuses on usage of dedicated traffic sensors and satellite information which is quite expensive. However, many urban commuters tend to post updates about traffic on various social media in the form of tweets or Facebook posts. With the copious amount of data made available upon traffic problems on social media sites, we collect historical data about traffic posts from specific cities and build a sentiment classifier to monitor commuters' emotions round the clock. The knowledge is used to analyze and predict traffic patterns in a given location. Also we identify the probable cause of a traffic congestion in a particular area by analyzing the collected historical data. Through our work, we are able to present an uncensored, economical and alternative approach to traditional methods for monitoring traffic congestion.

[ICACCI 2016]

Disaster Analysis through tweets

Social networks offer a wealth of information for capturing additional information on people’s behavior, trends, opinions and emotions during any human-affecting events such as natural disasters. During disaster, social media provides a plethora of information which includes information about the nature of disaster, affected people’s emotions and relief efforts. In this paper we propose a natural-disaster analysis interface that solely makes use of tweets generated by the Twitter users during the event of a natural disasters. We collect streaming tweets relating to disasters and build a sentiment classifier in order to categorize the users’ emotions during disasters based on their various levels of distress. Various analysis techniques are applied on the collected tweets and the results are presented in the form of detailed graphical analysis which demonstrates users’ emotions during a disaster, frequency distribution of various disasters and geographical distribution of disasters. We observe that our analysis of data from social media provides a viable, economical, uncensored and real-time alternative to traditional methods for disaster analysis and the perception of affected population towards a natural disaster.

[ICACCI 2015]

Classification of facebook news feeds and sentiment analysis

As recently seen in Google's Gmail, the messages in inbox are classified into primary, social and promotions, which makes it easy for the users to differentiate the messages which they are looking for from the bulk of messages. Similarly, a users wall in facebook is usually flooded with huge amount of data which makes it annoying for the users to view the important news feeds among the rest. Thus we aim to focuses on classification of facebook news feeds. In this paper, we attempt to classify the users news feeds into various categories using classifiers to provide a better representation of data on users wall. News feeds collected from facebook are dynamically classified into various classes such as friends posts and liked pages posts. Friends posts are further categorized into life events posts and entertainment posts. Posts or updates from pages which are liked by the users are grouped as liked pages posts. Posts from friends are tagged as friends posts and those regarding the events occurring in their lives are said to be life event posts and the rest are tagged as entertainment posts. This helps users to find “important news feeds” from “live news feeds”. Sentiments are important as they depict the opinions and expressions of the user. Hence, detecting the sentiments of users from the life event posts also becomes an essential task. We also propose a system for automatic detection of sentiments from the life event posts and categorize based on sentiments into happy, neutral and bad feelings posts. This paper looks towards applying the classification methods from the literature to our dataset with the objective of evaluating methods of automatic news feeds classification and sentiment analysis which in future can provide facebook page a well organized and more appealing look.

[ICACCI 2014]