Trang chủ‎ > ‎IT‎ > ‎Data Mining‎ > ‎Text Mining‎ > ‎

Resources for NLP, Sentiment Analysis, and Deep Learning

Table of Contents

This resource page contains several types of material for gaining a better understanding of Natural Language Processing (NLP), Sentiment Analytis, and/or Deep Learning techniques:

Resource Lists

  • KDnuggets Tutorial List: Deep Learning is a very hot Machine Learning techniques which has been achieving remarkable results recently. We give a list of free resources for learning and using Deep Learning.

  • deeplearning.net Deep Learning Tutorials: The tutorials presented here will introduce you to some of the most important deep learning algorithms and will also show you how to run them using Theano. Theano is a python library that makes writing deep learning models easy, and gives the option of training them on a GPU.

Overviews & Blogs

  • Understanding LSTM Networks, August 2015, Christopher Olah: Long Short Term Memory networks – usually just called “LSTMs” – are a special kind of RNN, capable of learning long-term dependencies.

  • A Few Useful Things to Know about Machine Learning, 2012, Pedro Domingos, University of Washington CS Department: This article summarizes twelve key lessons that machine learning researchers and practitioners have learned. These include pitfalls to avoid, important issues to focus on, and answers to common questions.

  • Understanding Convolution in Deep Learning: Convolution is probably the most important concept in deep learning right now. It was convolution and convolutional nets that catapulted deep learning to the forefront of almost any machine learning task there is. But what makes convolution so powerful? How does it work? In this blog post I will explain convolution and relate it to other concepts that will help you to understand convolution thoroughly.

  • Conv Nets: A Modular Perspective: In the last few years, deep neural networks have lead to breakthrough results on a variety of pattern recognition problems, such as computer vision and voice recognition. One of the essential components leading to these results has been a special kind of neural network called a convolutional neural network.

  • [The Unreasonable Effectiveness of Recurrent Neural Networks] (http://karpathy.github.io/2015/05/21/rnn-effectiveness/): May 21, 2015 - Together with this post I am also releasing code on Github that allows you to train character-level language models based on multi-layer LSTMs. You give it a large chunk of text and it will learn to generate text like it one character at a time. You can also use it to reproduce my experiments below.

Tutorials

  • [Machine Learning with Torch7] (http://code.madbits.com/wiki/doku.php): This wiki provides multiple tutorials, with the overall objective of teaching you how to do machine learning with Torch7. Torch7 provides a Matlab-like environment for state-of-the-art machine learning algorithms. It is easy to use and provides a very efficient implementation, thanks to an easy and fast scripting language (LuaJIT) and an underlying C implementation.

    • Tutorial 1: Setup / Basics / Getting Started
    • Tutorial 2: Supervised Learning
    • Tutorial 3: Unsupervised Learning
    • Tutorial 4: Graphical Models
    • Tutorial 5: Creating New Modules
    • Tutorial 6: Using CUDA
  • DIY Deep Learning for Vision: a Hands-On Tutorial with Caffe: This is a hands-on tutorial intended to present state-of-the-art deep learning models and equip vision researchers with the tools and know-how to incorporate deep learning into their work.

  • Stanford: Unsupervised Feature Learning and Deep Learning: escription: This tutorial will teach you the main ideas of Unsupervised Feature Learning and Deep Learning. By working through it, you will also get to implement several feature learning/deep learning algorithms, get to see them work for yourself, and learn how to apply/adapt these ideas to new problems.

  • [Kaggle: Word2Vec Tutorial for Movie Reviews] (https://www.kaggle.com/c/word2vec-nlp-tutorial): This tutorial will help you get started with Word2Vec for natural language processing. It has two goals:

    • Basic Natural Language Processing: Part 1 of this tutorial is intended for beginners and covers basic natural language processing techniques, which are needed for later parts of the tutorial.
    • Deep Learning for Text Understanding: In Parts 2 and 3, we delve into how to train a model using Word2Vec and how to use the resulting word vectors for sentiment analysis.

Books

  • Deep Learning in Neural Networks: An Overview (2014): In recent years, deep artificial neural networks (including recurrent ones) have won numerous contests in pattern recognition and machine learning. This historical survey compactly summarises relevant work, much of it from the previous millennium. Shallow and deep learners are distinguished by the depth of their credit assignment paths, which are chains of possibly learnable, causal links between actions and effects. I review deep supervised learning (also recapitulating the history of backpropagation), unsupervised learning, reinforcement learning & evolutionary computation, and indirect search for short programs encoding deep and large networks.

  • Deep Machine Learning — A New Frontier in Artificial Intelligence Research (2010): This article provides an overview of the mainstream deep learning approaches and research directions proposed over the past decade. It is important to emphasize that each approach has strengths and weaknesses, depending on the application and context in which it is being used. Thus, this article presents a summary on the current state of the deep machine learning field and some perspective into how it may evolve.

  • Neural Networks and Deep Learning (2014): Neural Networks and Deep Learning is a free online book. The book will teach you about:

    • Neural networks, a beautiful biologically-inspired programming paradigm which enables a computer to learn from observational data
    • Deep learning, a powerful set of techniques for learning in neural networks
  • Deep Learning (2015, in Draft): Note: the book structure has been re-organized into three parts:

    • Applied math and machine learning basics (can be skipped by people with appropriate background)
    • Modern practical deep networks (used in industry currently, mostly supervised learning)
    • Deep learning research (looking forward, mostly unsupervised learning) and some chapters have moved around or been split accordingly.
  • A Primer on Neural Network Models for Natural Language Processing: A pamphlet/primer by Yoav Goldberg.

    • Introduces MLPs (including convolutional neural nets) and recurrent/recursive architectures for document classification as well as structured output prediction

Classes

  • Stanford CS231n: Convolutional Neural Networks for Visual Recognition: During the 10-week course, students will learn to implement, train and debug their own neural networks and gain a detailed understanding of cutting-edge research in computer vision. We will focus on teaching how to set up the problem of image recognition, the learning algorithms (e.g. backpropagation), practical engineering tricks for training and fine-tuning the networks and guide the students through hands-on assignments and a final course project.

  • Stanford CS224d: Deep Learning for Natural Language Processing (homepagesyllabus) (2015): The course provides a deep excursion into cutting-edge research in deep learning applied to NLP. On the model side we will cover word vector representations, window-based neural networks, recurrent neural networks, long-short-term-memory models, recursive neural networks, convolutional neural networks as well as some very novel models involving a memory component.

  • Oxford Machine Learning: Linear prediction , Maximum likelihood, Regularizers, Optimisation, Logistic regression, Back-propagation and layer-wise design of neural nets, Neural networks and deep learning with Torch, Convolutional neural networks, Max-margin learning and siamese networks, Recurrent neural networks and LSTMs, Hand-writing with recurrent neural networks, Variational autoencoders and image generation, Reinforcement learning with direct policy search, Reinforcement learning with action-value functions

  • statistics.com: 3-wk Sentiment Analysis Course: This online course, “Sentiment Analysis,” is designed to give you an introduction to the algorithms, techniques and software used in sentiment analysis. Their use will be illustrated by reference to existing applications, particularly product reviews and opinion mining. The course will try to make clear both the capabilities and the limitations of these applications. For real-world applications, sentiment analysis draws heavily on work in computational linguistics and text-mining. At the completion of the course, a student will have a good idea of the field of sentiment analysis, the current state-of-the-art and the issues and problems that are likely to be the focus of future systems. Course Program:

    • WEEK 1: Introduction and Subjectivity Analysis
    • WEEK 2: Sentiment Extraction
    • WEEK 3: Opinion Retrieval and Spam
  • Coursera: Natural Language Processing: This course covers a broad range of topics in natural language processing, including word and sentence tokenization, text classification and sentiment analysis, spelling correction, information extraction, parsing, meaning extraction, and question answering, We will also introduce the underlying theory from probability, statistics, and machine learning that are crucial for the field, and cover fundamental algorithms like n-gram language modeling, naive bayes and maxent classifiers, sequence models like Hidden Markov Models, probabilistic dependency and constituent parsing, and vector-space models of meaning.

    • Week 1 - Basic Text Processing; Course Introduction; Edit Distance
    • Week 2 - Language Modeling; Spelling Correction
    • Week 3 - Sentiment Analysis; Text Classification
    • Week 4 - Discriminative classifiers: Maximum Entropy classifiers; Named entity recognition and Maximum Entropy Sequence Models; Relation Extraction
    • Week 5 - Advanced Maximum Entropy Models; Instructor Chat; Parsing Introduction; POS Tagging
    • Week 6 - Dependency Parsing; Lexicalized Parsing; Probabilistic Parsing
    • Week 7 - Information Retrieval; Ranked Information Retrieval
    • Week 8 - Instructor Chat II; Question Answering; Semantics; Summarization
  • Coursera: Neural Networks for Machine Learning: Neural networks use learning algorithms that are inspired by our understanding of how the brain learns, but they are evaluated by how well they work for practical applications such as speech recognition, object recognition, image retrieval and the ability to recommend products that a user will like. As computers become more powerful, Neural Networks are gradually taking over from simpler Machine Learning methods. They are already at the heart of a new generation of speech recognition devices and they are beginning to outperform earlier systems for recognizing objects in images. The course will explain the new learning procedures that are responsible for these advances, including effective new proceduresr for learning multiple layers of non-linear features, and give you the skills and understanding required to apply these procedures in many other domains.

Benchmark Test Suite

FrameworksDatasetsData ModelsArchitecturesOptimizersFixed ParametersOptional Enhancements
KerasIMDBOne-hot WordsLSTMRMSPropEpochs=10Autoencoder pre-training
NeonAmazonEmbedded WordsCNN-genericADAMBatch Size=128
TheanoSentiment140One-hot CharsCNN-crepeTraining Size=20,000
Sina WeiboTest Size=5,000
Lab41140

Chinese Datasets

Chinese Language Corpora for Sentiment Analysis

Microblogs

Open Weiboscope

This dataset comes from researchers at the Journalism and Media Center of the University of Hong Kong.

  • 226 million posts on Sina Weibo (Twitter-like microblogging service)
  • (zipped) CSV format
  • Collected in 2012 from feeds of users having > 1000 followers
  • Not tagged for sentiment
  • Released for public use, citation required, no specific licensing terms

NLPIR Weibo Content (zh)

From China's NLP and Information Retrieval sharing platform (run by the Big Data Search and Mining Lab at the Beijing Institute of Technology).

  • 230,000 posts from Sina Weibo (2011)
  • XML format
  • metadata: user ID, time of posting, etc.
  • Released for public non-commercial use, citation required, no specific licensing terms

Microblog PCU

From researchers at Xi'an Jiaotong university, shared with UC Irvine's machine learning repository.

  • About 50,000 posts from Sina Weibo.
  • Has more user metadata, apparently including full following-follower information.
  • Subject to UCI machine Learning Rpository's usage/citation guidelines [https://archive.ics.uci.edu/ml/citation_policy.html]

NLPIR 5 million Weibo (zh)

From researchers at BIT.

  • 5 million Sina Weibo posts
  • SQL format
  • Use limited to research and teaching; commercial usage prohibited.
  • Slow connection to server; I have not yet successfully completed a download.

Medium-length documents

Surprisingly, it's harder to find publicly available corpora of medium-length texts in Chinese that aren't just news articles or other formal written genres. There are citations for corpora of product reviews and short documents, but accessing them has proved difficult.

Ren-CECps

Small corpus of blog posts with annotations of emotion and sentiment at document, paragraph, and sentence levels. Constructed by Changqin Quan (Hefei University of Technology) and Fuji Ren (Tokushima University).

  • 1,500 blog posts (11k paragraphs, 35k sentences)
  • annotated for 3-way polarity, real-valued scores on 8 emotion categories
  • Has been publicly released [http://a1-www.is.tokushima-u.ac.jp/member/ren/Ren-CECps1.0/Ren-CECps1.0.html], but is not currently accessible through that link. Fuji Ren can be contacted (ren@is.tokushima-u.ac.jp) for a license agreement.

ChnSentiCorp

Small corpus of product reviews, maintained by Tan Songbo (Chinese Academy of Sciences, tansongbo@software.ict.ac.cn).

  • 6,000 reviews of hotels, computers, and books.
  • Includes ratings as sentiment polarity labels
  • Not currently accessible
  • Limited to academic use

Mandarin Chinese News Text (LDC)

  • 250 million Chinese character corpus (hundreds of thousands of documents)
  • News text from People's Daily, Xinhua newswire, China Radio International
  • $500.00 for non-members

GALE Phase 1 Chinese Blog Parallel Text (LDC)

  • 277 blog posts in Chinese and translated to English.
  • $1500.00 for non-members

Sogou News (zh)

Anacode Chinese NLP API Web Data

Articles and user-generated content scraped from major Chinese domains, incl. texts and relevant metadata (date, author, source etc.). Maintained and regularly updated by Anacode GmbH.

  • More than 10 industries (automotive, health, cosmetics etc.)
  • Data in JSON format.
  • Free access for most of the datasets
  • Additional semantic information on datasets based on Anacode's NLP analysis.

Conferences

2015


ConferenceDatesLocationOverview
MAY 2015
MLConfMay 1SeattleMLconf gathers communities to discuss the recent research and application of Algorithms, Tools, and Platforms to solve the hard problems that exist within organizing and analyzing massive and noisy data sets.
ICLR2015May 7-9San DiegoDespite the importance of representation learning to machine learning and to application areas such as vision, speech, audio and NLP, there was no venue for researchers who share a common interest in this topic. The goal of International Conference on Learning Representations has been to help fill this void.
ICANNMay 14-15AmsterdamThe ICANN 2015: 17th International Conference on Artificial Neural Networks aims to bring together leading academic scientists, researchers and research scholars to exchange and share their experiences and research results about all aspects of Artificial Neural Networks.
Deep Learning SummitMay 26-27BostonThe Deep Learning Innovation Summit is the next revolution in artificial intelligence. Explore the impact of image & speech recognition as a disruptive trend in business and industry. How can multiple levels of representation and abstraction help to make sense of data such as images, sound, and text.
JUN 2015
SemEvalJune 4-5DenverThe International Workshop on Semantic Evaluation is an ongoing series of evaluations of computational semantic analysis systems; it evolved from the Senseval word sense evaluation series. The evaluations are intended to explore the nature of meaning in language
IWANNJune 10-12SpainThis biennial meeting seeks to provide a discussion forum for scientists, engineers, educators and students about the latest ideas and realizations in the foundations, theory, models and applications of hybrid systems inspired on nature (neural networks, fuzzy logic and evolutionary systems) as well as in emerging areas related to the above items.
JUL 2015
IJCNNJuly 12-17IrelandThe International Joint Conference on Neural Networks is the premier international conference in the area of neural networks.
AUG 2015
KDDAug 10-13SydneyKDD 2015 is a premier conference that brings together researchers and practitioners from data mining, knowledge discovery, data analytics, and big data.
SEP 2015
MLConfSep 18Atlanta
GigaOm Structure IntelSep 22-23SFAdvances in artificial Intelligence are enabling many of the promises made during the early days of big data.
Deep Learning SummitSep 24-25LondonThe Deep Learning Innovation Summit is the next revolution in artificial intelligence. Explore the impact of image & speech recognition as a disruptive trend in business and industry. How can multiple levels of representation and abstraction help to make sense of data such as images, sound, and text.
OCT 2015
NOV 2015
LT-AccelerateNov 24-25Brussels
DEC 2015
NIPS 2015Dec 7-12Montreal
Comments