Yong-gang Cao(曹勇刚)'s archives


moving to http://sites.google.com/site/chiefadminofficer/

Research

Codes&Docs

Tools

Photos

Life

work

Related people

Yong-Gang Cao, PhD
Research Associate
College of Health Science
University of Wisconsin-Milwaukee
One page Resume

CV available upon request


Categorized Collections

  • Machine Learning
    • Algorithms/Models
      • ~ Conditional Random Field
        • Conditional random fields (CRFs) are a probabilistic framework for labeling and segmenting structured data, such as sequences, trees and lattices.

      • ~ Maximum Entropy
        • The principle of maximum entropy is a method for analyzing available qualitative information in order to determine a unique epistemic probability distribution. It states that the least biased distribution that encodes certain given information is that which maximizes the information entropy.

      • ~ Logistic Regression
        • logistic regression is a model used for prediction of the probability of occurrence of an event.
      • ~ Support Vector Machine
        • Support vector machines (SVMs) are a set of related supervised learning methods used for classification and regression. They belong to a family of generalized linear classifiers. They can also be considered a special case of Tikhonov regularization. A special property of SVMs is that they simultaneously minimize the empirical classification error and maximize the geometric margin; hence they are also known as maximum margin classifiers.

      • ~ Naiive Bayes
      • AdaBoost
      • Decision Tree
      • k-means
      • k-Nearest Neighborhood
      • Hidden Markov Chain
      • Association Rule
      • ~ IR language Model
    • Tools
      • ~ Mallet
        • MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.

      • ~ Weka
        • Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes.

  • NLP
    • Tools
      • ~ OpenNLP
        • OpenNLP is an organizational center for open source projects related to natural language processing. Its primary role is to encourage and facilitate the collaboration of researchers and developers on such projects.

      • ~ GATE
        • GATE is...      * the Eclipse of Natural Language Engineering, the Lucene of Information Extraction, a leading toolkit for Text Mining     * used worldwide by thousands of scientists, companies, teachers and students     * comprised of an architecture, a free open source framework (or SDK) and graphical development environment     * used for all sorts of language processing tasks, including Information Extraction in many languages     * funded by the EPSRC, BBSRC, AHRC, the EU and commercial users     * 100% Java reference implementation of ISO TC37/SC4 and used with XCES in the ANC     * 10 years old in 2005, used in many research projects and compatible with IBM's UIMA     * based on MVC, mobile code, continuous integration, and test-driven development, with code hosted on SourceForge

    • Data
      • ~ WordNet
        • WordNet® is a large lexical database of English, developed under the direction of George A. Miller. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations. The resulting network of meaningfully related words and concepts can be navigated with the browser. WordNet is also freely and publicly available for download. WordNet's structure makes it a useful tool for computational linguistics and natural language processing.

  • IR
    • Tools
      • ~ Lucene
        • The Apache Lucene project develops open-source search software, including:      * Lucene Java, our flagship sub-project, provides Java-based indexing and search technology.     * Nutch builds on Lucene Java to provide web search application software.     * Lucy is a loose C port of Lucene Java, with Perl and Ruby bindings.     * Solr is a high performance search server built using Lucene Java, with XML/HTTP and JSON/Python/Ruby APIs, hit highlighting, faceted search, caching, replication, and a web admin interface.     * Lucene.Net is a source code, class-per-class, API-per-API and algorithmatic port of the Lucene Java search engine to the C# and .NET platform utilizing Microsoft .NET Framework. Lucene.Net is currently under incubation.     * Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries. Tika is currently under incubation.     * Mahout is new Lucene subproject with the goal of creating a suite of scalable machine learning libraries.  

      • ~ Lemur toolkits
        • The Lemur Toolkit is a open-source toolkit designed to facilitate research in language modeling and information retrieval. Lemur supports a wide range of industrial and research language applications such as ad-hoc retrieval, site-search, and text mining.

      • Smart
      • Okapi
      • Inquery
      • ~ Swish-e
      • ~ Compass
      • ~ MG
        • The tool used in "management of gigabytes"
      • ~ Xapian Code Library
      • Carrot2
    • Data
      • ~ Trec
        • Trec text and web collections with gold standards
  • Development
    • Language
    • IDE
    • Libraries
      • ~ Math
        • Commons Math is a library of lightweight, self-contained mathematics and statistics components addressing the most common problems not available in the Java programming language or Commons Lang.

    • Framework
      • ~ Seam
        • JBoss Seam is a powerful new application framework for building next generation Web 2.0 applications by unifying and integrating technologies such as Asynchronous JavaScript and XML (AJAX), Java Server Faces (JSF), Enterprise Java Beans (EJB3), Java Portlets and Business Process Management (BPM).

  • BioInformatics
    • Data
      • ~ UMLS
        • The NLM Unified Medical Language System (UMLS) project develops and distributes multi-purpose, electronic "Knowledge Sources" and associated lexical tools for system developers. Researchers will find the UMLS products useful in investigating knowledge representation and retrieval questions.

    • Image/Figure Search
  • Systems
    • Search Engine
    • Thesaurus
    • Question Answering
      • ~ Start
        • START, the world's first Web-based question answering system, has been on-line and continuously operating since December, 1993.

      • ~ ask Hermes
        • My working on project. It can answer ad-hoc clinical questions. You can enter either, words, phrase,or question to ask it. Answeres are automatically organized and ranked in a structured way.

  • Books