2012.07.01 - 2017.06.30 (5 years). This research was supported by Next-Generation Information Computing Development Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Education, Science and Technology (No. 2012M3C4A7033344)
The goal of this project is the development of enabling software technologies for big data mining. Through this project, we will research data mining techniques for big data in natural sciences and social networks. We will also develop personalized service technologies based on unstructured big data analysis and customer behavior models. Furthermore, we will produce well-trained software engineers who are experts in big data mining.
2011.05.01 - 2014.04.30 (3 years). This work was supported by the Brain Korea 21 Project in 2010 and Mid-career Researcher Program through NRF grant funded by the MEST (No. KRF-2011-0016029).
Combining the highly profitable information search industry and the mobile computing paradigm, mobile information search industry has been growing rapidly despite the global economy recession. Thus, development of mobile search technology will impact on the economy positively. This project aims at advancing the technologies in the areas of mobile search and mining, low-power consumption utility mining, and mining for mobile online advertizing.
2009.05.01 - 2012.02.28 (3 years). This work was supported by the Brain Korea 21 Project and Mid-career Researcher Program through NRF grant funded by the MEST (No. KRF-2009-0080667).
PubMed MEDLINE, a database of biomedical and life science journal articles, is one of the most important information source for medical doctors and bio-researchers. Finding the right information from the MEDLINE is nontrivial because it is not easy to express the intended relevance using the current PubMed query interface, and its query processor focuses on fast matching rather than accurate relevance ranking. This project develop techniques for building a user-friendly MEDLINE search engine.
2011.02.01 - 2012.01.31 (1 years). This work was supported by Samsung Electronics.
Existing recommendation systems (e.g., the Netflix competition) focus on an accurate prediction of purchase, as the systems are evaluated based on the prediction accuracy. However, such systems tend to recommend popular items. Recommending popular items, however, might not be effective or affective on users' purchase decisions, as users likely already know the items and likely have pre-made decisions on the purchase of items, e.g., recommend to watch Star Wars or Titanic. Effective recommendation must recommend unexpected or novel items that could surprise users and affect users' purchase decision. This project is to develop an effective recommendation for digital TV customers.
This work is supported by MSRA (Microsoft Research Asia).
Feature weighting for ranking has not been researched as extensively as for classiﬁcation. This project develops various feature weighting methods for ranking by leveraging existing methods for classiﬁcation. The developed methods are used on the Live Search query log data to identify key features that determine the users’ click-through behaviors. The developed methods will also be used to build the feature selection component of RefMed -- relevance feedback PubMed search engine.
With Prof. Jaewook Lee (PI), 2008.07.01 - 2010.06.31 (2 years). This work was supported by the Brain Korea 21 Project and the Korea Research Foundation Grant funded by the Korean Government (KRF-2008-314-D00483).
Support vector machines (SVMs) have been promising methods for classification, regression, ranking analysis due to their solid mathematical
foundations, which include two desirable properties: margin maximization and nonlinear classification using kernels. However, despite these prominent properties, SVMs are usually not chosen for large-scale data mining problems because their training complexity is highly dependent on the data set size. Unlike traditional pattern recognition and machine learning, real-world data mining applications often involve a huge number of data records that does not fit in main memory and a multiple scans of the data set is often too expensive. Through this project, we developed techniques for approximately training SVMs in one scan of the database.
Single-Class Classification (SCC) seeks to distinguish one class of data from universal set of multiple classes. We call the target class positive and the complement set of samples negative. In SCC problems, it is assumed that a reasonable sample of the negative data is not available. Since it is not natural to collect the "non-interesting'' objects (i.e., negative data) to train the concept of the "interesting'' objects (i.e., positive data), SCC problems are prevalent in the real world where positive and unlabeled data are widely available but negative data are hard or expensive to acquire. We developed SCC algorithms which compute the boundary functions of the target class from positive and unlabeled data (without labeled negative data). The basic idea is to exploit the natural "gap'' between positive and negative data by incrementally labeling negative data from the unlabeled data using the margin maximization property of SVM. Our SCC algorithms build classification functions very close to the SVM with fully labeled data when the positive data is not much under-sampled.
Data Mining has many applications in the real world. Classification is an important sub-class of problems found in a wide variety of situations. Fraud detection is one of the biggest and most important classification problems. Take the case of identifying fraudulent credit card transactions. Banks collect transactional information for credit card customers. Due to the growing threat of identity theft, credit card loss, etc. identifying fraudulent transactions can lead to annual savings of billions of dollars. Deciding whether a particular transaction is true or false is a classification problem. Another completely different, though equally important problem is in healthcare (e.g., diagnosis of disease). Many such problems abound. Currently, classifiers are run locally or over data collected at one central location (i.e., in a data warehouse). The accuracy of a classifier usually improves with more training data. Data collected from different sites is especially useful, since it provides a better estimation of the population than the data collected at a single site. However, privacy and security concerns restrict the free sharing of data. There are both legal and commercial reasons to not share data. For e.g., HIPAA laws require that medical data not be released without appropriate anonymization. Similar constraints arise in many applications; European Community legal restrictions apply to disclosure of any individual data. In commercial terms, data is often a valuable business asset. For example, complete manufacturing processes are trade secrets (although individual techniques may be commonly known). Thus, it is increasingly important to enable privacy-preserving distributed mining of information.
Support Vector Machine (SVM) classification is one of the most favorably used classification methodology in data mining and machine learning. SVMs have proven to be effective in many real-world applications. This project develops secure SVM classification solutions for distributed data.
These solutions are based on our SVM Java implementation: