posted Oct 11, 2011 9:58 PM by Hwanjo Yu
This work is supported by MSRA (Microsoft Research Asia).
Feature weighting for ranking has not been researched as extensively as for classification. This project develops various feature weighting methods for ranking by leveraging existing methods for classification. The developed methods are used on the Live Search query log data to identify key features that determine the users’ click-through behaviors. The developed methods will also be used to build the feature selection component of RefMed -- relevance feedback PubMed search engine.
- "Efficient Feature Weighting Method for Ranking" by H Yu, J Oh, WS Han, ACM CIKM 2009 (14.5% accepted) |
posted Aug 8, 2011 4:26 AM by Hwanjo Yu
[
updated Aug 8, 2011 4:33 AM
]
2008.07.01 - 2011.06.30 (3 years). This work is supported by the Brain Korea 21 Project and the Korea Research Foundation Grant funded by the Korean Government (KRF-2008-331-D00528).
Most online data retrieval systems, built based on relational database management systems(RDBMS), support fast processing of Boolean queries but offer little support for relevance or preference ranking. A unified support of Boolean and ranking constraints in a query is essential for user-friendly data retrieval. This project develops foundational techniques that enable such data retrieval systems in which users intuitively express ranking constraints and the system efficiently process the queries.
|
posted Sep 4, 2010 5:51 PM by Hwanjo Yu
[
updated Aug 8, 2011 4:34 AM
]
With Prof. Jaewook Lee (PI), 2008.07.01 - 2010.06.31 (2 years). This work was supported by the Brain Korea 21 Project and the Korea Research Foundation Grant funded by the Korean Government (KRF-2008-314-D00483). |
posted Sep 11, 2009 9:43 PM by Taehoon Kim
[
updated Sep 4, 2010 6:29 PM by Hwanjo Yu
]
Support vector machines (SVMs) have been promising methods for classification, regression, ranking analysis due to their solid mathematical foundations, which include two desirable properties: margin maximization and nonlinear classification using kernels. However, despite these prominent properties, SVMs are usually not chosen for large-scale data mining problems because their training complexity is highly dependent on the data set size. Unlike traditional pattern recognition and machine learning, real-world data mining applications often involve a huge number of data records that does not fit in main memory and a multiple scans of the data set is often too expensive. Through this project, we developed techniques for approximately training SVMs in one scan of the database. Representative Publications- H Yu, J Yang, J Han & X Li, "Making SVMs Scalable to Large
Data Sets using Hierarchical Cluster Indexing",
Data Mining and Knowledge Discovery, Springer, 11(3): 295-321, 2005.
(DAMI'05) (SCI)
- H Yu, J Yang & J Han, "Classifying
Large Data Sets Using SVM with Hierarchical Clusters",
Proc. of ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining,
2003. (KDD'03 full paper, 13% accepted, received
student scholarship award)
|
posted Sep 11, 2009 9:37 PM by Taehoon Kim
[
updated Sep 4, 2010 6:18 PM by Hwanjo Yu
]
Single-Class Classification (SCC) seeks to distinguish one class of data from universal set of multiple classes. We call the target class positive and the complement set of samples negative. In SCC problems, it is assumed that a reasonable sample of the negative data is not available. Since it is not natural to collect the "non-interesting'' objects (i.e., negative data) to train the concept of the "interesting'' objects (i.e., positive data), SCC problems are prevalent in the real world where positive and unlabeled data are widely available but negative data are hard or expensive to acquire. We developed SCC algorithms which compute the boundary functions of the target class from positive and unlabeled data (without labeled negative data). The basic idea is to exploit the natural "gap'' between positive and negative data by incrementally labeling negative data from the unlabeled data using the margin maximization property of SVM. Our SCC algorithms build classification functions very close to the SVM with fully labeled data when the positive data is not much under-sampled. Representative Publications- H. Yu, "Single-Class Classification with Mapping Convergence", Machine Learning, Springer, 61:49-69, 2005. (ML'05)
- H. Yu, J. Han & K. C.-C. Chang, "PEBL: Web Page Classification without Negative Examples", IEEE Transaction on Knowledge and Data Engineering, Special Issue on Mining and Searching the Web, IEEE Computer Society, 16(1): 70-81, 2004. (TKDE'04 Special Issue, 11% accepted)
- H. Yu, "SVMC: Single-Class Classification With Support Vector Machines", Proc. of Int. Joint Conf. on Artificial Intelligence, 2003. (IJCAI'03 full paper, 20% accepted, received student scholarship award)
- H. Yu, J. Han & K. C.-C. Chang, "PEBL: Positive Example Based Learning for Web Page Classification Using SVM", Proc. of ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, 2002. (KDD'02 full paper, 14% accepted, received student scholarship award)
|
posted Sep 11, 2009 9:21 PM by Taehoon Kim
[
updated Sep 4, 2010 6:19 PM by Hwanjo Yu
]
Data Mining has many applications in the real world. Classification is an important sub-class of problems found in a wide variety of situations. Fraud detection is one of the biggest and most important classification problems. Take the case of identifying fraudulent credit card transactions. Banks collect transactional information for credit card customers. Due to the growing threat of identity theft, credit card loss, etc. identifying fraudulent transactions can lead to annual savings of billions of dollars. Deciding whether a particular transaction is true or false is a classification problem. Another completely different, though equally important problem is in healthcare (e.g., diagnosis of disease). Many such problems abound. Currently, classifiers are run locally or over data collected at one central location (i.e., in a data warehouse). The accuracy of a classifier usually improves with more training data. Data collected from different sites is especially useful, since it provides a better estimation of the population than the data collected at a single site. However, privacy and security concerns restrict the free sharing of data. There are both legal and commercial reasons to not share data. For e.g., HIPAA laws require that medical data not be released without appropriate anonymization. Similar constraints arise in many applications; European Community legal restrictions apply to disclosure of any individual data. In commercial terms, data is often a valuable business asset. For example, complete manufacturing processes are trade secrets (although individual techniques may be commonly known). Thus, it is increasingly important to enable privacy-preserving distributed mining of information. Support Vector Machine (SVM) classification is one of the most favorably used classification methodology in data mining and machine learning. SVMs have proven to be effective in many real-world applications. This project develops secure SVM classification solutions for distributed data. These solutions are based on our SVM Java implementation:
Representative Publications
- J. Vaidya, H. Yu & X. Jiang, "Privacy-Preserving SVM Classification", Knowledge and Information Systems, Springer, 2008.
- H. Yu, J. Vaidya & X. Jiang, "Privacy-Preserving SVM Classification on Vertically Partitioned Data", Lecture Notes in Artificial Intelligence - Pacific-Asia Conf. on Knowledge Discovery and Data Mining, Springer, To appear. (LNAI - PAKDD'06, 20% accepted 2006)
- H. Yu, X. Jiang & J. Vaidya, "Privacy-Preserving SVM using Nonlinear Kernels on Horizontally Partitioned Data", Proc. of ACM SAC Data Mining Track, 2006. (ACM SAC'06 full paper, 30% accepted)
- H. Yu & J. Vaidya, "Secure Matrix Addition", The University of Iowa Technical Report: UIOWA CS TR04-04, November, 2004
|
posted Sep 11, 2009 9:07 PM by Taehoon Kim
[
updated Sep 4, 2010 6:23 PM by Hwanjo Yu
]
Despite the popularity of SVMs in the data mining and machine learning communities, applying them to real world classification problems often confronts another obstacle, that is, the barrier of understanding and interpreting the results. For example, physicians may want to use the classification techniques of SVM for early diagnosis of diabetic patients. However, if the classification model generates the diagnosis result without an explanation of why or how, physicians may not appreciate or trust the result. As another example, pharmacologists are given an SVM model that accurately classifies active drugs from non-active drugs for a symptom, but the model may not be useful to them if it does not explain which components in the drug play the key roles.
Through this project, we developed techniques that discover discriminative feature combinations using SVM models. Our methods effectively capture the feature combinations on a drug activity dataset. We also developed Localized Radial Basis Function (L-RBF) kernels to visualize discriminative features for nonlinear SVM models. Our system captures and visualizes important factors for a disease, which presents valuable information to physicians. Representative Publications- B. Cho, H. Yu, J. Lee, Y. Chee, I. Kim, & S. Kim, "Nonlinear Support Vector Machine Visualization for Risk Factor Analysis using Nomograms and Localized Radial Basis Function Kernels", IEEE Transactions on Information Technology in Biomedicine, IEEE Computer Society, 12(2):247-256, 2008.03.
- B. Cho, H. Yu, K. Kim, T. Kim, I. Kim, & S. Kim, "Application of irregular and unbalanced data to predict diabetic nephropathy using visualization and feature selection methods", Artificial Intelligence in Medicine, Elsevier, 962:1-17, 2008.01.
- H. Yu, J. Yang, W. Wang & J. Han, "Discovering Compact and Highly Discriminative Features or Feature Combinations of Drug Activities Using Support Vector Machines", Proc. of IEEE Computer Society Bioinformatics Conf., 2003. (CSB'03 full paper, 18.5% accepted)
|
|