Jianfei Wu's Homepage


 
 
                                                                                                                                                                        中文

Computer Science Department
North Dakota State University
Tel. +1(701)540-2007
jianfei.wu@ndsu.edu


      New: I am in Microsoft now
      I got a CS Ph.D. degree in Computer Science department of North Dakota State University (NDSU) in 2011, advised by Dr. Anne Denton. My research interests include Data Mining, Machine Learning, Pattern Recognition, Parallel Computing, Parallel and Distributed Simulation, Information Retrieval and Image Processing. Currently I am very interested in Mining stock market data.
                                                                                                
  Publications
       Research
       Service

Research

2011

(1) TAE Record & Offline Simulation
This is a test automation run record and offline simulator. The simulator is able to record a test automation run in application, source code, and database level, enabling software engineers to analyze a test automation run very efficiently.



2010

(1) Data Mining General Framework (DMGF)
I am building a general data mining system, which will include Classification, Regression, Clustering, Frequent Pattern Mining, Association Rule Ming and so on. The goal of the system is to make data mining tasks easier. After users specifying a data mining task, and structuring the data set, DMGF will do the rest of work. The system also adopts parallel computing technique, greatly accelerating the data mining process. Currently, Classification, Regression and Clustering have been integrated into DMGF. The system has been tested in "2010 UC San Diego Data Mining Contest ", which is a classification task, and "ICMLA 2010 Challenges", which is a clustering task. Among both tasks, the system achieved very good results.

(2) 2010 UC San Diego Data Mining Contest
The goal is to identify potential new customers from a population of consumers for an online retailer. The contest attracted 141 teams from all over the world. The DMGF achieved amazing results: the 3rd and the 6th places in the first and second data sets respectively. In this contest, DMGF uses 4 average desktop computers, each of which building models for the data sets separately. One computer is responsible for communicating the other 3 computers, and ensemble the results from all computers together. I developed a small application which enables the computer to submit the ensemble results to the contest website automatically. Literally, the jobs were done by DMGF without (or with very little) human intervention. My next step to optimize the Classification in DMGF is, do more research on how to avoid (or more precisely,mitigate) overfitting to validation sets for the ensembles.

(3) ICMLA 2010 Challenges
It is a speaker clustering challenge. The challenge include two tasks. One task is to cluster speech segments coming from two speakers, the other is to cluster speech segments coming from unknown number of speakers. DMGF is used in the first task.
Algorithm Details:
DMGF first construct affinity matrices between each pair of speech segments, using a modified version of Point Distribution algorithm which is initially developed for mining Vector-Item patterns. The subsequent clustering procedure is based on fitting a Gaussian Mixture model on multiple random projection matrices. The final class label of each unit is voted from the results of these random projection matrices. (paper is to appear in ICMLA 2010).


(4)
Active Learning Challenge

Initially I took this challenge as a course project of my 'Bioinformatics Data mining' class. I Ranked the 2nd place in F data set and the 3rd place in C data set.

2009

(1) 2009 INFORMS Data Mining Contest

I ranked the 8th place in task1 and the 11th place in task2. My AUC values are 0.8883 and 0.9100 for taks1 and task2 respectively.




(2) Identifying Important Regions in the Design Space of Combinatorial Experiments

The algorithm is to identify designs that show interesting outcomes for multiple responses. The importance of a design region is measured through a linear model, which incorporates the effect size and confidence intervals of the region. The algorithm also detects patterns in subsets of responses. It provides an efficient and effective way for domain experts to analyze the relationship between factors and multiple responses simultaneously.



(3) KDD cup 2009
I spent quite limited time in this data mining contest. Luckily I stand a good position: rank 27th in the slow challenge and 30th in the fast challenge. There are 453 teams, including several  top companies and research institutes such as IBM research, have taken part in this year's KDD cup. The description of my algorithm is in here.



(4) Mining for Core Patterns in Stock Market Data
The core patterns within a sector are representative groups of stocks for the sector when it shows coherent behavior. In comparison with clustering algorithms, the core patterns are shown to
be more stable as the stock price evolves. The proposed algorithm has only one free parameter, for which an empirical choice is provided based on mathematical derivation. (The paper is accepted by ICDM-09 Workshop on Mining Multiple Information Sources)

2008

(1)  Contact Angle Analysis

This project aims to provide a handy tool for coating scientists to analyze contact angle and surface energy, as well as hysteresis.


(2) ICDM 08 Contest
I took part in ICDM 08 data mining contest, and I ranked 3rd in unlabeled data set. From this contest I gained many data mining experience. My code and paper is in here. This algorithm uses a GA algorithm to tune parameters of libSVM classifier.

(3) Soccer Robot Simulation

This was a class project for my "Parallel and Distributed Simulation" course. The simulation provides a platform for competing various soccer robot strategic algorithms. The software runs in libsynk environment. Source code and Class paper are in here.



(4)Relating Gene Expression Data on Two-Component Systems to Functional Annotations in Escherichia coli
We have developed an algorithm that relates patterns of gene expression in a set of microarray experiments to functional groups in one step. The effectiveness of the algorithm is demonstrated as part of a study of regulation by two-component systems in Escherichia coli. The significance of the relationships between expression data and functional annotations is evaluated based on density histograms that are constructed using product similarity among expression vectors. We present a biological analysis of three of the resulting functional groups of proteins and develop hypotheses for further biological studies.  http://www.biomedcentral.com/1471-2105/9/294


2007

(1)Mining Vector-Item Patterns for Annotating Protein Domains
An algorithm is introduced for finding patterns involving item and vector data. http://www.cse.fau.edu/~xqzhu/mmis/MMIS07_Proceedings.pdf




(2) Fingerprint Identification algorithm


This was a side project that I was working on long time ago. The algorithm provides two modes: 1:1 and 1:N. It can be directly connected to URU2000 scanner. Download the application from here.


                                                                                                                                                                                                               
    Locations of visitors to this page