Summary of Haiyan Wang's research



Summary of Haiyan Wang's Peer Reviewed Journal Publications on Methodology



  • (1). Inference based on original observations for hypothesis testing. Wang and Akritas [2011] address the asymptotic theory for hypotheses testing in high dimensional analysis of variance when the distributions are completely unspecified. Wang and Akritas [2010a] and Wang, Higgins and Blasi [2010] provide the inference for testing several effects in nested heteroscedastic functional data that includes a large number of repeated measurements observed within a subject or stratum. We build our theory on novel models in which the random effects are assumed to be neither uncorrelated nor normal. The models leave the covariance structure unspecified and apply to both discrete and continuous data. The asymptotic theory of the test statistics is driven by a large number of factor levels or a large number of measurements per subject and the assumption of nonstationary á-mixing on the error term. Both weak and long range dependence are considered. Wang, Tolos and Wang [2010] present a test of independence between the response variable, which can be discrete or continuous, and a continuous covariate after adjusting for heteroscedastic treatment effects. This work was extended to the theory of lack-of-fit in heteroscedastic constant and nonlinear regression by one of my current Ph.D. students. The results lead to two manuscripts under review for publication [Gharaibeh, Sahtout and Wang 2014, Gharaibeh and Wang 2014]. Additionally, the results there also made the first step toward a current research on nonlinear variable selection in additive models for high dimensional data that was studied in detail in the dissertation of one of my Ph.D. students. In Bathke et al. [2010], we derive asymptotic procedures as well as finite approximations for the analysis of data arising from series of randomized complete block designs with a large number of factor levels. The publication by Zhang et al. [2011] resulted from the dissertation work of my former student Ke Zhang provides a robust nonparametric approach to compare the expressions of longitudinally measured sets of genes under multiple treatments or experimental conditions.


  • (2) Rank based inference for high dimensional data. Such inference tests hypotheses specified in terms of distribution functions. Wang and Akritas [2004, 2010b] provide rank tests for the nonparametric main factor effects and interactions in two-way and multi-way high-dimensional analysis of variance when the cell distributions are completely unspecified, the sample size may be small and the number of factor levels may be large. Wang and Akritas [2009] consider rank based inferences for testing hypotheses in a fully nonparametric marginal model for heteroscedastic functional data. The asymptotic distribution of the rank statistics is obtained by showing their asymptotic equivalence to corresponding expressions based on the asymptotic rank transform. Compared with test procedures based on the original observations, the proposed rank procedures are free of moment conditions, converge to their limiting distribution faster, and have better power when the underlying distributions are heavy tailed or skewed.


  • (3) Rank-Test-based clustering for high dimensional data. The idea of using rank tests in clustering for high dimensional data was developed in Wang et al. (2008) for agglomerative clustering of functional data and in von Borries and Wang (2009) for partition clustering of independent low sample size data. We define clusters through the unknown high dimensional multivariate distributions and provide test-based clustering algorithms that are invariant to monotone transformations of data. These test-based clustering methods can take all the information from thousands of variables to effectively detect unknown patterns or clusters in high dimensional independent or functional data.


  • (4) Digital image quality assessment and image pixel classification and segmentation. Current popular image similarity measures (mean squared error, signal to noise ratio, structure similarity measure and its variants) do not take into account of possible nonlinear dependence between the source image and the image being compared. In Wang, Maldonado and Silwal [2011] and Silwal, Wang and Maldonado [2013], we propose nonparametric rank-test-based similarity measures in frequency and wavelet domain, respectively. Applications of the methods on a variety of altered images showed superior performance. Ghimire and Wang [2011] introduce, implement and assess the idea of using combined evidence from the multiple hypothesis testing and minimum distance to carry out image pixel classification and image segmentation. Extensive experiments show that our test-based segmentation has excellent edge detection and texture preservation properties for both gray scale and color images.


  • (5) Variable selection. Some of the research that I have been doing is on methods for variable selection and modeling in high dimensional data now commonly seen in biochemistry, proteomics, and genomics studies. Popular examples are cancer classification through molecular information from genomics data, prediction of gene annotation splice sites based on genomic sequence data, and quantitative structure-activity relationship modeling for drug design. Different from most literatures that focus on screening individual or pairs of variables without considering the possible interactions among variables, in Zhang et al. [2012] we introduce a new computational method for classification of cancer tissue samples based on gene expression data. The method takes potential variable interactions into account. During the variable selection process, the set of variables to be kept in the model was recursively refined and repeatedly updated according to the effect of a given variable on the contributions of other variables in reference to their usefulness in cancer classification. The variables selected from each data set leads to significantly improved leave-one-out classification accuracy across 10 data sets for multiple classifiers. In Qian et al. [2012], Zhou et al. [2012] and Li et al. [2012], we extract and summarize data-specific genomic sequence information to form new variables and achieve high accuracy in classification with the support vector machine. Wang et al. [2013] provide an algorithm that is a Chi-square-statistic-based Top Scoring Genes (TSG) classifier to perform informative gene selection in both binary and multi-class cancer classification. It overcomes the problem of only selecting gene pairs in top scoring pairs (TSP) family classifiers. Extensive comparison and validation with application to 9 binary and 10 multi-class gene expression datasets involving human cancers show that the our method has clear advantages. It outperforms TSP family classifiers by a big margin in most of the 19 datasets. In addition to improved accuracy, our classifier shares all the advantages of the TSP family classifiers including easy interpretation, invariant to monotone transformation, often selects a small number of informative genes allowing follow-up studies, and resistant to sampling variations due to within sample operations. In Xie et al. [2013] and Dai et al. [2014], we provide pipelines for prediction of multidimensional time series and quantitative structure-activity relationship analysis of peptides. The work introduces high dimensional semivariogram for near-neighbor sample selection for corresponding data settings, BMSF for feature selection, and weighted SVR regression for validation and prediction. Comparisons with published results suggest that our methods have much improved prediction accuracy.