Welcome to Ai He's homepage

email: aihe@usc.edu
I am a second year Master Student of University of Southern California, majoring in Computer Science.
Meanwhile, I am a Student Worker at Information Sciences Institute, USC, working on Machine Learninig  and NLP Research.


A position as a(n) programmer or engineer with special interests in Natural Language Processing, Machine Learning or Data Mining


Master of Science, Computer Science
University of Southern California, Los Angeles, CA
Aug 2012 - May 2014(expected)

Bachelor of Science, Computer Science
Beihang University, Beijing, P.R.C
Sept 2008 - Jun 2012


Visiting Graduate Student, Division of Biomedical Informatics, Department of Medicine, University of California, San Diego
La Jolla, CA, July 2013 - present

Student Researcher, Information Sciences Institute, University of Southern California
Los Angeles, CA, Nov 2012 - Oct 2013
Web Developer, Department of Computer Science, Tsinghua University
Beijing, P.R.C, Jun 2011 - Jun 2012
  • Worked on Chinese Review Miner System
  • Took charge of all data about digital products crawled from ZOL, Buy360 (Chinese Amazon)
  • Managed all data about restaurants and hotel crawled from DianPing(Chinese Yelp)
  • Constructed and Maintained the databases, the indices and web search back-end


Distance metric learning from high (thousands or more) dimensional data with hundreds or thousands of classes is intractable but in NLP and IR, high dimensionality is usually required to represent data points, such as in modeling semantic similarity. This paper presents algorithms to scale up learning of a Mahalanobis distance metric from a large data graph in a high dimensional space. Our novel contributions include random projection that reduces dimensionality and a new objective function that regularizes intra-class and inter-class distances to handle a large number of classes. We show that the new objective function is convex and can be efficiently optimized by a stochastic-batch subgradient descent method. We applied our algorithm to two different domains; semantic similarity of documents collected from the Web, and phenotype descriptions in genomic data. Experiments show that our algorithm can handle the high-dimensional big data and outperform competing approximations in both domains.


  • Yelp Dataset Challenge, Yelp Inc, 2013
  • Third Chinese Opinion Analysis Evaluation-COAE2011, Professional Committee of Information Retrieval, 2011
  • "Feng Ru Cup" Science and Technology Competition, Beihang University, 2011