Exploiting structure in customized models

This figure exemplifies why structure matters in data. In the left panel, there is one level of structure: soccer players. Both the features x and the response y describe individual players. A model f can be trained to predict player ability y. (The best player wins the MVP cup.) In the right panel, there are two levels of structure: players and teams. The features x continue to describe players, yet the response y describes team performance. A team is more than the sum of its players, so model f must effectively aggregate the players to generate a team prediction. This figure also motivates ranking, a modeling framework that is often overshadowed by classification and regression.

Source: My poster at the 2011 NYAS Machine Learning Symposium.

Interpreting macro- and micro- clustering models

These figures plot the scores of the leading three principal components of a sports medicine dataset. In the upper left panel, four clusters are discovered that categorize the samples into 4 treatment modalities. Metadata interprets these modalities, as described in the table, listing sports, complaints and providers associated with each cluster. The lower right panel shows how k-means further divides clusters II, III and IV into subclusters, providing specialized insight into packets of highly similar samples. Awareness of these clusters and subclusters can assist in planning a sports medicine clinic and training providers.

Source: Adapted from the paper Advanced Treatment Monitoring... that I co-authored with J. Siedlik et al.

Improving visualizations

In this article, Business Insider magazine created a widely-circulated map of the United States displaying each state's biggest export trading partner. In it, we see that 35 states primarily export to Canada. I expanded their map to include the Canadian provinces and territories, learning that all 10 provinces primarily export to the United States. My map goes a step further in highlighting the economic inter-dependency between both countries as the future of NAFTA is decided.

In this feature, The Economist created a map of India equating the population of each state to a country (according to the 2011 Census of India). I regenerated their map to include all union territories, designing a legend that is more logarithmic, selecting a color blind-friendly palette, and avoiding country duplication. We see that:

  • The population of Canada is squeezed into the small state of Kerala.
  • The population of the United States (Brazil + Mexico) fits in the two most populous states (Uttar Pradesh + Maharashtra).

Nonlinear curve fitting

This figure explains the significance of all 4 parameters in the Hill equation, a sigmoidal curve that is used in many applications, including dose-response relationships in pharmaceutical modeling.

Source: My doctoral dissertation.

Defining ontologies that describe data quality

This figure summarizes how a dose-response experiment is assigned to a category that describes data quality, based on criteria derived from the datapoints.

Source: My doctoral dissertation.

Customized experimental design

This flow chart describes a customized experimental design for accurately curve fitting hundreds of thousands of experimental samples with widely varying data qualities.

Source: My doctoral dissertation.