Introduction to Functional Data Analysis

One area of statistics that I have been involved in extensively is Functional Data Analysis, or FDA in short. Based on my communications with collaborators and interviewers over the past few years, it looks like FDA is still an emerging field that hasn't caught a lot of attention. This is not surprising given its relatively short history (the name "Functional Data Analysis" was not coined until 1990s). Some people may think that FDA is closely related to Functional Analysis in Mathematics, which is actually not true. While the theoretical research of some FDA methods need solid mathematical knowledge, FDA is actually a very general field focusing on the analysis of functional data.

Functional Data

What is "functional data"? If you check Wikipedia, it is defined as "data providing information about curves, surfaces or anything else varying over a continuum", which is a quite broad concept. Based on my experience, there's no strict definition of functional data in the field, but in general the data should have the following properties:

One famous example that is closely related to classical statistics is the time series data, since the data are collected over time. However, FDA methods typically rely on different assumptions from traditional time series models and therefore can be more flexible and interpretable in many real world applications. So next time when you get some time series data, it is worth trying some FDA methods and see their performance. 

Another example is the longitudinal data collected in numerous studies, which can be viewed as sparse functional data in most cases. Of course, the sparsity and irregularity makes FDA applications to longitudinal data a bit challenging, but some research efforts have been made in this direction over the past years. 

The NHANES data set introduced in another post is a good example of dense functional data with many other nonfunctional variables. The physical activity data collected through the accelerometer of wearable devices are released as minute-level activity counts (AC), resulting in 1440 observations per day per study participant. You can check out the NHANES tutorial here.

Software

In many cases, people are hesitate to try new methods and decide to follow old approaches due to the lack of software. This has become a problem in many modern research studies. In some cases, it is very difficult to reproduce the analysis on a large-scale, high-dimensional new dataset even with the code. Fortunately, there is some excellent software to facilitate the use of FDA methods, including the refund R package and the mgcv R package

Just want to highlight some functions in refund that I found useful: