# Functional data analysis

This article Analysis of Samples of Curves (=Functional Data Analysis) was adapted from an original article by Hans-Georg MÃ¼ller, which appeared in StatProb: The Encyclopedia Sponsored by Statistics and Probability Societies. The original article ([http://statprob.com/encyclopedia/FunctionalDataAnalysis2.html StatProb Source], Local Files: pdf | tex) is copyrighted by the author(s), the article has been donated to Encyclopedia of Mathematics, and its further issues are under Creative Commons Attribution Share-Alike License'. All pages from StatProb are contained in the Category StatProb.

2010 Mathematics Subject Classification: Primary: 62G05 Secondary: 62M09 [MSN][ZBL]

$\def\cov{ {\rm cov}}$ $\def\var{ {\rm var}}$ $\def\ci{\cite}$ $\def\cp{\citep}$ $\def\eps{\varepsilon}$

$\def\T{\mathcal{T}}$ $\def\mt{\mathcal{T}}$ $\def\xk{A_k}$ $\def\xik{A_{ik}}$ $\def\hxk{\hat{A}_k}$ $\def\hxik{\hat{A}_{ik}}$ $\def\tX{\tilde{X}}$ $\def\tY{\tilde{Y}}$ $\def\tij{t_{ij}}$ $\def\Yij{Y_{ij}}$ $\def\Xij{X_{ij}}$ $\def\pk{\phi_k}$

Functional Data Analysis

Hans-Georg Müller

Department of Statistics University of California, Davis

One Shields Ave., Davis, CA 95616, USA.

e-mail: mueller@wald.ucdavis.edu

KEY WORDS: Autocovariance Operator, Clustering, Covariance Surface, Eigenfunction, Infinite-dimensional Data, Karhunen-Lo\eve Representation, Longitudinal Data, Nonparametrics, Panel Data, Principal Component, Registration, Regression, Smoothing, Square Integrable Function, Stochastic Process, Time Course, Tracking, Warping.

1. Overview

Functional data analysis (FDA) refers to the statistical analysis of data samples consisting of random functions or surfaces, where each function is viewed as one sample element. Typically, the random functions contained in the sample are considered to be independent and to correspond to smooth realizations of an underlying stochastic process. FDA methodology then provides a statistical approach to the analysis of repeatedly observed stochastic processes or data generated by such processes. FDA differs from time series approaches, as the sampling design is very flexible, stationarity of the underlying process is not needed, and autoregressive-moving average models or similar time regression models play no role, except where the elements of such models are functions themselves.

FDA also differs from multivariate analysis, the area of statistics that deals with finite-dimensional random vectors, as functional data are inherently infinite-dimensional and smoothness often is a central assumption. Smoothness has no meaning for multivariate data analysis, which in contrast to FDA is permutation invariant. Even sparsely and irregularly observed longitudinal data can be analyzed with FDA methodology. FDA thus is useful for the analysis of longitudinal or otherwise sparsely sampled data. It is also a key methodology for the analysis of time course, image and tracking data.

The approaches and models of FDA are essentially nonparametric, allowing for flexible modeling. The statistical tools of FDA include smoothing, e.g., based on series expansions, penalized splines, or local polynomial smoothing, and functional principal component analysis. A distinction between smoothing methods and FDA is that smoothing is typically used in situations where one wishes to obtain an estimate for one non-random object (where objects here are functions or surfaces) from noisy observations, while FDA aims at the analysis of a sample of random objects, which may be assumed to be completely observed without noise or to be sparsely observed with noise; many scenarios of interest fall in between these extremes.

An important special situation arises when the underlying random processes generating the data are Gaussian processes, an assumption that is often invoked to justify linear procedures and to simplify methodology and theory. Functional data are ubiquitous and may for example involve samples of density functions \cp{knei:01}, hazard functions, or behavioral tracking data. Application areas that have been emphasized in the statistical literature include growth curves \cp{rao:58, mull:84:2}, econometrics and e-commerce \cp{rams:02:2,jank:06:2}, evolutionary biology \cp{kirk:89,izem:05}, and genetics and genomics \cp{opge:06,mull:08:3}. FDA also applies to panel data as considered in economics and other social sciences.

2. Methodology

Key FDA methods include functional principal component analysis \cp{cast:86,rice:91}, warping and curve registration \cp{gerv:04} and functional regression \cp{rams:91}. Theoretical foundations and asymptotic methods of FDA are closely tied to perturbation theory of linear operators in Hilbert space \cp{daux:82,bosq:00,mas:03}; a reproducing kernel Hilbert space approach has also been proposed \cp{euba:08}, as well as Bayesian approaches \cp{tele:08}. Finite sample implementations typically require to address ill-posed problems, emplying suitable regularization, which is often implemented by penalized least squares or penalized likelihood and by truncated series expansions. A broad overview of methods and applied aspects of FDA can be found in the textbook \ci{rams:05} and some additional reviews are in \ci{rice:04,zhao:04,mull:08:7}.

The basic statistical methodologies of ANOVA, regression, correlation, classification and clustering that are available for scalar and vector data have spurred analogous developments for functional data. An additional aspect is that the time axis itself may be subject to random distortions and adequate functional models sometimes need to reflect such time-warping (also referred to as alignment or registration).

Another issue is that often the random trajectories are not directly observed. Instead, for each sample function one has available measurements on a time grid that may range from very dense to extremely sparse. Sparse and randomly distributed measurement times are frequently encountered in longitudinal studies. Additional contamination of the measurements of the trajectory levels by errors is also common. These situations require careful modeling of the relationship between the recorded observations and the assumed underlying functional trajectories \cp{rice:01, jame:03, mull:05:4}.

Initial analysis of functional data includes exploratory plotting of the observed functions in a "spaghetti plot" to obtain an initial idea of functional shapes, to check for outliers and to identify potential "landmarks". Preprocessing may include outlier removal and registration to adjust for time-warping \cp{gass:95,gerv:04, mull:04:4, jame:07,knei:08}.

3. Functional Principal Components

Basic objects in FDA are the mean function $\mu$ and the covariance function $G$. For square integrable random functions $X(t)$, \begin{eqnarray} \mu(t)=E(Y(t)), \quad G(s,t)&=&\cov\left\{X(s),X(t)\right\},\quad s,t \in \T, \end{eqnarray} with auto-covariance operator $(A f)(t) = \int_{\T}\, f(s) G(s,t)\, ds.$ This linear operator of Hilbert-Schmidt type has orthonormal eigenfunctions $\pk,\, k=1,2,\ldots,$ with associated ordered eigenvalues $\lambda_{1} \ge \lambda_{2} \ge \ldots$, such that $A\, \pk = \lambda_k \, \pk.$ The foundation for functional principal component analysis is the Karhunen-Lo\eve representation of random functions \cp{karh:46,gren:50,gikh:69} $X(t)=\mu(t)+\sum\limits_{k=1}^{\infty} \xk\,\pk(t),$ where $\xk=\int_{\T} (Y(t)-\mu(t))\pk(t)\,dt$ are uncorrelated centered random variables with $\var(\xk)=\lambda_{k}$, referred to as functional principal components (FPCs).

Estimation of eigenfunctions, eigenvalues and of FPCs is a core objective of FDA. Various smoothing-based methods and applications for various sampling designs have been considered \cp{jone:92, silv:96, stan:98, card:00:1,jame:00, paul:09:1}. Estimators employing smoothing methods (local least squares or splines) have been developed for various sampling schemes (sparse, dense, with errors) to obtain a data-based version of the eigen-representation, where one regularizes by truncating at a finite number $K$ of included components. The idea is to borrow strength from the entire sample of functions, rather than estimating each function separately. The functional data are then represented by the subject-specific vectors of score estimates $\hxk,\, k=1,\ldots, K$, which can be used to represent individual trajectories and for subsequent statistical analysis \cp{mull:05:4}.

More adequate representations of functional data are sometimes obtained by fitting pre-specified fixed basis functions with random coefficients. In particular, B-splines \cp{sy:97}, P-splines \cp{yao:06} and wavelets \cp{morr:06} have been successfully applied. A general relation between mixed linear models and fitting functional models with basis expansion coefficients can be used to advantage for modeling and implementation of these approaches. In the theoretical analysis, one may distinguish between an essentially multivariate analysis, which results from assuming that the number of series terms is actually finite, leading to parametric rates of convergence, and an essentially functional approach. In the latter, the number of components is assumed to increase with sample size and this leads to "functional" rates of convergence that depend on the properties of underlying processes, such as decay and spacing of the eigenvalues of the autocovariance operator.

4. Functional Regression and Related Models

Functional regression models may include one or several functions among the predictors, responses, or both. For pairs $(X,Y)$ with centered random predictor functions $X$ and scalar responses $Y$, the linear model is $$E(Y|X)=\int_{\T} (X(s)-\mu(s))\beta(s)\,ds.$$ The regression parameter function $\beta$ can be represented in a suitable basis, for example the eigenbasis, with coefficient estimates determined by least squares or similar criteria. The functional linear model has been thoroughly studied, including optimal rates of convergence \cp{card:03:1,card:03:2,mull:05:5,cai:06,hall:07:1,li:07,mas:09}.

The class of useful functional regression models is large, due to the infinite-dimensional nature of the functional predictors. The case where the response is functional \cp{rams:91} also is of interest. Flexible extensions of the functional linear model for example include nonparametric approaches \cp{ferr:06}, where unfavorable small ball probabilities and the non-existence of a density in general random function space impose limits on convergence \cp{mull:09:5}, and multiple index models \cp{jame:05}. Another extension is the functional additive model \cp{mull:08:2}. For functional predictors $X=\mu + \sum_{k=1}^\infty A_k \pk$ and scalar responses $Y$, this model is given by $$E(Y|X)=\sum_{k=1}^\infty f_k(A_k) \pk$$ for smooth functions $f_k$ with $E(f_k(A_k))=0$.

Another variant of the functional linear model, which is also applicable for classification purposes, is the generalized functional linear model $E(Y|X)=g\{\mu + \int_{\T} \, X(s)\beta(s)\,ds\}$ with link function $g$ \cp{jame:02,esca:04, card:05:1, mull:05:1}. The link function (and an additional variance function if applicable) is adapted to the (often discrete) distribution of $Y$; the components of the model can be estimated by quasi-likelihood. Besides discriminant analysis via the binomial functional generalized linear model, various other methods have been studied for functional clustering and discriminant analysis \cp{jame:03,chio:07,chio:08}.

Of practical relevance are extensions towards polynomial functional regression models \cp{mull:10:1}, hierarchical functional models \cp{crai:09:1}, models with varying domains, models with more than one predictor function, and functional (autoregressive) time series models, among others. In addition to the functional trajectories themselves, derivatives are of interest to study the dynamics of the underlying processes \cp{rams:05}. Software for functional data analysis evolves rapidly and is available from various sources. Freely available software includes for example the fda package (R and matlab), at \newline http://www.psych.mcgill.ca/misc/fda/software.html , and the PACE package (matlab), at http://anson.ucdavis.edu/~mueller/data/pace.html.

Acknowledgments

Research supported in part by NSF Grant DMS-0806199. Based on an article from Lovric, Miodrag (2011), International Encyclopedia of Statistical Science. Heidelberg: Springer Science +Business Media, LLC.

How to Cite This Entry:
Functional data analysis. Encyclopedia of Mathematics. URL: http://encyclopediaofmath.org/index.php?title=Functional_data_analysis&oldid=37743