%%% Title of object: Nonparametric regression using kernel and spline methods
%%% Canonical Name: NonparametricRegressionUsingKernelAndSplineMethods3
%%% Type: Topic
%%% Created on: 2010-08-28 01:55:12
%%% Modified on: 2011-05-11 00:43:13
%%% Creator: jopsomer
%%% Modifier: misha123
%%% Author: misha123
%%% Author: jopsomer
%%%
%%% Classification: msc:62G08
%%% Keywords: smoothing
%%% Preamble:
\documentclass[10pt]{article}
% this is the default PlanetMath preamble. as your knowledge
% of TeX increases, you will probably want to edit this, but
% it should be fine as is for beginners.
% almost certainly you want these
\usepackage{amssymb}
\usepackage{amsmath}
\usepackage{amsfonts}
% used for TeXing text within eps files
%\usepackage{psfrag}
% need this for including graphics (\includegraphics)
\usepackage{graphicx}
% for neatly defining theorems and propositions
%\usepackage{amsthm}
% making logically defined graphics
%\usepackage{xypic}
% there are many more packages, add them here as you need them
\usepackage{epsfig}
\usepackage{chicago}
% define commands here
\def\R{\mbox{\rlap{I}\hskip .03in R}}
\def\Z{\mbox{\rlap{Z}\hskip .03in Z}}
\newcommand{\law}{\stackrel{{\cal L}}{\to}}
\newcommand{\prob}{\stackrel{p}{\to}}
\newcommand{\bm}[1]{{\mbox{\boldmath $#1$}}}
\def\bn#1{{\bf #1}}
\def\qed{\par {\hfill \rule{3mm}{3mm}}}
\newcommand{\E}{{\rm E}}
\newcommand{\Var}{{\rm Var}}
\newcommand{\eac}[1]{{\em et al.}~\cite{#1}}
\newtheorem{theorem}{Theorem}
\newtheorem{lemma}{Lemma}
\newtheorem{propos}{Proposition}
\newtheorem{assume}{\rlap{A}\hskip .06in }
\newtheorem{cor}{Corollary}
%%% Content:
\begin{document}
\title{Nonparametric regression using kernel and spline methods}
\author{Jean D. Opsomer\thanks{Department of Statistics, Colorado State University, Fort Collins, CO, USA. Email: jopsomer@stat.colostate.edu.}
\and F. Jay Breidt \thanks{Department of Statistics, Colorado State University, Fort Collins, CO, USA. Email: jbreidt@stat.colostate.edu}
}
\maketitle
\section{The statistical model}
\label{model}
When applying nonparametric regression methods, the researcher is interested in estimating the relationship
between one dependent variable, $Y$, and one or several covariates,
$X_1,\ldots,X_q$. We discuss here the situation with one covariate, $X$ (the case with multiple covariates is addressed in the references provided below). The relationship between $X$ and $Y$ can be expressed as the conditional expectation
\begin{displaymath}
\E(Y|X=x) = f(x).
\end{displaymath} Unlike in parametric regression, the shape of the function $f(\cdot)$ is
not restricted to belong to a specific parametric family such as polynomials.
This representation for the mean function is the key difference between parametric and
nonparametric regression, and the remaining aspects of the statistical model for
$(X,Y)$ are similar between both regression approaches. In particular, the random variable
$Y$ is often assumed to have a constant (conditional) variance,
$\Var(Y|X)=\sigma^2$, with $\sigma^2$ unknown. The constant variance and other common regression model assumptions, such as independence, can be relaxed
just as in parametric regression.
\section{Kernel methods}
Suppose that we have a dataset available with observations $(x_1,y_1),\ldots,(x_n,y_n)$. A simple kernel-based estimator of $f(x)$ is the {\em Nadaraya-Watson kernel regression} estimator, defined as
\begin{equation}
\hat{f}_h(x) = \frac{\sum_{i=1}^n K_h(x_i-x) y_i}{\sum_{i=1}^n K_h(x_i-x)},
\label{nadwat}
\end{equation}
with $K_h(\cdot)=K(\cdot/h)/h$ for some kernel function $K(\cdot)$ and bandwidth parameter $h>0$.
The function $K(\cdot)$ is usually a symmetric probability density and examples of commonly used kernel functions are the Gaussian kernel
$K(t)=(\sqrt{2\pi})^{-1} \exp(-t^2/2)$ and the {\em Epanechnikov} kernel $K(t) = \max\{\frac{3}{4} (1-t^2),0\}$.
Generally, the researcher is not interested in estimating the value of $f(\cdot)$ at a
single location $x$, but in estimating the curve over a range of values, say for all $x
\in [a_x,b_x]$. In principle, kernel regression requires computing (\ref{nadwat}) for any
value of interest. In practice, $\hat{f}_h(x)$ is calculated on a sufficiently fine grid of $x$-values and the curve is obtained by interpolation.
We used the subscript $h$ in $\hat{f}_h(x)$ in (\ref{nadwat}) to emphasize the fact that the bandwidth $h$ is the main determinant of the shape of the estimated regression, as demonstrated in Figure \ref{kernregex}. When $h$ is small relative to the range of the data, the resulting fit can be highly variable and look ``wiggly.'' When $h$ is chosen to be larger, this results in a less variable, more smooth fit, but it makes the estimator less responsive to local features in the data and introduces the possibility of bias in the estimator. Selecting a value for the bandwidth in such a way that it balances the variance with the potential bias is therefore a crucial decision for researchers who want to apply nonparametric regression on their data. Data-driven bandwidth selection methods are available in the literature, including in the references provided below.
\begin{center}
\begin{figure}[ht]
\includegraphics[scale=0.35]{bank_swallow.eps}
\caption{\small Dates (Julian days) of first sightings of bank swallows in Cayuga Lake
basin, with three kernel regressions using bandwidth values $h$ calculated as the range of
years multiplied by 0.05 ($--$), 0.2 (--) and 0.4 ($-\cdot$).}
\label{kernregex}
\end{figure}
\end{center}
A class of kernel-based estimators that generalizes the Nadaraya-Watson estimator in (\ref{nadwat}) is referred to as {\em local polynomial regression} estimators. At each location $x$, the estimator $\hat{f}_h(x)$ is obtained as the estimated intercept, $\hat{\beta}_0$, in the weighted least squares fit of a polynomial of degree $p$,
\begin{displaymath}
\min_{\bm{\beta}} \sum_{i=1}^n \left(y_i - \beta_0 + \beta_1 (x_i-x) + \cdots + \beta_p (x_i-x)^p\right) K_h(x_i-x).
\end{displaymath}
This estimator can be written explicitly in matrix notation as
\begin{equation}
\hat{f}_h(x) = (1,0,\ldots,0) \left(\bm{X}_x^T \bm{W}_x \bm{X}_x
\right)^{-1} \bm{X}_x^T \bm{W}_x
\bm{Y},
\label{lprdef}
\end{equation}
where $\bm{Y}=(y_1,\ldots,y_n)^T$, $\bm{W}_x =
\mbox{diag}\{K_h(x_1-x),\ldots,K_h(x_n-x)\}$ and
\begin{displaymath}
\bm{X}_x = \left[ \begin{array}{cccc}
1 & x_1-x & \cdots & (x_1-x)^p \\
\vdots & \vdots & & \vdots \\
1 & x_n-x & \cdots & (x_n-x)^p
\end{array} \right].
\end{displaymath}
It should be noted that the Nadaraya-Watson estimator (\ref{nadwat}) is a special case of the local polynomial regression estimator with $p=0$. In practice, the local linear ($p=1$) and local quadratic estimators ($p=2$) are frequently used.
An extensive literature on kernel regression and local polynomial regression exists, and their theoretical properties are well understood. Both kernel regression and local polynomial regression estimators are biased but consistent estimators of the unknown mean function, when that function is continuous and sufficiently smooth. For further information on these methods, we refer to reader to the monographs by \citeN{wan95} and \citeN{fan96}.
\section{Spline methods}
\label{spline}
In the previous section, the unknown mean function was assumed to be {\em locally} well
approximated by a polynomial, which led to local polynomial regression. An alternative
approach is to represent the fit as a {\em piecewise} polynomial, with the pieces
connecting at points called {\em knots}. Once the knots are selected,
such an estimator can be computed globally in a manner similar to that for a
parametrically specified mean function, as will be explained below. A fitted mean
function represented by a piecewise continuous curve only rarely provides a satisfactory
fit, however, so that usually the function and at least its first derivative are
constrained to be continuous everywhere, with only the second or higher derivatives allowed to be
discontinuous at the knots. For historical reasons, these constrained piecewise
polynomials are referred to as {\em splines}, leading to the name {\em spline regression}
or {\em spline smoothing} for this type of nonparametric regression.
Consider the following simple type of polynomial spline of degree $p$:
\begin{equation}
\beta_0 + \beta_1 x + \cdots + \beta_p x^p + \sum_{k=1}^K \beta_{p+k} (x-\kappa_k)^p_+,
\label{regspline}
\end{equation}
where $p\geq 1$, $\kappa_1,\ldots,\kappa_K$ are the knots and
$(\cdot)^p_+ = \max\{(\cdot)^p,0\}$. Clearly, (\ref{regspline}) has continuous
derivatives up to degree $(p-1)$, but the $p$th derivative can be discontinuous at the
knots. Model (\ref{regspline}) is constructed as a linear combination of {\em basis functions} $1,
x, \ldots, x^p, (x-\kappa_1)^p_+,\ldots, %%\linebreak[1]
(x-\kappa_K)^p_+$. This basis is
referred to as the {\em truncated power basis}. A popular set of basis functions are the so-called
{\em B-splines}. Unlike the truncated power splines, the B-splines have compact support and are numerically more stable, but they span the same function space. In what follows, we will write $\psi_j(x), j=1,\ldots,J$ for a set of (generic) basis
functions used in fitting regression splines, and replace (\ref{regspline}) by
$\beta_1 \psi_1(x) + \cdots +\beta_J \psi_J(x)$.
For fixed knots, a regression spline is linear in the unknown parameters $\bm{\beta} = (\beta_1,\ldots,\beta_{J})^T$ and can be fitted parametrically using least squares techniques. Under the homoskedastic model described in Section \ref{model}, the {\em regression spline} estimator for $f(x)$ is
obtained by solving
\begin{equation}
\hat{\bm{\beta}} = \arg \min_{\bm{\beta}} \sum_{i=1}^n \left(y_i - \sum_{j=1}^J \beta_j \psi_j(x_i)\right)^2
\label{Bspline}
\end{equation}
and setting $\hat{f}(x) = \sum_{j=1}^J \hat{\beta}_j \psi_j(x)$. Since deviations from the parametric shape can only occur at the knots, the amount of
smoothing is determined by the degree of the basis and the location and number of knots. In practice, the degree is fixed (with $p=1, 2$ or 3 as common choices) and the knot locations are usually chosen to be equally-spaced over the range of the data or placed at
regularly spaced data quantiles. Hence, the number of knots $K$ is the only
remaining smoothing parameter for the spline regression estimator. As $K$ (and therefore $J$) is chosen to be larger, increasingly flexible estimators for $f(\cdot)$ are produced. This reduces the potential bias due to approximating the unknown mean function by a spline function, but increases the variability of the estimators.
The {\em smoothing spline} estimator is an important extension of the regression spline estimator. The smoothing spline estimator for $f(\cdot)$ for a set of data generated by the
statistical model described in Section \ref{model} is defined as the minimizer of
\begin{equation}
\sum_{i=1}^n (y_i - f(x_i))^2 + \lambda
\int_{a_x}^{b_x} (f^{(p)}(t))^2 dt,
\label{sspline}
\end{equation}
over the set of all functions $f(\cdot)$ with continuous
$(p-1)$th derivative and square integrable $p$th derivative, and
$\lambda >0$ is a constant determining the degree of smoothness of the estimator. Larger values of $\lambda$ correspond to smoother fits. The
choice $p=2$ leads to the popular {\em cubic smoothing splines}. While not immediately obvious
from the definition, the function minimizing (\ref{sspline}) is exactly equal to a special
type of regression spline with knots at each of the observation points
$x_1,\ldots,x_n$ (assuming each of the locations $x_i$ is unique).
%For further information on smoothing splines, see the entry by Wahba (same volume).
Traditional regression spline fitting as in (\ref{Bspline}) is usually done using a
relatively small number of knots. By construction, smoothing splines use a large number of knots (typically, $n$ knots), but the smoothness of the function is controlled by a penalty term and the smoothing
parameter $\lambda$. The {\em penalized spline} estimator represents a compromise between these two approaches. It uses a moderate number of knots and puts a penalty on the coefficients of the basis functions. Specifically, a simple type of penalized spline estimator for $m(\cdot)$ is obtained by solving
\begin{equation}
\hat{\bm{\beta}} = \arg \min_{\bm{\beta}} \sum_{i=1}^n \left(y_i - \sum_{j=1}^J \beta_j \psi_j(x_i)\right)^2 + \lambda
\sum_{j=1}^J
\beta_j^2
\label{Pspline}
\end{equation}
and setting $\hat{f}_{\lambda}(x) = \sum_{j=1}^J \hat{\beta}_j \psi_j(x)$ as for regression splines. Penalized splines combine the advantage of a parametric fitting method, as for regression
splines, with the flexible adjustment of the degree of smoothness as in smoothing splines. Both the basis function and the exact form of the penalization of the coefficients can be varied to accommodate a large range of regression settings.
Spline-based regression methods are extensively described in the statistical literature. While the theoretical properties of (unpenalized) regression splines and smoothing splines are well established, results for penalized regression splines have only recently become available. The monographs by \citeN{wah90}, \citeN{eub99} and \shortciteN{rup03} are good sources of information on spline-based methods.
\section*{Acknowledgements}
Based on an article from Lovric, Miodrag (2011), International
Encyclopedia of Statistical Science. Heidelberg: Springer Science
+Business Media, LLC.
\begin{thebibliography}{}
\bibitem[\protect\citeauthoryear{Eubank}{Eubank}{1999}]{eub99}
Eubank, R.~L. (1999).
\newblock {\em Nonparametric Regression and Spline Smoothing\/} (2nd ed.).
\newblock New York: Marcel Dekker.
\bibitem[\protect\citeauthoryear{Fan and Gijbels}{Fan and
Gijbels}{1996}]{fan96}
Fan, J. and I.~Gijbels (1996).
\newblock {\em Local Polynomial Modelling and its Applications}.
\newblock London: Chapman \& Hall.
\bibitem[\protect\citeauthoryear{Ruppert, Wand, and Carroll}{Ruppert
et~al.}{2003}]{rup03}
Ruppert, D., M.~P. Wand, and R.~J. Carroll (2003).
\newblock {\em Semiparametric Regression}.
\newblock Cambridge, UK: Cambridge University Press.
\bibitem[\protect\citeauthoryear{Wahba}{Wahba}{1990}]{wah90}
Wahba, G. (1990).
\newblock {\em Spline models for observational data}.
\newblock SIAM [Society for Industrial and Applied Mathematics].
\bibitem[\protect\citeauthoryear{Wand and Jones}{Wand and Jones}{1995}]{wan95}
Wand, M.~P. and M.~C. Jones (1995).
\newblock {\em Kernel Smoothing}.
\newblock London: Chapman and Hall.
\end{thebibliography}
\end{document}