%%% Title of object: Statistical Approaches to Protecting Confidentiality in Public Use Data
%%% Canonical Name: StatisticalApproachesToProtectingConfidentialityInPublicUseData
%%% Type: Topic
%%% Created on: 2010-08-23 22:08:58
%%% Modified on: 2010-08-24 14:02:30
%%% Creator: jerryreiter
%%% Modifier: jkimmel
%%% Author: jerryreiter
%%%
%%% Classification: msc:62P99, msc:62D05
%%% Keywords: Confidentiality, Imputation, Public, Survey
%%% Synonyms: Statistical Approaches to Protecting Confidentiality in Public Use Data=Anonymization
%%% Statistical Approaches to Protecting Confidentiality in Public Use Data=De-identification
%%% Statistical Approaches to Protecting Confidentiality in Public Use Data=Disclosure
%%% Statistical Approaches to Protecting Confidentiality in Public Use Data=Privacy
%%% Preamble:
\documentclass[10pt]{article}
% this is the default PlanetMath preamble. as your knowledge
% of TeX increases, you will probably want to edit this, but
% it should be fine as is for beginners.
% almost certainly you want these
\usepackage{amssymb}
\usepackage{amsmath}
\usepackage{amsfonts}
\usepackage{natbib}
% used for TeXing text within eps files
%\usepackage{psfrag}
% need this for including graphics (\includegraphics)
%\usepackage{graphicx}
% for neatly defining theorems and propositions
%\usepackage{amsthm}
% making logically defined graphics
%\usepackage{xypic}
% there are many more packages, add them here as you need them
% define commands here
%%% Content:
\begin{document}
Many national statistical agencies, survey organizations, and
researchers---henceforth all called agencies---collect data that they
intend to share with others. Wide dissemination of data facilitates advances in science
and public policy, enables students to develop skills at data analysis,
and helps ordinary citizens learn about their communities.
Often, however, agencies cannot release data as collected, because
doing so could reveal data subjects' identities or values of
sensitive attributes. Failure to protect confidentiality can have
serious consequences for agencies, since they may be violating
laws or institutional rules enacted to protect confidentiality. Additionally,
when confidentiality is compromised, the agencies may lose the trust
of the public, so that potential respondents are less willing to give
accurate answers, or even to participate, in future studies
\citep{reiterchance}.
At first glance, sharing safe data with others seems a straightforward task:
simply strip unique identifiers like names, tax identification numbers,
and exact addresses before releasing data. However, these actions
alone may not suffice when quasi-identifiers, such as demographic
variables, employment/education histories, or establishment sizes,
remain on the file. These quasi-identifiers can be used to match units in the
released data to other databases. For example, \citet{sweeney:1997}
showed that 97\% of the records in a medical database for Cambridge,
MA, could be identified using only birth date and 9-digit ZIP code by
linking them to a publicly available voter registration list.
Agencies therefore further limit what they release,
typically by altering the collected data \citep{willenborg:waal:2001}. Common
strategies include those listed below. Most public use data sets
released by national statistical agencies have undergone at least one
of these methods of statistical disclosure limitation.
\vspace{12pt}
\noindent {\bf Aggregation.} Aggregation reduces disclosure risks by turning atypical records---which
generally are most at risk---into typical records. For example,
there may be only one person with a particular combination of
demographic characteristics in a city, but many
people with those characteristics in a state.
Releasing data for this
person with geography at the city level might have a high disclosure risk,
whereas releasing the data at the state level might not.
Unfortunately, aggregation makes analysis at finer levels difficult and often
impossible, and it creates problems of ecological inferences.
\vspace{12pt}
\noindent {\bf Top coding.} Agencies can report sensitive values exactly only
when they are above or below certain thresholds, for example reporting
all incomes above \$200,000 as ``\$200,000 or more.'' Monetary
variables and ages are frequently reported
with top codes, and sometimes with bottom codes as well. Top or
bottom coding by definition eliminates detailed inferences about
the distribution beyond the thresholds. Chopping off tails also
negatively impacts estimation of whole-data quantities.
\vspace{12pt}
\noindent {\bf Suppression.} Agencies can delete sensitive
values from the released data. They might suppress
entire variables or just at-risk data values. Suppression of
particular data values generally creates data that are not missing at
random, which are difficult to analyze properly.
\vspace{12pt}
\noindent {\bf Data swapping.} Agencies can swap data values for selected
records---for example, switch values of age, race, and sex for at-risk
records with those for other records---to discourage users from
matching, since matches may be based on incorrect data
\citep{dalenius}. Swapping is used extensively by government
agencies. It is generally presumed that swapping fractions are low---agencies
do not reveal the rates to the public---because swapping at
high levels destroys relationships involving the swapped and
unswapped variables.
\vspace{12pt}
\noindent {\bf Adding random noise.} Agencies can protect numerical
data by adding some randomly selected amount to the observed values, for
example a random draw from a normal distribution with mean
equal to zero \citep{fuller:1993}. This can reduce the possibilities of accurate
matching on the perturbed data and distort the values of sensitive
variables. The degree of confidentiality protection depends on the
nature of the noise distribution; for example, using a large variance
provides greater protection. However, adding noise with large variance
introduces measurement error that stretches marginal
distributions and attenuates regression coefficients \citep{winkler:2003}.
\vspace{12pt}
\noindent {\bf Synthetic data}. The basic idea of synthetic data is to replace
original data values at high risk of disclosure with values simulated from probability
distributions \citep{rubin:1993}. These distributions are specified to reproduce as many
of the relationships in the original data as possible. Synthetic data approaches
come in two flavors: partial and full synthesis \citep{reiter:raghu:07}.
Partially synthetic data comprise the units originally surveyed with
some subset of collected values replaced with simulated values. For
example, the agency might simulate sensitive or identifying variables for units in the sample with rare
combinations of demographic characteristics; or, the agency might replace all data
for selected sensitive variables. Fully synthetic
data comprise an entirely simulated data set; the originally sampled units are
not on the file. In both types, the agency generates and releases
multiple versions of the data (as in multiple imputation for
missing data). Synthetic data can provide valid inferences for
analyses that are in accord with the synthesis models, but they may
not give good results for other analyses.
\vspace{12pt}
Statisticians play an important role in determining agencies' data
sharing strategies. First, they measure the risks of disclosures of
confidential information in the data, both before and after application of data
protection methods. Assessing disclosure risks is a challenging task
involving modeling of data snoopers' behavior and resources; see
\citet{reiter05} and \citet{skelam} for examples.
Second, they advise agencies on which protection methods to apply and with
what level of intensity. Generally, increasing the amount of data
alteration decreases the risks of disclosures; but, it also decreases
the accuracy of inferences obtained from the released data, since
these methods distort relationships among the variables. Statisticians
quantify the disclosure risks and data quality of competing protection
methods to select ones with acceptable properties. Third, they
develop new approaches to sharing confidential data. Currently, for
example, there do not exist statistical approaches for safe and
useful sharing of network and relational data, remote sensing
data, and genomic data. As complex new data types become readily
available,
there will be an increased need for statisticians to develop new protection
methods that facilitate data sharing.
Reprinted with permission from Lovric, Miodrag (2011), International
Encyclopedia of Statistical Science. Heidelberg: Springer Science
+ Business Media, LLC
\bibliographystyle{natbib}
\begin{thebibliography}{}
\bibitem[Dalenius and Reiss(1982)]{dalenius}
Dalenius, T. and Reiss, S.~P. (1982). Data-swapping: {A} technique for disclosure control. \emph{Journal of Statistical Planning and Inference} \textbf{6}, 73--85.
\bibitem[Elamir and Skinner(2006)]{skelam}
Elamir, E. and Skinner, C.~J. (2006).
Record level measures of disclosure risk for survey microdata.
\emph{Journal of Official Statistics} \textbf{22}, 525--539.
\bibitem[Fuller(1993)]{fuller:1993}
Fuller, W.~A. (1993).
Masking procedures for microdata disclosure limitation.
\emph{Journal of Official Statistics} \textbf{9}, 383--406.
\bibitem[Reiter(2004)]{reiterchance}
Reiter, J.~P. (2004). New approaches to data dissemintation: {A} glimpse into the future
(?). \emph{Chance} \textbf{17}, 3, 12--16.
\bibitem[Reiter(2005)]{reiter05}
Reiter, J.~P. (2005).
Estimating identification risks in microdata.
\emph{Journal of the American Statistical Association} \textbf{100},
1103--1113.
\bibitem[Reiter and Raghunathan(2007)]{reiter:raghu:07}
Reiter, J.~P. and Raghunathan, T.~E. (2007).
The multiple adaptations of multiple imputation.
\emph{Journal of the American Statistical Association} \textbf{102},
1462--1471.
\bibitem[Rubin(1993)]{rubin:1993}
Rubin, D.~B. (1993).
Discussion: Statistical disclosure limitation.
\emph{Journal of Official Statistics} \textbf{9}, 462--468.
\bibitem[Sweeney(1997)]{sweeney:1997}
Sweeney, L. (1997).
Computational disclosure control for medical microdata: the {D}atafly
system.
In \emph{Proceedings of an International Workshop and Exposition},
442--453.
\bibitem[Willenborg and {de Waal}(2001)]{willenborg:waal:2001}
Willenborg, L. and {de Waal}, T. (2001). \emph{Elements of Statistical Disclosure Control}.
New York: Springer-Verlag.
\bibitem[Yancey \emph{et~al.}(2002)Yancey, Winkler, and Creecy]{winkler:2003}
Yancey, W.~E., Winkler, W.~E., and Creecy, R.~H. (2002).
Disclosure risk assessment in perturbative microdata protection.
In J.~Domingo-Ferrer, ed., \emph{Inference Control in Statistical
Databases}, 135--152. Berlin: Springer-Verlag.
\end{thebibliography}
\end{document}