Ejaz Ahmed, Brock, (Mathematics and Statistics)
Emanuel Ben-David (Columbia)
On the minimum number of observations that guarantees the existence
of MLE for Gaussian graphical models
In this talk I will discuss the conditions under which the existence
of the MLE of the covariance parameter in Gaussian graphical models
is guaranteed. These conditions are given in terms of an upper
bound and a lower bound on the number of observations that ensures
the existence of the MLE. These bounds are in general tighter
than the best known bounds found in Buhl [1993] and Lauritzen
[1996]. In fact, for many graphs, such as grids, the new bounds
are tight enough to exactly determine the minimum number of needed
observations.
Joseph Beyene, McMaster (Biostatistics)
Constrained Dirichlet-Multinomial Mixture Models for Human Microbiome
Analysis
Human microbiomes are microscopic populations of organisms which
live in or on a single human being and may impact the health of
the host. Studying these populations has become more prevalent
because of Next-Generation Sequencing, which can approximately
characterize a microbiome. The discrete, skewed nature of the
data makes understanding microbiomes difficult, creating a need
for new statistical methods. We propose an evolutionary algorithm
that takes a proposed Dirichlet-Multinomial model and determines
which parameters can be constrained to be equal, thereby creating
a more robust model. This choice of constraint is particularly
interesting because of its implications in characterizing relative
abundances throughout a microbiome. We will illustrate the algorithm
on simulated and real microbiome data and discuss ongoing challenges.
Peter Bubenik, Cleveland State (Mathematics)
Statistical topological data analysis using persistence landscapes
(slides)
In this talk I will define a topological summary for data that
I call the persistence landscape. Since this summary lies in a
vector space, it is easy to calculate averages of such summaries,
and distances between them. Viewed as a random variable with values
in a Banach space, this summary obeys a Strong Law of Large Numbers
and a Central Limit Theorem. I will show how a number of standard
statistical tests can be used for statistical inference using
this summary.
David Dunson, Duke University (Statistics)
Robust and scalable Bayes via the median posterior
Bayesian methods have great promise in big data sets, but this
promise has not been fully realized due to the lack of scalable
computational methods. Usual MCMC and SMC algorithms bog down
as the size of the data and number of parameters increase. For
massive data sets, it has become routine to rely on penalized
optimization approaches implemented on distributed computing systems.
The most popular scalable approximation algorithms rely on variational
Bayes, which lacks theoretical guarantees and badly under-estimates
posterior covariance. Another problem with Bayesian inference
is the lack of robustness; data contamination and corruption is
particularly common in large data applications and cannot easily
be dealt with using traditional methods. We propose to solve both
the robustness and the scalability problem using a new alternative
to exact Bayesian inference we refer to as the median posterior.
Data are divided into subsets and stored on different computers
prior to analysis. For each subset, we obtain a stochastic approximation
to the full data posterior, and run MCMC to generate samples from
this approximation. The median posterior is defined as the geometric
median of the subset-specific approximations, and can be rapidly
approximated. We show several strong theoretical results for the
median posterior, including general theorems on concentration
rates and robustness. The methods are illustrated through a simulation
and application to nonparametric modeling of contingency table
data from social surveys.
Joint work with Stas Minsker, Lizhen Lin and Sanvesh Srivastava
Subhashis Ghosal, North Carolina State University, (Statistics)
Bayesian estimation of sparse precision matrices (slides)
We consider the problem of estimating the sparsity structure
of the precision matrix for a multivariate Gaussian distribution,
especially for dimension p exceeding the sample size n. Gaussian
graphical models serve as an important tool in representing the
sparsity structure through the presence or absence of the edges
in the underlying graph. Some novel methods for Bayesian analysis
of graphical models have been explored in recent times using a
Bayesian analog of the graphical LASSO algorithm using suitable
priors. In this talk, we use priors which put point mass on the
zero elements of the precision matrix along with absolutely continuous
priors on the non-zero elements, and hence the resulting posterior
distribution can be used for graphical structure learning. The
posterior distribution of the different graphical models is intractable
and we propose a fast computational method for approximating the
posterior probabilities of the graphical structures using Laplace
approximation method using the graphical LASSO solution as the
posterior mode. We also theoretically asses the quality of the
Laplace approximation. We study the asymptotic behavior of the
posterior distribution of sparse precision matrices and show that
it converges at the oracle rate with respect to the Frobenius
norm on matrices. The proposed Bayesian method is studied by extensive
simulation experiments and is found to be extremely fast and give
very sensible results. The method is applied on a real dataset
on stocks and is able to find sensible relations between different
stock types.
This talk is based on joint work with Sayantan Banerjee.
Elizabeth Gross, NCSU
Goodness-of-fit testing for log-linear network models
Social networks and other large sparse data sets pose significant
challenges for statistical inference, as many standard statistical
methods for testing model/data fit are not applicable in such
settings. Algebraic statistics offers an approach to goodness-of-fit
testing that relies on the theory of Markov bases and is intimately
connected with the geometry of the model as described by its fibers.
Most current practices require the computation of the entire basis,
which is infeasible in many practical settings. In this talk,
we present a dynamic approach to explore the fiber of a model,
which bypasses this issue, and is based on the combinatorics of
hypergraphs arising from the toric algebra structure of log-linear
models.
We demonstrate the approach on the Holland-Leinhardt p1 model
for random directed graphs that allows for reciprocated edges.
Giseon Heo, University of Alberta (Dentistry, Statistics)
Beyond Mode Hunting (slides)
The scale space has been studied in the context of blurring in
computer vision, smooth curve estimation in statistics, and persistent
feature detection in computational topology. We review the background
of three approaches and discuss how persistent homology can be
useful in high dimensions.
Stephan Huckemann, Goettingen (Stochastiks)
Circular Scale Spaces and Mode Persistence for Measuring Early
Stem Cell Differentiation (slides)
We generalize the SiZer of Chaudhuri and Marron (1999, 2000)
for the detection of shape parameters of densities on the real
line to the case of circular data. It turns out that only the
wrapped Gaussian kernel gives a symmetric, strongly Lipschitz
semi-group satisfying "circular" causality, i.e. not
introducing possibly artificial modes with increasing levels of
smoothing. Based on this we provide for an asymptotic theory to
infer on persistence of shape features. The resulting circular
mode persistence diagram is applied to the analysis of early mechanically
induced differentiation in adult human stem cells from their actin-
myosin filament structure. In consequence the circular SiZer based
on the wrapped Gaussian kernel (WiZer) allows to discriminate
at a controlled error level between three different micro-environments
impacting early stem cell differentiation. Joint work with Kwang-Rae
Kim, Axel Munk, Florian Rehfeld, Max Sommerfeld, Joachim Weickert
and Carina Wollnik
Georges Michailidis, Michigan(Statistics)
Estimation in High-Dimensional Vector Autoregressive Models (slides)
Vector Autoregression (VAR) is a widely used method for learning
complex interrelationship among the components of multiple time
series. Over the years it has gained popularity in the fields
of control theory, statistics, economics, finance, genetics and
neuroscience. We consider the problem of estimating stable VAR
models in a high-dimensional setting, where both the number of
time series and the VAR order are allowed to grow with sample
size. In addition to the ``curse of dimensionality" introduced
by a quadratically growing dimension of the parameter space, VAR
estimation poses considerable challenges due to the temporal and
cross-sectional dependence in the data. Under a sparsity assumption
on the model transition matrices, we establish estimation and
prediction consistency of $\ell$1-penalized least squares and
likelihood based methods. Exploiting spectral properties of stationary
VAR processes, we develop novel theoretical techniques that provide
deeper insight into the effect of dependence on the convergence
rates of the estimates. We study the impact of error correlations
on the estimation problem and develop fast, parallelizable algorithms
for penalized likelihood based VAR estimates.
Washington Mio, FSU (Mathematics)
On Genetic Determinants of Facial Shape Variation (slides)
Mapping genetic determinants of phenotypic variation is a major
challenge in biology and medicine. The problem arises in contexts
such as investigation of development, inheritance and evolution
of phenotypic traits, and studies of the role of genetics in diseases.
Shape is a ubiquitous trait whose biological relevance spans multiple
scales from organelles to cells through organs and tissues
to entire organisms. Accurate and biologically interpretable shape
quantification enables investigation of fundamental questions
about the genetic underpinnings of normal and pathological morphological
variation. In this talk, I will discuss an ongoing collaborative
genome wide association study of human facial shape variation
with an emphasis on the morphometric aspects of the study, which
uses geometric and topological methods to model facial shape.
Sayan Murherjee, Duke (Statistical Sciences)
Victor Patrangenaru, FSU (Statistics)
Data Analysis on Manifolds (slides)
While seeking answers to the fundamental question of what data
analysis should be all about, it is useful to go to the basic
notion of variability, that separates Statistics from all other
sciences; one soon realizes that there are two inescapable theoretical
ideas in data analysis. Firstly, one may quantify variability
within or between samples only in terms of a certain distance
on the sample space telling how far are observed sample points
from each other. Secondly, the distance, as a function of the
two data points separated by it, has to have some continuity property,
to make any consistency statement possible justifying why the
larger the sample, the closer the sample variability measure to
its population counterpart. In addition, since an asymptotic theory,
based on random observations is necessary to estimate the population
variance based on a large sample, such a theory can be formulated
only under the additional assumption of differentiability of some
power of the square distance function.
In summary, data analysis imposes some sort of differentiable
structure on the sample space, that has to be consequently either
a manifold, or having some manifold related structure, no matter
what the nature of the objects is. However the overwhelming number
of Statistics users are specializing more in understanding the
nature of the objects themselves, having little or no exposure
to the basics of geometry and topology of manifolds knowledge
needed to develop appropriate of methodology for data analysis.
At the same time, theoretical mathematicians who have a reasonable
knowledge about manifolds, might be unfamiliar with nonparametric
multivariate statistics, while computational grad students and
computational data analysts involved with nonlinear data are sometime
asking for a sound statistics methodology, or for some sort of
a multidimensional differential geometry or topology toolkit,
that may help them design fast algorithms for data analysis. To
answer such demands, we would like to structure our presentation
as follows. Firstly, introduce the basics for the three "pillars
of data analysis": (i) notrivial examples of data, (ii) nonparametric
multivariate statistics and (iii) geometry and topology of manifolds.
Secondly we develop a general methodology based on (i) and (ii),
and "translate" this methodology, in the context of
certain manifolds arising in statistics. Finally, apply this methodology
to concrete examples of data analysis.
Thanh Mai Pham Ngoc, Paris-Orsay (Mathematics)
Goodness-of-fit test for Noisy Directional Data (abstract
image)
We consider spherical data $X_i$ noised by a random rotation
$\varepsilon_i \in \mathrm{SO(3)}$ so that only the sample $Z_i
= \varepsilon_iX_i, i = 1,\ldots,N$ is observed. We define a nonparametric
test procedure to distinguish $H_0:$``the density $f$ of $X_i$
is the uniform density $f_0$ on the sphere'' and $H_1:$ ``$\parallel
f-f_0\parallel^2_2 \geq \mathcal C\psi_N$ and $f$ is in a Sobolev
space with smoothness $s$''. For a noise density $f_\varepsilon$
with smoothness index $\nu$, we show that an adaptive procedure
(i.e. $s$ is not assumed to be known) cannot have a faster rate
of separation than $\psi^{ad}_N(s) = (N/\sqrt{\log\log(N)})^{-2s/(2s+2\nu+1)}$
and we provide a procedure which reaches this rate. We also deal
with the case of super smooth noise. We illustrate the theory
by implementing our test procedure for various kinds of noise
on $\mathrm{SO(3)}$ and by comparing it to other procedures. Finally,
we apply our test to real data in astrophysics and paleomagnetism.
Bala Rajaratnam, Stanford (Statistics)
Methods for Robust High Dimensional Graphical Model Selection
(slides)
Learning high dimensional correlation and partial correlation
graphical network models is a topic of contemporary interest.
A popular approach is to use L1 regularization methods to induce
sparsity in the inverse covariance estimator, leading to sparse
partial covariance/correlation graphs. Such approaches can be
grouped into two classes: (1) regularized likelihood methods and
(2) regularized regression-based, or pseudo-likelihood, methods.
Regression based methods have the distinct advantage that they
do not explicitly assume Gaussianity. One major gap in the area
is that none of the popular methods proposed for solving regression
based objective functions have provable convergence guarantees,
and hence it is not clear if these methods lead to estimators
which are always computable. It is also not clear if resulting
estimators actually yield correct partial correlation/partial
covariance graphs. To this end, we propose a new regression based
graphical model selection method that is both tractable and has
provable convergence guarantees. In addition we also demonstrate
that our approach yields estimators that have good large sample
and finite sample properties. The methodology is successfully
illustrated on both real and simulated data with a view to applications
to big data problems. We also present a novel unifying framework
that places various pseudo-likelihood graphical model selection
methods as special cases of a more general formulation, leading
to important insights. (Joint work with S. Oh and K. Khare).
Elena Villa, Universita' degli Studi di Milano (Mathematics)
Different Kinds of Estimators of the mean density of random closed
sets: Theoretical Results and Numerical experiements (slides)
Many real phenomena may be modelled as random closed sets in
$R^d$ , of different Hausdorff dimensions. Of particular interest
are cases in which their Hausdorff dimension, say $n$, is strictly
less than $d$, such as fiber processes, boundaries of germ-grain
models, and $n$-facets of random tessellations. The mean density,
say $L_Qn$ , of a random closed set $Q_n$ in $R^d$ with Hasudorff
dimension $n$ is defined to be the density of the measure $E[H^n(Q_n\sqcap·
)]$ on $R^d$, whenever it is absolutely continuous with respect
to $H^d$. A crucial problem is the pointwise estimation of $L_Qn$
. In this talk we present three different kinds of estimators
of $L_Qn(x)$; the first one will follow as a natural consequence
of the Besicovitch derivation theorem; the second one will follow
as a generalization to the $n$-dimensional case of the classical
kernel density estimator of random vectors; the last one will
follow by a local approximation of $L_Qn$ based on a stochastic
version of the $n$-dimensional Minkowski content of $Q_n$.
We will study the unbiasedness and consistency properties, and
identify optimal bandwidths for all proposed estimators, under
sufficient regularity conditions. Finally, we will provide a set
of simulations via typical examples of lower dimensional random
sets.
back to top