An Exploratory Data Analyisis (EDA) of the Paths of Moving Animals
This work presents an EDA of the trajectories of deer and elk moving about in the Starkey Experimental Forest and Range in eastern Oregon. The animals' movements may be affected by habitat variables and the behavior of the other animals. In the work of this paper a stochastic differential equation based model is developed in successive stages. Equations of motion are set down motivated by corresponding equations of physics. Functional parameters appearing in the equations are estimated nonparametrically and plots of vector fields of animal movements are prepared. Residuals are used to look for interactions amongst the movements of the animals. There are exploratory analyses of various sorts. Statistical inferences are based on Fourier transforms of the data, which are unequally spaced. The the material is motivated by motivating quotes from the writings of John Tukey. The work is joint with researchers at the US Forest Service.

Sir David R. Cox
Oxford, Nuffield College and Department of Statistics

Graphical models for the interpretation of data: some recent developments
Graphical representations of statistical relationships are generalizations of Sewall Wright's path analysis. They are used in a number of contexts; in this paper the emphasis is on the analysis of empirical data. Several different types of graph are needed. The relation between graphical representations and matrices is used to show the consequences of certain manipulations of the graphs. The relation with discussions of statistical causality is outlined.

Augustine Kong
deCode Genetics

A High Resolution Recombination Map of the Human Genome
Recombination is a mechanism of DNA mixing from one generation to the next and serves an important role in human evolution. Results based on new data, six times the sample size previously available, will be presented. Emphasis will be on the interpretation and implications of the data, although we will also touch on various statistical challenges.

Michael A. Newton
University of Wisconsin--Madison

A statistical approach to modeling genomic aberrations in cancer cells
I will discuss a modeling strategy for genomic-aberration data which allows us to to infer combinations of aberrations that together increase the chance that a precancerous cell will have a descendant tumor lineage. The likelihood component involves a network of pathway structures and MCMC is used to sample from the space of these oncogenic networks. I illustrate the methodology with chromosome-based comparative genomic hybridizations from several recent studies, and I will draw some comparisons with the oncogenic-tree methods of R. Desper and colleagues.

Daryl Pregibon, withCorinna Cortes & Chris Volinsky
AT&T Shannon Labs

Graph Mining: Discovery in Large Networks
Large financial and telecommunication networks provide a rich source of problems for the data mining community. The problems are inherently quite distinct from traditional data mining in that the data records, representing transactions between pairs of entities, are not independent. Indeed, it is often the linkages between entities that are of primary interest. A second factor, network dynamics, induces further challenges as new nodes and edges are introduced through time while old edges and nodes disappear.

We discuss our approach to representing and mining large sparse graphs. Several applications in telecommunications fraud detection are used to illustrate the benefits our approach.

James Robins
Harvard University

Optimal Treatment Regimes
We discuss a new approach to estimation of the optimal treatment regime or strategy from longitudinal observational data. This approach is based on so called G-estimation of optimal regime structural nested mean models. It ia an extension of the novel approach recently developed by Susan Murphy.

Elizabeth Thompson
Washington University

Monte Carlo estimation of multipoint linkage lod scores Elizabeth Thompson, University of Washington
The computation of multipoint linkage log-likelihoods is an important tool in localizing genes for human traits, particularly on extended pedigrees where observations may be sparse. Many real data analysis problems are beyond the scope of exact likelihood computation, and Markov chain Monte Carlo (MCMC) provides an alternative approach. In any MCMC estimation procedure there are two major issues: the mixing properties of the MCMC samplers, and the Monte Carlo variance of estimators. To improve the sampling process we have adopted a variety of tools from MCMC methodology, including block-Gibbs updates, integrated proposals, and Metropolis-Hastings restarts using sequential imputation proposals. Additionally, we use a pseudo-Bayesian approach retrieving a likelihood estimate from realizations from a posterior distribution. Our estimators use Rao-Blackwellized versions of the usual simple count estimators. Together, these tools provide accurate and effective computation, as illustrated by several real-data examples. This research is joint work with Andrew George.

Rob Tibshirani
Stanford University

Least angle regression, forward stagewise and the lasso
We discuss "Least Angle Regression" ("LARS"), a new model selection algorithm. This is a useful and less greedy version of
traditional forward selection methods. Three main properties of LARS are derived. (1) A simple modification of the LARS algorithm implements the Lasso, an attractive version of Ordinary Least Squares that constrains the sum of the absolute regression coefficients; the LARS modification calculates all possible Lasso estimates for a given problem in an order of magnitude less computer time than previous methods. (2) A different LARS modification efficiently implements Forward Stagewise linear regression, another promising new model selection method; this connection explains the similar numerical results previously observed for the Lasso and Stagewise, and helps understand the properties of both methods, which are seen as constrained versions of the simpler LARS algorithm. (3) A simple approximation for the degrees of freedom of a LARS estimate is available, from which we derive a Cp estimate of prediction error; this allows a principled choice among the range of possible LARS estimates.

LARS and its variants are computationally efficient: the paper describes a publicly available algorithm that requires only the same order of magnitude of computational effort as Ordinary Least Squares applied to the full set of covariates.

This is joint work with Bradley Efron, Trevor Hastie and Iain Johnstone

Back to Index

SCIENTIFIC PROGRAMS AND ACTIVITIES

Workshop in Honour of David F. Andrews - May 23 and 24, 2002 Short Course in Microarray Data Analysis - May 25, 2002

Speaker Abstracts

Workshop in Honour of David F. Andrews - May 23 and 24, 2002
Short Course in Microarray Data Analysis - May 25, 2002