Marti Anderson. Some solutions to the Behrens-Fisher problem for multivariate ecological data.
The Behrens-Fisher problem (BFP) is one of the oldest puzzles in statistics. The essence of this problem is how validly to compare means (or multivariate centroids) between two or more populations when their variances (or multivariate dispersions) differ. This is especially irksome in ecology whenever variables consist of counts of abundances of species, because any differences in means will be accompanied by differences in variances as well. Some solutions to the BFP do exist for the univariate case, but they assume variables are normally distributed, whereas species' counts tend to be long-tailed and overdispersed, with many zeros. This issue is exacerbated for multivariate ecological data (counts of many species in a community), which often also have more variables (species) than samples. This means none of the existing attempts to solve the multivariate BFP (all of which also assume normality) can be used in practice.
In this talk, I will outline, compare and contrast some potential solutions to the multivariate BFP that rely on some rather clever permutation, bootstrap or Monte Carlo (re-)sampling methods. While the permutation approach tends to be mildly liberal, the bootstrap approach, even with reasonable empirical bias-corrections, tends to be overly conservative. But will the Monte Carlo approach come to the rescue with a more exact test? And at what cost in terms of additional underlying assumptions?
Mark Beaumont. Statistical inference for complicated models in ecology and evolutionary biology.
Monte Carlo simulation has long been a widely used tool in biology, traditionally for predictive modelling, as an adjunct to more analytical approaches, and latterly for statistical inference. The predictive and inferential aspects have typically been considered separately, often by different specialisms. One reason for this is that, while it is generally easy to formulate stochastic data-generating simulations of arbitrary complexity, it may be practically impossible to obtain a likelihood function for the same problem. However, it is now becoming widely appreciated that the predictive and inferential aspects can be viewed, under the Bayesian paradigm, as simply parts of the same whole, depending on which parts of the model are regarded as fixed and which are random. A useful tool that has underpinned this conceptual change in a very practical way is the method of approximate Bayesian computation (ABC). In this talk I will outline the basic ABC approach, and how it has evolved. Using examples in ecology and population genetics I will illustrate how it can be used for Bayesian model choice, posterior predictive modelling, and prior- and posterior- model checking. The known problems of the approach are also discussed. Finally I will end by describing the use of these methods for agent-based models in ecology.
Ben Bolker. Statistical machismo vs common sense: when are new methods worthwhile?
Statisticians' bread and butter, the work that excites us and brings academic rewards, is developing novel methods. Applying existing methods to new problems and new data sets, no matter how exciting the scientific results or useful the management conclusions, doesn't have the same intellectual thrill. A recent blog post by Brian McGill accused ecologists of "statistical machismo", using unwarrantedly fancy statistical methods for swank; I will explore the costs and benefits of new, complex statistics from the statistician's point of view. When are new methods really useful, and when do they just enable statistical machismo? What are the tradeoffs between robustness, ease of use, transparency, and correctness? Is providing easy-to-use software doing users a favour? How often do our new methods solve problems that ecologists really need solved?
Nick Gotelli. The Well-Tempered Assemblage: Reducing Bias in the Estimation of Species Rank Abundance Distributions.
Contributing Authors: Nicholas J. Gotelli, Anne Chao, Robert K. Colwell, Robin L. Chazdon, T. C. Hsieh
Most plant and animal assemblages are characterized by a few common species and many uncommon or rare species. Understanding the mechanisms shaping the species abundance distribution (SAD) has long been a major research focus in ecology. Beginning with seminal work by R.A. Fisher in the 1940s, ecologists have fit simple statistical models such as the geometric series, log normal, and exponential series to species abundance data. This distribution-fitting approach is based on the use of the simple “plug-in” estimator \hat{p}_i = n_i/N, where \hat{p}_i is the estimated relative frequency of species i, n_i is the number of individuals observed of species i, and N is the number of individuals in the sample. However, with incomplete sampling and undetected species, \hat{p}_i is a biased estimator of the true relative frequency of the species in the sample, and the degree of bias increases with the relative rarity of each species. Using the concept of sample coverage and the theory of frequency estimation by I.J. Good and A. Turing, we estimated the true species abundance distribution (SAD) based on a random sample of individuals. We separately estimated relative frequencies for the set of species detected in the sample and for the set of species undetected in the sample. We then combined the two parts to obtain an estimated SAD in which the relative frequency for each species has been tuned or adjusted to minimize the bias inherent in the traditional plug-in estimator. To examine the performance of the tuned estimators, we created artificial data sets by randomly sampling from common statistical distributions, and by randomly sampling from large empirical distributions based on very thorough field censuses. With sufficient sample size or sample coverage, the tuned estimators closely matched the true SADs. These more accurate estimators of relative frequency should aid ecologists in understanding and modeling SADs.
Jean-Dominique Lebreton. The interplay of relevance and generalization in Biostatistics.
In the development of mathematical methods, two contrasted, nearly contradictory logics are in action. This is particularly the case in Biomathematics, including Biostatistics and Statistical Ecology, which will be the main focus of my reflections and illustrations.
The logic of relevance stems from a full acceptation of biological questions, and then attempts at developing tools closely fitting these questions. After an initial tool is proposed, such as the probit model to analyse dose-response relationships, one generally see a proliferation of particularizations and variants, published one at a time, and often named from their author or by some exotic name. Many fields can serve as illustrations of this proliferation process: predator-prey models, capture-recapture methods, descriptive multivariate analysis (“data analysis”), etc… After an initial success, this proliferation of methods (and software) is often a source of confusion for users, with little help from a poor nomenclature. Another clear risk of the logic of relevance is of developing “ad hoc”, statistically non optimal approaches to any particular question, that may become the dominant practices for years within a particular scientific community.
The logic of generalization comes from pure mathematics, and is based on the idea that, by generalizing an existing mathematical object, you will unavoidably visit new, unexplored, territories, discover unexpected links, and make valuable encounters. The use of the duality diagram as a prospective tool in descriptive multivariate analysis and of generalized linear models as a common frame for a variety of discrete data models are obvious examples of the logic of generalization. In face of the advantage of unifying existing approaches and opening new avenues of development, the clear risk is to use a sledgehammer to crack a nutshell, or, worse, to use a hammer to fit a screw. Using a fancy mixed logistic model for estimating survival from data on marked individuals, not accounting for incomplete detection which is the key feature of such data, would be an example of such a mismatch.
One can easily deduce from such premises that statistical Ecology, and biomathematics in general, could not have survived, developed, and be efficient and useful with only one of these two logics at work. I will show and illustrate how these two logics fit together in successive phases of development, each one needing an accumulation of material from the other one to be fully efficient. In a pluridisciplinary endeavour such as statistical ecology, the reflections should necessarily encompass the development of software and shared data bases. I will go on discussing the strategies of research and transfer of knowledge that can be thought of in such a framework.
Perry de Valpine. Bayesians, frequentists, and pragmatists: the interaction of methods and software.
Ecologists take pride in statistical pragmatism, doing "whatever works." I will attempt to unpack this sense of pragmatism specifically for hierarchical statistical models and look at how it involves a swirl of principles and practices. Moreover, one's sense of pragmatism depends on available software, yet the statistical literature abounds with ideas for numerical methods for hierarchical model estimation, prediction, and diagnostics that are not readily available in software. On simple goal in principle would be to critically evaluate Bayesian results from a frequentist perspective, but this is rarely done because it is not practical. Some complex algorithms for some hierarchical models are available, while some simple algorithms are not. Therefore, the future of pragmatism is tied to the future of software. I will present progress on a new software package that allows flexible programming of algorithms that operate on shared model structures. This means that rather than tying specification of a model structure to a particular algorithm, such as one flavor of MCMC provided as a "black box", one can specify a model structure and run a variety of algorithms on it. Since algorithms can be concisely programmed, the system is naturally extensible and provides a way to disseminate new methods.
Christopher K. Wikle. Ecological Prediction with High-Frequency “Big Data” Covariates.
Time-frequency analysis has become a fundamental component of many scientific inquiries. Due to the improvements in technology, the amount of high-frequency signals that are collected for ecological and other scientific processes is increasing at a dramatic rate. Incorporating such information into traditional models is complicated by the inherent differences in temporal scales between the response and the predictors. Salient features of high-dimensional time-dependent outcomes and/or predictors may be difficult to discern through scientific or statistical examination in the time domain. Such features often become more pronounced and possibly more interpretable when considered from a time-frequency perspective. Critically, such time-frequency based representations can be considered analogous to spatial image processes, which can be effectively represented by common reduced-rank methods to deal with the inherent dependence between time-frequency “pixels.” When combined with efficient variable selection approaches, such representations can improve prediction, classification, and interpretation of spatial and temporal responses on different scales of resolution than the high-frequency covariates.
In order to facilitate the use of these data in ecological prediction and inference, we present a class of nonlinear multivariate time-frequency functional models that can identify important features of each signal as well as the interaction of signals corresponding to response variables. The proposed methods utilize various methods to estimate time-frequency “images”, rank reduction, and stochastic search variable selection to effectively reduce the dimensionality and to identify important time-frequency (and, hence, time-domain) features. The methods are demonstrated through various ecological and environmental examples, such as predicting phenotypic selection from insect communication signals, and predicting spawning success of shovelnose sturgeon on the Lower Missouri River from high-frequency data storage tag information.
Simon Wood. Statistical methods for non-linear ecological dynamic models.
Highly non-linear, process based dynamic models are commonly used to describe pest insect populations and disease dynamics, but present challenges when used for statistical purposes. Naive application of standard Bayesian or frequentist methods fails as the dynamics of the system get close to the chaotic regime. In that case likelihoods, or the target distributions for Bayesian simulation, become highly multimodal or completely irregular. At the same time, the data from which to estimate such models is usually quite limited, so that efficient use of information is at a premium. Two main lines of attack are through controlled information reduction, such as ABC or Synthetic likelihood, or via working directly on the system state variables via filtering, for example. This talk compares and contrasts these strategies.