Statistical Methods for
Tissue Array Images: Algorithmic Scoring, Data Contamination, and Blessings
of Dimensionality:
The tissue microarray
(TMA) technology provides an efficient way to evaluate large numbers of immunohistochemically-stained tissue images and has been
successfully used in many applications in clinical outcome analysis,
biomarker validations, and cancer research. Central to the use of TMAs is
their scoring, which currently relies mainly on manual evaluation; this is
mainly due to difficulties in quantifying the staining patterns, which are
highly heterogeneous and not "localized" in position, size or shape.
In response to concerns about the subjectivity and variability of manual TMA
evaluation, we develop an automatic scoring algorithm--TACOMA. TACOMA
effectively captures the statistical regularity in TMA images by statistics
about the transition of gray levels, and a few "representative"
image patches serve as prior information that allows TACOMA to focus on
biologically relevant features and score in a similar way as the
pathologists. Experiments show that TACOMA rivals pathologists in terms of
both accuracy and reproducibility. Moreover, it is easily interpretable in
that it reveals salient pixels in an image that are most relevant to scoring.
Related work will be presented towards challenges in the training of TACOMA,
namely, label noise in the scoring of TMA images and issues on small training
sample. In particular, we establish an asymptotic bound on the impact of data
contamination to classification and give insights on the success of several
ingredients of a highly successful machine learning method, co-training.
Donghui Yan,
Ph.D., WalmartLabs~January, 22, 2014
|
Another look at statistical
testing and integrative analysis in a big data era:
15 years after the introduction of microarrays, and 12 years
after the draft sequencing of the human genome, basic statistical issues of
multiple testing remain important for discovery-based and translational
science. However, at the extreme testing thresholds required for many -omics platforms, standard testing approaches can have
highly inflated false positive rates, leading to false discoveries.
Another problem, not always recognized by practitioners, is that standard
approaches to analyses of “pathways” can also lead to numerous
false discoveries. Permutation analysis provides a rigorous framework
for testing, but is computationally intensive and cumbersome. In this
talk, I will introduce the basic rationale for multiple testing and
permutation analysis when dealing with high-dimensional data. For
testing individual features in ‘omics
platforms, I will describe the Moment-Corrected Correlation (MCC) approach to
perform extremely fast and accurate testing, with careful control of false
positives. For testing pathways, or other defined grouped sets of
features, I will introducesafeExpress, a new
software package for performing highly rigorous pathway testing.
Finally, I will describe several additional projects and software packages
that are incorporating the ideas from these two approaches, and are being
applied to methylation, genotyping, and RNA-Seq
datasets.
Yihui Zhou, Ph.D., North Carolina State University~January
30, 2014
|
Ensemble Subsampling for
Imbalanced Multivariate Two-Sample Tests
In the past decade, imbalanced data have drawn increasing
attention in the machine learning community. Such data commonly arise in many
fields such as biomedical science, financial economics, fraud detection,
marketing, and text mining. The imbalance refers to a large difference
between the sample sizes of data from two underlying distributions or from
two classes in the setting of classification. We tackle the challenges of
imbalanced learning in the setting of the long-standing statistical problem
of multivariate two-sample tests. Some existing nonparametric two-sample
tests for equality of multivariate distributions perform unsatisfactorily
when the two sample sizes are imbalanced. In particular, the power of these
tests tends to diminish with increasingly imbalanced sample sizes. We propose
a new testing procedure to solve this problem. The proposed test is based on
a nearest neighbor method and employs a novel ensemble subsampling scheme to
treat the imbalance of data. We demonstrate the strong power of the testing
procedure by simulation study and real data example, and provide asymptotic
analysis for our testing procedure.
Lisha Chen Ph.D., Department of Statistics, Yale University~Feburary 6, 2014
|
Computational Methods for Function-on-Function
Regression:
Statistical
inference and software implementation for functional regression models are
presented. New methods for analyzing functional data are discussed,
where the dependence of functional responses on functional predictors is of
interest. Penalized regression and the mixed model representation are used to
provide a framework that allows the inclusion of multiple functional
predictors and scalar covariates. Computational procedures in R are developed to facilitate estimation of confidence
sets for the model parameters describing the statistical associations
involving functional data. Results from simulation studies show good
numerical performance in implementations to several functional data sampling
designs. Applications to a study on human brain tract profiles are discussed.
Andrada Ivanescu, PhD, East Carolina
University~March 7, 2014
|
Infinite Parameter Estimates in Polytomous
Regression:
This talk presents a method for inference for a
multinomial regression model in the presence of likelihood monotonicity, by
translating the multinomial regression problem into a conditional binary
regression problem, using existing techniques to reduce this conditional
binary regression problem to one with fewer observations and fewer
covariates, such that probabilities for the canonical sufficient statistic of
interest, conditional on remaining sufficient statistics, are identical.
This conditional binary regression problem is translated back to the
multinomial regression setting. This reduced multinomial regression
problem does not exhibit monotonicity of its likelihood, and so conventional
asymptotic techniques can be used.
John Kolassa,
PhD, Rutgers University~April 10, 2014
|
Central Limit Theorem Approximations for the Number of
Runs in Markov-Dependent Sequences
Success runs is a
problem with long history in probability and has applications in many areas, such
as quality control, reliability theory, start-up demonstration tests,
analysis of DNA data in biology etc. I this talk we firstly present the
general frame and definitions for various types of success runs (overlapping,
non-overlapping, exact, etc.) for Markov-dependent binary or multitrials.
Then we establish a multivariate Central Limit Theorem for the number of
these types of runs and obtain its covariance matrix by means of the
recurrent potential matrix in terms of the stationary distribution and the
mean transition times in the chain. Finally we discuss applications in
reliability theory and molecular biology.
George Mytalas,
PhD, Department of Mathematical Sciences, New Jersey Institute of Technology~April, 24,2014
|
GLM fitting under the generalized inverse sampling
scheme for a cancer incidence data
The generalized
linear model for a multi-way contingency table for several independent
populations that follow the extended negative multinomial distributions is
introduced. This model represents an ex-tension of negative multinomial
log-linear model. The parameters of the new model are estimated by the
quasi-likelihood method and the corresponding score function, which gives a
close form estimate of the regression parameters. The goodness-of-fit test
for the model is also discussed. An application of the log-linear model under
the generalized inverse sampling scheme which is represented by cancer incidence
data is given as
an example to demonstrate the use of the
model.
Sunil Kumar Dhar, PhD, Center for Applied Mathematics and
Statistics, Department of Mathematical Sciences, New Jersey Institute of Technology~May, 1, 2014
|
|
|
|
|