Statistics Seminar Series

Department of Mathematical Sciences
and
Center for Applied Mathematics and Statistics

New Jersey Institute of Technology

Spring 2014

 

All seminars are 4:00 - 5:00 p.m. on Thursdays, in Cullimore Hall Room 611 (Math Conference Room) unless noted otherwise. If you have any questions about a particular seminar, please contact the person hosting the speaker.

 

Date

Speaker and Title

Host

 Wednesday
 January,22, 2014
 
11:30 am

Donghui Yan,  Ph.D., WalmartLabs

 

Statistical Methods for Tissue Array Images: Algorithmic Scoring, Data Contamination, and Blessings of Dimensionality
(abstract)

Ji Meng Loh

Thursday

January 30, 2014 11:30 am


            Yihui Zhou, Ph.D., North Carolina State University

Another look at statistical testing and integrative analysis in a big data era

(abstract)

Zoi-Heleni Michalopoulou

Thursday
February 6, 2014
4:00PM


         

                                               Lisha Chen Ph.D., Yale University

 

                      Ensemble Subsampling for Imbalanced Multivariate Two-Sample Tests

 

 (abstract)

Sunil Dhar

Thursday
March 7, 2014

4:00PM

 

 Andrada Ivanescu, PhD, East Carolina University

 

                           Computational Methods for Function-on-Function Regression

                                                                (abstract)

   Sunil Dhar

Thursday

April 10, 2014

4:00PM


     John Kolassa, PhD, Rutgers University

Infinite Parameter Estimates in Polytomous Regression

                                                                 (abstract)


Antai Wang

Thursday
April 17, 2014
4:00PM

                         

                             Zhiying Qiu, MS, Department of Mathematical Sciences

                                         New Jersey Institute of Technology

                                                             TBA
                                                        

                                                          (abstract)

Wenge Guo

Tuesday
April 24, 2014
4:00PM


  George Mytalas, PhD, Department of Mathematical Sciences

New Jersey Institute of Technology

Central Limit Theorem Approximations for the Number of

Runs in Markov-Dependent Sequences

                                                       

                                                         (abstract)

Antai Wang

Thursday

May 1, 2014
4:00PM

 

                                               Sunil Kumar Dhar, PhD

                 Center for Applied Mathematics and Statistics                           

Department of Mathematical Sciences, New Jersey Institute of Technology

         GLM fitting under the generalized inverse sampling scheme for a cancer incidence data

                                                         (abstract)

Antai Wang

 

 

 

 

ABSTRACTS

 

Statistical Methods for Tissue Array Images: Algorithmic Scoring, Data Contamination, and Blessings of Dimensionality: 

 

The tissue microarray (TMA) technology provides an efficient way to evaluate large numbers of immunohistochemically-stained tissue images and has been successfully used in many applications in clinical outcome analysis, biomarker validations, and cancer research. Central to the use of TMAs is their scoring, which currently relies mainly on manual evaluation; this is mainly due to difficulties in quantifying the staining patterns, which are highly heterogeneous and not "localized" in position, size or shape. In response to concerns about the subjectivity and variability of manual TMA evaluation, we develop an automatic scoring algorithm--TACOMA. TACOMA effectively captures the statistical regularity in TMA images by statistics about the transition of gray levels, and a few "representative" image patches serve as prior information that allows TACOMA to focus on biologically relevant features and score in a similar way as the pathologists. Experiments show that TACOMA rivals pathologists in terms of both accuracy and reproducibility. Moreover, it is easily interpretable in that it reveals salient pixels in an image that are most relevant to scoring. Related work will be presented towards challenges in the training of TACOMA, namely, label noise in the scoring of TMA images and issues on small training sample. In particular, we establish an asymptotic bound on the impact of data contamination to classification and give insights on the success of several ingredients of a highly successful machine learning method, co-training.

 

Donghui Yan,  Ph.D., WalmartLabs~January, 22, 2014

 

Another look at statistical testing and integrative analysis in a big data era:


15 years after the introduction of microarrays, and 12 years after the draft sequencing of the human genome, basic statistical issues of multiple testing remain important for discovery-based and translational science.  However, at the extreme testing thresholds required for many -omics platforms, standard testing approaches can have highly inflated false positive rates, leading to false discoveries.  Another problem, not always recognized by practitioners, is that standard approaches to analyses of “pathways” can also lead to numerous false discoveries.  Permutation analysis provides a rigorous framework for testing, but is computationally intensive and cumbersome.  In this talk, I will introduce the basic rationale for multiple testing and permutation analysis when dealing with high-dimensional data.   For testing individual features in ‘omics platforms, I will describe the Moment-Corrected Correlation (MCC) approach to perform extremely fast and accurate testing, with careful control of false positives.  For testing pathways, or other defined grouped sets of features, I will introducesafeExpress, a new software package for performing highly rigorous pathway testing.  Finally, I will describe several additional projects and software packages that are incorporating the ideas from these two approaches, and are being applied to methylation, genotyping, and RNA-Seq datasets.

 

Yihui Zhou, Ph.D., North Carolina State University~January 30, 2014

 

Ensemble Subsampling for Imbalanced Multivariate Two-Sample Tests                 

 

In the past decade, imbalanced data have drawn increasing attention in the machine learning community. Such data commonly arise in many fields such as biomedical science, financial economics, fraud detection, marketing, and text mining. The imbalance refers to a large difference between the sample sizes of data from two underlying distributions or from two classes in the setting of classification. We tackle the challenges of imbalanced learning in the setting of the long-standing statistical problem of multivariate two-sample tests. Some existing nonparametric two-sample tests for equality of multivariate distributions perform unsatisfactorily when the two sample sizes are imbalanced. In particular, the power of these tests tends to diminish with increasingly imbalanced sample sizes. We propose a new testing procedure to solve this problem. The proposed test is based on a nearest neighbor method and employs a novel ensemble subsampling scheme to treat the imbalance of data. We demonstrate the strong power of the testing procedure by simulation study and real data example, and provide asymptotic analysis for our testing procedure.

 

Lisha Chen Ph.D., Department of Statistics, Yale University~Feburary 6, 2014

 

 

Computational Methods for Function-on-Function Regression:

 

Statistical inference and software implementation for functional regression models are presented. New methods for analyzing functional data are discussed, where the dependence of functional responses on functional predictors is of interest. Penalized regression and the mixed model representation are used to provide a framework that allows the inclusion of multiple functional predictors and scalar covariates. Computational procedures in R are developed to facilitate estimation of confidence sets for the model parameters describing the statistical associations involving functional data. Results from simulation studies show good numerical performance in implementations to several functional data sampling designs. Applications to a study on human brain tract profiles are discussed.

 

Andrada Ivanescu, PhD, East Carolina University~March 7, 2014

 

 

Infinite Parameter Estimates in Polytomous Regression:

 

This talk presents a method for inference for a multinomial regression model in the presence of likelihood monotonicity, by
translating the multinomial regression problem into a conditional binary regression problem, using existing techniques to reduce this conditional binary regression problem to one with fewer observations and fewer covariates, such that probabilities for the canonical sufficient statistic of interest, conditional on remaining sufficient statistics, are identical.  This conditional binary regression problem is translated back to the multinomial regression setting.  This reduced multinomial regression problem does not exhibit monotonicity of its likelihood, and so conventional asymptotic techniques can be used.
 

John Kolassa, PhD, Rutgers University~April 10, 2014

 

 

Central Limit Theorem Approximations for the Number of Runs in Markov-Dependent Sequences

 

Success runs is a problem with long history in probability and has applications in many areas, such as quality control, reliability theory, start-up demonstration tests, analysis of DNA data in biology etc. I this talk we firstly present the general frame and definitions for various types of success runs (overlapping, non-overlapping, exact, etc.) for Markov-dependent binary or multitrials. Then we establish a multivariate Central Limit Theorem for the number of these types of runs and obtain its covariance matrix by means of the recurrent potential matrix in terms of the stationary distribution and the mean transition times in the chain. Finally we discuss applications in reliability theory and molecular biology.

 

George Mytalas, PhD, Department of Mathematical Sciences, New Jersey Institute of Technology~April, 24,2014

 

 

GLM fitting under the generalized inverse sampling scheme for a cancer incidence data

The generalized linear model for a multi-way contingency table for several independent populations that follow the extended negative multinomial distributions is introduced. This model represents an ex-tension of negative multinomial log-linear model. The parameters of the new model are estimated by the quasi-likelihood method and the corresponding score function, which gives a close form estimate of the regression parameters. The goodness-of-fit test for the model is also discussed. An application of the log-linear model under the generalized inverse sampling scheme which is represented by cancer incidence data is given as

an example to demonstrate the use of the model.

 

Sunil Kumar Dhar, PhD,  Center for Applied Mathematics and Statistics, Department of Mathematical Sciences, New Jersey Institute of Technology~May, 1, 2014