Program Overview

Time 7/9/2024 Time 7/10/2024
8:50-9:00 Opening Remarks 9:00-10:45 Parallel Invited Sessions
9:00-10:00 Keynote 1 by Ian McKeague 10:45-11:00 Break
10:00-10:15 Break 11:00-12:45 Parallel Invited Sessions
10:15-12:00 Parallel Invited Sessions 12:45-1:45 Lunch
12:00-1:00 Lunch 1:45-3:30  Parallel Invited Sessions
1:00-2:45  Parallel Invited Sessions 3:30-3:45 Break
2:45-3:00 Break 3:45-4:45  Parallel Contributed Sessions
3:00-4:45  Parallel Invited Sessions 4:45-5:00  Break
4:45-5:00 Break 5:00-6:00 Keynote 2 by Markus Pauly
5:00-6:00 Parallel Contributed Sessions 6:00-6:10 Closing Remarks
6:00-8:30 Banquet/Junior Research Award

Tentative Program

Click to open program details

Invited Sessions

Judy Wang (Organizer)
Daniela Castro-Camilo
Sebastian Engelke
Tiandong Wang
Chen Zhou
A Bayesian multivariate extreme value mixture model
Daniela Castro-Camilo
Lecturer in Statistics, School of Mathematics and Statistics, University of Glasgow, Glasgow G12 8QQ
Daniela.CastroCamilo@glasgow.ac.uk
Impact assessment of natural hazards requires the consideration of both extreme and non-extreme events. Extensive research has been conducted on the joint modelling of bulk and tail in univariate settings; however, the corresponding body of research in the context of multivariate analysis is comparatively scant. This study extends the univariate joint modelling of bulk and tail to the multivariate framework. Specifically, it pertains to cases where multivariate observations exhibit extremity in at least one component.
We propose a multivariate extreme value mixture model that assumes a parametric model to capture the bulk of the distribution, which is in the max-domain of attraction of a multivariate extreme value distribution. The multivariate tail is described by the asymptotically justified multivariate generalized Pareto distribution. Bayesian inference based on multivariate random-walk Metropolis-Hastings and the automated factor slice sampler allows us to easily incorporate uncertainty from the threshold selection. The performance of our model is tested using different simulation scenarios, and the applicability of our model is illustrated using temperature records in the UK that show the need to accurately describe the joint tail behaviour.
Machine learning beyond the data range: an extreme value perspective
Sebastian Engelke
Associate Professor, Research Center for Statistics, University of Geneva
Daniela.CastroCamilo@glasgow.ac.uk
Machine learning methods perform well in prediction tasks within the range of the training data. These methods typically break down when interest is in (1) prediction in areas of the predictor space with few or no training observations; or (2) prediction of quantiles of the response that go beyond the observed records. Extreme value theory provides the mathematical foundation for extrapolation beyond the range of the training data, both in the dimension of the predictor space and the response variable. In this talk we present recent methodology that combines this extrapolation theory with flexible machine learning methods to tackle the out-of-distribution generalization problem (1) and the extreme quantile regression problem (2). We show the practical importance of prediction beyond the training observations in environmental and climate applications, where domain shifts in the predictor space occur naturally due to climate change and risk assessment for extreme quantiles is required.
Testing for Strong VS Full Dependence
Tiandong Wang, Ph.D.
Shanghai Center for Mathematical Sciences, Fudan University
td_wang@fudan.edu.cn
Preferential attachment models of network growth are bivariate heavy tailed models for in- and out-degree with limit measures which either concentrate on a ray of positive slope from the origin or on all of the positive quadrant depending on whether the model includes reciprocity or not. Concentration on the ray is called full dependence. If there were a reliable way to distinguish full dependence from not-full, we would have guidance about which model to choose. This motivates investigating tests that distinguish between (i) full dependence; (ii) strong dependence (limit measure concentrates on a proper subcone of the positive quadrant; (iii) concentration on positive quadrant. We give two test statistics and discuss their asymptotically normal behavior under full and not-full dependence.
This is a joint work with Prof. Sidney Resnick at Cornell University.
Tail copula estimation for heteroscedastic extremes
Chen Zhou
Erasmus University Rotterdam
zhou@ese.eur.nl
Consider independent multivariate random vectors which follow the same copula, but where each marginal distribution is allowed to be non-stationary. This non-stationarity is for each marginal governed by a scedasis function (see Einmahl et al. (2016)) that is the same for all marginals. We establish the asymptotic normality of the usual rank-based estimator of the stable tail dependence function, or, when specialized to bivariate random vectors, the corresponding estimator of the tail copula. Remarkably, the heteroscedastic marginals do not affect the limiting process. Next, under a bivariate setup, we develop nonparametric tests for testing whether the scedasis functions are the same for both marginals. Detailed simulations show the good performance of the estimator for the tail dependence coefficient as well as that of the new tests. In particular, novel asymptotic confidence intervals for the tail dependence coefficient are presented and their good finite-sample behavior is shown. Finally an application to the S&P500 and Dow Jones indices reveals that their scedasis functions are about equal and that they exhibit strong tail dependence.
Lexin Li (Organizer)
Hernando Ombao
Jaroslaw Harezlak
Haoda Fu
Lexin Li
Overview of Functional Dependence in Brain Networks
Hernando Ombao
Statistics Program, King Abdullah University of Science and Technology
hernando.ombao@kaust.edu.sa
Brain activity is complex. A full understanding of brain activity requires careful study of its multi-scale spatial-temporal organization (from neurons to regions of interest; and from transient events to long-term temporal dynamics). Motivated by these challenges, we will explore some characterizations of dependence between components of a brain network. This is potentially interesting because alterations in functional brain connectivity are associated with mental and neurological diseases. In this talk, we provide an overview of functional dependence measures. We present a general framework for exploring dependence through the oscillatory activities derived from each component of the tine series. The talk will draw connections of this framework to some of the classical notions of spectral dependence such as coherence, partial coherence, and dual-frequency coherence. Moreover, this framework provides a starting point for exploring potential non-linear cross-frequency interactions. These interactions include the impact of phase of one oscillatory activity in one component on the amplitude of another oscillation. The proposed approach captures lead-lag relationships and hence can be used as a general framework for spectral causality. Under this framework, we will also present some recent work on inference using spectral mutual information and entropy measures. This is joint work with Marco Pinto (UC Irvine), Paolo Redondo (KAUST) and Raphael Huser (KAUST).
Novel penalized regression method applied to study the association of brain functional connectivity and alcohol drinking
Jaroslaw (Jarek) Harezlak, Ph.D.
Department of Epidemiology and Biostatistics, School of Public Health-Bloomington, Indiana University, Bloomington, IN
harezlak@iu.edu
The intricate associations between brain functional connectivity and clinical outcomes are difficult to estimate. Common approaches used do not account for the interrelated connectivity patterns in the functional connectivity (FC) matrix, which can jointly and/or synergistically affect the outcomes. In our application of a novel penalized regression approach called SpINNEr (Sparsity Inducing Nuclear Norm Estimator), we identify brain FC patterns that predict drinking outcomes. Results dynamically summarized in the R shiny app indicate that this scalar-on-matrix regression framework via the SpINNEr approach uncovers numerous reproducible FC associations with alcohol consumption.
LLM Is Not All You Need. Generative AI on Smooth Manifolds
Haoda Fu, Ph.D.
Associate Vice President, AI/Machine Learning, AADS, Eli Lilly
fu_haoda@lilly.com
Generative AI is a rapidly evolving technology that has garnered significant interest lately. In this presentation, we'll discuss the latest approaches, organizing them within a cohesive framework using stochastic differential equations to understand complex, high-dimensional data distributions. We'll highlight the necessity of studying generative models beyond Euclidean spaces, considering smooth manifolds essential in areas like robotics and medical imagery, and for leveraging symmetries in the de novo design of molecular structures. Our team's recent advancements in this blossoming field, ripe with opportunities for academic and industrial collaborations, will also be showcased.
Kernel Ordinary Differential Equations
Lexin Li
Professor, Department of Biostatistics and Epidemiology & Helen Wills Neuroscience Institute, University of California, Berkeley
lexinli@berkeley.edu
Ordinary differential equation (ODE) is widely used in modeling biological and physical processes in science. In this talk, we propose a new reproducing kernel-based approach for estimation and inference of ODE given noisy observations. We do not assume the functional forms in ODE to be known, or restrict them to be linear or additive, and we allow pairwise interactions. We perform sparse estimation to select individual functionals, and construct confidence intervals for the estimated signal trajectories. We establish the estimation optimality and selection consistency of kernel ODE under both the low-dimensional and high-dimensional settings, where the number of unknown functionals can be smaller or larger than the sample size. Our proposal builds upon the smoothing spline analysis of variance (SS-ANOVA) framework, but tackles several important problems that are not yet fully addressed, and thus extends the scope of existing SS-ANOVA as well. We demonstrate the efficacy of our method through numerous ODE examples.
Xinyuan Song (Organizer)
Xingqiu Zhao
Liming Xiang
Jun Ma
Xinyuan Song
Deep Nonparametric Inference for Conditional Hazard Function
Xingqiu Zhao
The Hong Kong Polytechnic University
xingqiu.zhao@polyu.edu.hk
We propose a novel deep learning approach to nonparametric statistical inference for the conditional hazard function of survival time with right-censored data. We use a deep neural network (DNN) to approximate the logarithm of a conditional hazard function given covariates and obtain a DNN likelihood-based estimator of the conditional hazard function. Such an estimation approach grants model flexibility and hence relaxes structural and functional assumptions on conditional hazard or survival functions. We establish the consistency, convergence rate, and functional asymptotic normality of the proposed estimator. Subsequently, we develop new one-sample tests for goodness-of-fit evaluation and two-sample tests for treatment comparison. Both simulation studies and real application analysis show superior performances of the proposed estimators and tests in comparison with existing methods.
Multiple Imputation for Flexible Modelling of Interval-censored Data with Covariates Subject to Missingness and Detection Limits
Liming Xiang
Nanyang Tech University
LMXiang@ntu.edu.sg
Interval-censored failure time data is popular in biomedical studies when a failure time is not observed exactly but only known to lie in an interval obtained from a sequence of examination times. The presence of covariates subject to missingness and detection limits poses challenges for regression analysis of interval-censored data and necessitates an effective statistical method. We propose a novel multiple imputation approach via rejection sampling for analysis of such data under semiparametric transformation models. Our proposal alleviates strong dependence of the usual imputation methods on the choice of imputation models and yields consistently and asymptotically normal estimators of the regression parameters. Simulation studies demonstrate that the proposed approach is flexible and leads to more efficient estimation than the complete analysis and augmented inverse probability weighting analysis in various practical situations. Finally, we apply the proposed approach to an Alzheimer’s disease data set that motivates this study.
Joint modelling of longitudinal covariates and partly-interval censored survival data - a penalized likelihood approach
Jun Ma
Macquarie University
jun.ma@mq.edu.au
This talk will focus on a joint modelling of longitudinal covariates and partly interval censored time-to-event data. Longitudinal time-varying covariates play a crucial role in achieving accurate dynamic predictions using a survival regression model. However, these covariates are often measured at limited time points and may contain measurement errors. Moreover, they are usually specific to each individual. On the other hand, the event times of interest are often interval-censored. Accounting for all these factors is essential when constructing a survival model. We will present a new approach for joint modelling of the longitudinal time-varying covariates and the time-to-event Cox model, where the latter is subject to interval censoring. We will develop a novel maximum penalized likelihood approach for estimation of all the model parameters including the random effects. A profile likelihood is used to obtain the covariance matrix of the estimated parameters.
Bayesian tree-based heterogeneous mediation analysis with a time-to-event outcome
Xinyuan Song
Chinese University of Hong Kong
xysong@sta.cuhk.edu.hk
Mediation analysis aims at quantifying and explaining the underlying causal mechanism between an exposure and an outcome of interest. In the context of survival analysis, mediation models have been widely used to achieve causal interpretation for the direct and indirect effects on the survival of interest. Although heterogeneity in treatment effect is drawing increasing attention in biomedical studies, none of the existing methods have accommodated the presence of heterogeneous causal pathways pointing to a time-to-event outcome. In this study, we consider a heterogeneous mediation analysis for survival data based on a Bayesian tree-based Cox proportional hazards model with shared topologies. Under the potential outcomes framework, individual-specific conditional direct and indirect effects are derived on the scale of the logarithm of hazards, survival probability, and restricted mean survival time. A Bayesian approach with efficient sampling strategies is developed to estimate the conditional causal effects through the Monte Carlo implementation of the mediation formula. Simulation studies show the satisfactory performance of the proposed method. The proposed model is then applied to an HIV dataset extracted from the ACTG175 study to demonstrate its usage in detecting heterogeneous causal pathways.
Lan Luo (Organizer)
Emily Hector
Lan Luo
Ling Zhou
Liangyuan Hu
Turning the data-integration dial: efficient inference from different data sources
Emily Hector
North Carolina State University
A fundamental aspect of statistics is the integration of data from different sources. Classically, Fisher and others were focused on how to integrate homogeneous sets of data. More recently, the question of if data sets from different sources should be integrated is becoming more relevant. The current literature treats this as a yes/no question: integrate or don't. Here we take a different approach, motivated by information-sharing principles coming from the shrinkage estimation literature. In particular, we deviate from the binary, yes/no perspective and propose a dial parameter that controls the extent to which two data sources are integrated. How far this dial parameter should be turned is shown to depend on the informativeness of the different data sources as measured by Fisher information. This more-nuanced data integration framework leads to relatively simple parameter estimates and valid tests/confidence intervals. We demonstrate both theoretically and empirically that setting the dial parameter according to our recommendation leads to more efficient estimation compared to other binary data integration schemes.
Efficient quantile covariate adjusted response adaptive experiments
Lan Luo
Assistant Professor, Department of Biostatistics and Epidemiology, Rutgers School of Public Health
ll1118@sph.rutgers.edu
In program evaluation studies, understanding the heterogeneous distributional impacts of a program beyond the average effect is crucial. Quantile treatment effect (QTE) provides a natural measure to capture such heterogeneity. While much of the existing work for estimating QTE has focused on analyzing observational data based on untestable causal assumptions, little work has gone into designing randomized experiments specifically for estimating QTE. In this talk, we propose two covariate adjusted response adaptive design strategies–fully adaptive designs and multi-stage designs–to efficiently estimate the QTE. We demonstrate that the QTE estimator obtained from our designs attains the optimal variance lower bound from a semiparametric theory perspective, which does not impose any parametric assumptions on underlying data distributions. Moreover, we show that using continuous covariates in multi-stage designs can improve the precision of the estimated QTE compared to the classical fully adaptive setting. We illustrate the finite-sample performance of our designs through Monte Carlo experiments and one synthetic case study on charitable giving. Our proposed designs offer a new approach to conducting randomized experiments to estimate QTE, which can have important implications for policy and program evaluation.
High-dimensional subgroup learning for multiple mixed outcome
Ling Zhou
Southwestern University of Finance and Economics
zhouling@swufe.edu.cn
In survey research, it is interesting to infer the grouped association patterns between risk factors and questionnaire responses, where the grouping is shared across multiple response variables that jointly capture one’s underlying status. In particular, based on a survey study named the China Health and Retirement Survey (CHRS), our aim is to identify the important risk factors that are simultaneously associated with the health and well-being of senior adults. While earlier studies have pointed to several known risk factors, heterogeneity in the outcome-risk factor association exists, motivating us to analyze this data through the lens of subgroup analysis. We devise a subgroup analysis procedure that models a multiple mixed outcome which describe one’s general health and well-being, while tackling additional challenges that have arisen in our data analysis, including high-dimensionality, collinearity, and weak signals in covariates. Computationally, we propose an efficient algorithm that alternately updates a set of estimating equations and likelihood functions. Theoretically, we establish the asymptotic consistency and normality of the proposed estimators. The validity of our proposal is corroborated by simulation experiments. An application of the proposed method to the CHRS data identifies caring for grandchildren as a new risk factor for poor physical and mental health.
Estimating the causal effect of multiple intermittent treatments on censored survival outcomes
Liangyuan Hu
Rutgers University
lh707@sph.rutgers.edu
To draw real-world evidence about the comparative effectiveness of multiple time-varying treatments on patient survival, we develop a joint marginal structural survival model and a novel weighting strategy to account for time-varying confounding and censoring. Our methods formulate complex longitudinal treatments with multiple start/stop switches as the recurrent events with discontinuous intervals of treatment eligibility. We derive the weights in continuous time to handle a complex longitudinal dataset without the need to discretize or artificially align the measurement times. We further use machine learning models designed for censored survival data with time-varying covariates and the kernel function estimator of the baseline intensity to efficiently estimate the continuous-time weights. Our simulations demonstrate that the proposed methods provide better bias reduction and nominal coverage probability when analyzing observational longitudinal survival data with irregularly spaced time intervals, compared to conventional methods that require aligned measurement time points. We apply the proposed methods to a large-scale COVID-19 dataset to estimate the causal effects of several COVID-19 treatments on the composite of in-hospital mortality and ICU admission.
Ying Wei (Organizer)
Shuang Wang
Tian Gu
Yanyuan Ma
Ying Wei
PartIES: a disease subtyping framework with Partition-level Integration using diffusion-Enhanced Similarities from Multi-omics Data
Yuqi Miao1, Huang Xu2, Shuang Wang1*
1. Department of Biostatistics, Columbia University, New York New York USA
2. Department of Statistics, University of Science and Technology of China, Hefei, Anhui, P.R. China
sw2206@cumc.columbia.edu
Integrating multi-omics data helps identify disease subtypes. Many similarity-based clustering methods were developed for disease subtyping using multi-omics data, with many of them focusing on extracting common clustering structures across multiple types of omics data, thus are not meant to preserve specific clustering structures within each omics data type. Moreover, clustering performance of similarity-based methods are known to be affected by how accurate similarity measures are. In this paper, we proposed PartIES, a Partition-level Integration using diffusion-Enhanced Similarities (PartIES) to perform disease subtyping using multi-omics data. We propose to use diffusion to reduce noises in similarity/kernel matrices, and partition individual diffusion-enhanced similarity matrices before integration and learn the integrative similarity structure adaptively on the partition level. Simulation studies showed that the diffusion step enhances clustering accuracy, and PartIES outperforms other competing methods, especially when omics data types provide different clustering structures. Using mRNA, lncRNA, miRNA expression data, DNA methylation data and mutation data from The Cancer Genome Atlas project (TCGA), PartIES identified subtypes in bladder urothelial carcinoma (BLCA) and thyroid carcinoma (THCA) that most significantly differentiate patient survival among all methods. To provide the biological meaning of the identified cancer subtypes, we further mapped subtype-differentiated omics features to the protein-protein interaction (PPI) network to identify important interacting cancer genes and compared the activities of cancer-related pathways across subtypes.
A Robust Angle-based Transfer Learning
Tian Gu
Department of Biostatistics, Columbia University Mailman School of Public Health
Email: tg2880@cumc.columbia.edu
Transfer learning aims to improve the performance of a target model by leveraging data from related source populations, which is especially helpful in cases with insufficient target data. In this paper, we study the problem of how to train a high-dimensional ridge regression model using limited target data and existing regression models trained in heterogeneous source populations. We consider a practical setting where only the parameter estimates of the fitted source models are accessible, instead of the individual-level source data. Under the setting with only one source model, we propose a novel flexible angle-based transfer learning (angleTL) method, which leverages the concordance between the source and the target model parameters. We show that angleTL unifies several benchmark methods by construction, including the target-only model trained using target data alone, the source model fitted on source data, and the distance-based transfer learning method that incorporates the source parameter estimates and the target data under a distance-based similarity constraint. We also provide algorithms to effectively incorporate multiple source models, accounting for the fact that some source models may be more helpful than others. Our high-dimensional asymptotic analysis provides interpretations and insights regarding when a source model can be helpful to the target model, and demonstrates the superiority of angleTL over other benchmark methods. We perform extensive simulation studies to validate our theoretical conclusions and show the feasibility of applying angleTL to transfer existing genetic risk prediction models across multiple biobanks.
Doubly Flexible Estimation under Label Shift
Yanyuan Ma
Pennsylvania State University
yanyuanma@gmail.com
In studies ranging from clinical medicine to policy research, complete data are usually available from a population P, but the quantity of interest is often sought for a related but different population Q which only has partial data. In this paper, we consider the setting that both outcome Y and covariate X are available from P whereas only X is available from Q, under the so-called label shift assumption, i.e., the conditional distribution of X given Y remains the same across the two populations. To estimate the parameter of interest in population Q via leveraging the information from population P, the following three ingredients are essential: (a) the common conditional distribution of X given Y, (b) the regression model of Y given X in population P, and (c) the density ratio of the outcome Y between the two populations. We propose an estimation procedure that only needs some standard nonparametric regression technique to approximate the conditional expectations with respect to (a), while by no means needs an estimate or model for (b) or (c); i.e., doubly flexible to the possible model misspecifications of both (b) and (c). This is conceptually different from the well-known doubly robust estimation in that, double robustness allows at most one model to be misspecified whereas our proposal here can allow both (b) and (c) to be misspecified. This is of particular interest in our setting because estimating (c) is difficult, if not impossible, by virtue of the absence of the Y-data in population Q. Furthermore, even though the estimation of (b) is sometimes off-the-shelf, it can face curse of dimensionality or computational challenges. We develop the large sample theory for the proposed estimator, and examine its finite-sample performance through simulation studies as well as an application to the MIMIC-III database.
A Double Projection Approach for Safe and Efficient Semi-Supervised Data-Fusion
Ying Wei
Columbia University
yw2148@cumc.columbia.edu
Advances in data collection and transmission technologies have made larger amounts of data readily available. However, there are differences in the data collection capabilities of different data centers, or there are inevitable data missing. Many previous approaches to handling missing information have solely focused on either missing predictors or missing responses. In this paper, we will consider both types of missing and incorporate more information by projecting score functions into subsets, thus proposing algorithms that have ensured efficiency relative to the complete-case analysis. By generalizing the algorithm of this paper, it is promising to be able to handle more complex missing data structures in the future. This is joint work with Molei Liu, Yiming Li and Sean Yang.
Peter Song (Organizer)
Annie Qu
Michael Elliott
Ji Zhu
Jian Kang
Optimal Individualized Treatment Rule for Combination Treatments under Budget Constraints
Annie Qu
University of California, Irvine
aqu2@uci.edu
The individualized treatment rule (ITR), which recommends an optimal treatment based on individual characteristics, has drawn considerable interest from many areas such as precision medicine, personalized education, and personalized marketing. Existing ITR estimation methods mainly adopt one of two or more treatments. However, a combination of multiple treatments could be more powerful in various areas. In this paper, we propose a novel Double Encoder Model (DEM) to estimate the individualized treatment rule for combination treatments. The proposed double encoder model is a nonparametric model which not only flexibly incorporates complex treatment effects and interaction effects among treatments, but also improves estimation efficiency via the parameter-sharing feature. In addition, we tailor the estimated ITR to budget constraints through a multi-choice knapsack formulation, which enhances our proposed method under restricted-resource scenarios. In theory, we provide the value reduction bound with or without budget constraints, and an improved convergence rate with respect to the number of treatments under the DEM. Our simulation studies show that the proposed method outperforms the existing ITR estimation in various settings. We also demonstrate the superior performance of the proposed method in PDX data that recommends optimal combination treatments to shrink the tumor size of the colorectal cancer.
Using variability in longitudinally-measured variables as a predictor of health outcomes
Michael Elliott
mrelliot@umich.edu
Longitudinal data has become a major part of the landscape for clinical and epidemiological research. While variance is typically understood as nuisance – the “noise” in “signal-to-noise” – there is increasing evidence that underlying variability in subject-level measures over time may also be important in predicting future health outcomes of interest. However, most statistical methods development has been focused on the use of mean trends obtained from longitudinal data; approaches that incorporate subject-level variability are far rarer, and consequently use of such information is rare. I will provide a review of several methods developed to incorporate variability of predictors in a range of statistical modeling settings, with a deeper dive on a specific application where we consider how a woman’s mean and variability trends of multivariate hormonal measures during menopausal transition can impact measures of post-menopausal health outcomes.
A Latent Space Model for Hypergraphs with Diversity and Heterogeneous Popularity
Ji Zhu
Susan A. Murphy Collegiate Professor, Associate Chair for Graduate Programs, Department of Statistics, University of Michigan, Ann Arbor
jizhu@umich.edu
While relations among individuals make an important part of data with scientific and business interests, existing statistical modeling of relational data has mainly been focusing on dyadic relations, i.e., those between two individuals. This work addresses the less studied, though commonly encountered, polyadic relations that can involve more than two individuals. In particular, we propose a new latent space model for hypergraphs using determinantal point processes, which is driven by the diversity within hyperedges and each node's popularity. This model mechanism is in contrast to existing hypergraph models, which are predominantly driven by similarity rather than diversity. Additionally, the proposed model accommodates broad types of hypergraphs, with no restriction on the cardinality and multiplicity of hyperedges. Consistency and asymptotic normality of the maximum likelihood estimates of the model parameters have been established. The proof is challenging, owing to the special configuration of the parameter space. Simulation studies and an application to the What's Cooking data show the effectiveness of the proposed model.
Bayesian methods for brain-computer interfaces
Jian Kang
University of Michigan
jiankang@umich.edu
A brain-computer interface (BCI) is a system that translates brain activity into commands to operate technology. BCIs help people with disabilities use technology for communication. A common design for an electroencephalogram (EEG) BCI relies on the classification of the P300 event-related potential (ERP), which is a response elicited by the rare occurrence of target stimuli among common non-target stimuli. Existing studies have focused on constructing the ERP classifiers, but few provide insights into the underlying mechanism of the neural activity. In this talk, I will discuss several new Bayesian methods for analyzing brain signals from BCI systems based on Gaussian Processes (GP). Our proposed methods can make statistical inferences about the spatial-temporal differences and dependence of the neural activity in response to external stimuli, which provides statistical evidence of P300 ERP responses and helps design user-specific profiles for efficient BCIs. Our inference results demonstrate the importance of ERPs from several brain regions for P300 speller performance. The robustness of our analysis is justified by cross-participant comparisons and extensive simulations.
Gen Li (Organizer)
Hongzhe Li
Huilin Li
Zhigang Li
Gen Li
Transfer Learning with Random Coefficient Ridge Regression for Microbiome Applications
Hongzhe Li
University of Pennsylvannia
hongzhe@pennmedicine.upenn.edu
Ridge regression with random coefficients provides an important alternative to fixed coefficients regression in high dimensional setting when the effects are expected to be small but not zeros. Such models are particularly appropriate for microbiome-based prediction. This paper considers estimation and prediction of random coefficient ridge regression in the setting of transfer learning, where in addition to observations from the target model, source samples from different but possibly related regression models are available. The informativeness of the source model to the target model can be quantified by the correlation between the regression coefficients. This paper proposes two estimators of regression coefficients of the target model as the weighted sum of the ridge estimates of both target and source models, where the weights can be determined by minimizing the limiting estimation risk or prediction risk. Using random matrix theory, the limiting values of the optimal weights are derived under the setting when p/n→γ, where p is the number of the predictors and n is the sample size, which leads to an explicit expression of the estimation or prediction risks. We present results for several microbiome-based disease prediction, including IBD and colon cancer.
Joint Modeling of Longitudinal Microbiome Data and Survival outcome.
Huilin Li
New York Univesity
Huilin.Li@nyulangone.org
Recently more and more longitudinal microbiome studies are conducted to identify candidate microbes as biomarkers for the disease prognosis. We propose a novel joint modeling framework JointMM for longitudinal microbiome and time-to-event data to investigate the effect of dynamic changes of microbiome abundance profile on disease onset. JointMM comprises of two sub-models, i.e., the zero-inflated scaled-Beta mixed-effects regression sub-model aimed at depicting the temporal structure of microbial abundances among subjects; and the survival sub-model to characterize the occurrence of disease and its relationship with microbiome abundances changes. JointMM is specifically designed to handle the zero-inflated and highly skewed longitudinal microbiome abundance data and exhibits better interpretability that JointMM can examine whether the temporal microbial presence/absence pattern and/or the abundance dynamics would alter the time to disease onset. Comprehensive simulations and real data analyses demonstrated the statistical efficiency of JointMM compared with competing methods.
Estimating equations with inverse probability weighting for microbiome analysis
Zhigang Li
University of Florida
zhigang.li@ufl.edu
Human microbiome data is collected in many research studies to investigate the role of microbiome in association with diseases or conditions. Sequencing technologies including 16S rRNA sequencing and metagenome shotgun sequencing are commonly used for quantifying microbiome data. However, it remains challenging to appropriately analyze microbiome data due to its unique features such as zero-inflated structure and compositional structure. We develop a novel approach to analyze microbiome data for differential abundance analyses or regression analyses. This approach employes inverse probability weighting techniques to account for the mixture of true and false zeros. GEE is used to account for the complicated inter-taxa correlation structure. The method does not require imputing zeros with a positive value for the data analysis. It has a good performance in comparison with existing methods in the simulation study. Application of the new approach in a real data set is also presented.
Analysis of Microbiome Differential Abundance by Pooling Tobit Models
Gen Li, PhD
Associate Professor, Department of Biostatistics, School of Public Health, University of Michigan, Ann Arbor
ligen@umich.edu
Differential abundance analysis identifies microbiome taxa whose abundances differ between two or more conditions. Compositionality and sparsity of metagenomics sequencing data pose statistical challenges. We propose ADAPT (Analysis of Microbiome Differential Abundance by Pooling Tobit Models) as a solution. Count ratios between taxa satisfy subcompositional coherence. Zero counts can be regarded as left-censored at one. The tobit model is suitable for modeling left-censored count ratios. ADAPT first fits tobit models to relative abundances of individual taxa. It then selects a subset of non-differentially abundant taxa as the reference taxa set based on the estimated effect sizes and the distribution of p-values. Finally, tobit models for the count ratios between individual taxa and the reference taxa set reveal differentially abundant taxa. Simulation studies show that ADAPT has higher power than alternative methods while controlling false discovery rates. Application of ADAPT to early childhood dental caries data reveals differentially abundant oral bacteria species and functional genes between the saliva samples of children with and without dental caries.
Shuangge Ma (Organizer)
Hao Mei
Yuan Huang
Shuangge Ma
Ai-Ling Hour
Clinical Human Disease Networks with Healthcare Administrative Claims Data
Hao Mei
Renmin University of China
hao.mei@ruc.edu.cn
Clinical treatment outcomes are the quality and cost targets that healthcare providers aim to improve. Most existing outcome analysis focuses on a single disease or all diseases combined, which ignores the complex interconnections among diseases. Motivated by the success of molecular and phenotypic human disease networks (HDNs), we develop clinical HDNs that describes the interconnections among diseases in terms of multiple clinical treatment outcomes. In this framework, one node represents one disease, and two nodes are linked with an edge if their outcomes are conditionally dependent. Along this direction, we also develop a time-dependent clinical HDN that investigate temporal variation of disease interconnection from a clinical point of view. Our data experiments validate the performance of the proposed models in identifying correct edges. Analyzing key network properties, such as connectivity, module/hub, and temporal variation, using healthcare administrative claims data, the findings are not only biomedically sensible, but also uncover information that are less/not investigated in the literature. Overall, clinical HDNs can provide additional insight into diseases’ properties and their interconnections and assist more efficient disease management and health-care resources allocation.
TBD
Yuan Huang
Yale University
yuan.huang@yale.edu
Traditional mediation analysis methods have been limited to handling only a small number of mediators, posing significant challenges when dealing with high-dimensional mediators. These challenges are further compounded by the intricate relationships introduced by confounding variables. To effectively address these issues, we introduce an approach called DP2LM (Deep neural network-based Penalized Partially Linear Mediation). This approach incorporates deep neural network techniques to account for nonlinear effects in confounders and utilizes the penalized partially linear model to accommodate high dimensionality. Unlike most existing works that concentrate on mediator selection, our method prioritizes estimation and inference on mediation effects. We will present its performance under simulation studies and its application to real-world data. Additionally, we will discuss some important considerations and potential limitations when utilizing this approach.
Heterogeneous Network Analysis of Disease Clinical Treatment Measures via Mining Electronic Medical Record Data
Shuangge Ma
Department of Biostatistics, Yale School of Public Health
Shuangge.ma@yale.edu
The analysis of clinical treatment measures has been extensively conducted and can facilitate more effective resource management and planning and also assist better understanding siseases. Most of the existing analyses have been focused on a single disease or a large number of diseases combined. Partly motivated by the successes of gene-centric and phenotypic human disease network (HDN) research, there has been growing interest in the network analysis of clinical treatment measures. However, the existing studies have been limited by a lack of attention to heterogeneity and relevant covariates, ineffectiveness of methods, and low data quality. In this study, our goal is to mine the Taiwan National Health Insurance Research Database (NHIRD), a large population-level electronic medical record (EMR) database, and construct HDNs for number of outpatient visits and medical cost. Significantly advancing from the existing literature, the proposed analysis accommodates heterogeneity and effects of covariates (for example, demographics). Additionally, the proposed method effectively accommodates the zero-inflation nature of data, Poisson distribution, high-dimensionality, and network sparsity. Computational and theoretical properties are carefully examined. Simulation demonstrates competitive performance of the proposed approach. In the analysis of NHIRD data, two and five subject groups are identified for outpatient visit and medical cost, respectively. The identified interconnections, hubs, and network modules are found to have sound implications.
Few-shot learning of Tabular Medical Records with Large Language Models
Kai-Yuan Hsiao1,2, Wei-Shan Chang1,2, Ai-Ling Hour3*, Ben-Chang Shia1,2
1 Artificial Intelligence Development Center, Fu Jen Catholic University, New Taipei City, Taiwan
2 Graduate Institute of Business Administration, College of Management, Fu Jen Catholic University, New Taipei City, Taiwan
3 Department of Life Science, Fu-Jen Catholic University, New Taipei City, Taiwan
The medical field is replete with abundant tabular data encompassing everything from patient records to results of clinical trials. Traditional approaches to the analysis of such data often involve complex statistical techniques, feature engineering, and conventional machine learning methods. However, the integration of large language models (LLMs) offers a transformative method for this analysis. This study examines the application of LLMs for zero-shot and few-shot classification of medical tabular data, employing techniques for generating additional feature information and natural language serialization of tabular data. In this study, the methodology encompasses the serialization of medical tabular data into natural language strings, complete with descriptions of the specific analytical tasks. For instance, tables detailing patient symptoms and diagnoses are transformed into formats readily interpretable by Large Language Models (LLMs). Where only a few examples are present, the generative capacity of LLMs is harnessed to bolster their comprehension of medical contexts. Iterative generation of additional, semantically meaningful features is based on the medical background of the dataset, with an investigation into the efficacy of the generated features. In the experimental process, besides testing large language models for downstream tasks, comparisons with traditional machine learning methods are made, with efforts to interpret the results, ensuring that medical professionals can understand and trust the findings.
Cheng Zheng (Organizer)
Ying Zhang
Cheng Zheng
Ping Ma
Danping Liu
Semiparametric Inference for Misclassified Semi-Competing Risks Data under Gamma-Frailty Conditional Markov Model
Ying Zhang*, Ruiqian Wu, and Giorgos Bakoyannis
*Department of Biostatistics, University of Nebraska Medical Center
ying.zhang@unmc.edu
There has been increasing interest in semi-competing risks data modeling to jointly study disease progression and death for the illness-death problem. Identification of risk factors for the benchmark events will provide insight to detect the high-risk group according to personal-level characteristics, which is critical to develop a personalized prevention strategy to delay the progression from illness to death. However, in many applications, event ascertainment is incomplete, resulting in event misclassification that complicates the statistical inference with semi-competing risks data. In this work, we consider a Gamma frailty conditional Markov model to study the misclassified semi-competing risk data and propose a two-stage semiparametric maximum pseudo-likelihood estimation approach equipped with a pseudo-EM algorithm to make unbiased statistical inference. Extensive simulation studies show the proposed method is numerically stable and performs well even with a large amount of event misclassification. The method is applied to a multi-center HIV cohort study in East Africa to measure the impact of interruption of lifelong antiretroviral therapy (ART) on HIV mortality at the personal level.
Investigating Multiple Causal Mechanisms with Multiple Mediators and Estimating Direct and Indirect Effects: A Joint Modeling Approach for Recurrent and Terminal Events
Cheng Zheng, PhD
Associate Professor, Department of Biostatistics, University of Nebraska Medical Center
cheng.zheng@unmc.edu
Understanding the diverse causal mechanisms between primary exposure and outcomes has garnered significant interest in the social and medical fields. In the context of HIV patients, over 20 distinct opportunistic infections (OIs) present complex effects on the health trajectory and associated mortality. It is crucial to differentiate among these OIs to devise tailored strategies to enhance patients' survival and quality of life. However, existing statistical frameworks for studying causal mechanisms have limitations, either focusing on single mediators or lacking the ability to handle unmeasured confounding, especially for the survival outcomes. In this work, we propose a novel joint modeling approach that considers multiple recurrent events as mediators and survival endpoints as outcomes, relaxing the assumption of “sequential ignorability” by utilizing the shared random effect to handle unmeasured confounders. We assume the multiple mediators are not causally related to each other given observed covariates and the shared frailty. Simulation studies demonstrate good finite sample performance of our methods in estimating both model parameters and multiple mediation effects. We apply our approach to an AIDS study and evaluate the mediation effects of different types of OIs. We find that distinct pathways through the two treatments and CD4 counts impact overall survival via different types of recurrent opportunistic infections.
Analyzing CITE-seq Data via a Quantum Algorithm
Ping Ma
UGA Distinguished Research Professor, Department of Statistics, University of Georgia
pingma@uga.edu
With the rapid development of quantum computers, researchers have shown quantum advantages in physics-oriented problems. Quantum algorithms tackling computational biology problems are still lacking. In this talk, I will demonstrate the quantum advantage of analyzing CITE-seq data. CITE-seq, a single-cell technology, enables researchers to simultaneously measure expressions of RNA and surface protein detected by antibody-derived tags (ADTs) in the same cells. CITE-seq data hold tremendous potential for elucidating RNA-ADT co-expression networks and identifying cell types effectively. However, both tasks are challenging since the best subset of ADTs needs to be identified from enormous candidate subsets. To surmount the challenge, I will present a quantum algorithm for analyzing CITE-seq data.
Dynamic Risk Prediction for Cervical Precancer Screening with Continuous and Binary Longitudinal Biomarkers
Danping Liu
National Cancer Institute
danping.liu@nih.gov
Dynamic risk prediction that incorporates longitudinal measurements of biomarkers is useful in identifying high-risk patients for better clinical management. Our work is motivated by the prediction of cervical precancers. Currently, Pap cytology is used to identify HPV+ women at high-risk of cervical precancer, but cytology lacks accuracy and reproducibility. HPV DNA methylation is closely linked to the carcinogenic process and shows promise of improved risk stratification. We are interested in developing a dynamic risk model that uses all longitudinal biomarker information to improve precancer risk estimation. We propose a joint model to link both the continuous methylation biomarker and binary cytology biomarker to the time to precancer outcome using shared random effects. The model uses a discretization of the time scale to allow for closed-form likelihood expressions, thereby avoiding high-dimensional integration of the random effects. The method handles an interval-censored time-to-event outcome due to intermittent clinical visits, incorporates sampling weights to deal with stratified sampling data, and can provide immediate and 5-year risk estimates that may inform clinical decision-making.
Qi Long (Organizer)
Ying Guo
Suprateek Kundu
Ming Wang
Ziyi Li
A Regularized Blind Source Separation Framework for Unveiling Hidden Sources of Brain Functional Connectome
Ying Guo
Emory University
yguo2@emory.edu
Brain connectomics has become increasingly important in neuroimaging studies to advance understanding of neural circuits and their association with neurodevelopment, mental illnesses, and aging. These analyses often face major challenges, including the high dimensionality of brain networks, unknown latent sources underlying the observed connectivity, and the large number of brain connections leading to spurious findings. In this talk, we will introduce a novel regularized blind source separation (BSS) framework for reliable mapping of neural circuits underlying static and dynamic brain functional connectome. The proposed LOCUS methods achieve more efficient and reliable source separation for connectivity matrices using low-rank factorization, a novel angle-based sparsity regularization, and a temporal smoothness regularization. We develop a highly efficient iterative Node-Rotation algorithm that solves the non-convex optimization problem for learning LOCUS models. Simulation studies demonstrate that the proposed methods have consistently improved accuracy in retrieving latent connectivity traits. Application of LOCUS methods to the Philadelphia Neurodevelopmental Cohort (PNC) neuroimaging study generates considerably more reproducible findings in revealing underlying neural circuits and their association with demographic and clinical phenotypes, uncovers dynamic expression profiles of the circuits and the synchronization between them, and generates insights on gender differences in the neurodevelopment of brain circuits.
Flexible Bayesian Product Mixture Models for Vector Autoregressions
Suprateek Kundu, PhD
Associate Professor, Department of Biostatistics, The University of Texas at MD Anderson Cancer Center
SKundu2@mdanderson.org
Bayesian non-parametric methods based on Dirichlet process mixtures have seen tremendous success in various domains and are appealing in being able to borrow information by clustering samples that share identical parameters. However, such methods can face hurdles in heterogeneous settings where objects are expected to cluster only along a subset of axes or where clusters of samples share only a subset of identical parameters. We overcome such limitations by developing a novel class of product of Dirichlet process location-scale mixtures that enable independent clustering at multiple scales, which result in varying levels of information sharing across samples. First, we develop the approach for independent multivariate data. Subsequently we generalize it to multivariate time-series data under the framework of multi-subject Vector Autoregressive (VAR) models that is our primary focus, which go beyond parametric single-subject VAR models. We establish posterior consistency and develop efficient posterior computation for implementation. Extensive numerical studies involving VAR models show distinct advantages over competing methods, in terms of estimation, clustering, and feature selection accuracy. Our resting state fMRI analysis from the Human Connectome Project reveals biologically interpretable connectivity differences between distinct intelligence groups, while another air pollution application illustrates the superior forecasting accuracy compared to alternate methods.
Enhancing Primary Outcome Analysis by Leveraging Information from Secondary Outcomes
Ming Wang, PhD
Associate Professor, Director of the MS program in Biostatistics, Department of Population and Quantitative Health Sciences, Case Western Reserve University School of Medicine
mxw827@case.edu
Many observational studies and clinical trials collect various secondary outcomes that may be highly correlated with the primary endpoint. Typically, these secondary outcomes are analyzed separately from the primary analysis. However, leveraging secondary outcome data can significantly enhance the precision of primary outcome estimates. In this work, we will introduce recently developed methods that demonstrate how primary outcome estimation efficiency can be improved by incorporating information from secondary outcomes. We will explore scenarios involving single and multiple secondary outcomes, motivated from real-world clinical applications. The proposed methods will employ empirical likelihood-based weighting adaptive to different types of secondary outcomes. This innovative framework remains robust against model misspecifications in secondary data and can be flexibly extended to address various complex secondary outcomes. Both theoretical and simulation studies showcase efficiency gain. Finally, we will apply these methods to assess risk factors for cardiovascular diseases in the Atherosclerosis Risk in Communities (ARIC) study.
Accommodating time-varying heterogeneity in risk estimation under the Cox model: a transfer learning approach
Ziyi Li (presenter), Yu Shen, Jing Ning
MD Anderson Cancer Center
ZLi16@mdanderson.org
Transfer learning has attracted increasing attention in recent years for adaptively borrowing information across different data cohorts in various settings. Cancer registries have been widely used in clinical research because of their easy accessibility and large sample size. Our method is motivated by the question of how to utilize cancer registry data as a complement to improve the estimation precision of individual risks of death for inflammatory breast cancer (IBC) patients at The University of Texas MD Anderson Cancer Center. When transferring information for risk estimation based on the cancer registries (i.e., source cohort) to a single cancer center (i.e., target cohort), time-varying population heterogeneity needs to be appropriately acknowledged. However, there is no literature on how to adaptively transfer knowledge on risk estimation with time-to-event data from the source cohort to the target cohort while adjusting for time-varying differences in event risks between the two sources. Our goal is to address this statistical challenge by developing a transfer learning approach under the Cox proportional hazards model. To allow data-adaptive levels of information borrowing, we impose Lasso penalties on the discrepancies in regression coefficients and baseline hazard functions between the two cohorts, which are jointly solved in the proposed transfer learning algorithm. As shown in the extensive simulation studies, the proposed method yields more precise individualized risk estimation than using the target cohort alone. Meanwhile, our method demonstrates satisfactory robustness against cohort differences compared with the method that directly combines the target and source data in the Cox model. We develop a more accurate risk estimation model for the MD Anderson IBC cohort given various treatment and baseline covariates, while adaptively borrowing information from the National Cancer Database to improve risk assessment.
Menggang Yu (Organizer)
Jeremy Taylor
Lu Tian
Menggang Yu
Ruth Pfeiffer
James-Stein approach for improving prediction of linear regression models by integrating external information from heterogeneous populations
Jeremy M G Taylor, Peisong Han, Haoyue Li
University of Michigan
jmgt@umich.edu
We consider the setting where (i) an internal study builds a linear regression model for prediction based on individual-level data, (ii) some external studies have fitted similar linear regression models that use only subsets of the covariates and provide coefficient estimates for the reduced models without individual-level data, and (iii) there is heterogeneity across these study populations. The goal is to integrate the external model summary information into fitting the internal model to improve prediction accuracy. We adapt the James-Stein shrinkage method to propose estimators that have guaranteed improvement in the prediction mean squared error after information integration, regardless of the degree of study population heterogeneity. We conduct comprehensive simulation studies to investigate the numerical performance of the proposed estimators. We also apply the method to enhance a prediction model for patella bone lead level in terms of blood lead level and other covariates by integrating summary information from published literature.
Adaptive Prediction Strategy with Individualized Variable Selection
Lu Tian
Stanford University
lutian@stanford.edu
Today, physicians have access to a wide array of tests for diagnosing and prognosticating medical conditions. Ideally, they would apply a high-quality prediction model, utilizing all relevant features as input, to facilitate appropriate decision-making regarding treatment selection or risk assessment. However, not all features used in these prediction models are readily available to patients and physicians without incurring some costs. In practice, predictors are typically gathered as needed in a sequential manner, while the physician continually evaluates information dynamically. This process continues until sufficient information is acquired, and the physician gains reasonable confidence in making a decision. Importantly, the prospective information to collect may differ for each patient and depend on the predictor values already known. In this paper, we present a novel dynamic prediction rule designed to determine the optimal order of acquiring prediction features in predicting a clinical outcome of interest. The objective is to maximize prediction accuracy while minimizing the cost associated with measuring prediction features for individual subjects. To achieve this, we employ reinforcement learning, where the agent must decide on the best action at each step: either making a clinical decision with available information or continuing to collect new predictors based on the current state of knowledge. To evaluate the efficacy of the proposed dynamic prediction strategy, extensive simulation studies have been conducted. Additionally, we provide two real data examples to illustrate the practical application of our method.
Entropy Balancing for Causal Generalization with Target Sample Summary Information
Menggang Yu, PhD
Department of Biostatistics, University of Michigan
meyu@biostat.wisc.edu
In this talk, we focus on estimating the average treatment effect (ATE) of a target population when individual-level data from a source population and summary-level data (e.g., first or second moments of certain covariates) from the target population are available. In the presence of heterogeneous treatment effect, the ATE of the target population can be different from that of the source population when distributions of treatment effect modifiers are dissimilar in these two populations, a phenomenon also known as covariate shift. Many methods have been developed to adjust for covariate shift, but most require individual covariates from a representative target sample. We develop a weighting approach based on summary-level information from the target sample to adjust for possible covariate shift in effect modifiers. In particular, weights of the treated and control groups within a source sample are calibrated by the summary-level information of the target sample. Our approach also seeks additional covariate balance between the treated and control groups in the source sample.
Accommodating population differences in model validation
Ruth Pfeiffer, Ph.D.
Biostatistics Branch, National Cancer Institute, NIH, HHS, Bethesda, MD 20892-7244
pfeiffer@mail.nih.gov
Validation of risk prediction models in independent data provides a rigorous assessment of model performance. However, several differences between the populations that gave rise to the training and the validation data can lead to seemingly poor performance of a risk model. We formalize the notions of “similarity” of the training and validation data and define reproducibility and transportability. We address the impact of different predictor distributions and differences in verifying the outcome on model calibration, accuracy and discrimination. When individual level data from both the training and validation data sets are available, we propose and study weighted versions of the validation metrics that adjust for differences in the predictor distributions and in outcome verification to provide a more comprehensive assessment of model performance. We give conditions on the model and the training and validation populations that ensure a model’s reproducibility or transportability and show how to check them. We discuss approaches to recalibrate a model. As an illustration we develop and validate a prostate cancer risk model using data from two large North American prostate cancer prevention trials, the SELECT and PLCO trials. Joint work with Yiyao Chen, Mitchell H. Gail, Donna P. Ankerst
Yichuan Zhao (Organizer)
Gang Li
Amita Manatunga
Yi Li
Yichuan Zhao
A Semiparametric Bayesian Instrumental Variable Analysis Method for Partly Interval-Censored Time-to-Event Outcome
Gang Li
UCLA
This paper develops a semiparametric Bayesian instrumental variable (IV) analysis method for estimating the causal effect of an endogenous variable when dealing with unobserved confounders and measurement errors with partly interval-censored time-to-event data, where event times are observed exactly for some subjects but left-censored, right-censored, or interval-censored for others. Our method is based on a two-stage Dirichlet process mixture instrumental variable (DPMIV) model which simultaneously models the first-stage random error term for the exposure variable and the second-stage random error term for the time-to-event outcome using a Gaussian mixture of the Dirichlet process (DPM) model. The DPM model can be broadly understood as a mixture model with an unspecified number of Gaussian components, which relaxes the normal error assumptions and allows the number of mixture components to be determined by the data. We develop an MCMC algorithm for the DPMIV model tailored for partly interval-censored data and conduct extensive simulations to assess the performance of our DPMIV method in comparison with some existing methods. Our simulations revealed that our proposed method is robust under different error distributions and can have far superior performance over some competing methods under a variety of scenarios. We further demonstrate the effectiveness of our approach on the UK Biobank study to investigate the causal effect of systolic blood pressure (SBP) on time-to-development of cardiovascular disease (CVD) from the diagnosis of diabetes mellitus (DM).
Noninvasive Monitoring for Anemia in Very Low Birth Weight Infants Using Smartphone Images
Amita Manatunga, Emily Wu, Limin Peng
amanatu@emory.edu
Department of Biostatistics and Bioinformatics, Emory University, Atlanta, Georgia
Monitoring for anemia is an important part of clinical care for very low birth weight (VLBW) infants in the neonatal intensive care unit. Collecting fingernail imaging using smartphone device has been experimented as a noninvasive alternative to standard repeated invasive blood draws that may paradoxically contribute to anemia through accumulated blood loss. In this talk, we will present novel preliminary analysis of a longitudinal study of VLBW infants from three Atlanta-area hospitals to evaluate the prognostic value of utilizing the fingernail imaging data, which takes the form of three 51 X 51 matrices of RGB values, for predicting infant’s need of red blood cell transfusion. Our analyses effectively utilize the repeated measurements of the fingernail imaging data, and moreover properly account for the potential dependency between the timing of taking fingernail images and the underlying risk of requiring blood transfusion. Survival analysis techniques are also employed to handle censoring to the observation of transfusions. We will conclude the talk with sensible interpretation of our analysis results.
CeCNN: Copula-Enhanced Convolutional Neural Networks in Joint Prediction of Refraction Error and Axial Length Based on Ultrawide Field Fundus Images
Catherine Chunling Liu
Hong Kong Polytechnic University
catherine.chunling.liu@polyu.edu.hk
Ultra-widefield (UWF) fundus images are replacing traditional fundus images in screening, detection, prediction, and treatment of complications related to myopia because their much broader visual range is advantageous for highly myopic eyes. Spherical equivalent (SE) is extensively used as the main myopia outcome measure, and axial length (AL) has drawn increasing interest as an important ocular component for assessing myopia. Cutting-edge studies show that SE and AL are strongly correlated. Using the joint information from SE and AL is potentially better than using either separately. In the deep learning community, though there is research on multiple-response tasks with a 3D image biomarker, dependence among responses is only sporadically taken into consideration. Inspired by the spirit that information extracted from the data by statistical methods can improve the prediction accuracy of deep learning models, we formulate a class of bivariate response regression models with a higher-order tensor biomarker, for the bivariate tasks of regression-classification and regression-regression. Specifically, we propose a copula-enhanced convolutional neural network (CeCNN) architecture that incorporates the dependence between responses through a Gaussian copula (with parameters estimated from a warm-up CNN) and uses the induced copula-likelihood loss with the backbone CNNs. We establish the statistical framework and algorithms for the aforementioned two bivariate tasks. We show that the CeCNN has better prediction accuracy after adding the dependency information to the backbone models. The modeling and the proposed CeCNN algorithm are applicable beyond the UWF scenario and can be effective with other backbones beyond ResNet and LeNet.
Multi-task Learning for Gaussian Graphical Regressions with High Dimensional Covariates
Yi Li, PhD
M. Anthony Schork Collegiate Professor, Professor of Biostatistics, Professor of Global Public Health, School of Public Health, University of Michigan, Ann Arbor
yili@umich.edu
Gaussian graphical regression is a powerful approach for regressing the precision matrix of a Gaussian graphical model on covariates, which permits the response variables and covariates to outnumber the sample size. However, traditional approaches of fitting the model via separate node-wise lasso regressions overlook the network-induced structure among these regressions, leading to high error rates, particularly when the number of nodes is large. To address this issue, we propose a multi-task learning estimator for fitting Gaussian graphical regression models, which incorporates a cross-task group sparsity penalty and a within-task element-wise sparsity penalty to govern the sparsity of active covariates and their effects on the graph, respectively. We also develop an efficient augmented Lagrangian algorithm for computation, which solves subproblems with a semi-smooth Newton method. We further prove that our multi-task learning estimator has considerably lower error rates than the separate node-wise regression estimates, as the cross-task penalty enables borrowing information across tasks. To address the main challenge of entangled tasks in a complicated correlation structure, we establish a new tail probability bound for dependent heavy-tailed, for example, sub-exponential, variables with an arbitrary dependence structure, which is a useful theoretical result in its own right. We examine the utility of our method through simulations and an application to a gene co-expression network study with brain cancer patients.
Weighted empirical likelihood inference for the difference between the areas under two correlated ROC curves with right-censored data
Yichuan Zhao
Department of Mathematics and Statistics, Georgia State University
yichuan@gsu.edu
In this article, we propose the two-sample weighted empirical likelihood, building upon Chrzanowski’s (2014) method, to compare the area under the ROC curve of two correlated ROC curves. A normal approximation method is derived. We define a weighted empirical likelihood ratio and demonstrate that the resulting statistic follows a scaled chi-square distribution. Additionally, to improve the accuracy of confidence intervals for small sample sizes, we employ a calibration method known as the adjusted empirical likelihood. Extensive simulations are conducted to assess the excellent finite-sample performance of our proposed weighted empirical likelihood method. To further illustrate the practical applicability of our methodology, we provide a real-world example showcasing its effectiveness.
Zhezhen Jin (Organizer)
Xiaonan Xue
Yongzhao Shao
Shanshan Ding
Lihui Zhao
Jointly modeling of sleep variables that are objectively measured by wrist actigraphy
Xiaonan Xue
Albert Einstein University
xiaonan.xue@einsteinmed.edu
Recently developed actigraphy devices have made it possible for continuous and objective monitoring of sleep over multiple nights. Sleep variables captured by wrist actigraphy devices include sleep onset, sleep end, total sleep time, wake time after sleep onset, number of awakenings, etc. Currently available statistical methods to analyze such actigraphy data have limitations. First, averages over multiple nights are used to summarize sleep activities, ignoring variability over multiple nights from the same subject. Second, sleep variables are often analyzed independently. However, sleep variables tend to be correlated with each other. For example, how long a subject sleeps at night can be correlated with how long and how frequent he/she wakes up during that night. It is important to understand these inter-relationships. We therefore propose a joint mixed effect model on total sleep time, number of awakenings, and wake time. We develop an estimating procedure based upon a sequence of generalized linear mixed effects models, which can be implemented using existing software. The use of these models not only avoids computational intensity and instability that may occur by directly applying a numerical algorithm on a complicated joint likelihood function, but also provides additional insights on sleep activities. We demonstrated in simulation studies that the proposed estimating procedure performed well in estimating both fixed and random effects' parameters. We applied the proposed model to data from the Women's Interagency HIV Sleep Study to examine the association of employment status and age with overall sleep quality assessed by several actigraphy measured sleep variables.
Assessing heterogeneous effects of biomarkers on multi-outcomes in a competing-risk survival analysis
Yongzhao Shao, PhD
New York University Grossman School of Medicine
Yongzhao.Shao@nyulangone.org
There is often a need to evaluate heterogeneous effects on competing survival events due to different causes, increasingly so in competing-risk survival analysis of late-onset Alzheimer’s disease and cancers. In such problems, it is of interest to identify factors that have different effects among the multi-outcomes of competing survival events. We propose a semi-competing risk regression for multi-center multi-outcome data and develop a computing algorithm. Simulation studies are used to demonstrate the effectiveness of such a method under various practical scenarios. We will discuss applications to mortality analysis of multi-outcomes in a multi-center late-onset Alzheimer's disease research to understand the heterogeneous effects of the allele 4 of the ApoE gene in the context of competing risk survival analysis of multi-outcomes.
Nonconvex-regularized integrative sufficient dimension reduction for multi-source data
Shanshan Ding
Associate Professor of Statistics, University of Delaware
sding@udel.edu
As advances in high-throughput technology significantly expand data availability, integrative analysis of multiple data sources has become an increasingly important tool for biomedical studies. An integrative and nonconvex-regularized sufficient dimension reduction method is proposed to achieve simultaneous dimension reduction and variable selection for multi-source data analysis in high dimensions. The proposed method aims to extract sufficient information in a supervised fashion, and the asymptotic results establish a new theory for integrative sufficient dimension reduction and allow the number of predictors in each data source to increase exponentially fast with sample size. The promising performance of the integrative estimator and efficient numerical algorithms is demonstrated through simulation and real data examples.
On the Dynamic Risk Prediction with Time-Varying Risk Factors
Lihui Zhao
Northwestern University
lihui.zhao@northwestern.edu
Risk prediction plays a central role in clinical prevention strategies, by aiding decision making for lifestyle modification and to match the intensity of therapy to the absolute risk of a given patient. Most prediction models are developed based on the risk factors measured at a single time. Since risk factors like blood pressure are regularly collected in clinical practice, and electronic medical records are making longitudinal data on these risk factors available to clinicians, dynamic risk prediction on a real-time basis using the longitudinal history of risk factors will likely improve the precision of personalized risk prediction. We will present statistical methods to build dynamic risk prediction models using repeated measured risk factor levels. Real data analysis will be used for illustration.
Yanqing Sun (Organizer)
Wenqing He
Yanqing Sun
Chiung-Yu Huang
Grace Yi
Parametric and semiparametric estimation methods for survival data under a flexible class of models
Wenqing He
Professor, Department of Statistical and Actuarial Sciences, University of Western Ontario
whe@stats.uwo.ca
In survival analysis, accelerated failure time models are useful in modeling the relationship between failure times and the associated covariates, where covariate effects are assumed to appear in a linear form in the model. Such an assumption of covariate effects is, however, quite restrictive for many practical problems. To incorporate the flexible nonlinear relationships between covariates and transformed failure times, we propose partially linear single-index models to facilitate the complex relationship between transformed failure times and covariates. We develop two inference methods that handle the unknown nonlinear function in the model from different perspectives. The first approach is weakly parametric which approximates the nonlinear function globally, whereas the second method is a semiparametric quasi-likelihood approach which focuses on picking up local features. We establish the asymptotic properties of the proposed methods. A real example is used to illustrate the usage of the proposed methods, and simulation studies are conducted to assess the performance of the proposed methods for a broad variety of situations.
Regression analysis of semiparametric Cox-Aalen transformation models with partly interval-censored data
Yanqing Sun
University of North Carolina at Charlotte, USA
yasun@charlotte.edu
Partly interval-censored data, comprising exact and interval-censored observations, are prevalent in biomedical, clinical, and epidemiological studies. This paper studies a flexible class of semiparametric Cox-Aalen transformation models for partly interval censored data. The model offers greater flexibility and has the potential to enhance statistical power. It extends the semiparametric transformation models by allowing potentially time-dependent covariates to work additively on the baseline hazard and extends the Cox-Aalen model through a transformation function. We construct a set of estimating equations and propose an Expectation-Solving (ES) algorithm to facilitate efficient computation. The variance estimators are computed using the weighted bootstrap via the ES algorithm. The proposed ES algorithm is an extension of the Expectation-Maximization (EM) algorithm that can handle general estimating equations beyond those derived from the loglikelihood. The proposed estimators are shown to be consistent and asymptotically normal based on theories of the empirical processes. Simulation studies show that the proposed methods work well. The proposed method is applied to analyze data from a randomized HIV/AIDS trial.
Improved semiparametric estimation of the proportional rate model with recurrent event data
Chiung-Yu Huang
Professor, Department of Epidemiology & Biostatistics
Director, Department of Surgery Biostatistics Research Core
University of California, San Francisco
E-mail: ChiungYu.Huang@ucsf.edu
Owing to its robustness properties, marginal interpretations, and ease of implementation, the pseudo-partial likelihood method proposed in the seminal papers of Pepe and Cai and Lin et al. has become the default approach for analyzing recurrent event data with Cox-type proportional rate models. However, the construction of the pseudo-partial score function ignores the dependency among recurrent events and thus can be inefficient. An attempt to investigate the asymptotic efficiency of weighted pseudo-partial likelihood estimation found that the optimal weight function involves the unknown variance–covariance process of the recurrent event process and may not have closed-form expression. Thus, instead of deriving the optimal weights, we propose to combine a system of pre-specified weighted pseudo-partial score equations via the generalized method of moments and empirical likelihood estimation. We show that a substantial efficiency gain can be easily achieved without imposing additional model assumptions. More importantly, the proposed estimation procedures can be implemented with existing software. Theoretical and numerical analyses show that the empirical likelihood estimator is more appealing than the generalized method of moments estimator when the sample size is sufficiently large. An analysis of readmission risk in colorectal cancer patients is presented to illustrate the proposed methodology.
Estimation and Variable Selection under the Function-on-Scalar Linear Model with Covariate Measurement Error
Grace Yi
University of Western Ontario
gyi5@uwo.ca
Function-on-scalar linear regression has been widely used to model the relationship between a functional response and multiple scalar covariates. Its utility is, however, challenged by the presence of measurement error, a ubiquitous feature in applications. Naively applying the usual function-on-scalar linear regression to error-contaminated data often yields biased inference results. Further, the estimation of model parameters is complicated by the presence of inactive variables, especially when handling data with a large dimension. Building parsimonious and interpretable function-on-scalar linear regression models is in urgent demand to handle error-contaminated functional data. In this paper, we study this important problem and investigate measurement error effects. We propose a debiased loss function combined with a sparsity-inducing penalty function to simultaneously estimate functional coefficients and select salient predictors. An efficient computing algorithm is developed with tuning parameters determined by data-driven methods. Under mild conditions, the asymptotic properties of the proposed estimator are rigorously established, including estimation consistency, selection consistency, and limiting distributions. The finite sample performance of the proposed method is assessed through extensive simulation studies, and the usage of the proposed method is illustrated by a real data application.
Tony Sun (Organizer)
Jian Huang
Chengchun Shi
Yifan Cui
Qixian Zhong
Conditional Generative Learning
Jian Huang
Department of Applied Mathematics, The Hong Kong Polytechnic University
j.huang@polyu.edu.hk
Conditional distribution is a fundamental quantity in statistics and machine learning that provides a full description of the relationship between a response variable and a predictor. In this talk, we present and compare two generative approaches for learning a conditional distribution: (a) a generative adversarial approach to learning a conditional distribution by estimating a conditional generator, and (b) a stochastic interpolation approach for learning a drift function and a score function, and then use differential equations for conditional sample generation. We conduct numerical experiments to validate the proposed methods and use several benchmark datasets to illustrate their applications in conditional sample generation, prediction, image reconstruction, and protein sequence generation.
Fiducial inference in survival analysis
Yifan Cui
Zhejiang University
Censored data, where the event time is partially observed, are challenging for survival probability estimation. In this paper, we introduce a novel nonparametric fiducial approach to interval-censored data, including right-censored, current status, case II censored, and mixed case censored data. The proposed approach leveraging a simple Gibbs sampler has a useful property of being "one size fits all", i.e., the proposed approach automatically adapts to all types of non-informative censoring mechanisms. As shown in the extensive simulations, the proposed fiducial confidence intervals significantly outperform existing methods in terms of both coverage and length. In addition, the proposed fiducial point estimator has much smaller estimation errors than the nonparametric maximum likelihood estimator.
Hypothesis Testing for the Deep Cox Model
Qixian Zhong (Xiamen University), Jonas Mueller (Cleanlab), Jane-Ling Wang (UC Davis)
qxzhong@xmu.edu.cn
Deep learning has become enormously popular in the analysis of complex data, including event time measurements with censoring. To date, deep survival methods have mainly focused on prediction. Such methods are scarcely used in matters of statistical inference such as hypothesis testing. Due to their black-box nature, deep-learned outcomes lack interpretability which limits their use for decision-making in biomedical applications. This paper provides estimation and inference methods for the nonparametric Cox model -- a flexible family of models with a nonparametric link function to avoid model misspecification. Here we assume the nonparametric link function is modeled via a deep neural network. To perform statistical inference, we split the data into an estimation set and a set for statistical inference. This inference procedure enables us to propose a new significance test to examine the association of certain covariates with event times. We establish convergence rates of the neural network estimator, and show that deep learning can overcome the curse of dimensionality in nonparametric regression by learning to exploit low-dimensional structures underlying the data. In addition, we show that our test statistic converges to a normal distribution under the null hypothesis and establish its consistency, in terms of the Type II error, under the alternative hypothesis. Numerical studies demonstrate the usefulness of the proposed test.
Yufeng Liu (Organizer)
Eric Chi
Ali Shojaie
Zhengyuan Zhu
Yuying Xie
Sparse Single Index Models for Multivariate Responses
Eric Chi, Ph.D.
Associate Professor
echi@rice.edu
Joint models are popular for analyzing data with multivariate responses. We propose a sparse multivariate single index model, where responses and predictors are linked by unspecified smooth functions and multiple matrix level penalties are employed to select predictors and induce low-rank structures across responses. An alternating direction method of multipliers (ADMM) based algorithm is proposed for model estimation. We demonstrate the effectiveness of the proposed model in simulation studies and an application to a genetic association study.
Learning causal effects of multiple covariates on multiple outcomes in high dimensions
Ali Shojaie
Professor of Biostatistics and Statistics (adjunct), Associate Chair, Department of Biostatistics, University of Washington
ashojaie@uw.edu
We consider the problem of learning causal effects of multiple covariates on multiple outcomes. The problem is cast as a special instance of learning directed acyclic graphs (DAGs) from partial or set orderings. We show that unlike the simpler problem of learning DAGs from full causal orderings, DAG learning from partial orderings is computationally NP-hard. Building on recent developments for learning DAGs in high dimensions, we propose an efficient algorithm that learns the (direct) causal effects of covariates on outcome by leveraging the partial ordering and illustrate the advantages of the proposed algorithm over general-purpose DAG learning algorithms.
Maximizing Benefits under Harm Constraints: A Generalized Linear Contextual Bandit Approach
Zhengyuan Zhu
Iowa State University
zhuz@iastate.edu
In many contextual sequential decision-making scenarios, such as dose-finding clinical trials for new drugs or personalized news article recommendation systems in social media, each action can simultaneously carry both benefits and potential harm. This could manifest as efficacy versus side effects in clinical trials, or increased user engagement versus the risk of radicalization and psychological distress in news recommendation. These multifaceted situations can be modeled using the multi-armed bandit (MAB) framework. Given the intricate balance of positive and negative outcomes in these contexts, there is a compelling need to develop methods which can maximize benefits while limiting harm within the MAB framework. This paper addresses this gap by proposing a novel generalized linear contextual MAB model which balance the objectives of optimizing reward potential while limiting the harm and developing an $\epsilon$-greedy-based policy which achieves a sublinear regret. Extensive experimental results are presented to support our theoretical analyses and validate the effectiveness of our proposed model and policy.
Clustering and visualization of single-cell RNA-seq data using path metrics
Yuying Xie
Michigan State University
xyy@msu.edu
Recent advances in single-cell technologies have enabled high-resolution characterization of tissue and cancer compositions. Although numerous tools for dimension reduction and clustering are available for single-cell data analyses, these methods often fail to simultaneously preserve local cluster structure and global data geometry. To address these challenges, we developed a novel analyses framework, Single-Cell Path Metrics Profiling (scPMP), using power-weighted path metrics, which measure distances between cells in a data-driven way. Unlike Euclidean distance and other commonly used distance metrics, path metrics are density sensitive and respect the underlying data geometry. By combining path metrics with multidimensional scaling, a low dimensional embedding of the data is obtained which preserves both the global data geometry and cluster structure. We evaluate the method both for clustering quality and geometric fidelity, and it outperforms current scRNAseq clustering algorithms on a wide range of benchmarking data sets.
Virginie Rondeau (Organizer)
Pedro Miranda Afonso
Cécile Proust-Lima
Denis Rustand
Manel Rakez
A fast approach to analyzing large datasets with joint models for longitudinal and time-to-event outcomes
Pedro Miranda Afonso
PhD Candidate, Department of Epidemiology, Erasmus MC
p.mirandaafonso@erasmusmc.nl
The joint modeling of longitudinal and time-to-event outcomes has become a popular tool in follow-up studies. However, fitting Bayesian joint models to large datasets, such as patient registries, can require extended computing times. To speed up sampling, we divided a patient registry dataset into subsamples, analyzed them in parallel, and combined the resulting Markov chain Monte Carlo draws into a consensus distribution. We used a simulation study to investigate how different consensus strategies perform with joint models. In particular, we compared grouping all draws together with using equal- and precision-weighted averages. We considered scenarios reflecting different sample sizes, numbers of data splits, and processor characteristics. Parallelization of the sampling process substantially decreased the time required to run the model. We found that the weighted-average consensus distributions for large sample sizes were nearly identical to the target posterior distribution. The proposed algorithm has been made available in an R package for joint models, JMbayes2. This work was motivated by the clinical interest in investigating the association between ppFEV1, a commonly measured marker of lung function, and the risk of lung transplant or death, using data from the US Cystic Fibrosis Foundation Patient Registry (35,153 individuals with 372,366 years of cumulative follow-up). Splitting the registry into five subsamples resulted in an 85% decrease in computing time, from 9.22 to 1.39 hours. Splitting the data and finding a consensus distribution by precision-weighted averaging proved to be a computationally efficient and robust approach to handling large datasets under the joint modeling framework.
Analysis of multivariate longitudinal and survival data: what about random forests?
Cécile Proust-Lima, Anthony Devaux, Corentin Segalas, Robin Genuer
Inserm, Univ. Bordeaux, Bordeaux Population Health Research Center, Bordeaux, France
cecile.proust-lima@inserm.fr
Health studies usually involve the collection of variables repeatedly measured over time. This includes exposures (e.g., treatment, blood pressure, nutrition), markers of progression (e.g., brain volumes, blood tests, cognitive functioning, tumor size) and times to clinical endpoints (e.g., death, diagnosis). Joint models for longitudinal and survival data are now widely used in biostatistics to analyze such longitudinal data and address a variety of etiological and predictive questions. However, they quickly encounter numerical limitations as the number of repeated variables substantially increases making it challenging for them to address the in-depth medical research questions raised by the complex longitudinal information collected nowadays. In this talk, we tackle the challenge of the prediction of clinical endpoints from repeated measures of a large number of markers using another paradigm: the random survival forests. Random survival forests constitute a flexible method of prediction that can handle high-dimensional predictors and capture complex relationships with the outcome to predict. However, they are limited to time-invariant predictors. We show how random survival forests (possibly with competing causes of events) can be extended to time-varying noisy predictors by incorporating a modeling step into the recursive tree building procedure. The performances of the methodology, implemented in the DynForest R package, are assessed in simulations both in a small dimensional context (in comparison with joint models) and in a large dimensional context (in comparison with a regression calibration method that ignores the missing data informative mechanism in the time-varying covariates). The methodology is also illustrated in dementia research to (i) predict the individual probability of dementia using multi-modal repeated information (e.g., cognition, brain structure), and (ii) quantify the relative importance of each type of markers.
Efficient Inference for Joint Models of Multivariate Longitudinal and Survival Data Using INLAjoint
Denis Rustand
Post-Doctoral fellow, Statistics Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology
denis.rustand@kaust.edu.sa
Clinical research often requires the simultaneous study of longitudinal and survival data. Joint models, which can combine these two types of data, are essential tools in this context. A joint model involves multiple regression submodels (one for each longitudinal/survival outcome) usually linked together through correlated or shared random effects. This makes their estimation process rather complex, time-consuming, and sometimes even unfeasible, especially when dealing with many outcomes. In this context, we introduce INLAjoint, a user-friendly and flexible R package designed to leverage the Integrated Nested Laplace Approximation (INLA) method from the INLA R package, renowned for its computational efficiency and speed (Rue et al., 2009). INLAjoint can handle various model formulations and simplifies the application of INLA to fit joint models, ensuring fast and accurate parameter estimation. Our simulation studies show that INLA reduces the computation time substantially compared to alternative strategies such as Bayesian inference via Markov Chain Monte Carlo, without compromising on accuracy. (Rustand et al., 2023). A key application of joint models is the dynamic prediction of the risk of an event, such as death or disease progression, based on changes in the longitudinal outcome(s) over time. INLAjoint allows for the estimation of dynamic risk predictions and can incorporate changes in the longitudinal outcome(s) to update future risk predictions. This makes INLAjoint a valuable tool for analyzing complex health data.
Evolution of breast density over time and its impact on breast cancer diagnosis during screening
Rakez M.*, Guillaumin J., Chick A., Fillard P., Amadeo B., Rondeau V.
BIOSTAT Team, Bordeaux Population Health Center, ISPED, Centre INSERM, U1219, Bordeaux, France
manel.rakez@u-bordeaux.fr
Breast cancer (BC) is the leading cause of cancer death in women worldwide. Mammography-based screening programs reduce breast cancer (BC) mortality by promoting earlier detection. The mammography’s sensitivity depends on breast density (BD). The latter is subject to changes over time, affecting the risk of a BC diagnosis. Women with high BD are likelier to develop BC, and their mammography’s sensitivity is lessened. Thus, to better understand the impact of temporal BD changes on BC diagnosis risk during screening, we propose a new methodology to predict BC risk accounting for the deep learning assessment of the sequential BD. From the sequential and complete mammography exams of 131,209 women participating in the BC screening program, the percent density (PD), a quantitative estimation of a woman’s BD at each visit, is estimated using MammoDL. This segmentation model comprises two successive modified U-Nets allowing for breast identification from the entire mammogram first and dense tissue region delineation from the breast region second. A ResNet-34 replaces the U-Net encoder to alleviate training challenges. In addition, this model is fine-tuned to extend its use to processed GE and Hologic vendors’ images. Then, a joint model for a linear biomarker and a time-to-event outcome is implemented using the consensus Monte Carlo algorithm. First, the temporal trajectory of the PD is described using a linear mixed-effect model, adjusted on factors impacting the BD, such as age. This sub-model is flexible in dealing with irregular intervals between screening visits and outcome-dependent drop-out. Second, the individual and dynamic prediction of BC diagnosis is estimated conditionally on the biomarker’s intermediate longitudinal measurements and is defined over the screening period. This probability is derived for each woman and is dynamically updated as PD measurements accumulate. We propose a reproducible method to estimate BD’s temporal evolution and its impact on BC diagnosis. The segmentation model gives a quantitative estimation of the BD at each screening visit. The joint model uses the biomarker’s repeated measurements to dynamically update the BC diagnosis prediction throughout the screening period.
Jessica Barrett (Organizer)
Hein Putter
Dimitris Rizopoulos
Danilo Alvares
Liang Li
Dynamic prediction with many biomarkers: combining landmarking 2.0 with multivariate Functional Principal Component Analysis
Hein Putter
Department of Biomedical Data Sciences, Leiden University Medical Center
H.Putter@lumc.nl
Predicting patient survival based on longitudinal biomarker measurements poses a common statistical challenge. In many cases, the volume of longitudinal data exceeds what can be practically managed with a joint model. For this reason, numerous methods employ a multi-step landmarking approach, where the longitudinal data up to a specific landmark time is summarized and used in a subsequent landmark model. In our previous research, we utilized multivariate Functional Principal Component Analysis (mFPCA) to summarize the available longitudinal data and used a proportional hazards landmark model for prediction. We demonstrated the effectiveness of a "strict" landmarking approach, where only the information preceding the landmark is utilized. In this presentation, we explore the advancements in landmarking 2.0 to further improve on this approach. Our approach involves using the mFPCA results up to the landmark to forecast the progression of the longitudinal biomarkers from the landmark time onward until the prediction horizon. We then fit a time-dependent Cox model, incorporating these predictable time-dependent covariates as the foundation for a landmark model. We demonstrate the utility of this method for dynamic prediction, assess its performance through simulation studies, and illustrate it in real data.
Optimizing Dynamic Predictions from Joint Models Using Super Learning
Dimitris Rizopoulos
Erasmus University Rotterdam
d.rizopoulos@erasmusmc.nl
Joint models for longitudinal and time-to-event data are often employed to calculate dynamic individualized predictions used in numerous applications of precision medicine. Two components of joint models that influence the accuracy of these predictions are the shape of the longitudinal trajectories and the functional form linking the longitudinal outcome history to the hazard of the event. Finding a single well-specified model that produces accurate predictions for all subjects and follow-up times can be challenging, especially when considering multiple longitudinal outcomes. In this work, we use the concept of super learning and avoid selecting a single model. In particular, we specify a weighted combination of the dynamic predictions calculated from a library of joint models with different specifications. The weights are selected to optimize a predictive accuracy metric using V-fold cross-validation. We use as predictive accuracy measures the expected quadratic prediction error and the expected predictive cross-entropy. In a simulation study, we found that the super learning approach produces results similar to the Oracle model, which performed best in the test datasets. All proposed methodology is implemented in the freely available package JMbayes2.
A two-stage approach for Bayesian joint modelling of competing risks and multiple longitudinal outcomes
Danilo Alvares
MRC Biostatistics Unit, University of Cambridge, UK
danilo.alvares@mrc-bsu.cam.ac.uk
Recent trends in personalised healthcare have motivated great interest in the dynamic prediction of survival and other clinically important events by using baseline characteristics and the evolving history of disease progression. Some of the methodological developments were motivated by case studies in multiple myeloma (a type of bone marrow cancer), where progression is assessed by several biomarker trajectories, and patients may experience multiple regimen changes over time. To understand the dynamic interplay between biomarkers and their connections to the survival process, a two-stage Bayesian joint model is developed for competing risks and multiple longitudinal outcomes. The proposal is applied to an observational study from the US nationwide Flatiron health electronic health record (EHR)-derived de-identified database, where patients diagnosed with multiple myeloma from January 2015 to February 2022 were selected. The data is split into training and test sets in order to assess the performance of the proposal in making dynamic predictions of times to events of interest (time to next line of therapy or time to death) using baseline variables and longitudinally measured biomarkers available up to the time of prediction. Residuals validated the robustness of the model, and the calibration supported its good predictive accuracy.
Backward Joint Model for the Dynamic Prediction of Multivariate Longitudinal and Survival Outcomes
Liang Li
MD Anderson Cancer Center
LLi15@mdanderson.org
Joint modeling is an important approach to dynamic prediction of clinical outcomes using longitudinally measured predictors, such as biomarkers. We consider the situation where the predictors include baseline covariates and the longitudinal trajectories of many correlated biomarkers, measured asynchronously at irregularly spaced time points. The outcomes of predictive interest include both the terminal clinical event with or without competing risks, and the future longitudinal biomarker trajectories if the terminal or competing risk events do not occur. We propose a novel backward joint model (BJM) to solve this problem. The BJM can be flexibly specified to optimize the prediction accuracy. Its likelihood-based estimation algorithm is robust, fast, and stable regardless of the dimension of longitudinal biomarkers. We illustrate the BJM methodology with simulations and a real dataset from the African American Study of Kidney Disease and Hypertension.
Wolfgang Trutschnig & Sebastian Fuchs (Organizer)
Damjana Kokol Bukovsek
Nik Stopar
Jonathan Ansari
Patrick Langthaler
Exact upper bound for bivariate copulas with a given diagonal section
Damjana Kokol Bukovsek
University of Ljubljana
damjana.kokol.bukovsek@ef.uni-lj.si
For any bivariate copula or quasi-copula C: I^2→I its diagonal section δ_C: I→I, defined by δ_C(x) = C(x,x), is increasing, 2-Lipschitz, satisfies δ_C(1) = 1 and δ_C(x) ≤ x for all x∈I. On the other hand, given any function δ: I→I, which is increasing, 2-Lipschitz, satisfies δ_C (1) = 1 and δ_C (x) ≤ x for all x∈I, there exists a copula, such that δ is its diagonal section. Given a function δ with these properties, it is known what is the exact (pointwise) lower bound of all bivariate copulas with diagonal section δ; it is the Bertino copula. The same function is also the lower bound for quasi-copulas. Furthermore, the exact upper bound of all bivariate quasi-copulas with diagonal section δ is also known. Themain goal of this talk is to answer a question of the exact upper bound for bivariate copulas with a given diagonal section δ by giving an explicit formula for this bound. We achieve this by constructing a new copula with prescribed diagonal section, which attains the bound on the entire upper-left triangle of the unit square. We also answer the question for which diagonal sections this exact bound is a copula. As an application of our main result, we determine the maximal asymmetry of bivariate copulas with a given diagonal section and construct a copula that attains it. This is joint work with Blaž Mojškerc (University of Ljubljana) and Nik Stopar (University ofLjubljana).
Infima Aand Suprema of Multivariate Cumulative Distribution Functions
NIK STOPAR
University of Ljubljana, Slovenia
nik.stopar@fgg.uni-lj.si
A multivariate probability box is a set of cumulative distribution functions bounded point-wise by two standardized functions. It can be used to model imprecision in the knowledge about the true joint distribution function of a random vector. A probability box is coherent if its bounding functions are equal to the point-wise infimum and supremum of the distribution functions contained in the box. This is the main motivation for investigating infima and suprema of sets of multivariate distribution functions. A coherent probability box can be constructed by composing several univariate (marginal) probability boxes with a coherent imprecise copula (i.e., a coherent box of copulas bounded by two quasi-copulas). In this talk we discuss the question whether any coherent probability box can be obtained in such a way. If we are only interested in multivariate distributions with fixed marginals (i.e., each marginal probability box contains a single distribution function), then the answer is positive as a consequence of Sklar’s theorem. On the other hand, the answer is negative for more general probability boxes if we insist on the standard way of representing a multivariate cumulative distribution function with a copula. Nevertheless, we show that with a slightly modified representation a positive answer can be achieved under a mild condition on the probability boxes. In particular, we demonstrate how the point-wise infimum and supremum of a family of multivariate distribution functions can be represented with copulas that correspond to the members of the family.
A model-free multi-output variable selection
Dr. Jonathan Ansari
PostDoc, Lead of IDA Lab Team Applied Statistics, Department for Artificial Intelligence and Human Interfaces, Hellbrunnerstrasse 34 | 5020 Salzburg | Austria
E-Mail: jonathan.ansari@plus.ac.at
As a direct extension of Azadkia \& Chatterjee's rank correlation T to a set of q outcome variables, the novel measure T^q, introduced and investigated in Ansari \& Fuchs, quantifies the scale-invariant extent of functional dependence of a multi-output vector $Y = (Y1,..., Yq)$ on a number of p input variables $X = (X1,..., Xp)$ and fulfils all the desired characteristics of a measure of predictability, namely $0 <= Tq(Y|X) <= 1$, $Tq(Y|X) = 0$ if and only if Y and X are independent, and $Tq(Y|X) = 1$ if and only if Y is perfectly dependent on X. Based on various useful properties of $Tq(Y|X)$, a model-free and dependence-based feature ranking and forward feature selection of data with multiple output variables is presented, thus facilitating the selection of the most relevant explanatory variables.
Quantifying and estimating dependence via sensitivity of conditional distributions
Jonathan Ansari1,a, Patrick B. Langthaler1,2,b, Sebastian Fuchs 1,c, and Wolfgang Trutschnig 1,d
1 Department of Artificial Intelligence and Human Interfaces, University of Salzburg, Austria
2 Department of Neurology, Christian Doppler Klinik, Paracelsus Medical University, Salzburg, Austria
a jonathan.ansari@plus.ac.at
b patrickbenjamin.langthaler@stud.plus.ac.at
c sebastian.fuchs@plus.ac.at
d wolfgang@trutschnig.net
Recently established, directed dependence measures for pairs (X, Y ) of random variables build upon the natural idea of comparing the conditional distributions of Y given X = x with the marginal distribution of Y . They assign pairs (X, Y ) values in [0, 1], the value is 0 if and only if X, Y are independent, and it is 1 exclusively for Y being a function of X. We show that comparing randomly drawn conditional distributions with each other instead or, equivalently, analyzing how sensitive the conditional distribution of Y given X = x is on x, opens the door to constructing novel families of dependence measures Λφ induced by general convex functions φ : R → R, containing, e.g., Chatterjee's coefficient of correlation as special case. After establishing additional useful properties of Λφ we focus on continuous (X, Y ), translate Λφ to the copula setting, consider the Lp-version and establish an estimator which is strongly consistent in full generality. A real data example and a simulation study illustrate the chosen approach and the performance of the estimator. Complementing the afore-mentioned results, we show how a slight modification of the construction underlying Λφ can be used to define new measures of explainability generalizing the fraction of explained variance.
Georg Zimmermann (Organizer)
Geert Molenberghs
Frank Konietschke
Somnath Datta
Kelly van Lancker
A Broad Framework for Likelihood Alternatives, in View of Small, Very Large, and Variable-Size Studies with Multivariate and/or Repeated Measures
Geert Molenberghs
Interuniversity Institute for Biostatistics and statistical Bioinformatics, (1) Hasselt University, Diepenbeek, Belgium, (2) KU Leuven, Belgium
geert.molenberghs@uhasselt.be
We consider a number of data settings where the use of standard maximum likelihood or other estimation method is complicated for a number of reasons: data structures are complex, there are very large data streams, in reverse there are very small trials (like in orphan diseases), or there are non-standard design features (sequential trials, missing data, clustered data with variable size, etc.). Specific challenges arise when data are multivariate and/or longitudinal. The use of alternatives to maximum likelihood are explored, with particular emphasis on pseudo-likelihood, split-sample methods, and even closed-form estimators in settings where one would not expect them. Specific attention is devoted to the computational feasibility of the proposed methods. We pay particular attention to the existence of closed forms in our modified procedures. All settings are illustrated using real-life examples.
Statistical Planning and Evaluation of Translational Trials
Frank Konietschke
Institute of Biometry and Clinical Epidemiology, Charité Universitätsmedizin Berlin
Frank.Konietschke@charite.de
Any trial should start with its careful planning and especially with sample size calculations, in particular with regard to sample size and power considerations. The planning phase of an experiment is key, since errors in the statistical planning can have severe consequences on both the results and conclusions drawn from the data. In translational research (preclinical and early clinical), false conclusions highly affect subsequent trials and thus, mistakes proliferate, a rather unethical outcome. In statistical practice, most studies are planned based on t-tests and Wald-type statistics (including ANOVA) and make some strict distributional assumptions. Sample sizes are typically small and if planning assumptions are not met, the trials are either underpowered or too large, result in wrong conclusions and waste resources, and might even be misleading. On the other hand, nonparametric ranking methods (such as Wilcoxon- Mann-Whitney test, Brunner-Munzel test, multiple contrast tests and their generalizations) are excellent alternatives to such parametric approaches. However, sample size formulas as well as detailed power analyses are yet to be implemented for broad classes of such tests. In this talk, we discuss statistical planning and evaluation methods for translational trials. Real data sets illustrate the methods.
Specialized Statistical Analyses of Iowa Fluoride Study Data
Somnath Datta
University of Florida
somnath.datta@ufl.edu
We present both Bayesian and frequentist analysis, for longitudinal data that are clustered and non-continuous (more specifically, count and ordinal), and exhibit zero inflation patterns. The ultimate goal is to undertake a comprehensive and unified statistical examination of the total accumulation of dental caries and fluorosis data obtained from Iowa Fluoride Study participants. More specifically, we fit longitudinal statistical models to caries and fluorosis scores data obtained at ages five, nine, thirteen, seventeen, and twenty-three, for the participants in this cohort study of Iowa children. The ultimate goal is to study the time-varying (in particular, long-term) and joint effects of various risk and protective factors for dental caries and fluorosis outcomes.
Ensuring valid inference for Cox hazard ratios after variable selection
Kelly Van Lancker
University of Gent
Kelly.VanLancker@ugent.be
The problem of how to best select variables for confounding adjustment forms one of the key challenges in the evaluation of exposure effects in observational studies, and has been the subject of vigorous recent activity in causal inference. A major drawback of routine procedures is that there is no finite sample size at which they are guaranteed to deliver exposure effect estimators and associated confidence intervals with adequate performance. In this work, we will consider this problem when inferring conditional causal hazard ratios from observational studies under the assumption of no unmeasured confounding. The major complication that we face with survival data is that the key confounding variables may not be those that explain the censoring mechanism. In this presentation, we overcome this problem using a novel and simple procedure that can be implemented using off-the-shelf software for penalized Cox regression. In particular, we will propose tests of the null hypothesis that the exposure has no effect on the considered survival endpoint, which are uniformly valid under standard sparsity conditions. Simulation results show that the proposed methods yield valid inferences even when covariates are high-dimensional.
Simon Hirländer (Organizer)
Danyu Lin
Richard Cook
Limin Peng
Lei Liu
Evaluating Treatment Efficacy in Hospitalized Covid-19 Patients
Danyu Lin, Ph.D.
Department of Biostatistics, University of North Carolina
lin@bios.unc.edu
The clinical status of a COVID-19 patient is typically rated on a 7- or 8-point ordinal scale, ranging from resumption of normal activities to death, and the clinical status of a hospitalized COVID-19 patient may improve or deteriorate by different levels over the course of a clinical trial. The efficacy endpoints that have been used in clinical trials of hospitalized COVID-19 patients are the time to a specific change in clinical status or the clinical status on a particular day. For example, in the Adaptive COVID-19 Treatment Trials (ACTTs) and the Accelerating COVID-19 Therapeutic Interventions and Vaccines (ACTIV)-1 trial, the primary endpoints were time to recovery, and the secondary endpoints included 28-day mortality and clinical status at day 15 or day 28. However, these endpoints do not fully represent important clinical outcomes or make efficient use of available data. In this talk, I will present several methods that comprehensively characterize the treatment effects on the entire clinical course of a hospitalized COVID-19 patient and illustrate the advantages of these methods with the ACTT-1, ACTT-2, ACTT-3, and ACTIV-1 data.
A New Joint Model for Recurrent and Terminal Events
Jianchu Chen and Richard Cook*
University of Waterloo
Recurrent event arise in many chronic diseases and offer a meaningful basis for assessing treatment effects in clinical trials. Complications arise when the chronic disease has a non-negligible mortality rate as the recurrent event process is terminated by death. We introduce a new copula-based framework for joint modeling of the recurrent and terminal event processes which, unlike common joint models with correlated or shared random effects, yields an estimate of the hazard ratio based on the standard Cox model. The treatment effect based on the recurrent event processes is expressed conditional on a frailty. A copula function links this frailty with the terminal event time. Complete intensity functions are derived for the two processes to gain insight into the properties of the model. Semiparametric models are fitted by an expectation-maximization algorithm based on simultaneous or two-stage estimation. Adaptations to deal with intermittent observation of the recurrent event process are also developed.
Dynamic regression of longitudinal trajectory features
Limin Peng
Emory University
lpeng@emory.edu
Chronic disease studies often collect data on biological and clinical markers at follow-up visits to monitor disease progression. Viewing such longitudinal measurements governed by latent continuous trajectories, we develop a new dynamic regression framework to investigate the heterogeneity pattern of certain features of the latent individual trajectory that may carry substantive information on disease risk or status. Employing the strategy of multi-level modeling, we formulate the latent individual trajectory feature of interest through a flexible pseudo B-spline model with subject-specific random parameters, and then link it with the observed covariates through quantile regression, avoiding restrictive parametric distributional assumptions that are typically required by standard multi-level longitudinal models. We propose an estimation procedure from adapting the principle of conditional score and develop an efficient algorithm for implementation. Our proposals yield estimators with desirable asymptotic properties as well as good finite-sample performance as confirmed by extensive simulation studies. An application of the proposed method to a cohort of participants with mild cognitive impairment (MCI) in the Uniform Data Set (UDS) provides useful insights about the complex heterogeneous presentations of cognitive decline in MCI patients.
Deep Learning Models to Predict Primary Open-Angle Glaucoma
Lei Liu
Washington University in St. Louis
Lei.liu@wustl.edu
Glaucoma is a major cause of blindness and vision impairment worldwide, and visual field (VF) tests are essential for monitoring the conversion of glaucoma. While previous studies have primarily focused on using VF data at a single time point for glaucoma prediction, there has been limited exploration of longitudinal trajectories. Additionally, many deep learning techniques treat the time-to-glaucoma prediction as a binary classification problem (glaucoma Yes/No), resulting in the misclassification of some censored subjects into the non-glaucoma category and decreased power. To tackle these challenges, we propose and implement several deep-learning approaches that naturally incorporate temporal and spatial information from longitudinal visual field data to predict time-to-glaucoma. When evaluated on the Ocular Hypertension Treatment Study (OHTS) dataset, our proposed CNN-LSTM emerged as the top-performing model among all those examined.
Zhezhen Jin (Organizer)
Xuewen Lu
Yingwei Peng
Yuping Wang
Nicolas Dietrich
Variable Selection in Joint Frailty Model of Recurrent and Terminal Events with Diverging Number of Covariates
Xuewen Lu
University of Calgary
xlu@ucalgary.ca
In many biomedical applications, the recurrent event data are subject to an informative terminal event, for example, death. Joint modeling of recurrent and terminal events has attracted many research interests, however, very few works have been done for simultaneous estimation and variable selection for joint frailty proportional hazards models, moreover, it is lacking a theoretical justification and a validity when the dimension of covariates is diverging with the sample size. To fill this gap, we propose a broken adaptive ridge (BAR) regression procedure that combines the strengths of the quadratic regularization and the adaptive weighted bridge shrinkage. We establish the oracle property of the BAR regression. In the simulation study, the results indicate that the BAR regression outperforms the existing variable selection methods. Finally, the proposed method is applied to a real dataset for illustration.
Joint Analysis of Longitudinal Ordinal Categorical Item Response Data and Survival Times with Cure Fraction
Yingwei Peng, Professor
Department of Public Health Sciences, Department of Mathematics and Statistics (cross-appointed), Queen's University, Kingston, ON, K7L 3N6, Canada
pengp@queensu.ca
For longitudinal ordinal categorical item response data that may not be observable after a subject develops a terminal event, some statistical models were proposed for the joint analysis of the longitudinal item responses and times to the development of a terminal event. All of these models used an accelerated failure time or Cox proportional hazards model for the survival times, which may not be suitable when some of the subjects are considered cured and will, therefore, never develop an event. In this talk, I will present a new joint model that uses a promotion time cure model for survival times. Statistical estimation procedures are developed for the inference of the parameters in the model. The proposed model and inference procedures are assessed through a simulation study and application to data from a randomized clinical trial for patients with early breast cancer. This is joint work with Ming Chi, Xiaogang Wang, Hui Song, and Dongsheng Tu.
Hierarchical variable clustering based on the predictive strength
Yuping Wang
Paris-Lodron-University Salzburg
yuping.wang@plus.ac.at
A rank-invariant clustering of variables is introduced that is based on the predictive strength between groups of variables, i.e., two groups are assigned a high similarity if the variables in the first group contain high predictive information about the behavior of the variables in the other group and/or vice versa. The method presented here is model-free, dependence-based and does not require any distributional assumptions. Various general invariance and continuity properties are investigated, with special attention to those that are beneficial for the agglomerative hierarchical clustering procedure. A fully non-parametric estimator is considered whose excellent performance is demonstrated in several simulation studies and by means of real-data examples.
Revisiting the Williamson transform in the context of multivariate Archimedean copulas
Nicolas Dietrich
Paris-Lodron-University Salzburg
nicolaspascal.dietrich@plus.ac.at
Motivated by a recently established result saying that within the family of bivariate Archimedean copulas standard pointwise convergence implies the generally stronger weak conditional convergence, i.e. convergence of almost all conditional distributions, this result is extended to the class of multivariate Archimedean copulas. Working with the fact that generators of Archimedean copulas are d-monotone functions, pointwise convergence within the family of multivariate Archimedean copulas is characterized in terms of convergence of the corresponding generators, derivatives of the generators, marginal copulas as well as marginal densities. Furthermore, weak conditional convergence is a consequence of any of the afore-mentioned properties. Utilizing that generators of Archimedean copulas can be represented via Williamson transforms of one dimensional probability measures, it is established that weak convergence of the probability measures is equivalent to uniform convergence of the Archimedean copulas. Using Markov kernels, Archimedean copulas inherit absolute continuity, singularity and discreteness from the afore-mentioned probability measures, leading to the surprising result that absolutely continuous, singular, as well as discrete copulas are dense in the class of Archimedean copulas with respect to the uniform metric.
Gang Li (Organizer)
Xiaowu Dai
Douglas Schaubel
Ying Lu
Yuhua Zhu
Kernel ordinary differential equations
Xiaowu Dai
UCLA, United States
dai@stat.ucla.edu
The ordinary differential equation (ODE) is widely used in modelling biological and physical processes in science. A new reproducing kernel based approach is proposed for the estimation and inference of ODE given noisy observations. The functional forms in ODE are assumed to be known or restricted to be linear or additive, and pairwise interactions are allowed. Sparse estimation is performed to select individual functionals and construct confidence intervals for the estimated signal trajectories. The estimation optimality and selection consistency of kernel ODE are established under both the low-dimensional and high-dimensional settings, where the number of unknown functionals can be smaller or larger than the sample size. The proposal builds upon the smoothing spline analysis of variance (SS-ANOVA) framework, but tackles several important problems that are not yet fully addressed, and thus extends the scope of existing SS-ANOVA too.
Dynamic Risk Assessment by Landmark Modeling of the Restricted Mean Survival Time
Douglas Schaubel
Professor, Biostatistics Division, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, U.S.A.
Dynamic risk assessment is an important tool for healthcare providers to inform treatment selection for the purposes of optimizing patient outcomes and/or avoiding over-treatment. The risk of adverse events is regularly assessed based on changes in particular biomarkers or vital signs in order to evaluate a patient’s medical status and perhaps urgency of treatment receipt. Landmark analysis is a useful dynamic prediction approach which obviates the need to jointly model the time-dependent biomarkers and time-to-event outcome. The majority of landmark methods for survival analysis utilize a hazard regression model (typically, a Cox model) to quantify the association between the longitudinal predictors and outcomes. In addition, most methods assume independent censoring. In order to broaden the scope of landmark analysis, we propose landmark methods which directly model the restricted mean survival time (RMST) and allow for dependent censoring. Advantages of RMST models include removing the assumption that the hazard model is correct at every time point, since RMST models consider a single restriction time as opposed to a process. Moreover, many investigators prefer the area under the survival curve over hazard rate as a clinical endpoint due to interpretability. Asymptotic properties of the proposed estimators are derived, and a comprehensive simulation study demonstrates their decent performance in finite samples. The proposed methods are illustrated using national registry data on a cohort of end-stage liver disease patients. This is joint work with Yuan Zhang.
Using the Desirability of Outcome Ranking (DOOR) Approach to Construct Multicomponent Endpoints
Ying Lu, Ph.D.
Professor, Department of Biomedical Data Science, Stanford University School of Medicine
ylu1@stanford.edu
Complex disorders affect multiple symptom domains measured by multiple outcomes. Successful treatments may affect one or several domains that may vary among patients. Multiple component (MC) endpoints that integrate outcomes across multiple domains help the evaluation of totality of treatment benefits. In this talk, we present a general approach to construct a MC endpoint from multiple domains according to their relative ranking in an evaluation system. The ranking of outcome variables can be defined in a protocol (a shared decision making (SDM) trial) or vary by treatment approaches (such as for traditional Chinese medicine trials), or by patient preferences (such as the Patient-ranked Order of Function (PROOF) score for Amyotrophic Lateral Sclerosis (ALS) trials). Using the desirability of outcome ranking (DOOR) approach, we can construct the Mann-Whitney U-statistics to estimate the probability of a treated participant having more desirable outcomes than a control participant. This approach has the advantage of flexibility in how many domains to be integrated, independent of measurement units, and improvement in relevance of efficacy and statistical power. We demonstrated this approach using the results from the ENHANCE-AF trial (NCT04096781), which evaluated a novel SDM pathway for patients considering anticoagulation for stroke prevention, and the follow-up data from the development of the PROOF in prediction of ALS patient survival. We will discuss challenges in using this approach and strategies to address them. The presentation is based on collaborations with Professor Lu Tian, Paul Wang, and Randal l Stafford at Stanford University and Professors Ruben van Eijk and Leonard vd Berg at the University Medical Center, Utrecht, the Netherlands.
Continuous-in-time Reinforcement Learning
Yuhua Zhu
UCLA
yuhuazhu.math@gmail.com
When the data is discrete-in-time, how can we solve the continuous-in-time reinforcement learning problems? Given the prevalence of continuous-time dynamics in various real-world applications, our objective is to solve the optimal control problem in the presence of unknown dynamics. First, we show that the Bellman equation serves as a first-order approximation to continuous-time problems. Then, we derive higher-order equations based on partial differential equations (PDEs). To efficiently solve continuous-time reinforcement learning problems with discrete-in-time data, we further propose algorithms for solving the PDE-based Bellman equations.
Lei Liu (Organizer)
Jianwen Cai
Donglin Zeng
Yuanjia Wang
Haibo Zhou
Feature screening for case-cohort studies with failure time outcome
Jianwen Cai
University of North Carolina, Chapel Hill
Case-cohort design has been demonstrated to be an economical and effective approach in large cohort studies when the measurement of some covariates on all individuals is expensive. Various methods have been proposed for case-cohort data when the dimension of covariates is smaller than sample size. However, limited work has been done for high-dimensional case-cohort data which are frequently collected in large epidemiological studies. We propose a variable screening method for ultrahigh-dimensional case-cohort data under the framework of proportional hazards model, which allows the covariate dimension increases with sample size at exponential rate. Our procedure enjoys the sure screening property and the ranking consistency under some mild regularity conditions. We further extend this method to an iterative version to handle the scenarios where some covariates are jointly important but are marginally unrelated or weakly correlated to the response. The finite sample performance of the proposed procedure is evaluated via both simulation studies and an application to a real data from the breast cancer study.
Fusing Individualized Treatment Rules Using Auxiliary Outcomes
Donglin Zeng
Professor, Depart of Biostatistics, University of Michigan
dzeng@umich.edu
An individualized treatment rule (ITR) is a decision rule that recommends treatments for patients based on their covariates. In practice, the optimal ITR that maximizes its associated value function is also expected to cause little harm to other non-primary outcomes. Hence, one goal is to learn the ITR that not only maximizes the value function for the primary outcome but also approximates the optimal rule for the other auxiliary outcomes as closely as possible. In this work, we propose a fusion penalty to encourage ITRs based on the primary outcome and auxiliary outcomes to yield similar recommendations. We then optimize a surrogate loss function using empirical data for estimation. We derive the non-asymptotic properties for the proposed method and show that the agreement rate between the estimated ITRs for primary and auxiliary outcomes converges faster to the true agreement rate as compared to methods without using auxiliary outcomes. Finally, simulation studies and a real data example are used to demonstrate the finite-sample performance of the proposed method.
Mixed-Response State-Space Model for Analyzing Multi-Dimensional Digital Phenotypes
Yuanjia Wang
Columbia University
yw2016@cumc.columbia.edu
Digital technologies (e.g., mobile phones) can be used to obtain objective, frequent, and real-world digital phenotypes from individuals. However, modeling these data poses substantial challenges since observational data are subject to confounding and various sources of variabilities. For example, signals on patients’ underlying health status and treatment effects are mixed with variation due to the living environment and measurement noises. The digital phenotype data thus shows extensive variabilities between- and within patients as well as across different health domains (e.g., motor, cognitive, and speaking). Motivated by a mobile health study of Parkinson’s disease (PD), we develop a mixed-response state-space (MRSS) model to jointly capture multi-dimensional, multi-modal digital phenotypes and their measurement processes by a finite number of latent state time series. These latent states reflect the dynamic health status and personalized time-varying treatment effects and can be used to adjust for informative measurements. For computation, we use the Kalman filter for Gaussian phenotypes and importance sampling with Laplace approximation for non-Gaussian phenotypes. We conduct comprehensive simulation studies and demonstrate the advantage of MRSS in modeling a mobile health study that remotely collects real-time digital phenotypes from PD patients.
Semiparametric regression analysis of case-cohort studies with multiple interval-censored disease outcomes
Haibo Zhou
UNC at Chapel Hill
zhou@bios.unc.edu
Interval-censored failure time data commonly arise in epidemiological and biomedical studies where the occurrence of an event or a disease is determined via periodic examinations. In this work, we formulate the case-cohort design with multiple interval-censored disease outcomes and also generalize it to non-rare diseases where only a portion of diseased subjects are sampled. We develop a marginal sieve weighted likelihood approach, which assumes that the failure times marginally follow the proportional hazards model. We consider two types of weights to account for the sampling bias, and adopt a sieve method with Bernstein polynomials to handle the unknown baseline functions. We employ a weighted bootstrap procedure to obtain a variance estimate that is robust to the dependence structure between failure times. The proposed method is examined via simulation studies and illustrated with a dataset on incident diabetes and hypertension from the Atherosclerosis Risk in Communities (ARIC) study.