Colleen Chan

Google Scholar  •  LinkedIn

24 Hillhouse Ave,
New Haven, CT


I am a Ph.D. student in the Department of Statistics and Data Science at Yale University, advised by Jas Sekhon. I am broadly interested in causal inference and machine learning, with specific interests in developing machine learning methods for heterogeneous treatment effect estimation and transportability. My research is motivated by problems in the social sciences, medicine, and public health. Before Yale, I completed my undergraduate work at UC San Diego in 2018.


Honest Random Forests for Heterogeneous Treatment Effect Estimation with Covariate Shift.

Colleen E. Chan, Theo F. Saarinen, Jasjeet S. Sekhon.

In many real world applications, machine learning algorithms are used for prediction in environments that do not share the covariate distribution of the data that the algorithm was trained on. When this is the case, it is important for the algorithm to be robust to the distributional shift of the covariates to avoid harmful results down the line. For example, the Epic Sepsis model, a proprietary sepsis prediction model, had considerably worse than reported accuracy when deployed in the hundreds of hospitals not included in the training data. In any transfer learning problem, the change in the data generating process can manifest itself in two ways: a shift in the distribution of the covariates, and a change in the outcome regression function itself. We study an approach for training machine learning algorithms that aims to solve the first of these issues. Specifically, we introduce a version of the random forest algorithm that makes use of a novel sample splitting procedure during training in order to induce robustness to covariate shift at prediction time. We also provide a proof for identification of the CATE function under restrictions on the model shift between groups. In contrast to a standard random sample split, we restrict the predictions of the forest to only trees not trained on any observations from a given group. This form of prediction ensures that the forest can make robust predictions for the various experimental groups even in the presence of distributional shifts. Out of sample, this method relies on the assumption that the distributional shifts within the training groups are similar to the distributional shifts within the groups out of sample. We develop simulations that demonstrate that our method outperforms existing meta-analytic approaches. Finally, we apply our method to get-out-the-vote data.

Nonparametric Estimation of the Potential Impact Fraction and the Population Attributable Fraction with Individual-level and Aggregate Data.

Colleen E. Chan, Rodrigo Zepeda-Tello, Dalia Camacho García Formentí, Frederick Cudhea, Rafael Meza, Eliane Rodrigues, Donna Spiegelman, Tonatiuh Barrientos Gutierrez, Xin Zhou.


Young Statistician Showcase Presentation at IBC 2022.

The estimation of the potential impact fraction (including the population attributable fraction) with continuous exposure data frequently relies on strong distributional assumptions. However, these assumptions are often violated if the underlying exposure distribution is unknown or if the same distribution is assumed across time or space. Nonparametric methods to estimate the potential impact fraction are available for cohort data, but no alternatives exist for cross-sectional data. In this article, we discuss the impact of distributional assumptions in the estimation of the population impact fraction, showing that under an infinite set of possibilities, distributional violations lead to biased estimates. We propose nonparametric methods to estimate the potential impact fraction for aggregated (mean and standard deviation) or individual data (e.g. observations from a cross-sectional population survey), and develop simulation scenarios to compare their performance against standard parametric procedures. We also present an R package to implement these methods.

Improving Estimation of Total Effects in Meta-Analysis.

Colleen E. Chan, Tonatiuh Barrientos Gutierrez, Rodrigo Zepeda-Tello, Dalia Camacho García Formentí, Rodrigo Barran, Rosana Torres Alvarez, Dalia Stern-Solodkin, Donna Spiegelman.

The International Journal of Biostatistics, R&R.

Meta-analyses summarize evidence about an association across multiple sources of information, increasing statistical power and exploring sources of heterogeneity. Yet, meta-analyses may neglect the complex causal structure behind an association, failing to distinguish between total and direct effects. When summarizing the effect of an exposure on an outcome in the presence of a mediator, it is useful that the total effect is provided. However, when the total effect is unavailable, some meta-analyses include the direct effect in place of the total effect, biasing the summary of the association towards the null. We develop methods to estimate point and interval estimates of the mediation proportion and total effect in this setting, filling an important methodological gap in existing evaluation approaches. In addition to reducing bias, by leveraging a summary mediation proportion whose estimator is developed here, our method is able to include a wider range of studies in the meta-analysis, thus providing more efficient estimates. The methodology is illustrated by a meta-analysis of sugar-sweetened beverage (SSB) consumption in relation to the incidence of type 2 diabetes, where the estimated summary total effect increased by about 1/3 when our new method was applied.

Decomposing Signals from Dynamical Systems using Shadow Manifold Interpolation.

Erin S. George, Colleen E. Chan, Gal Dimand, Ryan M. Chakmak, Claudia Falcon, Robert Martin, Dan Eckhart.

SIAM Journal on Applied Dynamical Systems (2021); 20(4), 2236-2260.

Outstanding Poster Award at JMM 2019.

Traditional methods in signal analysis fail to meaningfully decompose composite signals with chaotic dynamics. Signals from chaotic systems are inherently broadband, so techniques such as the Fourier transform will be unable to separate a time series if it contains two chaotic signals of interest. We present how an algorithm called shadow manifold interpolation (SMI), inspired by recent advances in applied dynamical systems theory, can succeed in this task. SMI takes two causally-related signals and reconstructs one from the other, producing a reconstruction that only captures the most prominent shared dynamics between the two signals. From this we produce a decomposition of the signal into different signals representing separate dynamics. Furthermore, we demonstrate the effectiveness of SMI at decomposing signals in a variety of test cases. We consider three ways in which two signals may be causally related. Through testing the algorithm on simulated composite dynamical systems, we demonstrate that SMI succeeds in separating out the constituent systems in two schemes, failing only when the two signals share multiple decoupled dynamics. The main limitation of this algorithm is that a second causally-related reference signal is needed in order to decompose a given signal. This reference signal must share exactly one dynamic of interest with the given signal.
title={Decomposing Signals from Dynamical Systems Using Shadow Manifold Interpolation},
author={George, Erin and Chan, Colleen E and Dimand, Gal and Chakmak, Ryan M and Falcon, Claudia and Eckhardt, Daniel and Martin, Robert},
journal={SIAM Journal on Applied Dynamical Systems},

Food Insecurity in Medical Students: Preliminary Data.

Amanda G. Zhou, Michael R. Mercier, Colleen E. Chan, June Criscione, Nancy Angoff, Laura R. Ment.

Academic Medicine (2021); 96(6), 774-776.

Food insecurity, lack of money or resources that limits access to adequate food, has been well described in college students but little is known about this problem in medical students. Among undergraduates, food insecurity creates disparities in academic success and physical and mental health, including depressive symptoms. It also disproportionately affects underrepresented minority students. Because the consequences of food insecurity have important effects on student wellbeing, it is important for medical schools to investigate food insecurity in their students and use this information to develop interventions and guide school policy. Administration at Yale School of Medicine (YSM) is dedicated to improving student wellness and supporting its students’ needs and thus is interested in understanding food insecurity in its student population. In this article, the authors describe a pilot study conducted in April 2019 that assessed the prevalence and predictors of food insecurity at YSM. This pilot study demonstrated that there are higher rates of food insecurity at YSM than in the general population and that certain groups of students are more food insecure than others. In particular, male gender and being underrepresented in medicine were independent predictors of food insecurity. The authors then describe areas for further investigation as well as potential programs and other interventions for medical schools facing food insecurity. They conclude by reviewing the adverse implications of a high prevalence of food insecurity in medical students and by encouraging other medical schools to recognize and investigate this important issue.
title={Food Insecurity in Medical Students: Preliminary Data From Yale School of Medicine},
author={Zhou, Amanda G and Mercier, Michael R and Chan, Colleen E and Criscione, June and Angoff, Nancy and Ment, Laura R},
journal={Academic Medicine},


  • pifpaf: An R package for nonparametric estimation of the potential impact fraction and population attributable fraction
  • metamediate: An R package for mediation proportion and total effect estimation in meta-analysis


I have enjoyed helping teach the following courses.

At Yale:

  • S&DS 230/530: Data Exploration and Analysis (×2)
  • S&DS 238/538: Probability and Statistics
  • S&DS 242/542: Theory of Statistics
  • S&DS 317/517: Applied Machine Learning and Causal Inference
  • S&DS 365/565: Data Mining and Machine Learning
  • S&DS 617: Applied Machine Learning and Causal Inference Research Seminar
  • S&DS 627: Statistical Consulting (×7)
At UC San Diego:
  • ECON 5: Data Analytics for the Social Sciences
  • ECON 100A: Intermediate Microeconomics I
  • MATH 11: Calculus-Based Probablity and Statistics (×3)
  • MATH 20E: Vector Calculus


  • AI Resident, Google X, Fall 2021
  • Data Scientist Intern, Amazon, Summer 2020
  • Quant Specialist Ph.D. Intern, Federal Reserve Bank of Chicago, Summer 2019
  • Researcher, UCLA IPAM RIPS Program, Summer 2018
  • Researcher, UCLA Computational and Applied Math REU, Summer 2017
  • Research Assistant, UCSD Rady School of Management, Fall 2017-Spring 2018

Last updated: July 2022.