Colleen Chan

colleen.chan@yale.edu

219 Prospect St,
Kline Tower Room 1205,
New Haven, CT 06511

Google Scholar Logo GitHub Logo LinkedIn Logo
dark mode

Overview

I am a Ph.D. student in the Department of Statistics and Data Science at Yale University, advised by Jas Sekhon. I am broadly interested in generative AI, machine learning, and causal inference with application areas in medicine, public health, and the social sciences. Before Yale, I completed my undergraduate work at UC San Diego in 2018.

I will be joining Netflix as a research scientist starting July 2024.

Research

Journal Publications and Preprints

Nonparametric Estimation of the Potential Impact Fraction and the Population Attributable Fraction with Individual-level and Aggregate Data.

Colleen E. Chan, Rodrigo Zepeda-Tello, Dalia Camacho García Formentí, Frederick Cudhea, Rafael Meza, Eliane Rodrigues, Donna Spiegelman, Tonatiuh Barrientos Gutierrez, Xin Zhou.

arXiv:2207.03597.

Young Statistician Showcase Presentation at IBC 2022.

@article{chan2022nonparametric,
title={Nonparametric Estimation of the Potential Impact Fraction and Population Attributable Fraction with Individual-Level and Aggregated Data}},
author={Chan, Colleen E and Zepeda-Tello, Rodrigo and Camacho-Garc{\'\i}a-Forment{\'\i}, Dalia and Cudhea, Frederick and Meza, Rafael and Rodrigues, Eliane and Spiegelman, Donna and Barrientos-Gutierrez, Tonatiuh and Zhou, Xin},
journal={arXiv preprint arXiv:2207.03597},
year={2022}
}
The estimation of the potential impact fraction (including the population attributable fraction) with continuous exposure data frequently relies on strong distributional assumptions. However, these assumptions are often violated if the underlying exposure distribution is unknown or if the same distribution is assumed across time or space. Nonparametric methods to estimate the potential impact fraction are available for cohort data, but no alternatives exist for cross-sectional data. In this article, we discuss the impact of distributional assumptions in the estimation of the population impact fraction, showing that under an infinite set of possibilities, distributional violations lead to biased estimates. We propose nonparametric methods to estimate the potential impact fraction for aggregated (mean and standard deviation) or individual data (e.g. observations from a cross-sectional population survey), and develop simulation scenarios to compare their performance against standard parametric procedures. We also present an R package to implement these methods.

Improving Estimation of Total Effects in Meta-Analysis.

Colleen E. Chan, Tonatiuh Barrientos Gutierrez, Rodrigo Zepeda-Tello, Dalia Camacho García Formentí, Rodrigo Barran, Rosana Torres Alvarez, Dalia Stern-Solodkin, Donna Spiegelman.

The International Journal of Biostatistics, R&R.

Meta-analyses summarize evidence about an association across multiple sources of information, increasing statistical power and exploring sources of heterogeneity. Yet, meta-analyses may neglect the complex causal structure behind an association, failing to distinguish between total and direct effects. When summarizing the effect of an exposure on an outcome in the presence of a mediator, it is useful that the total effect is provided. However, when the total effect is unavailable, some meta-analyses include the direct effect in place of the total effect, biasing the summary of the association towards the null. We develop methods to estimate point and interval estimates of the mediation proportion and total effect in this setting, filling an important methodological gap in existing evaluation approaches. In addition to reducing bias, by leveraging a summary mediation proportion whose estimator is developed here, our method is able to include a wider range of studies in the meta-analysis, thus providing more efficient estimates. The methodology is illustrated by a meta-analysis of sugar-sweetened beverage (SSB) consumption in relation to the incidence of type 2 diabetes, where the estimated summary total effect increased by about 1/3 when our new method was applied.

Decomposing Signals from Dynamical Systems using Shadow Manifold Interpolation.

Erin S. George, Colleen E. Chan, Gal Dimand, Ryan M. Chakmak, Claudia Falcon, Robert Martin, Dan Eckhart.

SIAM Journal on Applied Dynamical Systems (2021); 20(4), 2236-2260.

Outstanding Poster Award at JMM 2019.

Traditional methods in signal analysis fail to meaningfully decompose composite signals with chaotic dynamics. Signals from chaotic systems are inherently broadband, so techniques such as the Fourier transform will be unable to separate a time series if it contains two chaotic signals of interest. We present how an algorithm called shadow manifold interpolation (SMI), inspired by recent advances in applied dynamical systems theory, can succeed in this task. SMI takes two causally-related signals and reconstructs one from the other, producing a reconstruction that only captures the most prominent shared dynamics between the two signals. From this we produce a decomposition of the signal into different signals representing separate dynamics. Furthermore, we demonstrate the effectiveness of SMI at decomposing signals in a variety of test cases. We consider three ways in which two signals may be causally related. Through testing the algorithm on simulated composite dynamical systems, we demonstrate that SMI succeeds in separating out the constituent systems in two schemes, failing only when the two signals share multiple decoupled dynamics. The main limitation of this algorithm is that a second causally-related reference signal is needed in order to decompose a given signal. This reference signal must share exactly one dynamic of interest with the given signal.
@article{george2021decomposing,
title={Decomposing Signals from Dynamical Systems Using Shadow Manifold Interpolation},
author={George, Erin and Chan, Colleen E and Dimand, Gal and Chakmak, Ryan M and Falcon, Claudia and Eckhardt, Daniel and Martin, Robert},
journal={SIAM Journal on Applied Dynamical Systems},
volume={20},
number={4},
pages={2236--2260},
year={2021},
publisher={SIAM}
}

Food Insecurity in Medical Students: Preliminary Data.

Amanda G. Zhou, Michael R. Mercier, Colleen Chan, June Criscione, Nancy Angoff, Laura R. Ment.

Academic Medicine (2021); 96(6), 774-776.

Food insecurity, lack of money or resources that limits access to adequate food, has been well described in college students but little is known about this problem in medical students. Among undergraduates, food insecurity creates disparities in academic success and physical and mental health, including depressive symptoms. It also disproportionately affects underrepresented minority students. Because the consequences of food insecurity have important effects on student wellbeing, it is important for medical schools to investigate food insecurity in their students and use this information to develop interventions and guide school policy. Administration at Yale School of Medicine (YSM) is dedicated to improving student wellness and supporting its students’ needs and thus is interested in understanding food insecurity in its student population. In this article, the authors describe a pilot study conducted in April 2019 that assessed the prevalence and predictors of food insecurity at YSM. This pilot study demonstrated that there are higher rates of food insecurity at YSM than in the general population and that certain groups of students are more food insecure than others. In particular, male gender and being underrepresented in medicine were independent predictors of food insecurity. The authors then describe areas for further investigation as well as potential programs and other interventions for medical schools facing food insecurity. They conclude by reviewing the adverse implications of a high prevalence of food insecurity in medical students and by encouraging other medical schools to recognize and investigate this important issue.
@article{zhou2021food,
title={Food Insecurity in Medical Students: Preliminary Data From Yale School of Medicine},
author={Zhou, Amanda G and Mercier, Michael R and Chan, Colleen and Criscione, June and Angoff, Nancy and Ment, Laura R},
journal={Academic Medicine},
volume={96},
number={46},
pages={774--776},
year={2021},
publisher={LWW}
}

Peer-Reviewed Conference Papers

Assessing the Usability of GutGPT: A Simulation Study of an AI Clinical Decision Support System for Gastrointestinal Bleeding Risk.

Colleen Chan, Kisung You, Sunny Chung, Mauro Giuffrè, Theo Saarinen, Niroop Rajashekar, Yuan Pu, Yeo Eun Shin, Loren Laine, Ambrose Wong, René Kizilcec, Jasjeet Sekhon, Dennis Shung.

Machine Learning for Health (ML4H) – Findings track, 2023.

Applications of large language models (LLMs) like ChatGPT have potential to enhance clinical decision support through conversational interfaces. However, challenges of human-algorithmic interaction and clinician trust are poorly understood. GutGPT, a LLM for gastrointestinal (GI) bleeding risk prediction and management guidance, was deployed in clinical simulation scenarios alongside the electronic health record (EHR) with emergency medicine physicians, internal medicine physicians, and medical students to evaluate its effect on physician acceptance and trust in AI clinical decision support systems (AI-CDSS). GutGPT provides risk predictions from a validated machine learning model and evidence-based answers by querying extracted clinical guidelines. Participants were randomized to GutGPT and an interactive dashboard, or the interactive dashboard and a search engine. Surveys and educational assessments taken before and after measured technology acceptance and content mastery. Preliminary results showed mixed effects on acceptance after using GutGPT compared to the dashboard or search engine but appeared to improve content mastery based on simulation performance. Overall, this study demonstrates LLMs like GutGPT could enhance effective AI-CDSS if implemented optimally and paired with interactive interfaces.
@article{chan2023assessing,
title={Assessing the Usability of GutGPT: A Simulation Study of an AI Clinical Decision Support System for Gastrointestinal Bleeding Risk},
author={Colleen Chan and Kisung You and Sunny Chung and Mauro Giuffrè and Theo Saarinen and Niroop Rajashekar and Yuan Pu and Yeo Eun Shin and Loren Laine and Ambrose Wong and René Kizilcec and Jasjeet Sekhon and Dennis Shung},
journal={arXiv preprint arXiv:2312.10072},
year={2023}
}

Human-Algorithmic Interaction Using a Large Language Model-Augmented Artificial Intelligence Clinical Decision Support System.

Niroop Rajashekar, Yeo Eun Shin, Yuan Pu, Sunny Chung, Kisung You, Mauro Giuffrè, Colleen Chan, Theo Saarinen, Allen Hsiao, Jasjeet Sekhon, Ambrose Wong, Leigh Evans, René Kizilcec, Loren Laine, Terika McCall, Dennis Shung.

Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems (CHI ’24).

Integration of artificial intelligence (AI) into clinical decision support systems (CDSS) poses a socio-technological challenge that is impacted by usability, trust, and human-computer interaction (HCI). AI-CDSS interventions have shown limited benefit in clinical outcomes, which may be due to insufficient understanding of how health-care providers interact with AI systems. Large language models (LLMs) have the potential to enhance AI-CDSS, but haven't been studied in either simulated or real-world clinical scenarios. We present findings from a randomized controlled trial deploying AI-CDSS for the management of upper gastrointestinal bleeding (UGIB) with and without an LLM interface within realistic clinical simulations for physician and medical student participants. We find evidence that LLM augmentation improves ease-of-use, that LLM-generated responses with citations improve trust, and HCI varies based on clinical expertise. Qualitative themes from interviews suggest the perception of LLM-augmented AI-CDSS as a team-member used to confirm initial clinical intuitions and help evaluate borderline decisions.
@article{rajashekar2024human,
title={Human-Algorithmic Interaction Using a Large Language Model-Augmented Artificial Intelligence Clinical Decision Support System},
author={Niroop Channa Rajashekar and Yeo Eun Shin and Yuan Pu and Sunny Chung and Kisung You and Mauro Giuffre and Colleen E. Chan and Theo Saarinen and Allen Hsiao and Jasjeet Sekhon and Ambrose H. Wong and Leigh V. Evans and Rene F. Kizilcec and Loren Laine and Terika McCall and Dennis Shung},
booktitle={Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems},
year={2024},
pages={20},
address={New York, NY, USA},
publisher={ACM},
doi={10.1145/3613904.3642024},
location={Honolulu, HI},
}

Working Papers

Horvitz-Thompson Estimators of Spillover Effects under Hypothetical Bernoulli Treatment Allocations in Two-Stage Randomized Experiments.

Colleen Chan, Shinpei Nakamura Sakai, Laura Forastiere.

In many applications, the no-interference assumption in causal inference is often violated as individuals often interact with one another. Two-stage randomized experiments are incredibly useful designs for estimating causal effects of a given treatment in the presence of interference. In this design, clusters are assigned a treatment saturation level in the first stage, and each unit within a cluster is randomized to treatment or control according to the assigned saturation level in the second stage. Previous two-stage designs have been proposed under complete randomization in both stages, and simple difference-in-means estimators have been developed under the partial interference assumption. However, complete randomization in the second stage only allows the estimation of causal effects under the treatment saturations of the first stage. We propose instead a Bernoulli assignment in the second stage and weighted estimators of direct and spillover effects, combining information from all clusters. One clear advantage of using Bernoulli assignment is that it allows researchers to estimate causal effects under hypothetical treatment allocations. We derive cluster weights achieving the optimal bias-variance trade-off for our estimator. We develop simulation studies to analyze the finite sample performance of our proposed estimators. Finally, we illustrate our methodology with a data-inspired information campaign to prevent anemia in India.

Honest Random Forests for Heterogeneous Treatment Effect Estimation with Covariate Shift.

Colleen Chan, Theo Saarinen, Jasjeet Sekhon.

In many real world applications, machine learning algorithms are used for prediction in environments that do not share the covariate distribution of the data that the algorithm was trained on. When this is the case, it is important for the algorithm to be robust to the distributional shift of the covariates to avoid harmful results down the line. For example, the Epic Sepsis model, a proprietary sepsis prediction model, had considerably worse than reported accuracy when deployed in the hundreds of hospitals not included in the training data. In any transfer learning problem, the change in the data generating process can manifest itself in two ways: a shift in the distribution of the covariates, and a change in the outcome regression function itself. We study an approach for training machine learning algorithms that aims to solve the first of these issues. Specifically, we introduce a version of the random forest algorithm that makes use of a novel sample splitting procedure during training in order to induce robustness to covariate shift at prediction time. We also provide a proof for identification of the CATE function under restrictions on the model shift between groups. In contrast to a standard random sample split, we restrict the predictions of the forest to only trees not trained on any observations from a given group. This form of prediction ensures that the forest can make robust predictions for the various experimental groups even in the presence of distributional shifts. Out of sample, this method relies on the assumption that the distributional shifts within the training groups are similar to the distributional shifts within the groups out of sample. We develop simulations that demonstrate that our method outperforms existing meta-analytic approaches. Finally, we apply our method to get-out-the-vote data.

Teaching

I have enjoyed helping teach the following courses.

At Yale:

  • S&DS 230/530: Data Exploration and Analysis (×2)
  • S&DS 238/538: Probability and Statistics
  • S&DS 242/542: Theory of Statistics
  • S&DS 317/517: Applied Machine Learning and Causal Inference
  • S&DS 365/565: Data Mining and Machine Learning
  • S&DS 617: Applied Machine Learning and Causal Inference Research Seminar (×3)
  • S&DS 627: Statistical Consulting (×7)
At UC San Diego:
  • ECON 5: Data Analytics for the Social Sciences
  • ECON 100A: Intermediate Microeconomics I
  • MATH 11: Calculus-Based Probability and Statistics (×3)
  • MATH 20E: Vector Calculus

Past Employment

Last updated: March 2024.