Traditional techniques face challenges in analyzing and fully utilizing multivariate time-series (MTS) electronic health records (EHR). Our research focuses on designing Transformer-based and recurrent deep models for precise disease risk prediction with multivariate time-series data. Additionally, our approaches aim to address the high missing rate issue in real-world MTS EHRs, paving the way to advanced intelligent medicine.
We are also building foundation models for MTS EHRs on top of successful large language models (LLMs), leveraging the latest techniques such as reprogramming.
Furthermore, we investigate how to enhance novel interpretability mechanisms, such as Testing with Concept Activation Vectors (TCAV), Attention-Aware Layer-wise Relevance Propagation, and DecompX, enabling their broader application in explaining biomedical deep learning models.
Irregular and asynchronous sampled multivariate time series (MTS) data is often filled with missing values. Most existing methods embed features according to timestamp, requiring imputing missing values. However, imputed values can drastically differ from real values, resulting in inaccurate predictions made based on imputation. To address the issue, we propose a novel concept, “each value as a token (EVAT),” treating each feature value as an independent token, which allows for bypassing imputing missing values. To realize EVAT, we propose scalable numerical embedding, which learns to embed each feature value by automatically discovering the relationship among features. We integrate the proposed embedding method with the Transformer Encoder, yielding the Scalable nUMerical eMbeddIng Transformer (SUMMIT), which can produce accurate predictions given MTS with missing values. We induct experiments on three distinct electronic health record (EHR) datasets with high missing rates. The experimental results verify SUMMIT's efficacy, as it attains superior performance than other models that need imputation. (paper link)
Electronic Health Records (EHRs) is a cornerstone of modern healthcare analytics, offering rich datasets for various disease analyses through advanced deep-learning algorithms. However, the pervasive issue of missing values in EHRs significantly hampers the development and performance of these models. Addressing this challenge is crucial for enhancing clinical decision-making and patient care. Existing methods for handling missing data, ranging from simple imputation to more sophisticated approaches, often fall short of capturing the temporal dynamics inherent in EHRs. To bridge this gap, we introduce the Deep Stochastic Time-series Imputation (Deep STI) algorithm, an innovative end-to-end deep learning model that seamlessly integrates a sequence-to-sequence generative network with a prediction network. Deep STI is designed to leverage the observed time-series data in EHRs, learning to infer missing values from the temporal context with high accuracy. We evaluated Deep STI on the liver cancer data from the National Taiwan University Hospital (NTUH), Taiwan. Our results showed that Deep STI achieved better 5-year hepatocellular carcinoma predictions (19.21% in the area under the precision-recall curve) than extreme gradient boosting (18.15%) and Transformer (18.09%). The ablation study also illustrates the efficacy of our generative architecture design compared to regular imputations. This approach not only promises to improve the reliability of disease analysis in the presence of incomplete data but also sets a new standard for utilizing EHRs in predictive healthcare. Our work aims to advance the field of healthcare analytics and open new avenues for research in deep learning applications to EHRs. (paper link: TBA)
Stroke is a leading cause of mortality and long-term disability worldwide. An accurate stroke risk prediction is crucial for its early detection and prevention. Using deep learning to exploit patients’ time-series electronic health records (EHRs) has been shown as a promising and efficient solution for such a prediction. Although time-series data could be more informative than a single cross-section in time, real-world time-series EHRs usually have a significantly high missing rate due to irregular patient visits. This could undermine sequential data’s benefits unless a proper deep-learning model design is adopted. Furthermore, deep models have long been challenged for their interpretability, which is especially crucial for medical applications. In this study, we propose an extreme design based on the concept of recurrent independent mechanisms (RIM), termed extreme RIM (X-RIM). With no need for imputation, X-RIM utilizes the information of each input feature’s temporal records through independent recurrent modules. Experiments on real-world data from the National Taiwan University Hospital showed that, in terms of the area under the precision-recall curve (AUPRC), the area under the receiver-operating characteristics curve (AUROC), and Youden Index, X-RIM (AUPRC: 0.210; AUROC: 0.764; Youden: 0.373) outperformed the classic risk score CHA2 DS2-VASc (AUPRC: 0.103; AUROC: 0.650; Youden: 0.223) and other benchmarks in stroke risk prediction. Additional experiments also indicate that individual feature contributions to a prediction could be evaluated intuitively under X-RIM’s independent structure to enhance interpretability. (paper link)
肝癌個人化風險預測模型建立及臨床驗證
Ministry of Health and Welfare