Single-cell RNA sequencing (scRNA-seq) has revolutionized our understanding of cellular diversity by providing an unprecedented glimpse into gene expression at the single-cell level. With the exponential growth of scRNA-seq data, integrating and analyzing this vast information has become a significant challenge. Deep learning methods have emerged as powerful tools to tackle this challenge, offering a flexible and effective approach to single-cell data integration. However, the optimal design of loss functions and benchmarking strategies for these methods has remained an open question.
In this study, we present a comprehensive benchmarking framework for deep learning-based single-cell data integration methods. We developed 16 deep learning methods within a unified variational autoencoder framework, each designed to evaluate the impact of different loss function combinations on data integration. Our framework systematically assesses the effects of batch correction and biological conservation across varying loss function configurations.
One of the key insights from our study is the importance of assessing intra-cell-type biological conservation, an area where existing methods have limitations. We introduced a correlation-based loss function, Correlation Mean Squared Error (Corr-MSE) Loss, to specifically maintain the intra-cell-type biological variation often lost during integration. Our results demonstrate that this loss function enhances the preservation of biological variation, particularly in complex single-cell datasets.
To address the limitations of existing benchmarking metrics, we developed an extended version, scIB-E, which encompasses three categories: batch correction, inter-cell-type biological conservation, and intra-cell-type biological conservation. The scIB-E framework provides a more holistic evaluation of single-cell data integration, capturing both inter- and intra-cell-type biological variation.
Our findings highlight the potential of deep learning methods for single-cell data integration, with the refined framework and benchmarking metrics offering deeper insights into the integration process. These advancements are poised to drive the development of deep learning methods for integrating increasingly complex multimodal and spatiotemporal single-cell data, ultimately enhancing our understanding of biological processes at the single-cell level.