数据整合

出海帆 / 问答 / 标签

10X单细胞(空间转录组)数据整合分析批次矫正之liger

Defining cell types requires integrating diverse single-cell measurements from multiple experiments and biological contexts( 这个不用多介绍了,一个样本发文章的时代早就过去了 ). To flexibly model singlecell datasets, we developed LIGER, an algorithm that delineates shared and dataset-specific features of cell identity. We applied it to four diverse and challenging analyses of human and mouse brain cells. (1) First, we defined region-specific and sexually dimorphic gene expression in the mouse bed nucleus of the stria terminalis.( 这个地方用到了形态学方法方面的辅助,以检验整合结果的优劣 ) (2)Second, we analyzed expression in the human substantia nigra, comparing cell states in specific donors and relating cell types to those in the mouse.( 跨物种之间的整合结果检验 ) (3)Third, we integrated in situ and singlecell expression data to spatially locate fine subtypes of cells present in the mouse frontal cortex.( 原位和单细胞共同的分析检验 )。 Finally, we jointly defined mouse cortical cell types using single-cell RNA-seq and DNA methylation profiles( DNA甲基化,这个不是我们今天的重点 ), revealing putative mechanisms of cell-type-specific epigenomic regulation( 表观调控 ). Integrative analyses using LIGER promise to accelerate investigations of celltype definition, gene regulation, and disease states( 让我们拭目以待 )。 The function of the mammalian brain is dependent upon the coordinated activity of highly specialized cell types.( 第一句话就很重要,强调了细胞空间位置的重要性,这也是为什么现在推出10X空间转录组的原因 )。单细胞技术have provided an unprecedented opportunity to systematically identify these cellular specializations,across multiple regions,in the context of perturbations,and in related species( 每次读到这里,都会想空间转录组如果也是单细胞精度就非常完美了 ),Furthermore, new technologies can now measure DNA methylation( 甲基化的结果也是非常的重要,大家可以深入的学习,这个方面你的大牛是汤富筹(不知道名字打对了没) ),chromatin accessibility( 这个就是ATAC ),and in situ expression( 原位杂交 ),in thousands to millions of cells.( 庞大的单细胞数据目前也是一个大问题,其中张泽民团队研究的新冠文章细胞数量达到恐怖的百万级 )Each of these experimental contexts and measurement modalities provides a different glimpse into cellular identity. Integrative computational tools that can flexibly combine individual single-cell datasets into a unified, shared analysis offer many exciting biological opportunities.( 整合分析的必要性 ),The major challenge of integrative analysis lies in reconciling the immense heterogeneity observed across individual datasets.( 现在不止免疫的个体异质性了,很多都设及到批次 )。However, in many kinds of analysis, both dataset similarities and differences are biologically important, such as when we seek to compare and contrast scRNA-seq data from healthy and disease-affected individuals。 To address these challenges, we developed a new computational method called LIGER (linked inference of genomic experimental relationships). We show here that LIGER enables the identification of shared cell types across individuals, species, and multiple modalities (gene expression, epigenetic, or spatial data), as well as dataset-specific features, offering a unified analysis of heterogeneous single-cell datasets.( 在这里我们只关注样本的差异去除,至于物种可以了解一下 )。 LIGER takes as input multiple single-cell datasets, which may be scRNA-seq experiments from different individuals, time points, or species—or measurements from different molecular modalities, such as single-cell epigenome data or spatial gene expression data( 个体,物种,技术 ) LIGER then employs integrative non-negative matrix factorization (iNMF)( 不知道大家对非负矩阵分解有多少了解 ) to learn a low-dimensional space in which each cell is defined by one set of dataset-specific factors, or metagenes, and another set of shared metagenes。 We assessed the performance of LIGER through the use of two metrics: alignment and agreement ( 这里应该理解为指标 )。 Alignment measures the uniformity of mixing for two or more samples in the aligned latent space.( 衡量对齐的潜在空间中两个或多个样本的混合均匀性。 ),This metric should be high when datasets share underlying cell types, and low when datasets do not share cognate populations.( 我们暂且记住这个注释 )。The second metric, agreement , quantifies the similarity of each cell"s neighborhood when a dataset is analyzed separately versus jointly with other datasets( 量化相似性 )。High agreement indicates that cell-type relationships are preserved with minimal distortion in the joint analysis.( 高度aggrement表明,在联合分析中,细胞类型关系得以保留,并且失真最小 )。 We calculated alignment and agreement metrics using published datasets,comparing the LIGER analyses to those generated by the Seurat package( 和Seurat相比较 ),We first ran our analyses on a pair of scRNA-seq datasets from human blood cells that show primarily technical differences( 技术上带来的批次 ),and should thus yield a high degree of alignment. Indeed, LIGER and Seurat show similarly high alignment statistics,and LIGER"s joint clusters match the published cluster assignments for the individual datasets An important application of integrative single-cell analysis in neuroscience is to quantify cell-type variation across different brain regions and different members of the same species. To examine LIGER"s performance in these tasks, we analyzed the bed nucleus of the stria terminalis (BNST), a subcortical region composed of multiple subnuclei,implicated in social, stress-related, and reward behaviors,To date, scRNA-seq has not been performed on BNST, providing an opportunity to clarify how cell types are shared between this structure and datasets generated from related tissues. We isolated, sequenced, and analyzed 204,737 nuclei enriched for the BNST region。Initial clustering identified 106,728 neurons, of which 70.2% were localized to BNST by examination of marker expression in the Allen Brain Atlas (ABA),Clustering analysis revealed 41 transcriptionally distinct populations of BNST-localized neurons( 单纯的聚类分析 ) 这个地方设及到因子分析,不知道大家没有过多的分析过,我们下一篇文章分享这个,但是这里的差异分析大家要关注一下,不知道大家知不知道这个差异基因排序的原理以及什么软件做的,知道的话,恭喜你,赶紧尝试一下吧 。 这里对不同技术进行整合,我们简单看一下 To perform online iNMF, we need to install the latest Liger package from GitHub. Please see the instruction below. We first create a Liger object by passing the filenames of HDF5 files containing the raw count data. The data can be downloaded here . Liger assumes by default that the HDF5 files are formatted by the 10X CellRanger pipeline. Large datasets are often generated over multiple 10X runs (for example, multiple biological replicates). In such cases it may be necessary to merge the HDF5 files from each run into a single HDF5 file. We provide the mergeH5 function for this purpose (see below for details). We then perform the normalization, gene selection, and gene scaling in an online fashion, reading the data from disk in small batches. Now we can use online iNMF to factorize the data, again using only minibatches that we read from the HDF5 files on demand (default mini-batch size = 5000). Sufficient number of iterations is crucial for obtaining ideal factorization result. If the size of the mini-batch is set to be close to the size of the whole dataset (i.e. an epoch only contains one iteration), max.epochs needs to be increased accordingly for more iterations. After performing the factorization, we can perform quantile normalization to align the datasets. We can also visualize the cell factor loadings in two dimensions using t-SNE or UMAP. Let"s first evaluate the quality of data alignment. The alignment score ranges from 0 (no alignment) to 1 (perfect alignment). With HDF5 files as input, we need to sample the raw, normalized, or scaled data from the full dataset on disk and load them in memory. Some plotting functions and downstream analyses are designed to operate on a subset of cells sampled from the full dataset. This enables rapid analysis using limited memory. The readSubset function allows either uniform random sampling or sampling balanced by cluster. Here we extract the normalized count data of 5000 sampled cells. Using the sampled data stored in memory, we can now compare clusters or datasets (within each cluster) to identify differentially expressed genes. The runWilcoxon function performs differential expression analysis by performing an in-memory Wilcoxon rank-sum test on this subset. Thus, users can still analyze large datasets with a fixed amount of memory. Here we show the top 10 genes in cluster 1 whose expression level significantly differ between two dataset. We can show UMAP coordinates of sampled cells by their loadings on each factor (Factor 1 as an example). Underneath it displays the most highly loading shared and dataset-specific genes, with the size of the marker indicating the magnitude of the loading. We can generate plots of dimensional reduction coordinates colored by expression of specified gene. The first two UMAP dimensions and gene ISG15 (identified by Wilcoxon test in the previous step) is used as an example here. Furthermore, we can make violin plots of expression of specified gene for each dataset (ISG15 as an example). The online algorithm can be implemented on datasets loaded in memory as well. The same analysis is performed on the PBMCs, shown below. 如果有条件的话,不妨试一下,如何灵活运用这个软件,就看大家的需求了 生活很好,有你更好