1-SpatialData简介及安装相关分析库
库的论文发表在nature methods:
Marconato, L., Palla, G., Yamauchi, K.A. et al. SpatialData: an open and universal data framework for spatial omics. Nat Methods (2024). https:///10.1038/s41592-024-02212-x
Github链接:
https://github.com/scverse/spatialdata
#请安装相关的库#最好新建一个环境进行分析,避免包冲突#pip install squidpy -i https://pypi.tuna./simple#pip install spatialdata -i https://pypi.tuna./simple#pip install geosketch -i https://pypi.tuna./simple#pip install spatialdata_plot -i https://pypi.tuna./simple#pip install spatialdata-io -i https://pypi.tuna./simple
import scanpy as scimport numpy as npimport pandas as pdimport squidpy as sqimport matplotlib.pyplot as pltimport spatialdata as spdimport spatialdata_plot as spltimport spatialdata_io as soimport geosketch as sketchimport jsonimport gcfrom PIL import Image
cd P5_CRC├── binned_outputs│ ├── square_002um│ ├── square_008um│ └── square_016um└── feature_slice.h5
cd P5_NAT.├── binned_outputs│ ├── square_002um│ │ ├── filtered_feature_bc_matrix│ │ ├── filtered_feature_bc_matrix.h5│ │ ├── raw_feature_bc_matrix│ │ ├── raw_feature_bc_matrix.h5│ │ ├── raw_probe_bc_matrix.h5│ │ └── spatial│ ├── square_008um│ │ ├── analysis│ │ ├── cloupe.cloupe│ │ ├── filtered_feature_bc_matrix│ │ ├── filtered_feature_bc_matrix.h5│ │ ├── raw_feature_bc_matrix│ │ ├── raw_feature_bc_matrix.h5│ │ └── spatial│ └── square_016um│ ├── analysis│ ├── filtered_feature_bc_matrix│ ├── filtered_feature_bc_matrix.h5│ ├── raw_feature_bc_matrix│ ├── raw_feature_bc_matrix.h5│ └── spatial└── feature_slice.h5
2-数据读取
首先我们单独读取一个样本演示读入函数用法,以及spatial object的保存与再次读入,并探究一下SpatialData数据格式:数据的读取使用so.visium_hd函数。函数使用简单,提供数据路径及分辨率即可,有点像seurat读取visium HD的赶脚!
sdata_crc = so.visium_hd( path="./P5_CRC/",#理想情况下是spaceranger outs文件夹路径,这里就是包含了binned_outputs和feature_slice.h5文件路径 dataset_id = 'P5_CRC',#sample识别ID,多样本可用于区分sample filtered_counts_file=True,#使用filtered_feature_bc_matrix.h5,过滤后的表达矩阵 bin_size='008',#选择square-bin的分辨率,一般有2,8,16三个分辨率数据。如果加载多个分辨率数据,传入一个列表['008','016'] fullres_image_file=None,#全分辨率图像的路径,这里选择不加载,数据太大 load_all_images=True,#如果设置为False,则只加载hires/lowres image,不加载cytassist_image var_names_make_unique=True #确保基因名称唯一性)sdata_crc
SpatialData object├── Images│ ├── 'P5_CRC_cytassist_image': DataArray[cyx] (3, 3000, 3199)│ ├── 'P5_CRC_hires_image': DataArray[cyx] (3, 5298, 6000)│ └── 'P5_CRC_lowres_image': DataArray[cyx] (3, 530, 600)├── Shapes│ └── 'P5_CRC_square_008um': GeoDataFrame shape: (541968, 1) (2D shapes)└── Tables └── 'square_008um': AnnData (541968, 18085)with coordinate systems: ▸ 'P5_CRC', with elements: P5_CRC_cytassist_image (Images), P5_CRC_hires_image (Images), P5_CRC_lowres_image (Images), P5_CRC_square_008um (Shapes) ▸ 'P5_CRC_downscaled_hires', with elements: P5_CRC_hires_image (Images), P5_CRC_square_008um (Shapes) ▸ 'P5_CRC_downscaled_lowres', with elements: P5_CRC_lowres_image (Images), P5_CRC_square_008um (Shapes)
多样本数据处理:
#字典传入不同sample信息#这里字典的key最好设置为样本名称samples = {"P5_CRC":["./P5_CRC/"],"P5_NAT":["./P5_NAT/"]}sdatas = []#循环读取for key, inputs in samples.items():sdata = so.visium_hd(path=inputs[0],dataset_id = key,filtered_counts_file=True,bin_size='008',fullres_image_file=None,load_all_images=True,var_names_make_unique=True)for table in sdata.tables.values():table.var_names_make_unique()table.obs["sample"] = keysdatas.append(sdata)
合并数据:这里的合并不是真正意义上的合并(不是图片合并哦),可以说是拼接不同样本SpatialData对象!
comb_sdata = spd.concatenate(sdatas, concatenate_tables=True)#concatenate_tables为True表示anndata也合并comb_sdata.write("comb_sdata", overwrite=True)#做好保存数据3-数据质控及降维聚类
这一部分内容就是相当于进行了scRNA的步骤【还是一样的,数据质控标准需要结合自己实际的数据】数据的质控当然是针对表达矩阵的数据,需要使用scanpy,和其流程一样的处理!去处一些低质量的bins。
首先提取adata,上面的数据结构介绍中以及知晓,表达矩阵以及metadata储存在table。
Tables
└── 'square_008um': AnnData (977741, 18085)
adata = comb_sdata["square_008um"] adata
AnnData object with n_obs × n_vars = 977741 × 18085 obs: 'in_tissue', 'array_row', 'array_col', 'location_id', 'region', 'sample' uns: 'spatialdata_attrs' obsm: 'spatial'
#计算线粒体基因比例adata.var["mt"] = adata.var_names.str.startswith(("MT-"))sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], inplace=True, percent_top=None)#plot QC前sc.pl.violin(adata=adata, keys=["total_counts"], stripplot=False, inner="box",groupby="sample",)sc.pl.violin(adata=adata, keys=["n_genes_by_counts"], groupby="sample", stripplot=False, inner="box")sc.pl.violin(adata=adata, keys=["pct_counts_mt"], groupby="sample", stripplot=False, inner="box")#过滤基因细胞,线粒体基因比例不做过滤了,仁者见仁,智者见智,也需要结合生物学意义#有些地方认为,空转线粒体基因比例高低可能具有生物学意义,不建议过滤。sc.pp.filter_genes(adata, min_cells=50)sc.pp.filter_cells(adata, min_counts=50)sc.pp.filter_cells(adata, max_counts=3000)
#plot QC后sc.pl.violin(adata=adata, keys=["total_counts"], stripplot=False, inner="box",groupby="sample",)sc.pl.violin(adata=adata, keys=["n_genes_by_counts"], groupby="sample", stripplot=False, inner="box")
#标准化sc.pp.normalize_total(adata, target_sum = None)sc.pp.log1p(adata)sc.tl.pca(adata)# Elbow plotsc.pl.pca_variance_ratio(adata, log=True,n_pcs=50)
sc.pp.neighbors(adata, n_neighbors=30, use_rep="X_pca",metric="correlation")sc.tl.leiden(adata, flavor="igraph", key_added="clusters", resolution=0.5,random_state=0)sc.tl.umap(adata,min_dist=0.5, spread=2, random_state=0)
# Plot UMAPsc.pl.umap(adata, color=["clusters"], title="UMAP by Clusters")sc.pl.umap(adata, color=["sample"], title="UMAP by Sample")
4-空间组织可视化聚类
这里可能会有一个关心的问题,我们做降维聚类的时候是单独提取了adata出来进行的,那么这个结果需要返回spatialdata obj吗?其实您关注一下就会知道,结果是直接返回的,之前的comb_sdata中tabels已经发生了改变!所以不需要额外的操作!
comb_sdata
SpatialData object, with associated Zarr store: /home/data_analysis/10X_space/CRC_visiumHD/comb_sdata├── Images│ ├── 'P5_CRC_cytassist_image': DataArray[cyx] (3, 3000, 3199)│ ├── 'P5_CRC_hires_image': DataArray[cyx] (3, 5298, 6000)│ ├── 'P5_CRC_lowres_image': DataArray[cyx] (3, 530, 600)│ ├── 'P5_NAT_cytassist_image': DataArray[cyx] (3, 3004, 3200)│ ├── 'P5_NAT_hires_image': DataArray[cyx] (3, 4153, 6000)│ └── 'P5_NAT_lowres_image': DataArray[cyx] (3, 415, 600)├── Shapes│ ├── 'P5_CRC_square_008um': GeoDataFrame shape: (541968, 1) (2D shapes)│ └── 'P5_NAT_square_008um': GeoDataFrame shape: (435773, 1) (2D shapes)└── Tables └── 'square_008um': AnnData (858806, 17860)with coordinate systems: ▸ 'P5_CRC', with elements: P5_CRC_cytassist_image (Images), P5_CRC_hires_image (Images), P5_CRC_lowres_image (Images), P5_CRC_square_008um (Shapes) ▸ 'P5_CRC_downscaled_hires', with elements: P5_CRC_hires_image (Images), P5_CRC_square_008um (Shapes) ▸ 'P5_CRC_downscaled_lowres', with elements: P5_CRC_lowres_image (Images), P5_CRC_square_008um (Shapes) ▸ 'P5_NAT', with elements: P5_NAT_cytassist_image (Images), P5_NAT_hires_image (Images), P5_NAT_lowres_image (Images), P5_NAT_square_008um (Shapes) ▸ 'P5_NAT_downscaled_hires', with elements: P5_NAT_hires_image (Images), P5_NAT_square_008um (Shapes) ▸ 'P5_NAT_downscaled_lowres', with elements: P5_NAT_lowres_image (Images), P5_NAT_square_008um (Shapes)
comb_sdata.pl.render_images("P5_CRC_hires_image").pl.render_shapes("P5_CRC_square_008um", color="clusters").pl.show(coordinate_systems="P5_CRC_downscaled_lowres", title='P5_CRC')comb_sdata.pl.render_images("P5_NAT_hires_image").pl.render_shapes("P5_NAT_square_008um", color="clusters").pl.show(coordinate_systems="P5_NAT_downscaled_lowres", title='P5_NAT')spd.bounding_box_query( comb_sdata, min_coordinate=[3750, 3750], max_coordinate=[5000, 4250], axes=("x", "y"), target_coordinate_system='P5_CRC_downscaled_hires').pl.render_images("P5_CRC_hires_image").pl.show(coordinate_systems="P5_CRC_downscaled_hires", title='P5_CRC')spd.bounding_box_query( comb_sdata, min_coordinate=[3750, 3750], max_coordinate=[5000, 4250], axes=("x", "y"), target_coordinate_system='P5_CRC_downscaled_hires').pl.render_images("P5_CRC_hires_image").pl.render_shapes("P5_CRC_square_008um", color="clusters").pl.show(coordinate_systems="P5_CRC_downscaled_hires", title='P5_CRC')5-marker genes & annotation
当然不能直接按照scRNA那样进行注释,不过cluster marker可以粗略的看看组织结构celltype分布。更精细的需要进行细胞分割或者反卷积!
#cluster marker genessc.tl.rank_genes_groups(adata = adata, groupby="clusters", method="wilcoxon")
# cannonical markers for annotationmarker_genes = { "Fibroblasts": ["COL1A1", "MMP2", 'VIM'], "Goblet cells": ["FCGBP", "MUC2", "CLCA1"], "Enterocyte":["EPCAM","KRT8","KRT20","FABP2"], "Plasma B cells":['JCHAIN','IGKC','IGHG1'], "B cell":['MS4A1','CD74','CD19','CD22'], "Smooth muscle":['TAGLN','DES','MYH11'], "Tumor":["CEACAM6",'REG1B',"REG1A"]}# Plot dotplot for initial cluster assessmentsc.pl.dotplot(adata = adata, var_names = marker_genes, groupby="clusters", standard_scale="var")comb_sdata.pl.render_images("P5_CRC_hires_image").pl.render_shapes("P5_CRC_square_008um",color="CEACAM6").pl.show(coordinate_systems="P5_CRC_downscaled_lowres", title='P5_CRC')