前面的笔记:
-
(IF=50.0/Q1)顶刊python代码学习(一):数据背景,环境配置与数据加载 -
(IF=50.0/Q1)顶刊python代码学习(二):细胞类型初始注释 -
(IF=50.0/Q1)顶刊python代码学习(三):细胞类型TSNE绘制 -
代码下载的地方:https://github.com/dpeerlab/lung-development-cancer-progression -
Note:
下面一些准备包括库的加载和与函数以及颜色定义代码会比较长,「代码里面给了一个 ipynb 的文件,直接在上面运行即可,不需要复制这里面的。」我们这里简单做一些代码解读。
今天的学习内容:
对应文章中的附图3,提取细胞TNSE坐标以及相关的细胞类型标签数据,绘制TNSE图,细胞类型注释基因dotplot图,热图。
Extended Data Fig. 3: Phenotyping myeloid, epithelial and stromal cell types
加载数据
见稿子:(IF=50.0/Q1)顶刊python代码学习(一):数据背景,环境配置与数据加载
注意前面有一些自定义的函数需要运行一下,读取H5文件:
# SPECIFY SAVE DIRECTORIES
# optional key word to append to directory name
tag = '_Res'
now = time.strftime("%x")
now = str.replace(now,'/','_')
# SPECIFY OUTPUT STEMS FOR FIGURES/PATHWAY ANALYSIS
FIG_output_stem = "./figures/" + FN.replace('.h5','{}_'.format(tag)) + now +'/'
print(FIG_output_stem)
# CREATE DIRECTORIES IF THEY DO NOT EXIST
# CREATE FIGURE DIRECTORY IF IT DOES NOT EXIST
d = os.path.dirname(FIG_output_stem)
ifnot os.path.exists(d):
os.makedirs(d)
# PATH TO H5 WITH PROCESSED DATAFRAMES
#'/ifs/e63data/massaguelab/ashley/data/sc_RNAseq/h5/processed/Project_AL_H2087_LCC/manuscript/'
DATA_PATH = './data/'
FN = 'PATIENT_LUNG_ADENOCARCINOMA_ANNOTATED.h5'
h5_data = H5(DATA_PATH + FN)
h5_data.ls()
这个数据结构很特别,全是各种坐标单独保存的。
加载数据:今天加载的是细胞亚群 「Myeloid/Epithelium/Stroma」
这里我将代码具体化了,前面那种 exec 的函数写法感觉很抽象。
# SPECIFY DATA SUBSET
subset_type = 'MYELOID_OTHER_ALL'
# LOAD DATA
# RAW COUNTS (Cells x Genes)
# exec('DF_{} = h5_data.load('/DF_{}')'.format(subset_type,subset_type))
DF_MYELOID_OTHER_ALL = h5_data.load('/DF_MYELOID_OTHER_ALL')
# MEDIAN LIBRARY-SIZE NORMALIZED COUNTS (Cells x Genes)
# exec('NDF_{} = h5_data.load('/NDF_{}')'.format(subset_type,subset_type))
NDF_MYELOID_OTHER_ALL = h5_data.load('/NDF_MYELOID_OTHER_ALL')
# MAGIC IMPUTED AND NORMALIZED DATA (Cells x Genes)
# exec('INDF_{} = h5_data.load('/INDF_{}')'.format(subset_type,subset_type))
INDF_MYELOID_OTHER_ALL = h5_data.load('/INDF_MYELOID_OTHER_ALL')
# CELLS X DIMENSIONS (including principal components, diffusion components, ect.)
# exec('DIMENSIONS_{} = h5_data.load('/DIMENSIONS_{}')'.format(subset_type,subset_type))
DIMENSIONS_MYELOID_OTHER_ALL = h5_data.load('/DIMENSIONS_MYELOID_OTHER_ALL')
# RANKED DIFFERENTIALLY EXPRESSED GENES BY MAST (EACH CELL TYPE VS. ALL OTHER CELLS)
# exec('GENE_RANK_MAST_{} = h5_data.load('/GENE_RANK_MAST_{}')'.format(subset_type,subset_type))
GENE_RANK_MAST_MYELOID_OTHER_ALL = h5_data.load('/GENE_RANK_MAST_MYELOID_OTHER_ALL')
数据说明:
-
「DF_MYELOID_OTHER_ALL」:count 矩阵 (cells x genes)
-
「NDF_MYELOID_OTHER_ALL」:标准化后的矩阵 (cells x genes)
-
「INDF_MYELOID_OTHER_ALL」:scale矩阵:imputed and library-size normalized dataframe (cells x genes)
-
「DIMENSIONS_MYELOID_OTHER_ALL」:pca等降维坐标(cells x dimensions)
-
「GENE_RANK_MAST_MYELOID_OTHER_ALL」:每个亚群的差异结果 genes x cell group (DEG in cell group compared to all other cells)
读取细胞降维坐标如PCA,TSNE:
从两个字典
INDF_和DIMENSIONS_中提取数据,并将提取的数据分别赋值给变量x和y
dimtype = 'IPC_TSNE' #基于INDF数据得到的降维坐标TSNE
meta = 'CELL_TYPE'
exec('QUERY = INDF_{}'.format(subset_type))
exec('DIM = DIMENSIONS_{}'.format(subset_type))
x = DIM['{}0'.format(dimtype)]
y = DIM['{}1'.format(dimtype)]
# 查看DIM数据
DIM.head()
表格后面几列:取的其中的 IPC_TSNE0 和 IPC_TSNE1这两列坐标。
绘制样本颜色的TNSE图:
# XY Parameters
plt.figure(figsize = (15,5))
gs1 = gridspec.GridSpec(1,3)
gs1.update(wspace=0.1, hspace=0.1) # set the spacing between axes.
dot_size = 2
xmax = np.abs(x).max()
ymax = np.abs(y).max()
axis_min = -ceil(ceil(max(xmax,ymax)/10)*10)
axis_max = ceil(ceil(max(xmax,ymax)/10)*10)
# SAMPLE ID: LEGEND
ax1 = plt.subplot(gs1[0])
seqc.plot.scatter.categorical(x, y, c=QUERY.index.get_level_values('Legend'),
#cmap=CM_SAMPLES, # 颜色设置去掉,前面有个颜色模块找不到
legend_kwargs={'ncol': 1, 'borderaxespad':0.,
'bbox_to_anchor':(1, 0)},
s=dot_size, ax=ax1);
sns.despine()
plt.xlim(axis_min,axis_max)
plt.ylim(axis_min,axis_max)
plt.axis('off')
sns.despine()
Phenograph算法聚类结果
通过Phenograph算法对细胞或其他生物实体进行聚类分析得到的聚类结果。Phenograph是一种基于图论的聚类算法,常用于单细胞数据分析中,用于识别细胞的不同亚群或表型。
# PHENOGRAPH CLUSTERS
ax2 = plt.subplot(gs1[1])
numeric_labels = [int(label[:-14]) for label in QUERY.index.get_level_values('PHENOGRAPH_CLASS')]
g = seqc.plot.scatter.categorical(x, y, c=numeric_labels,
# cmap= CM_MYELOID_OTHER_ALL_PHENOGRAPH_CLASS,
legend_kwargs={'ncol': 3, 'borderaxespad':0.,
'bbox_to_anchor':(0.75, 0)},
s=dot_size,randomize=True, ax = ax2);
ax2.tick_params(labelsize=14)
plt.xlim(axis_min,axis_max)
plt.ylim(axis_min,axis_max)
sns.despine()
ax2.axis('off')
细胞注释结果:
# CELL TYPE ASSIGNMENTS
ax3 = plt.subplot(gs1[2])
g = seqc.plot.scatter.categorical(x, y, c=QUERY.index.get_level_values('CELL_TYPE'),
# cmap= CM_CELL_TYPE_MYELOID_OTHER_ALL,
legend_kwargs={'ncol': 1, 'borderaxespad':0.,
'bbox_to_anchor':(1, 0)},
s=dot_size,randomize=True, ax = ax3);
h,l=g.get_legend_handles_labels()
ax3.tick_params(labelsize=14)
plt.xlim(axis_min,axis_max)
plt.ylim(axis_min,axis_max)
sns.despine()
ax3.axis('off')
三个图放在一起并保存:
需要上面的代码都放在一起运行:
dimtype = 'IPC_TSNE' # TSNE COMPUTED ON PRINCIPAL COMPONENTS ON THE INDF
meta = 'CELL_TYPE'
exec('QUERY = INDF_{}'.format(subset_type))
exec('DIM = DIMENSIONS_{}'.format(subset_type))
x = DIM['{}0'.format(dimtype)]
y = DIM['{}1'.format(dimtype)]
# XY Parameters
plt.figure(figsize = (15,5))
gs1 = gridspec.GridSpec(1,3)
gs1.update(wspace=0.1, hspace=0.1) # set the spacing between axes.
dot_size = 2
xmax = np.abs(x).max()
ymax = np.abs(y).max()
axis_min = -ceil(ceil(max(xmax,ymax)/10)*10)
axis_max = ceil(ceil(max(xmax,ymax)/10)*10)
# SAMPLE ID: LEGEND
ax1 = plt.subplot(gs1[0])
seqc.plot.scatter.categorical(x, y, c=QUERY.index.get_level_values('Legend'),
# cmap=CM_SAMPLES,
legend_kwargs={'ncol': 1, 'borderaxespad':0.,
'bbox_to_anchor':(1, 0)},
s=dot_size, ax=ax1);
sns.despine()
plt.xlim(axis_min,axis_max)
plt.ylim(axis_min,axis_max)
plt.axis('off')
sns.despine()
# PHENOGRAPH CLUSTERS
ax2 = plt.subplot(gs1[1])
numeric_labels = [int(label[:-14]) for label in QUERY.index.get_level_values('PHENOGRAPH_CLASS')]
g = seqc.plot.scatter.categorical(x, y, c=numeric_labels,
# cmap= CM_MYELOID_OTHER_ALL_PHENOGRAPH_CLASS,
legend_kwargs={'ncol': 3, 'borderaxespad':0.,
'bbox_to_anchor':(0.75, 0)},
s=dot_size,randomize=True, ax = ax2);
ax2.tick_params(labelsize=14)
plt.xlim(axis_min,axis_max)
plt.ylim(axis_min,axis_max)
sns.despine()
ax2.axis('off')
# CELL TYPE ASSIGNMENTS
ax3 = plt.subplot(gs1[2])
g = seqc.plot.scatter.categorical(x, y, c=QUERY.index.get_level_values('CELL_TYPE'),
# cmap= CM_CELL_TYPE_MYELOID_OTHER_ALL,
legend_kwargs={'ncol': 1, 'borderaxespad':0.,
'bbox_to_anchor':(1, 0)},
s=dot_size,randomize=True, ax = ax3);
h,l=g.get_legend_handles_labels()
ax3.tick_params(labelsize=14)
plt.xlim(axis_min,axis_max)
plt.ylim(axis_min,axis_max)
sns.despine()
ax3.axis('off')
# SAVE FIGURE
figure_label = '_{}_LEGEND_PHENOGRAPH_CELLTYPE_{}'.format(subset_type,dimtype)
fn = FIG_output_stem + FN.replace(".h5", "") + figure_label
plt.savefig(fn + '.png', bbox_inches='tight',dpi=400)
print(fn)
细胞注释marker基因dotplot图
1、先提取数据以及制定marker基因:
这段代码里面可以学习到一些亚群的marker基因,以及绘图惭怍。

# 指定分组参数及关联的配色方案
meta='CELL_TYPE'
FLATUI_PLOT = FLATUI_CELL_TYPE_MYELOID_OTHER_ALL
# 使用的 NDF 数据
exec('QUERY = NDF_{}'.format(subset_type))
# 按类别(有序)对分组的细胞进行排序
categories = ['EPITHELIAL','PROLIFERATING MESENCHYMAL PROGENITOR','FIBROBLAST','PERICYTE','ENDOTHELIAL',
'MONOCYTE', 'NEUTROPHIL','MDSC','MACROPHAGE',
'MICROGLIA/MACROPHAGE','DENDRITIC', 'DENDRITIC (ACTIVATED)','MAST']
# marker基因
marker_genes_dict = {
'EPITHELIAL': ['CDH1','SOX9','SOX2'],
'STROMA': ['VIM', 'ACTA2','DES','PDGFRA','PDGFRB','CDH5','PECAM1','CD34'],
'MONOCYTE': ['FCGR3A','ITGAX','CD14'],
'DENDRITIC':['CD40','CD80','HLA-DOA', 'HLA-DOB','HLA-DQA1','HLA-DQA2','HLA-DQB1'],
'MHC2': ['HLA-DPA1','HLA-DPB1', 'HLA-DQA1','HLA-DQA2','HLA-DQB1', 'HLA-DRA','HLA-DRB1'],
'MACROPHAGE': ['LAMP2','LGALS3','MRC1','TFRC','FCGR2A','ITGAM','CD163'],
'NEUTROPHIL': ['LY6G6D','S100A9','S100A8','S100A12','PILRA','LILRB2','ISG15'],
'MDSC':['CYBB','CSF1R','STAT1','CD14','FCGR1A','FCGR1B','CD33','CCR1'],
'MAST':['KIT'] #,'ENPP3'
}
#指定绘图基因的顺序
var_group_labels = ['EPITHELIAL','STROMA','MONOCYTE','NEUTROPHIL','MDSC','MACROPHAGE','DENDRITIC','MAST']
2、将分类的基因字典转换为带有位置注释的标记基因
# 将分类的基因字典转换为带有位置注释的标记基因
tmp = pd.DataFrame()
for gene_group in var_group_labels:
genes = marker_genes_dict[gene_group]
genes = [gene for gene in genes if gene in QUERY.columns]
labels = [gene_group] * len(genes)
current = pd.DataFrame({'labels':labels,'genes':genes})
tmp = tmp.append(current)
tmp.index = np.arange(len(tmp))
marker_genes = tmp['genes'].values
var_group_positions = []
for gene_group in var_group_labels:
cpositions = list(tmp[tmp['labels'] ==gene_group].index)
var_group_positions.append((cpositions[0],cpositions[len(cpositions)-1]))
注意运行上面的代码时汇报个错:
AttributeError: ‘DataFrame’ object has no attribute ‘append’
原因:
报错信息表明
DataFrame对象没有append方法。这是因为从Pandas 1.4.0版本开始,append方法已经被弃用,推荐使用pd.concat来替代。
修改后的代码:
# 将分类的基因字典转换为带有位置注释的标记基因
tmp = pd.DataFrame()
for gene_group in var_group_labels:
genes = marker_genes_dict[gene_group]
genes = [gene for gene in genes if gene in QUERY.columns]
labels = [gene_group] * len(genes)
current = pd.DataFrame({'labels':labels,'genes':genes})
# tmp = tmp.append(current)
tmp = pd.concat([tmp, current], ignore_index=True)
tmp.index = np.arange(len(tmp))
marker_genes = tmp['genes'].values
var_group_positions = []
for gene_group in var_group_labels:
cpositions = list(tmp[tmp['labels'] ==gene_group].index)
var_group_positions.append((cpositions[0],cpositions[len(cpositions)-1]))
3、dotplot绘图
这里也会报错:
# 按类别标记的基因的点图
ax = dotplot(QUERY[marker_genes], marker_genes, categories, groupby=meta, use_raw=None, log=False,
expression_cutoff=0, mean_only_expressed=True, color_map='Reds', dot_max=0.75,
dot_min=0.10, figsize=None, dendrogram=False, gene_symbols=None,
var_group_positions=var_group_positions, standard_scale = 'var', smallest_dot=0,
var_group_labels=var_group_labels, var_group_rotation=90, layer=None, show=None,
save=None)
# SAVE FIGURE
figure_label = '_{}_dotPlot_markerGenes_by{}'.format(subset_type,meta)
fn = FIG_output_stem + FN.replace(".h5", "") + figure_label
plt.savefig(fn + '.png', bbox_inches='tight',dpi=fig_dpi)
print(fn)
错误:
AttributeError: ‘DataFrame’ object has no attribute ‘ix’
原因:
AttributeError: 'DataFrame' object has no attribute 'ix'表明代码中可能使用了ix属性,但ix在 Pandas 0.20.0 版本之后已经被弃用。在 Pandas 的新版本中,推荐使用loc或iloc来替代ix。
这里需要去修改前面自定义的函数dotplot,将其中的ix改为loc就可以成功绘图了。
重新运行上面的代码,气泡图结果如下:
细胞类型聚类的热图
上面那些基因的热图:按细胞类型聚类的热图,显示了区分标记的平均估算表达值,并通过z分数进行了标准化。行根据注释的细胞类型进行了着色。
提取热图数据:
# 指定分组参数及关联的配色方案
meta='CELL_TYPE'
FLATUI_PLOT = FLATUI_CELL_TYPE_MYELOID_OTHER_ALL
# 行索引
genes = marker_genes
exec('QUERY = INDF_{}'.format(subset_type))
# 将十六进制颜色代码转换为RGB值
# colors = np.zeros((len(FLATUI_PLOT),3))
# for ii,hexcolor in enumerate(FLATUI_PLOT):
# colors[ii,:] = tuple(hex(hexcolor).rgb)
# colors = np.divide(colors,255)
# 选择独特的差异表达基因(DE genes)。
genes,ind = np.unique(genes, return_index=True)
# 构建热图数据
genes = [g for g in genes if g in list(QUERY.columns)] # Detected genes
heatmap_data = pd.DataFrame(data = zscore(QUERY[genes].values,axis=0),columns = genes, index = QUERY.index).groupby(level= meta, axis=0).median() #CLUSTERS
yticks = heatmap_data.index
xticks = heatmap_data.columns
# 按<meta>对行进行着色
ind = list(heatmap_data.index)
cix = (np.linspace(0,shape(colors)[0],len(np.unique(ind)))).astype(int)
if cix[len(cix)-1]==shape(colors)[0]:
cix[len(cix)-1]=shape(colors)[0]-1
lut = dict(zip(np.unique(ind), colors[cix,:]))
row_colors = pd.Series(ind,index = ind).map(lut)
# LINKAGE
method = 'average'
metric = 'correlation'
linkage = hc.linkage(heatmap_data, method=method, metric = metric)
row_linkage = deepcopy(linkage)
linkage = hc.linkage(heatmap_data.T, method=method, metric = metric)
col_linkage = deepcopy(linkage)
# 根据链接方法重新排序热图(可选,仍然较慢)。
r1 = hc.leaves_list(row_linkage)
cl = hc.leaves_list(col_linkage)
mat = heatmap_data.iloc[r1,cl]
行为细胞类型,列为对应的基因。
热图绘制:
# 查看聚类标记的热图和树状图
fig = plt.figure(figsize=(12,5))
plt.rcParams["axes.grid"] = False
import matplotlib.patches as patches
# ADD ROW COLOR INDEX (CELL OF ORIGIN)
ax1 = fig.add_axes([0.080,0.1,0.02,0.6])
x = 0
y = 0
for c in row_colors[list(mat.index)]:
pos = (x, y / len(mat.index))
ax1.add_patch(patches.Rectangle(pos, 1, 1 / len(mat.index), color=c))
if y >= len(r1)-1:
x += 1
y = 0
else:
y += 1
plt.axis('off')
# ADD MATRIX WITH GENE NAMES
axmatrix = fig.add_axes([0.1,0.1,0.7,0.6])
im = axmatrix.matshow(mat, aspect='auto', origin='lower', cmap=plt.cm.RdBu_r,vmin=-3,vmax=3)
labels = list(mat.columns)
axmatrix.xaxis.set_ticks_position('bottom')
axmatrix.set_yticklabels([]);
xtick = plt.xticks(range(len(labels)), labels, rotation = 90, fontsize = 14)
# ADD DENDROGRAM
sch.set_link_color_palette(['#808080', '#808080', '#808080', '#808080','#808080','#808080','#808080'])
ax2 = fig.add_axes([0,0.1,0.08,0.6])
Z1 = sch.dendrogram(row_linkage, orientation='left',above_threshold_color='#808080')
ax2.set_xticks([])
ax2.set_yticks([])
plt.axis('off')
# ADD COLORBAR
axcolor = fig.add_axes([0.81,0.1,0.01,0.6])
cbar = plt.colorbar(im, cax=axcolor)
cbar.ax.get_yaxis().labelpad = 20
cbar.ax.tick_params(labelsize=10)
cbar.ax.set_ylabel('Median Expression per {}'.format(meta), rotation=270, fontsize = 10)
# SAVE FIGURE
figure_label = '_{}_matrixPlotMedian_markerGenes_by{}'.format(subset_type,meta)
fn = FIG_output_stem + FN.replace(".h5", "") + figure_label
plt.savefig(fn + '.png', bbox_inches='tight',dpi=fig_dpi)
print(fn)
小技巧:
在我的python不是很精通的情况下,我也是今年才学会一点点。上面的很多报错我都是复制运行的代码,同时放上报错的关键信息,扔给kimi检查,这样很快就能定位到错误的问题。基本上99%的问题都能成功得到解决。推荐大家人工智能使用起来!!!
今天分享到这~
(未完待续)