（IF=50.0/Q1）顶刊python代码学习（四）：髓系、上皮、基质细胞类型注释

前面的笔记：

（IF=50.0/Q1）顶刊python代码学习（一）：数据背景，环境配置与数据加载
（IF=50.0/Q1）顶刊python代码学习（二）：细胞类型初始注释
（IF=50.0/Q1）顶刊python代码学习（三）：细胞类型TSNE绘制
代码下载的地方：https://github.com/dpeerlab/lung-development-cancer-progression
Note：

下面一些准备包括库的加载和与函数以及颜色定义代码会比较长，「代码里面给了一个 ipynb 的文件，直接在上面运行即可，不需要复制这里面的。」我们这里简单做一些代码解读。

今天的学习内容：

对应文章中的附图3，提取细胞TNSE坐标以及相关的细胞类型标签数据，绘制TNSE图，细胞类型注释基因dotplot图，热图。

Extended Data Fig. 3: Phenotyping myeloid, epithelial and stromal cell types

加载数据

见稿子：（IF=50.0/Q1）顶刊python代码学习（一）：数据背景，环境配置与数据加载

注意前面有一些自定义的函数需要运行一下，读取H5文件：

# SPECIFY SAVE DIRECTORIES
# optional key word to append to directory name
tag = '_Res'
now = time.strftime("%x") 
now = str.replace(now,'/','_')

# SPECIFY OUTPUT STEMS FOR FIGURES/PATHWAY ANALYSIS
FIG_output_stem = "./figures/" + FN.replace('.h5','{}_'.format(tag)) + now +'/'
print(FIG_output_stem)

# CREATE DIRECTORIES IF THEY DO NOT EXIST
# CREATE FIGURE DIRECTORY IF IT DOES NOT EXIST     
d = os.path.dirname(FIG_output_stem)
ifnot os.path.exists(d):
        os.makedirs(d)
        
# PATH TO H5 WITH PROCESSED DATAFRAMES
#'/ifs/e63data/massaguelab/ashley/data/sc_RNAseq/h5/processed/Project_AL_H2087_LCC/manuscript/'
DATA_PATH = './data/'
FN = 'PATIENT_LUNG_ADENOCARCINOMA_ANNOTATED.h5'

h5_data = H5(DATA_PATH + FN)
h5_data.ls()

这个数据结构很特别，全是各种坐标单独保存的。

加载数据：今天加载的是细胞亚群「Myeloid/Epithelium/Stroma」

这里我将代码具体化了，前面那种 exec 的函数写法感觉很抽象。

# SPECIFY DATA SUBSET
subset_type = 'MYELOID_OTHER_ALL'

# LOAD DATA
# RAW COUNTS (Cells x Genes)
# exec('DF_{} = h5_data.load('/DF_{}')'.format(subset_type,subset_type))
DF_MYELOID_OTHER_ALL = h5_data.load('/DF_MYELOID_OTHER_ALL')

# MEDIAN LIBRARY-SIZE NORMALIZED COUNTS (Cells x Genes)
# exec('NDF_{} = h5_data.load('/NDF_{}')'.format(subset_type,subset_type))
NDF_MYELOID_OTHER_ALL = h5_data.load('/NDF_MYELOID_OTHER_ALL')

# MAGIC IMPUTED AND NORMALIZED DATA (Cells x Genes)
# exec('INDF_{} = h5_data.load('/INDF_{}')'.format(subset_type,subset_type))
INDF_MYELOID_OTHER_ALL = h5_data.load('/INDF_MYELOID_OTHER_ALL')

# CELLS X DIMENSIONS (including principal components, diffusion components, ect.)
# exec('DIMENSIONS_{} = h5_data.load('/DIMENSIONS_{}')'.format(subset_type,subset_type))
DIMENSIONS_MYELOID_OTHER_ALL = h5_data.load('/DIMENSIONS_MYELOID_OTHER_ALL')

# RANKED DIFFERENTIALLY EXPRESSED GENES BY MAST (EACH CELL TYPE VS. ALL OTHER CELLS)
# exec('GENE_RANK_MAST_{} = h5_data.load('/GENE_RANK_MAST_{}')'.format(subset_type,subset_type))
GENE_RANK_MAST_MYELOID_OTHER_ALL = h5_data.load('/GENE_RANK_MAST_MYELOID_OTHER_ALL')

数据说明：

「DF_MYELOID_OTHER_ALL」：count 矩阵 (cells x genes)
「NDF_MYELOID_OTHER_ALL」：标准化后的矩阵 (cells x genes)
「INDF_MYELOID_OTHER_ALL」：scale矩阵：imputed and library-size normalized dataframe (cells x genes)
「DIMENSIONS_MYELOID_OTHER_ALL」：pca等降维坐标（cells x dimensions）
「GENE_RANK_MAST_MYELOID_OTHER_ALL」：每个亚群的差异结果 genes x cell group (DEG in cell group compared to all other cells)

读取细胞降维坐标如PCA，TSNE：

从两个字典INDF_和DIMENSIONS_中提取数据，并将提取的数据分别赋值给变量x和y

dimtype = 'IPC_TSNE' #基于INDF数据得到的降维坐标TSNE
meta = 'CELL_TYPE'
exec('QUERY = INDF_{}'.format(subset_type))
exec('DIM = DIMENSIONS_{}'.format(subset_type))
x = DIM['{}0'.format(dimtype)]
y = DIM['{}1'.format(dimtype)]
# 查看DIM数据
DIM.head()

表格后面几列：取的其中的 IPC_TSNE0 和 IPC_TSNE1这两列坐标。

绘制样本颜色的TNSE图：

# XY Parameters
plt.figure(figsize = (15,5))
gs1 = gridspec.GridSpec(1,3)
gs1.update(wspace=0.1, hspace=0.1) # set the spacing between axes. 
dot_size = 2

xmax = np.abs(x).max()
ymax = np.abs(y).max()
axis_min = -ceil(ceil(max(xmax,ymax)/10)*10)
axis_max = ceil(ceil(max(xmax,ymax)/10)*10)

# SAMPLE ID: LEGEND
ax1 = plt.subplot(gs1[0])
seqc.plot.scatter.categorical(x, y, c=QUERY.index.get_level_values('Legend'), 
                              #cmap=CM_SAMPLES, # 颜色设置去掉，前面有个颜色模块找不到
                              legend_kwargs={'ncol': 1, 'borderaxespad':0.,
                                            'bbox_to_anchor':(1, 0)}, 
                              s=dot_size, ax=ax1);
sns.despine()
plt.xlim(axis_min,axis_max)
plt.ylim(axis_min,axis_max)
plt.axis('off')
sns.despine()

Phenograph算法聚类结果

通过Phenograph算法对细胞或其他生物实体进行聚类分析得到的聚类结果。Phenograph是一种基于图论的聚类算法，常用于单细胞数据分析中，用于识别细胞的不同亚群或表型。

# PHENOGRAPH CLUSTERS
ax2 = plt.subplot(gs1[1])
numeric_labels = [int(label[:-14]) for label in QUERY.index.get_level_values('PHENOGRAPH_CLASS')]
g = seqc.plot.scatter.categorical(x, y, c=numeric_labels, 
                                  # cmap= CM_MYELOID_OTHER_ALL_PHENOGRAPH_CLASS, 
                                  legend_kwargs={'ncol': 3, 'borderaxespad':0.,
                                            'bbox_to_anchor':(0.75, 0)},
                                  s=dot_size,randomize=True, ax = ax2);
ax2.tick_params(labelsize=14)
plt.xlim(axis_min,axis_max)
plt.ylim(axis_min,axis_max)
sns.despine()
ax2.axis('off')

细胞注释结果：

# CELL TYPE ASSIGNMENTS
ax3 = plt.subplot(gs1[2])
g = seqc.plot.scatter.categorical(x, y, c=QUERY.index.get_level_values('CELL_TYPE'), 
                                 # cmap= CM_CELL_TYPE_MYELOID_OTHER_ALL, 
                                  legend_kwargs={'ncol': 1, 'borderaxespad':0.,
                                            'bbox_to_anchor':(1, 0)},
                                  s=dot_size,randomize=True, ax = ax3);
h,l=g.get_legend_handles_labels()
ax3.tick_params(labelsize=14)
plt.xlim(axis_min,axis_max)
plt.ylim(axis_min,axis_max)
sns.despine()
ax3.axis('off')

三个图放在一起并保存：

需要上面的代码都放在一起运行：

dimtype = 'IPC_TSNE' # TSNE COMPUTED ON PRINCIPAL COMPONENTS ON THE INDF
meta = 'CELL_TYPE'
exec('QUERY = INDF_{}'.format(subset_type))
exec('DIM = DIMENSIONS_{}'.format(subset_type))
x = DIM['{}0'.format(dimtype)]
y = DIM['{}1'.format(dimtype)]

# XY Parameters
plt.figure(figsize = (15,5))
gs1 = gridspec.GridSpec(1,3)
gs1.update(wspace=0.1, hspace=0.1) # set the spacing between axes. 
dot_size = 2

xmax = np.abs(x).max()
ymax = np.abs(y).max()
axis_min = -ceil(ceil(max(xmax,ymax)/10)*10)
axis_max = ceil(ceil(max(xmax,ymax)/10)*10)

# SAMPLE ID: LEGEND
ax1 = plt.subplot(gs1[0])
seqc.plot.scatter.categorical(x, y, c=QUERY.index.get_level_values('Legend'), 
                              # cmap=CM_SAMPLES,
                              legend_kwargs={'ncol': 1, 'borderaxespad':0.,
                                            'bbox_to_anchor':(1, 0)}, 
                              s=dot_size, ax=ax1);
sns.despine()
plt.xlim(axis_min,axis_max)
plt.ylim(axis_min,axis_max)
plt.axis('off')
sns.despine()

# PHENOGRAPH CLUSTERS
ax2 = plt.subplot(gs1[1])
numeric_labels = [int(label[:-14]) for label in QUERY.index.get_level_values('PHENOGRAPH_CLASS')]
g = seqc.plot.scatter.categorical(x, y, c=numeric_labels, 
                                  # cmap= CM_MYELOID_OTHER_ALL_PHENOGRAPH_CLASS, 
                                  legend_kwargs={'ncol': 3, 'borderaxespad':0.,
                                            'bbox_to_anchor':(0.75, 0)},
                                  s=dot_size,randomize=True, ax = ax2);
ax2.tick_params(labelsize=14)
plt.xlim(axis_min,axis_max)
plt.ylim(axis_min,axis_max)
sns.despine()
ax2.axis('off')

# CELL TYPE ASSIGNMENTS
ax3 = plt.subplot(gs1[2])
g = seqc.plot.scatter.categorical(x, y, c=QUERY.index.get_level_values('CELL_TYPE'), 
                                  # cmap= CM_CELL_TYPE_MYELOID_OTHER_ALL, 
                                  legend_kwargs={'ncol': 1, 'borderaxespad':0.,
                                            'bbox_to_anchor':(1, 0)},
                                  s=dot_size,randomize=True, ax = ax3);
h,l=g.get_legend_handles_labels()
ax3.tick_params(labelsize=14)
plt.xlim(axis_min,axis_max)
plt.ylim(axis_min,axis_max)
sns.despine()
ax3.axis('off')

# SAVE FIGURE
figure_label = '_{}_LEGEND_PHENOGRAPH_CELLTYPE_{}'.format(subset_type,dimtype)
fn = FIG_output_stem + FN.replace(".h5", "") + figure_label 
plt.savefig(fn + '.png', bbox_inches='tight',dpi=400)
print(fn)

细胞注释marker基因dotplot图

1、先提取数据以及制定marker基因：

这段代码里面可以学习到一些亚群的marker基因，以及绘图惭怍。

（IF=50.0/Q1）顶刊python代码学习（四）：髓系、上皮、基质细胞类型注释

# 指定分组参数及关联的配色方案
meta='CELL_TYPE'
FLATUI_PLOT = FLATUI_CELL_TYPE_MYELOID_OTHER_ALL

# 使用的 NDF 数据
exec('QUERY = NDF_{}'.format(subset_type))
     
# 按类别（有序）对分组的细胞进行排序
categories = ['EPITHELIAL','PROLIFERATING MESENCHYMAL PROGENITOR','FIBROBLAST','PERICYTE','ENDOTHELIAL',
              'MONOCYTE', 'NEUTROPHIL','MDSC','MACROPHAGE',
              'MICROGLIA/MACROPHAGE','DENDRITIC', 'DENDRITIC (ACTIVATED)','MAST']

# marker基因
marker_genes_dict = {
                     'EPITHELIAL': ['CDH1','SOX9','SOX2'],
                     'STROMA': ['VIM', 'ACTA2','DES','PDGFRA','PDGFRB','CDH5','PECAM1','CD34'],
                     'MONOCYTE': ['FCGR3A','ITGAX','CD14'],
                     'DENDRITIC':['CD40','CD80','HLA-DOA', 'HLA-DOB','HLA-DQA1','HLA-DQA2','HLA-DQB1'],
                     'MHC2': ['HLA-DPA1','HLA-DPB1', 'HLA-DQA1','HLA-DQA2','HLA-DQB1', 'HLA-DRA','HLA-DRB1'],
                     'MACROPHAGE': ['LAMP2','LGALS3','MRC1','TFRC','FCGR2A','ITGAM','CD163'],
                     'NEUTROPHIL': ['LY6G6D','S100A9','S100A8','S100A12','PILRA','LILRB2','ISG15'],
                     'MDSC':['CYBB','CSF1R','STAT1','CD14','FCGR1A','FCGR1B','CD33','CCR1'],
                     'MAST':['KIT'] #,'ENPP3'
                    }
#指定绘图基因的顺序
var_group_labels = ['EPITHELIAL','STROMA','MONOCYTE','NEUTROPHIL','MDSC','MACROPHAGE','DENDRITIC','MAST']

2、将分类的基因字典转换为带有位置注释的标记基因

# 将分类的基因字典转换为带有位置注释的标记基因
tmp = pd.DataFrame()
for gene_group in var_group_labels:
    genes = marker_genes_dict[gene_group]
    genes = [gene for gene in genes if gene in QUERY.columns]
    labels = [gene_group] * len(genes)
    current = pd.DataFrame({'labels':labels,'genes':genes})
    tmp = tmp.append(current)
tmp.index = np.arange(len(tmp))
marker_genes = tmp['genes'].values

var_group_positions = []
for gene_group in var_group_labels:
    cpositions = list(tmp[tmp['labels'] ==gene_group].index)
    var_group_positions.append((cpositions[0],cpositions[len(cpositions)-1]))

注意运行上面的代码时汇报个错：

AttributeError: ‘DataFrame’ object has no attribute ‘append’

原因：

报错信息表明DataFrame对象没有append方法。这是因为从Pandas 1.4.0版本开始，append方法已经被弃用，推荐使用pd.concat来替代。

修改后的代码：

# 将分类的基因字典转换为带有位置注释的标记基因
tmp = pd.DataFrame()
for gene_group in var_group_labels:
    genes = marker_genes_dict[gene_group]
    genes = [gene for gene in genes if gene in QUERY.columns]
    labels = [gene_group] * len(genes)
    current = pd.DataFrame({'labels':labels,'genes':genes})
    # tmp = tmp.append(current)
    tmp = pd.concat([tmp, current], ignore_index=True)
tmp.index = np.arange(len(tmp))
marker_genes = tmp['genes'].values

var_group_positions = []
for gene_group in var_group_labels:
    cpositions = list(tmp[tmp['labels'] ==gene_group].index)
    var_group_positions.append((cpositions[0],cpositions[len(cpositions)-1]))

3、dotplot绘图

这里也会报错：

# 按类别标记的基因的点图
ax = dotplot(QUERY[marker_genes], marker_genes, categories, groupby=meta, use_raw=None, log=False,
            expression_cutoff=0, mean_only_expressed=True, color_map='Reds', dot_max=0.75,
            dot_min=0.10, figsize=None, dendrogram=False, gene_symbols=None,
            var_group_positions=var_group_positions, standard_scale = 'var', smallest_dot=0,
            var_group_labels=var_group_labels, var_group_rotation=90, layer=None, show=None,
            save=None)

# SAVE FIGURE
figure_label = '_{}_dotPlot_markerGenes_by{}'.format(subset_type,meta) 
fn = FIG_output_stem + FN.replace(".h5", "") + figure_label 
plt.savefig(fn + '.png', bbox_inches='tight',dpi=fig_dpi)
print(fn)

错误：

AttributeError: ‘DataFrame’ object has no attribute ‘ix’

原因：

AttributeError: 'DataFrame' object has no attribute 'ix' 表明代码中可能使用了 ix 属性，但 ix 在 Pandas 0.20.0 版本之后已经被弃用。在 Pandas 的新版本中，推荐使用 loc 或 iloc 来替代 ix。

这里需要去修改前面自定义的函数dotplot，将其中的ix改为loc就可以成功绘图了。

重新运行上面的代码，气泡图结果如下：

细胞类型聚类的热图

上面那些基因的热图：按细胞类型聚类的热图，显示了区分标记的平均估算表达值，并通过z分数进行了标准化。行根据注释的细胞类型进行了着色。

提取热图数据：

# 指定分组参数及关联的配色方案
meta='CELL_TYPE'
FLATUI_PLOT = FLATUI_CELL_TYPE_MYELOID_OTHER_ALL 

# 行索引
genes = marker_genes
exec('QUERY = INDF_{}'.format(subset_type))

# 将十六进制颜色代码转换为RGB值
# colors = np.zeros((len(FLATUI_PLOT),3))
# for ii,hexcolor in enumerate(FLATUI_PLOT):
#     colors[ii,:] = tuple(hex(hexcolor).rgb)
# colors = np.divide(colors,255)

# 选择独特的差异表达基因（DE genes）。
genes,ind = np.unique(genes, return_index=True)

# 构建热图数据
genes = [g for g in genes if g in list(QUERY.columns)] # Detected genes
heatmap_data = pd.DataFrame(data = zscore(QUERY[genes].values,axis=0),columns = genes, index = QUERY.index).groupby(level= meta, axis=0).median() #CLUSTERS

yticks = heatmap_data.index
xticks = heatmap_data.columns

# 按<meta>对行进行着色
ind = list(heatmap_data.index)
cix = (np.linspace(0,shape(colors)[0],len(np.unique(ind)))).astype(int)
if cix[len(cix)-1]==shape(colors)[0]:
    cix[len(cix)-1]=shape(colors)[0]-1
lut = dict(zip(np.unique(ind), colors[cix,:]))
row_colors = pd.Series(ind,index = ind).map(lut)

# LINKAGE 
method = 'average'
metric = 'correlation'
linkage = hc.linkage(heatmap_data, method=method, metric = metric)
row_linkage = deepcopy(linkage)
linkage = hc.linkage(heatmap_data.T, method=method, metric = metric)
col_linkage = deepcopy(linkage)

# 根据链接方法重新排序热图（可选，仍然较慢）。
r1 = hc.leaves_list(row_linkage)
cl = hc.leaves_list(col_linkage)
mat = heatmap_data.iloc[r1,cl]

行为细胞类型，列为对应的基因。

热图绘制：

# 查看聚类标记的热图和树状图
fig = plt.figure(figsize=(12,5))
plt.rcParams["axes.grid"] = False
import matplotlib.patches as patches

# ADD ROW COLOR INDEX (CELL OF ORIGIN)
ax1 = fig.add_axes([0.080,0.1,0.02,0.6])
x = 0
y = 0
for c in row_colors[list(mat.index)]:
    pos = (x, y / len(mat.index))
    ax1.add_patch(patches.Rectangle(pos, 1, 1 / len(mat.index), color=c))
    if y >= len(r1)-1:
        x += 1
        y = 0
    else:
        y += 1
plt.axis('off')

# ADD MATRIX WITH GENE NAMES
axmatrix = fig.add_axes([0.1,0.1,0.7,0.6])
im = axmatrix.matshow(mat, aspect='auto', origin='lower', cmap=plt.cm.RdBu_r,vmin=-3,vmax=3)
labels = list(mat.columns)
axmatrix.xaxis.set_ticks_position('bottom')
axmatrix.set_yticklabels([]);
xtick = plt.xticks(range(len(labels)), labels, rotation = 90, fontsize = 14)

# ADD DENDROGRAM
sch.set_link_color_palette(['#808080', '#808080', '#808080', '#808080','#808080','#808080','#808080'])
ax2 = fig.add_axes([0,0.1,0.08,0.6]) 
Z1 = sch.dendrogram(row_linkage, orientation='left',above_threshold_color='#808080')
ax2.set_xticks([])
ax2.set_yticks([])
plt.axis('off')

# ADD COLORBAR
axcolor = fig.add_axes([0.81,0.1,0.01,0.6])
cbar = plt.colorbar(im, cax=axcolor)
cbar.ax.get_yaxis().labelpad = 20
cbar.ax.tick_params(labelsize=10)
cbar.ax.set_ylabel('Median Expression per {}'.format(meta), rotation=270, fontsize = 10)

# SAVE FIGURE
figure_label = '_{}_matrixPlotMedian_markerGenes_by{}'.format(subset_type,meta) 
fn = FIG_output_stem + FN.replace(".h5", "") + figure_label 
plt.savefig(fn + '.png', bbox_inches='tight',dpi=fig_dpi)
print(fn)

小技巧：

在我的python不是很精通的情况下，我也是今年才学会一点点。上面的很多报错我都是复制运行的代码，同时放上报错的关键信息，扔给kimi检查，这样很快就能定位到错误的问题。基本上99%的问题都能成功得到解决。推荐大家人工智能使用起来！！！

今天分享到这~

（未完待续）

微精选

（IF=50.0/Q1）顶刊python代码学习（四）：髓系、上皮、基质细胞类型注释

今天的学习内容：

加载数据

注意前面有一些自定义的函数需要运行一下，读取H5文件：

加载数据：今天加载的是细胞亚群「Myeloid/Epithelium/Stroma」

读取细胞降维坐标如PCA，TSNE：

绘制样本颜色的TNSE图：

Phenograph算法聚类结果

细胞注释结果：

三个图放在一起并保存：

细胞注释marker基因dotplot图

1、先提取数据以及制定marker基因：

2、将分类的基因字典转换为带有位置注释的标记基因

3、dotplot绘图

细胞类型聚类的热图

提取热图数据：

热图绘制：

小技巧：

最近文章

微精选

（IF=50.0/Q1）顶刊python代码学习（四）：髓系、上皮、基质细胞类型注释

今天的学习内容：

加载数据

注意前面有一些自定义的函数需要运行一下，读取H5文件：

加载数据：今天加载的是细胞亚群 「Myeloid/Epithelium/Stroma」

读取细胞降维坐标如PCA，TSNE：

绘制样本颜色的TNSE图：

Phenograph算法聚类结果

细胞注释结果：

三个图放在一起并保存：

细胞注释marker基因dotplot图

1、先提取数据以及制定marker基因：

2、将分类的基因字典转换为带有位置注释的标记基因

3、dotplot绘图

细胞类型聚类的热图

提取热图数据：

热图绘制：

小技巧：

加载数据：今天加载的是细胞亚群「Myeloid/Epithelium/Stroma」