凑着8天假期,居然给自己放了一个超长的假期,自动续了点。
富集分析以及可视化已经写的天花乱坠了,没有啥太大的空间了,我们号内检索富集可见。有小伙伴因为很多分析,包括基础分析转入python,所以对于富集这种基础内容也希望python有对应内容,所以这里介绍一个功能全面的python库—GSEApy,满足富集分析相关的分析及可视化需求。GSEApy是一个基于Python/Rust实现的GSEA(基因集富集分析)工具,并同时封装了Enrichr功能。它可用于RNA-seq、ChIP-seq 以及微阵列(Microarray)数据的富集分析,提供了便捷的GO富集分析功能,并且还有相当可以的可视化功能。GSEApy 提供多个子命令模块,包括:gsea、prerank、ssgsea、gsva、replot、enrichr 以及 biomart。
1、安装及导入
#安装包pip install gseapy#Collecting gseapy Downloading gseapy-1.1.10-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (600 kB) #━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 601.0/601.0 kB 31.8 kB/s eta 0:00:0000:0100:01
import pandas as pdimport gseapy as gpimport matplotlib.pyplot as plt
from gseapy import Biomartbm = Biomart()
#查看有哪些物种数据库:常见的物种都有。#datasets = bm.get_datasets(mart='ENSEMBL_MART_ENSEMBL')#datasets# Dataset Description# 0 falbicollis_gene_ensembl Collared flycatcher genes (FicAlb1.5)# 1 ptaltaica_gene_ensembl Tiger genes (PanTig1.0)# 2 omelastigma_gene_ensembl Indian medaka genes (Om_v0.7.RACA)# 3 lcrocea_gene_ensembl Large yellow croaker genes (L_crocea_2.0)# 4 strutta_gene_ensembl Brown trout genes (fSalTru1.1)# ... ... ...# 208 pcatodon_gene_ensembl Sperm whale genes (ASM283717v2)# 209 ppaniscus_gene_ensembl Bonobo genes (panpan1.1)# 210 shabroptila_gene_ensembl Kakapo genes (bStrHab1_v1.p)# 211 bsplendens_gene_ensembl Siamese fighting fish genes (fBetSpl5.2)# 212 lchalumnae_gene_ensembl Coelacanth genes (LatCha1)
#ID转化,这里的操作可以理解为筛选过滤#查询人类基因注释,需要的物种dataset参数设置参照bm.get_datasets(mart='ENSEMBL_MART_ENSEMBL')bm_human = bm.query(dataset='hsapiens_gene_ensembl',attributes=['ensembl_gene_id', 'hgnc_symbol','entrezgene_id'])
#需要转化的ID提供list,然后与上面的基因注释进行映射ID_list = ['ENSG00000141510', 'ENSG00000171862','ENSG00000125285','ENSG00000182968']df = pd.DataFrame(bm_human)mapping = df.set_index('ensembl_gene_id').loc[ID_list]mapping# hgnc_symbol entrezgene_id# ensembl_gene_id # ENSG00000141510 TP53 7157# ENSG00000171862 PTEN 5728# ENSG00000125285 SOX21 11166# ENSG00000182968 SOX1 6656
#人和小鼠的转化,操作也是相当的简单了bm = Biomart()# note the dataset and attribute names are different#小鼠➡人类m2h = bm.query(dataset='mmusculus_gene_ensembl', # 小鼠数据集 attributes=['ensembl_gene_id', # 小鼠基因ID 'external_gene_name', # 小鼠基因symbol 'hsapiens_homolog_ensembl_gene',# 对应的人类同源基因ID 'hsapiens_homolog_associated_gene_name'])# 对应的人类同源基因name#人类➡小鼠h2m = bm.query(dataset='hsapiens_gene_ensembl', attributes=['ensembl_gene_id','external_gene_name', 'mmusculus_homolog_ensembl_gene', 'mmusculus_homolog_associated_gene_name'])
#查看富集数据库,organism选择什么,就会展示什么物种,这个是后续富集gene sets的支持human = gp.get_library_name(organism='Human')human# # 'ARCHS4_Cell-lines',# 'ARCHS4_IDG_Coexp',# 'ARCHS4_Kinases_Coexp',# 'ARCHS4_TFs_Coexp',# 'ARCHS4_Tissues',# 'Achilles_fitness_decrease',# 'Achilles_fitness_increase',# 'Aging_Perturbations_from_GEO_down',# 'Aging_Perturbations_from_GEO_up',# 'Allen_Brain_Atlas_10x_scRNA_2021',# 'Allen_Brain_Atlas_down',# 'Allen_Brain_Atlas_up',# 'Azimuth_2023',# 'Azimuth_Cell_Types_2021',
#read datadiff=pd.read_csv('./diff.csv')diff.head()
#筛选差异显著基因diff_sig = diff[(diff['p_val_adj'] < 0.05) & (abs(diff['avg_log2FC']) > 0.3)]diff_sig# gene p_val avg_log2FC p_val_adj# 384 GABRG3 4.640000e-07 0.718301 0.022700# 391 SORBS2 2.120000e-07 0.387130 0.010380# 398 LENG8 9.880000e-08 0.330179 0.004829# 399 TRPM3 9.860000e-08 0.614009 0.004818# 401 NCOA1 8.020000e-08 0.313686 0.003921# ... ... ... ... ...# 1402 HSPA1A 0.000000e+00 -0.775858 0.000000# 1403 HSPA1B 0.000000e+00 -0.837887 0.000000# 1404 SRSF3 0.000000e+00 -0.806969 0.000000# 1405 SRSF2 0.000000e+00 -0.480263 0.000000# 1406 SRSF1 0.000000e+00 -0.876165 0.000000
#这里我们演示GO富集分析,富集分析需要提供基因listgene_list = diff_sig['gene'].squeeze().str.strip().to_list()# if you are only intrested in dataframe that enrichr returned, please set outdir=None#默认cutoff是p小于0.05,所以不再设置enr = gp.enrichr(gene_list=gene_list, # gene形式需要与数据库保持一致 gene_sets=['GO_Biological_Process_2025','KEGG_2021_Human'],#需要分析的基因集,可以同时分析GO、KEGG两个gene set organism='human', #物种 outdir=None)
#可视化from gseapy import barplot, dotplot
# categorical scatterplotax = dotplot(enr.results,#分析结果 column="Adjusted P-value",#按照adj pvalue展示结果 x='Gene_set', # set x axis, so you could do a multi-sample/library comparsion size=10,#点大小 top_term=5,#展示的top5terms figsize=(3,5),#图长宽 title = "Enrichment",#标题 xticklabels_rot=45, # rotate xtick labels show_ring=False, # set to False to revmove outer ring marker='o',#plot图形形式,参照https:///stable/api/markers_api.html )
# categorical barplotax = barplot(enr.results, column="Adjusted P-value", group='Gene_set', # set group, so you could do a multi-sample/library comparsion size=10, top_term=5, figsize=(3,5), color = {'GO_Biological_Process_2025': 'salmon','KEGG_2021_Human':'darkblue'} )
4-GSEA分析【Prerank】
Prerank是GSEA分析的一种应用,但与标准的GSEA方法不同,Prerank直接使用已经排序好的基因列表进行分析,而不需要原始的表达矩阵。也就是我们常在R中进行的分析,将基因按照logFC等事先进行排序,然后进行分析。
#对差异基因数据表进行排序,得到rank文件import pandas as pddf = diff.sort_values(by='avg_log2FC', ascending=False) #降序排列df = df[['gene', 'avg_log2FC']].copy()df = df.set_index('gene')df.head()
# # run prerankpre_res = gp.prerank(rnk=df,#rank files gene_sets='KEGG_2021_Human',#需要分析的基因集 threads=4,#线程数,可以加快分析 min_size=5, max_size=1000, permutation_num=1000, # reduce number to speed up testing outdir=None, # don't write to disk seed=6, verbose=True, # see what's going on behind the scenes )
#经典可视化#可以直接plot需要的terms,名称需要正确和结果文件里面的terms一致pre_res.plot(terms='Estrogen signaling pathway')
#还可以使用函数gseaplot对图进行更细致的调整from gseapy import gseaplotgseaplot(term='Estrogen signaling pathway',#需要plot的terms名称 rank_metric=pre_res.ranking, #rank文件,在分析结果中有储存 ofname='GSEA_plot.pdf', #结果直接保存为pdf color='red', #设置曲线颜色 figsize = (4,5),#设置图形尺寸,长高 **pre_res.results['Estrogen signaling pathway'])
#也可以同时显示多条tremspre_res.plot(terms=pre_res.res2d['Term'][0:5], show_ranking=True, # whether to show the second yaxis figsize=(3,4) )
#气泡图展示结果from gseapy import dotplotax = dotplot(pre_res.res2d, column="FDR q-val", cmap=plt.cm.viridis, size=6, # adjust dot size top_term=15,#需要展示的top terms数目 figsize=(3,5), cutoff=0.25, show_ring=False)
