A fast and efficient DataFrame object for data manipulation with integrated indexing;
Tools for reading and writing data between in-memory data structures and different formats: CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5 format;
Intelligent data alignment and integrated handling of missing data: gain automatic label-based alignment in computations and easily manipulate messy data into an orderly form;
Flexible reshaping and pivoting of data sets;
Intelligent label-based slicing, fancy indexing, and subsetting of large data sets;
Columns can be inserted and deleted from data structures for size mutability;
Aggregating or transforming data with a powerful group by engine allowing split-apply-combine operations on data sets;
High performance merging and joining of data sets;
Hierarchical axis indexing provides an intuitive way of working with high-dimensional data in a lower-dimensional data structure;
Time series-functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging. Even create domain-specific time offsets and join time series without losing data;
Highly optimized for performance, with critical code paths written in Cython or C.
Python with pandas is in use in a wide variety of academic and commercial domains, including Finance, Neuroscience, Economics, Statistics, Advertising, Web Analytics, and more.
# 为了读取多个文件,定义一个函数简化操作
defreadExpr_1(tsvFileL,typeL=['TPM','FPKM']):'''
tsvFileL: lists of files waiting for reading
resultD: a dictionary to save data matrix
{'TPM':[mat1, mat2,...]
'FPKM':[mat1, mat2, ...]}
typeL; list of names for columns to be extracted
'''resultD={}for_typeintypeL:resultD[_type]=[]fortsvFileintsvFileL:expr=pd.read_table(tsvFile,header=0,index_col=0)name=os.path.split(tsvFile)[-1][:-4]#this options is very arbitary
for_typeintypeL:# add _ to type to avoid override Python inner function `type`
expr_type=expr.loc[:,[_type]]expr_type.columns=[name]resultD[_type].append(expr_type)returnresultD#-----------------------------------------------------
# 读取多个文件,并且合并矩阵,定义一个函数简化操作
defconcatExpr(tsvFileL,typeL=['TPM','FPKM']):'''
tsvFileL: lists of files waiting for reading
resultD: a dictionary to save data matrix
{'TPM':[mat1, mat2,...]
'FPKM':[mat1, mat2, ...]}
typeL; list of names for columns to be extracted
'''resultD={}for_typeintypeL:resultD[_type]=[]fortsvFileintsvFileL:expr=pd.read_table(tsvFile,header=0,index_col=0)name=os.path.split(tsvFile)[-1][:-4]#this options is very arbitary
for_typeintypeL:# add _ to type to avoid override Python inner function `type`
expr_type=expr.loc[:,[_type]]expr_type.columns=[name]resultD[_type].append(expr_type)#-------------------------------------------
mergeD={}for_typeintypeL:mergeM=pd.concat(resultD[_type],axis=1)mergeM=mergeM.fillna(0)# Substitute all NA with 0
mergeM=mergeM.loc[(mergeM>0).any(axis=1)]# Delete aoo zero rows.
mergeD[_type]=mergeMreturnmergeD#-----------------------------------------------------
# 假如只提取`Biosample`开头的列
#meta_colL = ['Biosample term id', 'Biosample term name']
# Extract columns matching specific patterns
# Both works well, filter is more simple
#metaM.loc[:,metaM.columns.str.contains(r'^Biosample')]
metaM=metaM.filter(regex=("^Biosample"))metaM
HDF5 is a data model, library, and file format for storing and managing data. It supports an unlimited variety of datatypes, and is designed for flexible and efficient I/O and for high volume and complex data. HDF5 is portable and is extensible, allowing applications to evolve in their use of HDF5. The HDF5 Technology suite includes tools and applications for managing, manipulating, viewing, and analyzing data in the HDF5 format. https://support.hdfgroup.org/HDF5/
/MPATHB/soft/anacond/lib/python2.7/site-packages/IPython/core/interactiveshell.py:3035:PerformanceWarning:yourperformancemaysufferasPyTableswillpickleobjecttypesthatitcannotmapdirectlytoc-types[inferred_type->mixed,key->block0_values][items->['Biosample term id','Biosample term name','Biosample type','Biosample life stage','Biosample sex','Biosample organism','Biosample Age']]exec(code_obj,self.user_global_ns,self.user_ns)
meta_type=["Biosample term name","Biosample type","Biosample life stage","Biosample sex"]
meta=meta[meta_type]meta
Biosample term name
Biosample type
Biosample life stage
Biosample sex
File accession
ENCFF673KYR
mesangial cell
primary cell
unknown, fetal
unknown, female
ENCFF262OBL
pulmonary artery endothelial cell
primary cell
adult
male
ENCFF060LPA
pulmonary artery endothelial cell
primary cell
adult
male
ENCFF289HGQ
fibroblast of villous mesenchyme
primary cell
newborn
male, female
修改下矩阵信息,去除unknow,字符串(只是为了展示方便)
meta.loc['ENCFF673KYR',"Biosample life stage"]="fetal"# Much faster
meta=meta.set_value('ENCFF673KYR','Biosample sex','female')meta=meta.set_value('ENCFF289HGQ','Biosample sex','female')meta
#R code for reading hdf5>h5ls('test.hdf5')groupnameotypedclassdim0/FPKMH5I_GROUP1/FPKMaxis0H5I_DATASETSTRING32/FPKMaxis1H5I_DATASETSTRING251353/FPKMblock0_itemsH5I_DATASETSTRING34/FPKMblock0_valuesH5I_DATASETFLOATx251355/TPMH5I_GROUP6/TPMaxis0H5I_DATASETSTRING37/TPMaxis1H5I_DATASETSTRING240258/TPMblock0_itemsH5I_DATASETSTRING39/TPMblock0_valuesH5I_DATASETFLOATx2402510/ens2synH5I_GROUP11/ens2synaxis0H5I_DATASETSTRING112/ens2synaxis1H5I_DATASETSTRING6072513/ens2synblock0_itemsH5I_DATASETSTRING114/ens2synblock0_valuesH5I_DATASETVLEN115/metaH5I_GROUP16/metaaxis0H5I_DATASETSTRING4717/metaaxis1H5I_DATASETSTRING318/metablock0_itemsH5I_DATASETSTRING1919/metablock0_valuesH5I_DATASETFLOATx320/metablock1_itemsH5I_DATASETSTRING221/metablock1_valuesH5I_DATASETINTEGERx322/metablock2_itemsH5I_DATASETSTRING2623/metablock2_valuesH5I_DATASETVLEN1>TPM=h5read("test.hdf5","/TPM")>str(TPM)Listof4$axis0:chr[1:3(1d)]"ENCFF673KYR""ENCFF805ZGF""ENCFF581ZEU"$axis1:chr[1:24025(1d)]"ENSG00000000003.14""ENSG00000000005.5""ENSG00000000419.12""ENSG00000000457.13"...$block0_items:chr[1:3(1d)]"ENCFF673KYR""ENCFF805ZGF""ENCFF581ZEU"$block0_values:num[1:3,1:24025]2.421.645.69000.111.83.826.380.38...>d<-TPM$block0_values>rownames(d)<-TPM$axis1Errorin`rownames<-`(`*tmp*`,value=c("ENSG00000000003.14","ENSG00000000005.5",:lengthof'dimnames'[1]notequaltoarrayextent>d<-as.data.frame(TPM$block0_values)>rownames(d)<-TPM$axis1Errorin`row.names<-.data.frame`(`*tmp*`,value=value):invalid'row.names'length>dims(d)Error:couldnotfindfunction"dims">dim(d)[1]324025>d<-t(as.data.frame(TPM$block0_values))>dim(d)[1]240253>rownames(d)<-TPM$axis1>colnames(d)<-TPM$axis0>hed(d)Error:couldnotfindfunction"hed">head(d)ENCFF673KYRENCFF805ZGFENCFF581ZEUENSG00000000003.142.421.645.69ENSG00000000005.50.000.000.11ENSG00000000419.121.803.826.38ENSG00000000457.130.380.571.17ENSG00000000460.160.160.310.14ENSG00000000938.120.000.030.00