Anyone with a sequence to classify faces a bewildering array of database choice. You’ve got your classic NCBI quiver; nt, nr, refseq, cdd, etc. Then you’ve got uniprot, seed, pfam, and numerous others. Some of these have obvious advantages or disadvantages. The NCBI nr database for example, is comprehensive but cumbersome. Choosing between others can at times feel arbitrary.
One database that I use a lot is pfam. I like it because, in addition to being directly supported by the excellent protein classifier HMMER3.0, it is well curated and has helpful Wikipedia-like articles for some of the protein families. Classification is based on the presence of conserved domains which is also a little easier to wrap your head around than the implications of sequence similarity alone. The downside to pfam is that the presence of a conserved domain doesn’t always clue you in to the function of an unknown protein sequence. There are also many thousands of protein domains. Sometimes you can guess at the function from the name. Aldedh for example, is the aldehyde dehydrogenase family. Other names are more obtuse. Abhydrolase_5 is a family containing the alpha/beta hydrolase fold domain, but unless you’re a biochemist the implications of that are probably a mystery (it was to me, fortunately there are the wiki articles). What’s needed is a broader level of classification for these protein families.
NCBI used to maintain a really great database called COG (Clusters of Orthologous Genes). Proteins were grouped into COGs which were given descriptive names like Glyceraldehyde-3-phosphate dehydrogenase/erythrose-4-phosphate dehydrogenase. Even better the COGs were assigned letters based on their perceived physiological function. Some might argue that this is too coarse a classification, and they might be right, but it is still helpful when evaluating the ecological relevance of a protein. Here are the COG codes, taken from the fun.txt file at the COG ftp site:
INFORMATION STORAGE AND PROCESSING
[J] Translation, ribosomal structure and biogenesis
[A] RNA processing and modification
[K] Transcription
[L] Replication, recombination and repair
[B] Chromatin structure and dynamics
CELLULAR PROCESSES AND SIGNALING
[D] Cell cycle control, cell division, chromosome partitioning
[Y] Nuclear structure
[V] Defense mechanisms
[T] Signal transduction mechanisms
[M] Cell wall/membrane/envelope biogenesis
[N] Cell motility
[Z] Cytoskeleton
[W] Extracellular structures
[U] Intracellular trafficking, secretion, and vesicular transport
[O] Posttranslational modification, protein turnover, chaperones
METABOLISM
[C] Energy production and conversion
[G] Carbohydrate transport and metabolism
[E] Amino acid transport and metabolism
[F] Nucleotide transport and metabolism
[H] Coenzyme transport and metabolism
[I] Lipid transport and metabolism
[P] Inorganic ion transport and metabolism
[Q] Secondary metabolites biosynthesis, transport and catabolism
POORLY CHARACTERIZED
[R] General function prediction only
[S] Function unknown
I decided to try and map the pfam database to the COG database so that I could have my cake and eat it too. To do that I ran blastp on the Pfam-A.fasta file against a protein blast database created from cog0303.fa and used the following script to parse the tab-delimited blast output and various other files to create the mapping. This was a reasonably large blast job and I ran into a computer issue part way through. As a result the job didn’t complete. When I have the chance to re-run I’ll post the completed pfam to cog table here. In the meantime here’s the script for anyone who would like to create their own pfam to cog map. It is assumed that all database files are in the working directory.
## first step was to blast pfam-A against cog0303.fa ## blastp -db COG/cog0303 -query Pfam-A.fasta -out Pfam-A_cog.txt -outfmt 6 -num_threads 8 -evalue 1e-5 ## if you want to conduct a new mapping, for example from a new blast file, remove the old pickle first import gzip import cPickle cog_cats = {} cogs_seqs = {} cog_names = {} pfam_seqs = {} pfam_cog = {} import os if 'pfam_cog_dict.p' not in os.listdir('.'): ## map cog name to cog category print 'mapping cog name to cog category' with open('cogs.csv', 'r') as cog_file: for line in cog_file: line = line.rstrip() line = line.split(',') cog_cats[line[0]] = line[1] cog_names[line[0]] = line[2] ## map cog sequence to cog name print 'mapping cog sequence to cog name' with open('domdat.csv', 'r') as cogs_seqs_file: for line in cogs_seqs_file: line = line.rstrip() line = line.split(',') cogs_seqs[line[0]] = line[6] ## map pfam sequence to pfam print 'mapping pfam sequence to pfam' with open('Pfam-A.fasta', 'r') as pfam_seqs_file: for line in pfam_seqs_file: if line.startswith('>'): line = line.strip('>') line = line.rstrip() line = line.split() pfam_seqs[line[0]] = line[2] ## map blast print 'mapping blast' with gzip.open('Pfam-A_cog.txt.gz', 'rb') as blast: for line in blast: line = line.rstrip() line = line.split() pfam_seq = line[0] cog_seq = line[1] pfam = pfam_seqs[pfam_seq] cog = cogs_seqs[cog_seq] try: cog_name = cog_names[cog] except KeyError: cog_name = 'NULL' try: cog_cat = cog_cats[cog] except KeyError: cog_cat = 'NULL' try: temp = pfam_cog[pfam] temp_cog = temp[0] temp_cog_cat = temp[1] temp_cog.add(cog) temp_cog_cat.add(cog_cat) temp_cog_name = temp[2] temp_cog_name.add(cog_name) pfam_cog[pfam] = temp_cog, temp_cog_cat, temp_cog_name except KeyError: temp_cog = set() temp_cog_cat = set() temp_cog.add(cog) temp_cog_cat.add(cog_cat) temp_cog_name = set() temp_cog_name.add(cog_name) pfam_cog[pfam] = temp_cog, temp_cog_cat, temp_cog_name print pfam, cog, cog_cat, cog_name cPickle.dump(pfam_cog, open('pfam_cog_dict.p', 'wb')) else: print 'using existing pfam to cog map' pfam_cog = cPickle.load(open('pfam_cog_dict.p', 'rb')) with open('pfam_cog_cat_map.txt', 'w') as output: for pfam in pfam_cog.keys(): temp = pfam_cog[pfam] pfam_temp = pfam.split(';') pfam_code = pfam_temp[0] pfam_code = pfam_code.split('.') pfam_code = pfam_code[0] for cog_cat in list(temp[1]): print >> output, pfam_code+'\t'+pfam_temp[1]+'\t'+cog_cat
This script produces a pickle with the mapping in case you want to format the output differently later without having to redo everything, and a tab-delimited file with the COG categories that each pfam belongs to:
PF07792 Afi1 NULL PF05346 DUF747 NULL PF12695 Abhydrolase_5 E PF12695 Abhydrolase_5 I PF12695 Abhydrolase_5 FGR PF12695 Abhydrolase_5 Q PF12695 Abhydrolase_5 R PF12695 Abhydrolase_5 U PF12695 Abhydrolase_5 NULL PF12644 DUF3782 S
Note that each pfam belongs to many COGS (maybe too many cogs to make this useful in many cases). Hopefully it will help clarify the role of some of these pfams however!