Converting pfams to cogs

Anyone with a sequence to classify faces a bewildering array of database choice.  You’ve got your classic NCBI quiver; nt, nr, refseq, cdd, etc.  Then you’ve got uniprot, seed, pfam, and numerous others.  Some of these have obvious advantages or disadvantages.  The NCBI nr database for example, is comprehensive but cumbersome.  Choosing between others can at times feel arbitrary.

One database that I use a lot is pfam.  I like it because, in addition to being directly supported by the excellent protein classifier HMMER3.0, it is well curated and has helpful Wikipedia-like articles for some of the protein families.  Classification is based on the presence of conserved domains which is also a little easier to wrap your head around than the implications of sequence similarity alone.  The downside to pfam is that the presence of a conserved domain doesn’t always clue you in to the function of an unknown protein sequence.  There are also many thousands of protein domains. Sometimes you can guess at the function from the name.  Aldedh for example, is the aldehyde dehydrogenase family. Other names are more obtuse.  Abhydrolase_5 is a family containing the alpha/beta hydrolase fold domain, but unless you’re a biochemist the implications of that are probably a mystery (it was to me, fortunately there are the wiki articles).  What’s needed is a broader level of classification for these protein families.

NCBI used to maintain a really great database called COG (Clusters of Orthologous Genes).  Proteins were grouped into COGs which were given descriptive names like Glyceraldehyde-3-phosphate dehydrogenase/erythrose-4-phosphate dehydrogenase.  Even better the COGs were assigned letters based on their perceived physiological function.  Some might argue that this is too coarse a classification, and they might be right, but it is still helpful when evaluating the ecological relevance of a protein.  Here are the COG codes, taken from the fun.txt file at the COG ftp site:

INFORMATION STORAGE AND PROCESSING
[J] Translation, ribosomal structure and biogenesis
[A] RNA processing and modification
[K] Transcription
[L] Replication, recombination and repair
[B] Chromatin structure and dynamics

CELLULAR PROCESSES AND SIGNALING
[D] Cell cycle control, cell division, chromosome partitioning
[Y] Nuclear structure
[V] Defense mechanisms
[T] Signal transduction mechanisms
[M] Cell wall/membrane/envelope biogenesis
[N] Cell motility
[Z] Cytoskeleton
[W] Extracellular structures
[U] Intracellular trafficking, secretion, and vesicular transport
[O] Posttranslational modification, protein turnover, chaperones

METABOLISM
[C] Energy production and conversion
[G] Carbohydrate transport and metabolism
[E] Amino acid transport and metabolism
[F] Nucleotide transport and metabolism
[H] Coenzyme transport and metabolism
[I] Lipid transport and metabolism
[P] Inorganic ion transport and metabolism
[Q] Secondary metabolites biosynthesis, transport and catabolism

POORLY CHARACTERIZED
[R] General function prediction only
[S] Function unknown

I decided to try and map the pfam database to the COG database so that I could have my cake and eat it too.  To do that I ran blastp on the Pfam-A.fasta file against a protein blast database created from cog0303.fa and used the following script to parse the tab-delimited blast output and various other files to create the mapping.  This was a reasonably large blast job and I ran into a computer issue part way through.  As a result the job didn’t complete.  When I have the chance to re-run I’ll post the completed pfam to cog table here. In the meantime here’s the script for anyone who would like to create their own pfam to cog map.  It is assumed that all database files are in the working directory.

## first step was to blast pfam-A against cog0303.fa
## blastp -db COG/cog0303 -query Pfam-A.fasta -out Pfam-A_cog.txt -outfmt 6 -num_threads 8 -evalue 1e-5
## if you want to conduct a new mapping, for example from a new blast file, remove the old pickle first

import gzip
import cPickle

cog_cats = {}
cogs_seqs = {}
cog_names = {}
pfam_seqs = {}
pfam_cog = {}
import os

if 'pfam_cog_dict.p' not in os.listdir('.'):
     ## map cog name to cog category
     print 'mapping cog name to cog category'
     with open('cogs.csv', 'r') as cog_file:
     for line in cog_file:
          line = line.rstrip()
          line = line.split(',')
          cog_cats[line[0]] = line[1]
          cog_names[line[0]] = line[2]

     ## map cog sequence to cog name
     print 'mapping cog sequence to cog name'
     with open('domdat.csv', 'r') as cogs_seqs_file:
          for line in cogs_seqs_file:
          line = line.rstrip()
          line = line.split(',')
          cogs_seqs[line[0]] = line[6]

     ## map pfam sequence to pfam
     print 'mapping pfam sequence to pfam'
     with open('Pfam-A.fasta', 'r') as pfam_seqs_file:
          for line in pfam_seqs_file:
          if line.startswith('>'):
          line = line.strip('>')
          line = line.rstrip()
          line = line.split()
          pfam_seqs[line[0]] = line[2]

     ## map blast
     print 'mapping blast'
     with gzip.open('Pfam-A_cog.txt.gz', 'rb') as blast:
          for line in blast:
          line = line.rstrip()
          line = line.split()
          pfam_seq = line[0]
          cog_seq = line[1]
          pfam = pfam_seqs[pfam_seq]
          cog = cogs_seqs[cog_seq]
          try:
               cog_name = cog_names[cog]
          except KeyError:
               cog_name = 'NULL'
          try:
               cog_cat = cog_cats[cog]
          except KeyError:
               cog_cat = 'NULL'
          try:
               temp = pfam_cog[pfam]
               temp_cog = temp[0]
               temp_cog_cat = temp[1]
               temp_cog.add(cog)
               temp_cog_cat.add(cog_cat)
               temp_cog_name = temp[2]
               temp_cog_name.add(cog_name)
               pfam_cog[pfam] = temp_cog, temp_cog_cat, temp_cog_name
          except KeyError:
               temp_cog = set()
               temp_cog_cat = set()
               temp_cog.add(cog)
               temp_cog_cat.add(cog_cat)
               temp_cog_name = set()
               temp_cog_name.add(cog_name)
               pfam_cog[pfam] = temp_cog, temp_cog_cat, temp_cog_name
          print pfam, cog, cog_cat, cog_name

     cPickle.dump(pfam_cog, open('pfam_cog_dict.p', 'wb'))

else:
     print 'using existing pfam to cog map'
     pfam_cog = cPickle.load(open('pfam_cog_dict.p', 'rb'))

     with open('pfam_cog_cat_map.txt', 'w') as output:
          for pfam in pfam_cog.keys():
          temp = pfam_cog[pfam]
          pfam_temp = pfam.split(';')
          pfam_code = pfam_temp[0]
          pfam_code = pfam_code.split('.')
          pfam_code = pfam_code[0]
          for cog_cat in list(temp[1]):
               print >> output, pfam_code+'\t'+pfam_temp[1]+'\t'+cog_cat

This script produces a pickle with the mapping in case you want to format the output differently later without having to redo everything, and a tab-delimited file with the COG categories that each pfam belongs to:

PF07792     Afi1     NULL
PF05346     DUF747     NULL
PF12695     Abhydrolase_5     E
PF12695     Abhydrolase_5     I
PF12695     Abhydrolase_5     FGR
PF12695     Abhydrolase_5     Q
PF12695     Abhydrolase_5     R
PF12695     Abhydrolase_5     U
PF12695     Abhydrolase_5     NULL
PF12644     DUF3782     S

Note that each pfam belongs to many COGS (maybe too many cogs to make this useful in many cases).  Hopefully it will help clarify the role of some of these pfams however!

18798 Total Views 4 Views Today
This entry was posted in Research. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

WordPress Anti Spam by WP-SpamShield