Yesterday I went to an inspiring informal seminar by Bailin Hao from Fudan University in Beijing. Bailin presented work by his group to establish a new method for determining microbial phylogeny. Inferring phylogeny is currently a massive headache in the field of microbiology. Traditionally phylogeny was determined by aligning a marker gene (typically the 16S rRNA gene) from a set of bacteria, and inferring relatedness within the set from differences in these alignments. These differences are the result of insertions, deletions, and base changes accumulated over time as the gene evolved independently for each analyzed strain.
Although this approach has served the study of microbial evolution and ecology well it has some serious shortcomings. First, phylogenetic inference from gene alignment is hard. A lot of work has gone into developing sophisticated statistical models that predict how likely a predicted mutation will occur (and thus when gaps should appear in an alignment), but the best model can’t really capture what’s occurring in nature. No alignment is perfect, and even imperfect alignments take a lot of time and computer power to calculate. Second, genomes don’t always evolve in a predictable vertical fashion, with a daughter genome acquiring all the genes of the parent genome and perhaps a beneficial mutation or two. Single celled organisms often pick up genes they encounter in the environment. If beneficial those genes can find a permanent home in the genome. There is no guarantee that the source organism of a “horizontally transferred” gene is closely related to the recipient, thus a phylogenetic inference based on a horizontally transferred gene will be incorrect with respect to the whole genome. Third is the problem of what taxonomy means anyway? If a bacterial genome is (hypothetically) 30 % horizontally acquired genes, and has a radically different phenotype than its closest relatives as a result of these transfers, what is the value of determining traditional taxonomy?
Enter compositional vector based phylogeny. This method is just one of a new generation of sequence comparison approaches that use kmers to determine similarity. A kmer is a word (of a length set by the investigator) created from a sliding window moving along the sequence in question. For example the sequence AGCTCCGGC would have the 4 letter kmers AGCT, GCTC, CTCC, TCCG, CCGG, and CGGC. The more closely related two sequences are, the more kmers they will have in common. The kmers in common can be counted without aligning the two sequences, thus eliminating one of the most time and computer intensive parts of sequence comparison.
In the compositional vector approach to phylogeny a complete microbial proteome is converted to a vector of all possible amino acid kmers of a specified size. For a five letter kmer the possible kmers are AAAAA through VVVVV, with 3,200,000 possible words (20 ^ 5). Tallying the number of occurrences of each possible kmer in the proteome produces a vector. Converting two or more vectors to a distance matrix provides a measure of phylogenetic distance.
What really struck me was the relative simplicity of this method. In reality there are some important additional steps, namely the vectors need to be normalized to account for neutral evolution, but any investigator can use this method without sophisticated tools. Python’s probably not the best language for this, but the concept can be explored with a simple Python script. The script below will calculate a matrix of compositional vectors for some number of proteomes, WITHOUT the important normalization step. It’s not fast, but it can calculate the compositional vectors for four proteomes (in the example obtained from UniProt) in a few minutes time.
import cPickle import os ## if you haven't already done so generate all possible ## five letter words if '5mers.set' not in os.listdir('.'): aa = ['A','R','N','D','C','E','Q','G','H','I',\ 'L','K','M','F','P','S','T','W','Y','V'] bins = set() temp = [0,0,0,0,0] for residue_1 in aa: temp[0] = residue_1 for residue_2 in aa: temp[1] = residue_2 for residue_3 in aa: temp[2] = residue_3 for residue_4 in aa: temp[3] = residue_4 for residue_5 in aa: temp[4] = residue_5 temp_string = ''.join(temp) print temp_string bins.add(temp_string) fivemers = open('5mers.set', 'wb') cPickle.dump(bins, fivemers) fivemers.close() else: fivemers = open('5mers.set', 'rb') bins = cPickle.load(fivemers) fivemers.close() ## set up a matrix to hold kmer tally vec_matrix = [] for b in sorted(bins): l = [b] vec_matrix.append(l) ## concantenate proteins in multifasta to single string for seq in ['34H_proteome.fasta', \ 'agrobacterium_tumefaciens.fasta', \ 'sinorhizobium_medicae.fasta', \ 'sinorhizobium_meliloti.fasta']: with open(seq, 'r') as fasta: query = '' for line in fasta: if line.startswith('>') == False: line = line.rstrip('\n') query = query+line ## search proteome for all 5 letter kmers, and tally occurrence found_bins = {} used_bins = set() for i in range(0,len(query)): kmer = query[i:i+5] print i if kmer not in used_bins: found_bins[kmer] = 1 used_bins.add(kmer) else: found_bins[kmer] = found_bins[kmer] + 1 ## fill out dictionary with 0 values for unrepresented kmers for fivemer in bins: if fivemer not in used_bins: found_bins[fivemer] = 0 for i,r in enumerate(sorted(bins)): vec_matrix[i].append(found_bins[r]) ## print matrix as tab delimited text file output = open('phylogeny_vector_output.txt', 'w') for row in vec_matrix: for i,e in enumerate(row): if i+1 != len(row): print >> output, str(e)+'\t', else: print >> output, str(e) output.close()
Here’s what the output looks like:
AAAAA 12 90 125 133 AAAAC 0 1 6 3 AAAAD 2 12 21 28 AAAAE 5 26 53 59 AAAAF 3 21 25 30 AAAAG 8 42 62 69 AAAAH 0 6 9 11 AAAAI 8 30 44 46 AAAAK 2 27 20 26 AAAAL 16 50 73 75
The whole matrix is ~ 50 mb, for anything larger than a few proteomes it would be advisable to compress the output (cool trick – I only recently learned that R will read in a .gz file with no additional prompting, this has saved me a ton of storage space of late). To look at the relatedness of the strains you could use R or any other data analysis platform. For this case a dendrogram derived from a distance matrix of the original data should illustrate compositional relatedness. Again, I stress that this is an overly-simplistic interpretation of the actual method. In R…
d <- t(as.matrix(read.table('phylogeny_vector_output.txt', row.names = 1))) d_dist <- dist(d, "euclidean") d_clust <- hclust(d_dist) plot(as.dendrogram(d_clust))
Voilà, a dendrogram of relatedness! V2-5 correspond to the order of the original composition matrix, or Colwellia psychrerythraea, Agrobacterium tumefaciens, Sinorhizobium medicae, and Sinorhizobium meliloti. As we would expect Colwellia, a Gammaproteobacteria, clusters separately from the other strains, all Alphaproteobacteria. The two Sinorhizobium strains cluster closest. For more about this approach, and details on the full method, check out Li, Xu, and Hao, 2010 in the Journal of Biotechnology. The Hao lab has an excellent website in beta where all available genomes are represented on an interactive tree: http://tlife.fudan.edu.cn/cvtree/.
Dendrogram of Euclidean distance based on complete proteome compositional vectors for Colwellia psychrerythraea (V2), Agrobacterium tumefaciens (V3), Sinorhizobium medicae (V4), and Sinorhizobium meliloti (V5).
I received a great suggestion for simplifying this code, using the itertools module:
import itertools
aa=list(‘ACDEFGHIKLMNPQRSTVWY’)
nmer_length = 5
aa_nmer = list(itertools.repeat(aa,nmer_length))
nmers = list(itertools.product(*aa_nmer))
print nmers