A couple of articles back I posted a down-and-dirty method for calculating compositional vectors in a whole genome. I’m using compositional vectors in one of my projects so I reworked the code to make the analysis more rigorous, implementing the suggestion to use Python’s itertools function, shifting the alphabet to amino acid space, limiting the composition to products from open reading frames, and applying the correction from Hao and Qi, 2004. The correction relies on the calculation of k – 1 and k – 2 length compositional vectors, which are the k1 and k2 sections in the following block. The k section is the desired k length vector, and k0 is the normalized vector.
#### load some modules we will need #### import os import subprocess import re import gzip import cPickle from Bio import SeqIO from joblib import Parallel, delayed #### generate all possible kmers #### k = 5 k1 = k - 1 k2 = k - 2 ## k pros = list(itertools.repeat(pro, k)) bins = list(itertools.product(*pros)) new_bins = [] for b in bins: b = ''.join(b) new_bins.append(b) bins = new_bins nmers = open(str(k)+'mers_pro.set', 'wb') cPickle.dump(bins, nmers) nmers.close() itertools.product() ## k1 pros = list(itertools.repeat(pro, k1)) k1_bins = list(itertools.product(*pros)) k1_new_bins = [] for b in k1_bins: b = ''.join(b) k1_new_bins.append(b) k1_bins = k1_new_bins k1_nmers = open(str(k1)+'mers_pro.set', 'wb') cPickle.dump(k1_bins, k1_nmers) k1_nmers.close() itertools.product() ## k2 pros = list(itertools.repeat(pro, k2)) k2_bins = list(itertools.product(*pros)) k2_new_bins = [] for b in k2_bins: b = ''.join(b) k2_new_bins.append(b) k2_bins = k2_new_bins k2_nmers = open(str(k2)+'mers_pro.set', 'wb') cPickle.dump(k2_bins, k2_nmers) k2_nmers.close() itertools.product()
Once all possible words have been identified for k, k1, and k2, I count their occurrence in the proteome and the normalize the count. The code doesn’t show where “name” comes from, in my case I’ve already looped over the files in the directory and generates a list of file prefixes for the proteomes I want to analyze (code not shown). I append ‘_pro.fasta’ to each prefix to get back to the actual file name.
#### find normalized kmers in proteome #### def calc_vector(name, bins, k1_bins, k2_bins): k1_found_bins = {} k1_used_bins = set() k2_found_bins = {} k2_used_bins = set() found_bins = {} used_bins = set() seqs = name+'_pro.fasta' for record in SeqIO.parse(seqs, 'fasta'): query = str(record.seq) ## k1 and k2 for i in range(0,len(query)): kmer = query[i:i+k1] print name, 'k1', i if kmer not in k1_used_bins: k1_found_bins[kmer] = 1 k1_used_bins.add(kmer) else: k1_found_bins[kmer] = k1_found_bins[kmer] + 1 for i in range(0,len(query)): kmer = query[i:i+k2] print name, 'k2', i if kmer not in k2_used_bins: k2_found_bins[kmer] = 1 k2_used_bins.add(kmer) else: k2_found_bins[kmer] = k2_found_bins[kmer] + 1 ## k for i in range(0,len(query)): kmer = query[i:i+k] print name, 'k', i if kmer not in used_bins: found_bins[kmer] = 1 used_bins.add(kmer) else: found_bins[kmer] = found_bins[kmer] + 1 ## k0 norm_bins = {} for kmer in found_bins.keys(): if len(kmer) == k: kmer_1 = kmer[0:-1] kmer_2 = kmer[1:] kmer_3 = kmer[1:-1] bigL = len(query) kmer_0 = ((k1_found_bins[kmer_1] * k1_found_bins[kmer_2]) / float(k2_found_bins[kmer_3])) * (((bigL - k + 1) * (bigL - k + 3)) / float((bigL - k + 2) ** 2)) kmer_norm = (found_bins[kmer] - kmer_0) / kmer_0 norm_bins[kmer] = kmer_norm print name, kmer, kmer_norm ## fill out dictionary with 0 values for unrepresented kmers for nmer in bins: if nmer not in used_bins: norm_bins[nmer] = 0 with gzip.open(name+'_'+str(k)+'mer_bins.txt.gz', 'wb') as bins_out: for each in sorted(norm_bins.keys()): print >> bins_out, each+'\t'+str(norm_bins[each])
It takes a little bit of time to calculate the normalized vector for each proteome you wish to analyze. One way to greatly speed things up is to analyze the proteomes in parallel. I’m using the Parallel function from the joblib module for this. Getting it to work right wasn’t straightforward, and the rather kludgy method of printing each compositional vector to file in the above function was my workaround for not being able to create a dictionary of vectors in a parallelized loop. In the below loop I’m generating a compositional vector for each proteome specified in names, starting a new process for each proteome up to the number of cpus available on the system (n_jobs = -1).
#### the the function in parallel on input files #### Parallel(n_jobs = -1, verbose = 5)(delayed(calc_vector) (name, bins, k1_bins, k2_bins) for name in names)
The final piece of the puzzle is to bring together the individual vectors into a single matrix that can be converted to a distance matrix (or whatever you wish to do with it). There are a number of ways to do this. I chose to generate a dictionary of all vectors, with the “name” as key. The bins are in order (or should be!) in the list that makes up each dictionary value. It potentially took a long time to calculate the matrix, so I save the dictionary as a pickle so that I can access the dictionary anytime in the future without having to re-calculate anything. Note that the matrix is transposed, if you write to a text file be sure to change the orientation. R and Matlab (I believe) will not read a matrix with 3.2 million columns (that many rows, and many more, are okay)!
#### bring the vectors together and create some permanent #### #### output #### final_bins = {} for f in os.listdir('.'): if f.endswith('bins.txt.gz'): name = re.split('_'+str(k)+'mer_bins', f) name = name[0] print name with gzip.open(f, 'rb') as bin_file: temp = [] for line in bin_file: line = line.rstrip('\n') line = line.split('\t') bin_id = line[0] bin_num = line[1] temp.append(bin_num) final_bins[name] = temp cPickle.dump(final_bins, open(str(k)+'mer_normalized_phylogeny_vector_output.p', 'wb'))