In a couple chapters of my dissertation I made use of protein flexibility values calculated with the ProtParam module in Biopython (Bio.SeqUtils.ProtParam). The flexibility function makes use of amino acid β-factors as reported by Vihinen et al. (1994). We were called out in a recent review for (among other things) not using more up-to-date β-factors, particularly those reported in Smith et al. (2003). I’m sure there’s some way to wrench open the hood and modify the ProtParam function so that it uses the newer β-factors, but that doesn’t sound like much fun. Instead I generated a new function that takes any protein sequence (as a string) and returns a list of flexibilities, smoothed with a 9 residue window. I knocked this up quick, so test before you use 🙂
def flexcalc(prot): b_factor = {'A': '0.717', 'C': '0.668', 'E': '0.963', 'D': '0.921', 'G': '0.843', 'F': '0.599', 'I': '0.632', 'H': '0.754', 'K': '0.912', 'M': '0.685', 'L': '0.681', 'N': '0.851', 'Q': '0.849', 'P': '0.85', 'S': '0.84', 'R': '0.814', 'T': '0.758', 'W': '0.626', 'V': '0.619', 'Y': '0.615'} b_factor_rr = {'A': '0.556', 'C': '0.607', 'E': '0.805', 'D': '0.726', 'G': '0.651', 'F': '0.465', 'I': '0.51', 'H': '0.597', 'K': '0.863', 'M': '0.575', 'L': '0.504', 'N': '0.735', 'Q': '0.729', 'P': '0.752', 'S': '0.698', 'R': '0.676', 'T': '0.648', 'W': '0.577', 'V': '0.503', 'Y': '0.461'} b_factor_rf = {'A': '0.704', 'C': '0.671', 'E': '0.911', 'D': '0.889', 'G': '0.811', 'F': '0.582', 'I': '0.617', 'H': '0.734', 'K': '0.862', 'M': '0.641', 'L': '0.65', 'N': '0.848', 'Q': '0.817', 'P': '0.866', 'S': '0.846', 'R': '0.807', 'T': '0.742', 'W': '0.609', 'V': '0.603', 'Y': '0.567'} b_factor_ff = {'A': '0.847', 'C': '0.67', 'E': '1.11', 'D': '1.055', 'G': '0.967', 'F': '0.653', 'I': '0.686', 'H': '0.894', 'K': '1.016', 'M': '0.74', 'L': '0.788', 'N': '0.901', 'Q': '1.007', 'P': '0.857', 'S': '0.914', 'R': '0.942', 'T': '0.862', 'W': '0.656', 'V': '0.707', 'Y': '0.741'} rigid = set(['A', 'C', 'F', 'I', 'H', 'M', 'L', 'W', 'V', 'Y']) flex = set(['E', 'D', 'G', 'K', 'N', 'Q', 'P', 'S', 'R', 'T']) ## get beta factors for each residue, according to neighbors b_factors = [] for r in range(0, len(prot)): b_query = False ## just making sure something doesn't go awry and an old value is used query = prot[r] if r == 0: b_query = b_factor[query] elif r == len(prot) - 1: b_query = b_factor[query] else: if prot[r - 1] in rigid: if prot[r + 1] in rigid: b_query = b_factor_rr[query] elif prot[r + 1] in flex: b_query = b_factor_rf[query] elif prot[r - 1] in flex: if prot[r + 1] in rigid: b_query = b_factor_rf[query] elif prot[r + 1] in flex: b_query = b_factor_ff[query] if b_query == False: print(r, 'bad b-factor!') exit() else: b_factors.append(b_query) ## average over 9 aa window b_factors_win = [] for b in range(0, len(b_factors)): win = b_factors[b:b + 9] if len(win) == 9: b_win = sum(map(float, win)) / 9.0 b_factors_win.append(b_win) return(b_factors_win)
Naturally we’d like to see how this compares with ProtParm. Here’s a quick comparison using an aminopeptidase from the psychrophile Colwellia psychrerythraea 34H:
prot = 'MKHFSKLCFLLSTFAVSIAPVTWAHEGATHQHANVSKLTDAYTYANYDQVKATHVYLDLNVDFDKKSLSG'\ 'FAELSLDWFTDNKAPLILDTRDLVIHRVMAKNSQGQWVKVNYDLAKRDDVLGSKLTINTPLNAKKVRVYY'\ 'NSTEKATGLQWLSAEQTAGKEKPFLFSQNQAIHARSWIPIQDTPSVRVTYTARITTDKDLLAVMSANNEP'\ 'GTERDGDYFFSMPQAIPPYLIAIGVGDLEFKAMSHQTGIYAESYILDAAVAEFDDTQAMIDKAEQMYGKY'\ 'RWGRYDLLMLPPSFPFGGMENPRLSFITPTVVAGDKSLVNLIAHELAHSWSGNLVTNESWRDLWLNEGFT'\ 'SYVENRIMEAVFGTDRAVMEQALGAQDLNAEILELDASDTQLYIDLKGRDPDDAFSGVPYVKGQLFLMYL'\ 'EEKFGRERFDAFVLEYFDSHAFQSLGTDNFVKYLKANLTDKYPNIVSDNEINEWIFKAGLPSYAPQPTSN'\ 'AFKVIDKQINQLVTDELTLEQLPTAQWTLHEWLHFINNLPVDLDHQRMVNLDKAFDLTNSSNAEIAHAWY'\ 'LLSVRADYKEVYPAMAKYLKSIGRRKLIVPLYKELAKNAESKAWAVEVYKQARPGYHGLAQGTVDGVLK' test1 = flexcalc(prot) from Bio.SeqUtils.ProtParam import ProteinAnalysis as PA test_seq = PA(prot) test2 = PA.flexibility(test_seq) import matplotlib.pyplot as plt plt.scatter(test1[1:], test2)
Which produces the following. ProtParam values on the y-axis and our values on the x-axis. Note that the scales are different, they are arbitrary. I haven’t figured out how to fit a linear model in Python yet (on the to do list), but the fit looks pretty good. Plenty of scatter but the methods are in general agreement (which is good – we already have one paper published using the old method!). Time to go back and redo a lot of work from the last two years…
Very nice and helpful work. I have a few suggestions to make it even better. The way you smooth the b_factors searches the next 8 amino acids while it should look on the next 4 and the previous 4 amino acids. Also your if statement (if len(win) == 9) prevents the return of all values. I suggest the following code for the smoothing:
## average over 9 aa window
b_factors_win = []
it = 0
for b in range(0, len(b_factors)):
if b in range(0,4):
win = b_factors[b – it:b + 5]
b_win = sum(map(float, win)) / len(win)
b_factors_win.append(b_win)
it += 1
else:
win = b_factors[b – 4:b + 5]
b_win = sum(map(float, win)) / len(win)
b_factors_win.append(b_win)
Cool, thanks Alexios!
Any chance you have published anywhere so I can properly reference you?
Thanks for inquiring… unfortunately we have not included this in a publication yet (on the to do list!). Feel free to use the code with a mention in acknowledgements.