I’m happy to report that I have a new paper out this week in Frontiers in Microbiology titled Identification of Microbial Dark Matter in Antarctic Environments. I thought that it would be interesting to see how well different Antarctic environments are represented by the available completed genomes (not very was my initial guess), got a little bored at the ISME meeting this summer, and had a go at it.
My approach was to find as many Antarctic 16S rRNA gene sequence datasets as I could on the NCBI SRA (Illumina MiSeq only), reanalyze them using consistent QC and denoising (dada2), and apply our paprica pipeline to see how well the environmental 16S rRNA sequence reads match the full-length reads in a recent build of the paprica database.
First things first, however, it was interesting to see 1) how poorly distributed the available Illumina libraries were around the Antarctic continent, and 2) just how many bad, incomplete, and incorrect submissions exist in SRA. 90 % of the effort on this project was invested in culling my list of projects, tracking down incorrect or erroneous lat/longs, sequence files that weren’t demultiplexed, etc. The demultiplexing issue is particularly irritating as I suspect it results purely from laziness. Of course the errors extend to some of my own data and I was chagrined to see that the accession number in our 2017 paper on microbial transport in the McMurdo Sound region is incorrect. Clearly we can all do better.
In the end I ended up with 1,810 libraries that I felt good about, and that could be loosely grouped into the environments shown in the figure above. To get a rough idea of how well each library was represented by genomes in the paprica database I used the map ratio value calculated within paprica by Guppy. The map ratio is the fraction of bases in a query read that match the reference read within the region of alignment. This is a pretty unrefined way to assess sequence similarity, but it’s fast and easy to interpret. My analysis looked at the map ratio value for 1) individual unique reads, 2) samples, and 3) environments. One way to think about #1 is represented by the figure below:
What these plots tell us is that most unique reads were reasonably well represented by the 16S rRNA genes associated with complete genomes (> 80 % map ratio, which is still pretty distant genetically speaking!), however, there are quite a lot of reasonably abundant reads with much lower map ratios (looking at this now it seems painfully obvious that I should have used relative abundance. Oh well).
I didn’t make an effort to track down all the completed genomes associated with Antarctic strains – if that’s even possible – but there is a known deficit of psychrophile genomes. Given that Antarctica tends to be chilly I’ll hazard a guess that there aren’t many complete bacterial or archaeal genomes from Antarctica isolates or metagenomes. Given the novelty of many Antarctic environments, and the number of microbiologists that do work in Antarctica, I’m a little surprised by this. Also kind of excited, however, thinking about how we might solve this for the future…