Gene Content Similarity and Dissimilarity
Last year, one of the grad students in the Hatfull lab, Travis Mavrich, was the lead author on a paper in Nature Microbiology that explored the extent to which horizontal gene transfer and mosaicism function in different populations of phages. During his research, Travis often used a metric called GCD, or Gene Content Dissimilarity to measure the genetic distance between two genomes. (GCS, or Gene Content Similarity, is simply the opposite (1 minus GCD). It is a measure of the proportion of genes two genomes share.)
A New Way to Cluster
Grouping phages into clusters based on similarity in their genomes has always been fraught with ambiguities. Because clusters are designed to be groupings of convenience rather than hierarchical taxonomies that imply evolutionary history, writing precise rules about what should and should not belong in the same cluster was a bit, well, dubious. From the original paper describing clusters:
"The primary criterion we have chosen for placing two genomes in the same cluster is that they show evident sequence similarity in a dotplot that spans more than 50% of the smaller of the two genomes."
But this is not a precise measurement, and relies solely on nucleotide sequence similarity rather than gene content. And sometimes it felt like a too-strict cutoff. Astute students of Actinobacteriophages will have noticed, for example, that Cluster A1 phages should probably not belong in Cluster A by this criterion; they're much less than 50% like other Cluster A phages at the nucleotide level!
As more phage genomes were sequenced, more complex relationships were discovered, and we found ourselves sometimes wanting to include two genomes that only had ~40% nucleotide similarity (or less!) in the same cluster. At the same time, Travis' work had shown that there was a little gap in GCS values across the collection of phages right around 35% shared phams. So we decided to change our primary clustering criterion, and now phages that share 35% or more gene content are grouped into the same cluster.
New PhagesDB Tools
Because we're shifting towards looking at gene content instead of just nucleotide similarity, it was important to have a reasonably efficient way to calculate GCS and GCD. We've therefore created a few new tools that we're making public so you can also explore how the gene content of your phage compares to others. They can be found by clicking here, or at the URL below, or in the Phages dropdown menu under Gene Content. I won't say too much about their details, hopefully they're fairly self-evident.
Gene content pages: http://phagesdb.org/genecontent/