Phamerator Database Building Update
Charlie Bowman
|
Oct 7, 2014

Summary


A major update to the way that Phams are built in Phamerator is being deployed that greatly speeds up database construction and enhances client performance in regards to downloads and updates. Users will see minor differences in the phams themselves, but should mostly experience much shorter times for inclusion of new genomes in draft databases, and an overall more responsive user experience.


Background


In the Cresawn et al description of Phamerator (Cresawn et al., 2011) we created phamilies (phams) by pairwise amino acid sequence alignments using CLUSTAL and BLASTP, assembling related proteins into phams if the pairwise values are above threshold values. As the number of sequenced genomes – and genes – has grown, the computational time needed for these calculations has grown substantially. In the most recent database – with ~65,000 genes – this means that 65000 x 65000, or ~ 4,225,000,000 (4.225e9) alignments and BLAST comparisons must be performed to build a new database. Incremental building approaches decrease this number to around 6.5e7, (the approximate number of years since the proposed Cretaceous-Paleogene extinction!) but this is still computationally demanding. Supercomputing resources that divide the calculations among hundreds of CPU cores can reduce this to less than two weeks, but this is still very limiting – especially if something goes awry half way through! Moreover, the size of the database tables is now approaching physical limitations.


Solutions


Phage hunters are not the only ones having to address the challenge of alignment–based similarity searching in large datasets. A common resolution is to use non-alignment methods such as CD-HIT (Li and Godzik, 2006) UCLUST (Edgar, 2010) and kClust (Hauser et al., 2013), which cluster large protein databases relatively quickly. Among these, UCLUST is not recommended for clustering below 50% identity (and the current threshold we use is 32.5% identity), and kClust can achieve higher sensitivities at lower cutoffs than CD-HIT (Hauser et al., 2013), and works generally by dividing protein sequences into small units (k-mers; 4-6 aa) and then comparing k-mer profiles.
We have therefore incorporated the kClust clustering pipeline into Phamerator. The new Phameration process loosely follows the kClust_iter pipeline, and involves two steps. First, genes are grouped based on 75% predicted amino acid conservation, multiple sequence alignments are created, and hidden Markov models are generated. Secondly, consensus sequences of these groups are used in a second round of clustering at 30% predicted amino acid conservation.
There are numerous variable parameters in each of these steps and we have evaluated many of them to generate databases that are logical with low incidences of both false-positive and false-negative groupings. In general, the phams closely reflect those determined using the previous method, although the pham numbers are different.


What you will see as a user


Database creation time is now reduced from approximately two weeks to under two hours for the most recent version of the Mycobacteriophage_Draft (690 genomes) database. The changes have been implemented on the Phamerator servers and do not affect the virtual machines themselves. So you don’t need to do anything to access the new databases, they will appear once you open up Phamerator and you allow the updates to be applied. Because overall database sizes have been decreased by approximately 80%, you should see faster updates and shorter loading times. Users who create their own databases should see the new scripts rolling out in the next few weeks, as well as instructions for their use.


What you will see as a database builder


We will be releasing the new script as a part of the Phamerator distribution within the next two weeks. The process of adding and removing genomes to the databases will be unchanged. The change is in the phamily building process, and you will no longer be required to do any BLAST or Clustalw alignments. A few things will need to be changed to use the new phamily building script:


  • kClust will need to be installed

  • The HH-suite of tools will need to be installed

  • Some database constraints will need to be changed

A document will be released along with the back-end updates that details these processes.


Some differences


The kClust method avoids the creation of the ultra large Phams that were emerging in the prior phamerator databases. These are now divided into many smaller phams, which generally make sense with regard to their actual similarities. You can still examine the sequence similarities among pham members, using the ‘Pham’ function in phamerator. You will also see that the pham numbers are quite different than what they were. While we believe the new groupings to be sound, please report anything that you think seems out of place. In particular, we are interested in whether you find any instances of proteins that you think ought to be in the same pham, but aren’t.


References


Cresawn, S.G., Bogel, M., Day, N., Jacobs-Sera, D., Hendrix, R.W., Hatfull, G.F., 2011. Phamerator: a bioinformatic tool for comparative bacteriophage genomics. BMC Bioinformatics 12, 395.
Edgar, R.C., 2010. Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26, 2460-2461.
Hauser, M., Mayer, C.E., Soding, J., 2013. kClust: fast and sensitive clustering of large protein sequence databases. BMC Bioinformatics 14, 248.
Li, W., Godzik, A., 2006. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22, 1658-1659.