CDS 102 - 410 /note=Start 10 at 102 is called by both Glitter and GeneMark, and was agreed upon by Starterator. Atlantica does not have most annotated start (11); start 10 is found in 11 of 52 ( 21.2% ) of genes in pham and called 100% of the time when present. BLASTP @ PhagesDB is consistent with this call. Next start @168 cuts off coding potential. Function was not too clear - one good HHPred hit to terminase small, but this is a PF hit. Phages in the same subcluster have called terminase, DNA binding protein, or hypothetical. We think the putative terminase on the right-hand side of the genome is a better call for terminase small, and there is little to no direct support for a functional call from the available evidence. CDS 370 - 1992 /note=Glimmer, genemark, and starterator all claims the same start (98 @370; in 3% of genes in pham; called 53.5% of time when present; 13 MAs). Atlantica does not have the most annotated start (171). Other valid start (82 @349) is farther into preceding ORF; found in 2.3% of genes in pham; called 35.3% of time when present; 7 MAs. From this, start 98 @370 is best supported. BLASTP is consistent with this start (many 1:1 hits); hits tend to have called terminase large subunit function. HHpred, has a 100% support for a large terminase subunit. CDS 1989 - 2318 /note=Glimmer start 2010; GeneMark start 1989. Start 8 @1989 is -4 gap (lazy ribosome); in 14 of 51 ( 27.5% ) of genes in pham; 5 MAs; called 35.7% of the time when present. Start 13 @2010 is most annotated start; in 40 of 51 ( 78.4% ) of genes in pham; 15 MAs; called 85.0% of time when present. We selected start 8 based on preference for lazy ribosome and because this start is well conserved within AS3. Other members of the AS3 subcluster called this an RNA binding protein; high-probability HHPred hits are to ribonucleoproteins. CDS 2335 - 4254 /note=Glimmer & GeneMark start 26@2335 ( in 61 of 236 ( 25.8% ) of genes in pham; 36 MAs; called 100% of the time when present). Atlantica does not have the most annotated start. Portal protein domain identified with 100% probability (residues 33-400ish). Probable fusion to MuF-like domain in C terminal. CDS 6510 - 6692 /note=Start was agreed upon by Glimmer and GeneMark. The selected start (Starterator start 7) is the most annotated start and is well conserved in cluster; this pham appears to be contained to cluster AS. Other proteins in pham are annotated as hypothetical; these hits dominate PhagesDB BLASTP. HHPRED hits are marginal (best 77% probability), short (~20 residues), and mostly to DUFs; no known functional motifs could be confirmed. CDS 6714 - 7301 /note=Have glimmer and genemark agreement. NCBI blasts suggest head-to-tail adapter, and HHpred also provides evidence for this function. Synteny also suggests this. We are confident this is a gene and has a function. CDS 7312 - 7650 /note=Glimmer & GeneMark agree on start 7312. Anything before ~7,400 doesn`t cut off significant coding potential, but there is a lesser peak extending to ~7200 (which would overlap substantially into ORF of previous gene). Atlantica does not have the most annotated start (58). Auto-annotated start (starterator 57) is: Found in 53 of 381 ( 13.9% ) of genes in pham; called 100% of the time when present. Other acceptable starts per GM have no Starterator support. HHPred indicates high-probability hits to head-to-tail stopper/head completion/head-to-tail joiner. BLASTP @NCBI is consistent. Functions list indicates that the hit to Head completion protein gp16 of B. subtilis (7Z4W_1) supports call of head-to-tail stopper. Calling this function. CDS 7655 - 7909 /note=Glimmer and Genemark agree. Starterator proposed no additional start and supports start 7655. NCBI and phagesDB blast provide no specific function for this section of the sequence, but full-length or almost full-length coverage suggests this gene has homologs and is probably a gene with unknown functions. Synteny are not helpful. CDS 7902 - 8342 /note=Auto-annotation call start 7902. Starterator agrees. PhagesDB blast and NCBI blast both indicate tail terminator. HHpred hits tail completion protein (gp_17) with full length hit. CDS 8378 - 8959 /note=Glimmer, GeneMark, and Starterator agreed on start 8378, which includes all coding potential in this area. This start was called 95.5% of the time it was auto-annotated. HHPred has a 99.92% probability hit to the crystal structure for "major tail protein bacteriophage" and BLASTP top hits were all to "major tail protein". CDS 9060 - 9413 /note=GeneMark and Glimmer agree on start (9060). This is the most annotated start (9; Found in 118 of 141 ( 83.7% ) of genes in pham; Called 99.2% of time when present). Larger gap, but no evidence for ORF in this space. HHPred has one good hit (to a PF) for tail assembly. BLAST indicates hits to tail assembly chaperones. CDS join(9060..9386,9386..9784) /note=FRAMESHIFT POSITION 9386. Conserved sequence and same annotated –1 frameshift as in RedFox. Called based on synteny and BLASTP CDS 9788 - 12076 /note=While the Glimmer and GeneMark starts are both supported by GeneMark coding potential, start 9788 (Glimmer) is best supported by Starterator (most conserved start). This START is found in 45 of 49 (91.8%) of genes in this pham. It is auto-called 97.% of time when present, and it is manually added 26 out of 29 times when not auto-called. Additionally, this choice is associated with the most optimal spacing (gap 3) and with the best-scoring RBS. Most BLASTP hits are full length (1:1 starting alignment) or nearly so (alignments starting at residues 2-3), indicating that this START is not removing a meaningful portion of the resulting protein. There are many members of this pham; all annotated members are tape measure protein. BLASTP results include many phage proteins annotated as tape measure protein. The best matches in HHPred (>90% probability and coverage) are tape measure proteins. CDS 12076 - 12921 /note=This start was called by genemark, glimmer, and in starterator it is called in 46/47 non draft annotations. Minor Tail Protein as suggested from synteny with other AS3 phages. It is after the tape measure protein. There were also Blastp hits to other minor tail proteins. CDS 12931 - 14175 /note=Glimmer, Genemark, starterator agreement. Start 23@12931 is found in 106 of 125(85.8%) of genes in pham. Manually added 62 of 75 times, called 99.1% times when present. HHpred hits to tail protein. NCBI and phagesDB blast both result in minor tail protein. CDS 14180 - 15139 /note=Start 13 @at 14180, agreed upon by Glitter and GeneMark, as well as starterator (most annotated start; in 100% of genes in pham; called 100% of the time when present). Also LORF. HHPred hit to crystal structure 3QR8_A contains baseplate domain, consistent with annotation as tail fiber component. CDS 15149 - 15772 /note=There is glimmer and genemark disagreement, but startertor supports 15149 (start 5) as most annotated start (found in 100% of genes in pham; Called 95.9% of time when present). 15179 (start 6) is suggested by genemark but not by starterator (Found in 38 of 49 ( 77.6% ) of genes in pham; Called 2.6% of time when present). Earlier starts cut off coding potential and are not acceptable. Blasts suggest minor tail protein, presumably called based on synteny. HHpred hits are to phage receptor/baseplate protein, supporting the call as a structural tail protein. CDS 15772 - 16194 /note=Start agreement GeneMark/Glimmer for start 15772. Chosen START is most annotated start, well conserved among cluster AS phage. BLASTP top hits were either membrane proteins or hypothetical proteins.No good HHPred hits, but four identified transmembrane domains according to Pham conserved domains map. DeepTMHMM graph showed very high probability of location of the protein being in the membrane. Foldseek showed top results also being membrane proteins (including holin, but 4 TM domains is atypical for known phage-associated holins) - the TM product encoded directly downstream of endolysin is probably a better call for holin, assuming we can only call this function once. Prior observation in Gordonia phage that "All of the four TM proteins are always found in lysis cassettes with at least two other TM proteins." (https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0276603) holds here as well - this is the first of three TM proteins in the putative lysis cassette, and the only one with four TMD. CDS 16204 - 16494 /note=Glimmer, genemark agreement. Most annotated start (starterator 35) is not present in this genome; recommended start (40 @16494) is found in 59 of 226 ( 26.1% ) of genes in pham (well conserved in AS), and is called 94.9% of time when present. Also LORF. BLASTP runs do not provide any function (hypothetical protein). Since gene length is relatively short, blast scores are low but alignment coverage is quite high. HHpred indicates hits to crystal structures of FtsB and FtsL (cell inner membrane proteins, bacterial cell division membrane protein complex) across most of the product, as well as to other membrane-associated protein-associating proteins. TMMHMM does indicate one (1) TM domain toward the N terminal but within the body of the protein (residues 5-23), consistent with HHPred hits to membrane-tethered proteins; region is annotated as TM rather than as signal, indicating the protein is not likely to be tagged for secretion. SOSUI confirms one TM domain across residues 3-25 (PELLTAILGAGGLAAIVPKLIDG). We therefore choose to call this as "membrane protein". CDS 16561 - 17622 /note=Start was agreed upon by Glitter and GeneMark. Does not cut off coding potential. This is the most annotated start (16; called in 24/25 non-draft genomes; Found in 45 of 48 ( 93.8% ) of genes in pham; called 100% of the time when present). All later starts cut off coding potential. BlastP have matched it to an endolysin, and related phages have also called this function. HHPred hits to peptidoglycan amidase support molecular activity. Unless we identify a lysin B elsewhere, this will be endolysin. CDS 17622 - 17945 /note=Start was agreed upon by Glimmer and GeneMark. Chosen start is LORF and -1 gap; RBS also good. Atlantica does not have the most annotated start (15); selected start is 30 chosen 88.9% of the time when present; well conserved in AS2/AS3. DeepTMHMM indicates two transmembrane domains (type II holin then presumably; slightly longer protein than typical for this class). Only RedFox (among related phage) has made this call. Proximity to endolysin supports this product as a potential holin. No good HHPred hits. We are calling holin. CDS complement (18078 - 18209) /note=18209 (lazy ribosome start) is called by Glimmer and Genemark; later starts cut off coding potential, and 18281 extends unusually far into previous coding region. Starterator agrees - this is most annotated start, called 100% of the time when present. PhagesDB BLAST consisted with Starterator. NCBI BLAST returns only SEA-PHAGES phage prots. HHpred has full length hits for HTH domain of 8B4H_C and 8Q4D_D (putative transposases), DNA binding 5CLV_J (TrfB transcriptional repressor protein; Helix-turn-helix), 8QA9_C (transcriptional regulator) crystal structures. Visual inspection of alignments indicates HTH-consistent predicted 2`, but GYM2.0 does not find HTH motif within this sequence. CDS complement (18206 - 18520) /note=Glimmer and Genemark disagreement. 18520 could be the start as the other suggested start introduces too much overlap (>50 bp), whereas start 18520 is lazy ribosome (generally preferred). Starterator: start 18 @18520 was definitively called on phage DanHam62, on the same track as Atlantica; this is the most annotated start, called in 17/21 non-draft genes in pham, present in 31 of 39 ( 79.5% ) of genes in pham, called 96.8% of time when present. Start 16 @18592 is present in 36 of 39 ( 92.3% ) of genes in pham, manual annotations 1/21, called 2.8% of the time when present. Blasts do not have indicative functions. Gene seems to belong in the Pleckstrin homology (PH)-like domains. per literature, "The PH domain consists of a seven-stranded β-sandwich, which forms a pair of perpendicular β-sheets capped by a C-terminal amphipathic α-helix" (https://www.sciencedirect.com/science/article/pii/S0022283609013576?via%3Dihub) - AlphaFold indicates that this structure is predicted for the monomer. CDS complement (18517 - 18732) /note=GYM2 finds a N-terminal HTH motif starting at position 23 (score 58) with motif LSRPEVAERIGVKPDTLNRYKL. Many 100% homologous hits in PhagesDB, most annotated as HTH domain protein. HHPred likewise shows multiple hits to HTH domain. CDS complement (18785 - 19504) /note=GeneMark and Glimmer agreed on start 19504 (start 8; in 50 of 52 ( 96.2% ) of genes in pham; called 98.0% of time when present). Start 19591 also acceptable from coding potential; this start is much less conserved and is never called. As 19504 is better conserved and has the better RBS, and as the farther start does not meaningfully diminish the gap, we will retain the selected start. HHPred hits are of low-marginal probability, and to specific regions only. Hit to conserved DUF2059 bacterial domain in (https://www.rcsb.org/structure/6F03) - domain is of unknown function. GYM2 finds a HTH domain with motif QVRQDLADELGTAGWTATDHTR starting at position 166 (score 39) - this is a region with frequent hits in HHPred and FoldSeek, to phage-associated proteins with no annotated function. A second N terminal region (residues 16-90 approx) is also highly conserved and seems to be separable per Foldseek and HHPred; again, function is unknown. CDS complement (19926 - 20165) /note=Glimmer does not call this gene. Start 12 @20165 is the most annotated start, but with very few members in pham; allows lazy ribosome. BLASTP returns only SEA-PHAGES phage products. HHPred hits are poor - not informative. This gene is weird but we are having a hard time quantifying why. There is a peak of coding potential here (kinda - host-trained GM is more convincing than self-trained), but it... just doesn`t look like anything? Since there is (maybe) coding potential, keep it I guess? Not happy about it though. CDS complement (20162 - 20521) /note=Glimmer/GeneMark disagreement. Per original starterator, auto-annotated start 14 @20326 is found in 5 of 6 ( 83.3% ) of genes in pham, selected 100% of the time when present; start 6 @20521 is also in 5 of 6 genes in pham but is never selected. In our maps, all starts before 20521 cut off coding potential; this start is also lazy ribosome and longest ORF, closing a gap between genes. HHPred hits are poor, and no conserved regions found - cannot annotate a specific function. CDS complement (20518 - 20871) /note=Genemark predicted another start (Starterator start 138 @20838 found in 18 of 638 ( 2.8% ) of genes in pham and called 33.3% of time when present (1 MA). Most annotated start (117) is not present. Start 106 @20871 has 14 MA`s, in 26 of 638 ( 4.1% ) of genes in pham, called 73.1% of time when present. This is also lazy ribosome start and therefore preferred on principle. HHpred gives us the function that is also shown in phagesDB blast and NCBI blast. The domain ranges from query 1 to 66. Domain: HNH endonuclease CDS complement (20868 - 21209) /note=Glimmer, Genemark, and starterator all agree on the start 21209. Gym2 also concludes that this region has one HTH domain. Blast results show high homologies with AS3 HTH. CDS complement (21302 - 21526) /note=Although Glimmer did not call a start, GeneMark and Starterator agreed on start 21526, which includes all coding potential. This start is the most annotated start and was called 100% of the time when present. HHPred had no high probability matches and BLASTP showed a few 100% coverage matches, all with unknown function. CDS complement (21529 - 21972) /note=GeneMark and Glimmer did agree on start. Atlantica does have the most annotated start; selected start at position 7, 21972 is most annotated; appears in 63% of genes in pham; is selected 100% of the time when present; is LORF; is not lazy ribosome. BLASTP @PhagesDB indicated hypothetical protein. HHPred did not produce high-probability hits. For hypothetical protein hits in phagesDB and NCBI and low probability hits in HHpred, we selected hypothetical protein as an appropriate function. CDS complement (22032 - 22211) /note=Start at 22111, good RBS, aligns well with area of high coding potential, called by glimmer and genemark. Start is called by 77.5% of genes in pham. Some moderate-probability hits (85-88%) to DNA binding proteins in hhpred, all ~32 amino acids long (~50% of target). C-term hits are lower probability, all DUF. Not enough evidence to call a function. (Note: Consider the total length of the product - 32 residues is more than half the protein!). AlphaFold indicates three nice confident alpha-helices, consistent with the body of a HU protein but lacking the protruded B-sheet arms responsible for DNA binding. CDS complement (22359 - 22619) /note=GeneMark and Glimmer did agree on start. Atlantica does have the most annotated start; selected start at position 7,22619 is most annotated; appears in 100% of genes in pham; is selected 100% of the time when present; is not LORF; is not lazy ribosome. For lack of hits in blastp, NCBI, and hhpred, we selected hypothetical protein. This ORF is well conserved with synteny at the end of the tandem-repeat-flanked sequence, but there is very little information on what this could be. NCBI BLASTN indicates hits within Methylobacterium "amino acid adenylation domain-containing proteins" (but only 4 hits, and these are much larger proteins). AlphaFold produced something with structure (a beta sheet region and two alpha helices), but not great confidence. GYM 2.0 indicated no HTH motifs. CDS complement (22721 - 23914) /note=Glimmer and Genemark call the same start, and starterator agrees. Blasts call for tyrosine integrase and integrase. HHpred mainly calls for integrase. Conserved domain: Integrase Domain (XerC Superfamily) CDS complement (23907 - 24284) /note=Glimmer and GeneMark disagreed on start. 24,284-24,350 OK for coding potential. Auto-called start (8 @24284) is most annotated start (called in 19 of the 29 non-draft genes in the pham; Found in 38 of 51 ( 74.5% ) of genes in pham; Called 94.7% of time when present). Other starts not as well conserved, rarely annotated. Gap corresponds to bi-directional promoter. HHPred hits strongly indicate transcriptional regulator. Phage hits are to immunity regulator/cI-type. Hits to P22/C1/cro-type have ~50% coverage. Near full-length hits to immunity regulator 7TZ1_A (Mycobacterium phage TipsytheTRex), to Rep from the temperate Salmonella phage SPC32H (structure 5D4Z), and to pLS20 conjugation repressor Rco (8BNY_B - shares p53-like tetramerization domain with Rep). For the last, HTH cro/c1 lambda repressor type domain is residues 13-69 (https://www.uniprot.org/uniprotkb/E9RIY8/entry). Tetramerization domain is later, in C term (!130-160, https://journals.iucr.org/d/issues/2023/03/00/jb5053/index.html) Tetramerization would imply cI... but tetramerization domain may not be present? Alignment to 8BNY_B stops at ~119...Does have a hit to cro (2OVG_A Phage lambda Cro) Nterm (~10-40; HTH motif region 16-35) at 79% probability. Consider https://www.ebi.ac.uk/interpro/entry/InterPro/IPR010982/. In Alphaffold - dimerizes beautifully using Cterm region (slightly uncertain a-helix). Tetramer also looks great, consistent with known cI regulatory mechanism. Predicted binding is right around 24,400 (24,390-24,420ish?) CDS 24504 - 24773 /note=Both Glimmer and Genemark called for start at 24504, agreed upon by Starterator with the most annotated start. GYM 2.0 HTH domain finder identified 2 HTH motifs within the sequence. Related phages called for a HTH DNA binding protein. Look at the HHPred hits. Phage lambda has cI and cro repressors, and this one has hits to P22 (cro-like) and cro... https://en.wikipedia.org/wiki/Cro_repressor_family. Does have a 97% probability hit to 7JVT_C Repressor protein CI - but only to the HTH domain (22-77), not to the cI C-term domain (97-196). Similar to P03034 RPC1_LAMBD Repressor protein cI - only over HTH (1-80) cI is 237 residues; HTH is 30-49. Shares hit to 7TZ1_A Immunity repressor of Mycobacterium phage TipsytheTRex (target 22-99); same to N-terminal domain of P22 (2R1J_L Repressor protein C2). Consider https://www.ebi.ac.uk/interpro/structure/PDB/1cop/ CDS 24774 - 25004 /note=Glimmer, genemark, and starterator agreement on the start. Blast and HHpred suggests HTH DNA binding protein function. It is also part of the helix-turn-helix_17 superfamily. NOTE: if this is excisionase, we should see the unusual winged HTH domain. CDS 25001 - 26014 /note=Calling start @ 25001, called by Glimmer and Starterator (start 41, called 100% of the time when present). BlastP from PhagesDB and NCBI suggests that similar proteins have been called as RecE-like exonucleases, and the pham 215809 have many calls for an exonuclease. HHPred found a 99.94% probability hit with 5YET_B, from positions 40-333 on the target sequence, which included a YqaJ-like viral recombinase domain, which matched up to positions 18-337 of the query sequence. CDS 26015 - 26812 /note=100% Probability HHpred hit over entire protein to a crystal structure verified Rec-T DNA annealing protein. 26015 is the most annotated start on Starterator, called 100% of the time. Also aligns well with area of high coding potential and good RBS score. CDS 26809 - 27408 /note=Glimmer didn`t call this start, but starterator and genemark mark this as a proper gene. HHpred and blasts suggest DNA methyltransferase. It also belongs to the Dcm superfamily conserved domain. It refers to a family of site-specific DNA-cytosine methyltransferases involved in DNA methylation. Interesting. There is no coding potential in the last ~half of this gene (rf1), and coding potential drops off right around the start of the next gene (in rf2). I wonder whether there is a missed STOP? There are full-length homologs in other phage within cluster, which does make this less likely, as well as apparent full-length hits in HHPred (well - full length of query, but targets tend to be 300 residues vs the 200 residues in this annotated protein.). Confirmed most of the C-specific MTase domains (), but domain VIII is incomplete https://pmc.ncbi.nlm.nih.gov/articles/PMC317633/ CDS 27170 - 27967 /note=Glimmer and Starterator suggest start 27170 (starterator start 27, in 19% of genes in pham, called 37.5% of the time when present). However, checking the phamerator graph reveals that AS3 phages often overlap in this region, and start 27575 (starterator start 148, not discussed in Starterator summary, but highly conserved) does not overlap in this region. Atlantica does not have the most annotated START. Therefore, we are calling 27170 the proper start in this region. Blast suggests DNA methyltransferase in many other phages. HHpred also suggests DNA methyltransferase. PhagesDB BLASTP indicates a mix of full-length 1:1 matches and matches truncated by ~60 residues at N-terminal, probably representing the weird START choice here. Coding potential shows a big gap between ~STOP of previous gene (27408) and start 27575. HHPred hits are all to C terminal (residues 200-end). If we choose start 27575 (which obviously changes the pham) and BLAST this sequence, we still see many full length 1:1 hits within cluster in PhagesDB. HHPred hits are now to ~full length of query but C-term of targets. NCBI BLASTP is largely the same, with hits to C term of bacterial DNA cytosine methyltransferases. Looking closer at active sites - the C term contains something not unlike block X (elsewhere called homology region 4) of this MTase family, with highly conserved "GN" core (notably - missing in previous gp), but no other conserved sites. This appears to be the missing region from the previous ORF - frameshift? (https://pmc.ncbi.nlm.nih.gov/articles/PMC317633/pdf/nar00124-0054.pdf, https://www.sciencedirect.com/science/article/pii/0022283689904804) CDS 27964 - 28344 /note=Several reasonable starts, with coding potential ending ~28050. Start 27985 (start 20) is preferred by Glimmer; found in 23% of genes in pham, selected 66.7% of the time when present. Start 27960 (start 13) is preferred by GeneMark and is lazy ribosome -4, as well as being the most annotated start in pham (42.3% of genes, selected 63.6% of the time when present). PhagesDB BLAST hits reflect this ambiguity. On general principle, the -4 start is preferred. Lots of good high-probability HHPred hits to various HTH transcriptional regulators. Alignments clearly include HTH motif. CDS 28341 - 28811 /note=Glimmer/GeneMark disagreement. Atlantica does not have the most annotated start (46). First two starts do not cut off coding potential. Of those: Start 26 @28341 has 28 MA`s, is in 52 of 330 ( 15.8% ) of genes in pham, is called 82.7% of time when present. Is LORF and lazy ribosome. Start 43 @28377 is in 53 of 330 ( 16.1% ) members, called 15.1% of time when present. Evidence favors start 26. Blasts and HHpred suggest RusA-like resolvase, and the gene belongs in the RusA superfamily. CDS 28808 - 29587 /note=Glimmer/Genemark disagreement. Starterator suggests this start (7 @28808; most annotated start, called in 28/29 non-draft genomes, in 100% of genes in pham; called 94.2% of the time when present); also LORF and lazy ribosome start. GeneMark suggested start 9 @28823 is also conserved (100% of genes in pham) but is called 5.8% of the time w/no manual annotations. Gym 2.0 found 2 probable helix-turn-helix DNA binding domain motifs. Blast and HHpred also provide evidence for a HTH DNA binding domain. Multiple high-probability N terminal hits in HHPred, including RepA-like and DnaD-like initiators (can see this in function frequency as well) and lambda O-like (same activity). RepA-like should contain 2 winged HTH (C-terminal is less conserved in literature, consistent with lack of hits to this region) connected by a long (variable-length) linker domain; we do see this in Alphafold, with an extended disordered region connecting the HTH domains. See https://www.pnas.org/doi/full/10.1073/pnas.1406065111 CDS 29584 - 30987 /note=Both Glimmer and Genemark called for 29584, agreed upon by Starterator. BlastP has hits to both DNA methylase and DNA methyltransferase, only the latter is a recognized function. Also a hit to Yhdj superfamily, which is a DNA methyltransferase domain. Other related phages also called for a DNA methylase/methyltransferase, and this gene is in the middle of a bunch of DNA/RNA binding proteins. CDS 30984 - 31448 /note=Glimmer and GeneMark agreed on this start (start 9 @30990). Starterator says that this gene has the most annotated start in the phamily (start 8 @30984, in 100% of genes in pham). As this start is also lazy ribosome, it will be preferred on principle. HHPred has high probability N-terminal hits to replisome organizer/helicase loader (e.g. 1NO1_A, lambda P). Alphafold is consistent, predicting an alpha-helical core, disordered linker, and small alpha-helix c-term; N term is the only substantial ordered region, we should not expect structural homology hits beyond this. No helicase downstream, but if this is a lambda-like system, helicase will be from host & not encoded here. See https://pmc.ncbi.nlm.nih.gov/articles/PMC10128896/ CDS 31445 - 32035 /note=Glimmer/GeneMark disagreement. start 13 @31,451 is the most annotated START and is very well conserved. No STARTS before 31.600 cut off coding potential in host GM; self GM indicates a second peak 31,450-31,600, in which case first two starts here are fine & later starts are not. Start 8 @ 31,445 is well conserved in AS3 phage, but not globally. On general principle, lazy ribosome START (start 8) is preferred, even though this is not the most annotated START. HHPred hits to zinc ribbon domain have a gap in the second CXXC repeat - not clear that the DNA-binding structure is conserved. Found hit to SHS2 (strand-helix-strand-strand) domain in 4QIW_P - this is a general domain w/many potential uses (https://pubmed.ncbi.nlm.nih.gov/15281131/) CDS 32157 - 32417 /note=GeneMark and Glimmer did agree on start. Atlantica does have the most annotated start; selected start at position 6,32157 is the most annotated; appears in 98.1% of genes in pham; is selected 84.3% of the time when present; is LORF; is not lazy ribosome. BLASTP @PhagesDB indicated no direct function (hypothetical protein); BLASTP @NCBI indicated no direct function (hypothetical protein). HHPred did not produce high-probability hits. CDS 32482 - 32637 /note=GeneMark and Glimmer did agree on start. Atlantica does not have the most annotated start; selected start at position 32482 is not most annotated; appears in 88.5% of genes in pham; is selected 67.4% of the time when present; is LORF; is not lazy ribosome. BLASTP @PhagesDB indicated function unknown. HHPred did not produce high-probability hits, but Phamerator map indicates two transmembrane domains in protein core; also found in DeepTMHMM. For these reasons we selected membrane protein. CDS 32634 - 32906 /note=GeneMark and Glimmer did agree on start. Atlantica does have the most annotated start; selected start at position 32634 is most annotated; appears in 100% of genes in pham; is selected 86.7% of the time when present; is LORF; is lazy ribosome. BLASTP @PhagesDB and @NCBI indicated unknown protein. HHPred did not produce high-probability hits. For these reasons we selected Hypothetical Protein CDS 32903 - 33148 /note=Calling start @ 32903, which is called by Glimmer and agreed upon by Starterator. This is also the most conserved start within this pham (found in 100% of the genes in the pham, called 97.9% of the time when present). There are no hits on BlastP (PhagesDB, NCBI) regarding function. HHPred suggests a slight possibility of a metal binding site, but hits are at best 70% probability. CDS 33145 - 33516 /note=Glimmer start 33121; GeneMark start 33145. Atlantica does not have the most annotated start. Start 59 @33121 has 1 MA's; in 15 of 409 ( 3.7% ) of genes in pham; called 60% of the time when present. Start 68 @33145 has 6 MA's; is -4 gap (lazy ribosome); found in 4 of 409 ( 1.0% ) of genes in pham; called 50.0% of time when present.No strong support for preferring either Glimmer or Genemark start; chose -4 start (lazy ribosome) on general principles. Functional call supported by multiple 90%+ probability hits to nucleoside deoxyribosyltransferase domain. CDS 33506 - 34015 /note=Start agreed on by Starterator, Glimmer, and GeneMark. BlastP had good hits to SSB protein on PhagesDB and NCBI. HHPred has identified an ssDNA binding protein domain. CDS 34170 - 34904 /note=Glimmer, Genemark, and starterator all call for this start. Blast and HHpred calls for endonuclease. However, phagesDB and other phages in the pham calls for NucT-like nuclease, and we are going to decide on this. CDS 34904 - 35263 /note=Both Glimmer and Genemark called for start at 34904, and Starterator agreed. Some related phage have called it as a terminase small subunit, BlastP have decent hits to a terminase small subunit; HHPred has high probability, 75%+ length hits to terminase small. As we have terminase large on the left arm of the genome, we can call small here. CDS 35260 - 35601 /note=Start 12@35260 is called by Glimmer & GeneMark and is the most annotated start; also both LORF and lazy ribosome (-4 gap).Selected start at position 35,260 is most annotated; appears in 68.8% of genes in pham; is selected 100% of the time when present. BLASTP @PhagesDB indicated conserved starts, unknown function. BLASTP @NCBI returned some bacterial hits, but always indicated unknown protein. HHPred did not produce high-probability hits. For these reasons we selected hypothetical protein. CDS 35604 - 35768 /note=Glimmer/GeneMark agreement on start 35604 (start 18; 33.3% of genes in pham, selected 62.5% of the time when present, well conserved in cluster); does not cut off coding potential. Does not have most annotated start (19). PhagesDB BLAST supports the auto-annotated START. HHPred indicates high-probability hits to zinc finger transcription factors, but aligned regions are quite short (~14-18 residues) and cover the first C-C of the zinc-finger stabilizing motif (CKHCG). This should be paired with a H-H motif farther toward the C terminal of the protein (per C2H2 consensus CX2CX3FX5LX2HX3H) - most of the motif is actually there, but I find only one His, with a tryptophan (W) where the first H should be. (This is still a cyclic AA with an indole nitrogen in what might be the right place - will it work? Plugged the protein into AlphaFold as a monomer with ion Zn2+, and it does actually seem to be held correctly - maybe this thing does bind an ion. This is, however, atypical.) CDS 35765 - 35887 /note=Glimmer, Genemark, and Starterator called for the same start at 35765. Related phages do not have a function listed, nor does BlastP, and HHPred has no good hits. AlphaFold predicted a structure with a single long helix (weird?). CDS 36066 - 36281 /note=Tandem start - picked second start. Glimmer/Genemark agree; this is most annotated start. No high-probability hits from HHPred, and few hits in BLASTP. CDS 36278 - 36595 /note=Neither the -4 or 20-gap start cuts off coding potential, but 36302 (start 2 in Phamerator, 20 gap) is less well conserved & appears to be auto-annotated but not manually annotated. No good HHPred hits; no functions suggested in BLAST. CDS 36592 - 36807 /note=-4 start allows lazy ribosome, 100% called in cluster. Good full-length PhagesDB BLAST hits to other proteins within cluster; hits outside of cluster are to N-terminal half (to residue ~35). No NCBI BLASTP hits to bacteria w/stringent search. HHPred hits are very marginal, but also mostly to N-term; probabilities are better when run on HHPred independently (still max 77% probability). From these hits this looks like half of a Rho terminator (specifically C term half after the fold, from hits to crystal structures 1SG5_A and 6JIE_B). This is the half that associates with RNAP in E. coli Rho, but we appear to be missing the 1` N terminal domain CDS 36807 - 37019 /note=GeneMark and Glimmer did agree on start 36807. Atlantica does have the most annotated start; selected start at position 36807 is most annotated; appears in 68.3% of genes in pham; is selected 78.6% of the time when present; is not LORF; it is lazy ribosome (-1bp overlap).BLASTP @NCBI also indicated conserved starts and stops and hypothetical protein listings for similar phage. BLASTP @PhagesDB indicated conserved start and stops with all phage indicating "function unknown". HHPred produced only weak-moderate hits <80% probability, no crystal structures. For the lack of functional hits, we listed this as a hypothetical protein. CDS 37019 - 37210 /note=Selected start is the only one not to cut off coding potential. Glimmer/GeneMark agreement, LORF, most annotated start. GYM2 finds a HTH finding (motif LRLLERARDLQEHEASVSAAFR start position residue 17). Folded protein (per AlphaFold) does not, however, resemble an HTH motif. CDS 37207 - 37425 /note=Start 37207 gives the longest gene. Glimmer and Genemark agreement. Starterator: start 12@37207 is found in 94% of the pham, and called 97% when present. Atlantica has the most annotated start. HHpred does not give reasonable results, and blasts give hypothetical protein. CDS 37538 - 37645 /note=Only GeneMark called start 37538. Atlantica does have the most annotated start in a pham of 9 phages, only 2 of which are not drafts; is LORF; is not lazy ribosome. BLASTP @PhagesDB indicated no good hits except to the two in AS3 phages Juno and KHumphrey. No good hits on HHpred and BlastP to indicate a possible function or even a possible protein, but there is coding potential - retain. CDS 37635 - 37859 /note=Even though GeneMark did not call this gene, the report did show high coding potential with a short dip in the middle. Does not have the most annotated start. Selected start (56 @37635) is in 21 of 46 ( 45.7% ) of genes in pham, called 57.1% of the time when present. Remaining viable start per coding potential (26@37563 has much lower prevalence and is rarely annotated. HHPred has lots of high probability hits (>98%) to HTH DNA binding domain viral proteins and BLAST confirms this function with similar phage. GYM 2.0 also confirms that this gene has an HTH DNA binding motif. The domain may be winged HTH - many good hits to excise, but we don`t need two (probably?). Best HHPred hit is to AlpA-type regulatory protein (antiterminator, interacts with RNAP). Also good hits to cox, terminase small, and MerR-family TFs. CDS 38103 - 38390 /note=Glimmer called 38103. Atlantica does not have the most annotated start; selected start at position 86 @38103; appears in 7.3% of genes in pham; is selected 100% of the time when present; is LORF; is not lazy ribosome. BLASTP @PhagesDB indicated that most of the decent hits were to an HNH endonuclease. Alternate starts cut off significant coding potential. HHPred did produce high-probability hits 97-98% to HNH endonuclease 4OGE_A, and BLASTP @NCBI indicated good hits to HNH endonuclease. 5-methylcytosine-specific restriction endonuclease McrA domain was identified by CDD. Therefore we decided to call HNH endonuclease function. (Is this endonuclease being targeted by the previous gene product (winged HTH)?)