VirMiner - About

Highlights - VirMiner

VirMiner provides a comprehensive analysis pipeline which includes several highlights: (1) Raw reads processing and on-site metagenome assembly and gene prediction; (2) Comprehensive functional annotations including Pfam, KEGG orthology (KO), phage orthologous groups (POG), viral protein families and viral hallmarks; (3) A highly sensitive random forest (RF) predictive model for phage contig identification, which shows outstanding performance in identifying high abundant phage contigs; (4) Phage-host relationship prediction and CRISPR-site recognition; (5) Statistical comparisons among different sample groups. Here we mainly introduce the two parts that we made some improvements.

Our updated POG database (uPOGs)

This uPOGs was built in June, 2016 using the same method in Kristensen et al, but we integrated more recently released and published phage genomes to gain more POGs. Compared to another recently updated database (Grazziotin et al., 2016) (pVOGs), we added the necessary information for the further analysis: virus-specific POGs and taxon signature POGs.

(1) POG construction. We collected 4,078 phage genomes from NCBI nucleotide database, using the standard COG-building method, the 357,460 proteins from those phage genomes were predicted and 321,294 were clustered into 16,710 POGs.

Table 1. Statistics for POG construction from uPOGs, pVOGs and POG 2012 version.

	uPOGs	pVOGs (Grazziotin et al., 2016)	POGs 2012(Kristensen et al., 2012)
Genomes	4,078	2,993	1,027
Proteins	357,460	295,653	97,731
POGs	16,710	9,518	4,542

(2) Virus-specific POGs identification. We further identified virus-specific POGs that were helpful to distinguish phage from other components of microbial genomes, like prokaryotic genomes, based on virus quotient (VQ) from Kristensen et al. If VQ is closer to 1, which represents this POG is highly virus-specific. The distributions of VQ values in the POG 2012 and our updated POG database were comparable. In total, 11,978 virus-specific POGs were provided, which outnumbered previous database.

Table 2. Virus-specific POGs for uPOGs and POG 2012.

VQ	POGs 2012 (Kristensen et al)	uPOGs
1	62% (2,816)	62.08% (10,374)
0.9 - 1	9% (409)	5.01% (838)
>= 0.8	>75% (>3,407)	71.68% (11,978)
<= 0.1	4% (182)	9.46% (1,581)
<= 0.2	7% (318)	12.34% (2,062)

(3) Taxon signature POGs identification. The taxon signature POGs could be used to specifically identify the presence of particular taxon groups. Using the criteria of 100% precision, VQ greater than 85%, recall greater than 85% and present in a single copy per genome, 640 taxon-specific POGs for 32 taxon groups were identified. You can download from here. Compared to previous reports (106 POGs for 40 taxa including 5 unclassified taxa), more taxon marker genes were identified, while the number of taxon groups slightly decreased (32 vs 40).

Pre-built random forest model for predicting phage contigs

The metagenomics and matched phageomics were used to build the model of phage contigs identification. We evaluated metagenomic phage contigs through metagenomic sequencing of DNA from phage library.

(1) Labeling metagenomic contigs into different categories. We defined three categories for metagenomic contigs: phage contigs, ambiguous contigs, and confident non-phage contigs. The most important part is to define true positive phage contigs in metagenomic contigs by retrieving metagenomic phage contigs from matched phage genomics data. The metagenomic contigs were first mapped against the contigs from the phageomic data for the same sample using megablast (E value < 1e-5, identity > 98%). A metagenomic contig is marked as a phage contig if it is present in the phageomic data and meets at least one of following criteria: (i) >80% coverage (aligned length / the length of the query or the subject contig, whichever is shorter); (ii) aligned length >10kb. The contigs defined as ambiguous should meet one of following criteria: (i) 40-80% coverage; (ii) aligned length between 4kb and 10kb. Other than the above two categories, the remaining ones were defined as confident non-phage contigs.

(2) Constructing random forest classifier model. First of all, several metrics were selected to construct random forest model. They include (i) average depth (the number of reads mapped to a contig divided by contig length), (ii) the number of predicted genes, (iii) the number of genes mapped to the updated POG database, (iv) the number of genes mapped to viral protein families defined by Paez-Espino et al., (v) the percentage of genes annotated to viral protein families (the number of predicted genes annotated to viral protein families divided by total number of predicted genes for this contig), (vi) the number of genes mapped to KO, (vii) the percentage of genes annotated to KO, (viii) the number of genes mapped to Pfam, (ix) the percentage of genes annotated to Pfam, and (x) the number of viral hallmark genes defined in Roux et al.. As we found only a very small fraction of contigs can be defined as phage according to last step, which may cause imbalanced data problem, All the phage contigs, all the suspicious contigs and randomly selected 3,000 confident non-phage contigs with abundance in the top 50th percentile. The random forest algorithm was implemented by an R package randomForest.

(3) Evaluating the predictive performance of constructed random forest model. Two measurements were introduced to evaluate the predictive performance. One was the count of contigs predicted correctly, while the other one was the cumulative ratio (over total reads in the sample) of reads represented by correctly predicted contigs. When evaluating the predictive performance, the putative ambiguous contigs and confident non-phage contigs were both considered as non-phage contigs.