The format of input data

  1. Metagenomic raw reads. VirMiner requires users to upload raw reads of microbial metagenomic data in pair-end FASTQ format (See HERE for an example). The name of pair-end FASTQ file should be ended with "_1.fastq" or "_2.fastq". PLEASE AVOID SPACES IN THE NAMES OF FASTQ FILES.
  2. Metadata file (optional). VirMiner supports two-group comparison. If the metagenomic samples uploaded cover different conditions and you would like to make inter-group comparison, you could provide metadata file specifying the conditions (“Control” or “Treatment”). It should be a tab-delimited file named "filenames_comparison.txt": the first column refers to the names of samples that you upload; the second column indicates condition for the samples. Here's an example of metadata file.

Data preprocessing

Data preprocessing includes these 3 steps:

  1. Quality control. Before data analysis, we need to ensure that the data is of highly quality. For this step, adaptor region, low quality bases/reads and PCR-duplicated reads are removed. Here’s an example of quality report.
  2. Genome assembly. The high-quality short reads are assembled into longer contigs using IDBA_UD.
  3. Gene prediction. HMM model from MetaGeneMark is used for gene prediction.

Characterizing metagenomic contigs to identify phage contigs

Only contigs >5kb were used for downstream analysis. We characterized each contig using these metrics:

  1. Average depth (the number of reads mapped to a contig divided by contig length).
  2. The number of predicted genes.
  3. The number of predicted genes annotated to POG developed by Kristensen et al.
  4. The number of predicted genes annotated to our updated POG database.
  5. The number of predicted genes annotated to viral protein families defined in Paez-Espino et al.
  6. The percentage of genes annotated to viral protein families (the number of predicted genes annotated to viral protein families divided by total number of predicted genes for the contig).
  7. The number of predicted genes annotated to KO.
  8. The percentage of genes annotated to KO.
  9. The number of predicted genes annotated to Pfam.
  10. The percentage of genes annotated to Pfam.
  11. The number of viral hallmark genes defined in Roux et al.

Identifying phage contigs using pre-built random forest model

All of the above metrics are used as features for constructing the random forest model. We have built the random forest model, which used for predicting phage contigs in metagenomic data. For the details of random forest model, you could read here.


Taxonomy affiliation for identified phage contigs

We employ the Ribosomal Database Project (RDP) classifier, a Naive Bayes classifier, to classify the identified phage contigs into different taxonomic levels. Originally it was developed for classifying bacterial rRNA sequences into new bacterial taxonomy. The classifier was developed based on the frequency of all possible eight-base subsequence in training set with known taxonomy information. Therefore it is also suitable to search for the taxonomy affiliation of identified phage contigs in our case. The sequences of 3,319 available phage genomes downloaded from NCBI were used to train the classifier. The taxonomy information was extracted according to the NCBI GenBank taxonomy. The identified phage contigs are assigned to different taxonomy groups using the trained model. The taxonomy assignments with estimated confidence level ≤0.5 will be discarded.


Microbial diversity

For the metagenomic dataset, we calculate the number of operational taxonomic units (OTUs) that are further grouped at both genus and species level for each sample based on the taxonomy affiliation of identified phage contigs assigned by RDP classifier. The microbial diversity indices including Shannon index, Simpson index and Pielou evenness index are calculated using the R package vegan at both genus and species level. Wilcoxon rank-sum test is employed for inter-group comparisons. You could see an example of microbial diveristy results visualization here.


Phage host prediction

The phage host relationships are predicted using CRISPR-spacer based method. This step includes three main parts: firstly, all the assembled contigs are mapped to prokaryotic genomes downloaded from NCBI by blastn (E value < 1e-5). Afterwards, the CRISPR recognition tool (CRT) is employed to identify spacers from bacterial contigs. All identified spacers are searched against identified phage contigs using Blastn-short function from the Blast+ package (evalue 1e-10, max_target_seqs 1, identity 95%), following Paez-Espino et al..


Downstream analysis

  1. Visualization of the abundance of different functional groups for the identified phage contigs. For these identified phage contigs, we calculate the abundance of different functional groups: KO and Pfam. After that, these information could be used for visualization at different level.
  2. Phage-host interaction network. One bacterial species could be the host for many phage contigs. Based on this, we could draw the phage-host interaction network. From this network, you may know the panorama in your sampling environment.
  3. Inter-group comparision based on metadata information (optional). The differential abundance (DA) analysis would be performed between "Control" and "Treatment" groups specified in the metadata file. Wilcoxon rank-sum test is used to detect differentially abundant KO or Pfam categories.

The main output files

  1. Quality control
    • The raw reads count of uploaded metagenomic samples.
    • The clean reads count of metagenomic samples after quality control.
    • The detail statistics of quality control.
    • The uploaded metagenomic data after quality control in FASTQ format.
    • The uploaded metagenomic data after quality control in FASTA format.
  2. Genome assembly
    • The read length of metagenomic raw reads, which is used for the parameter of genome assmebly.
    • The assembled contigs in FASTA format.
    • The list of all the contig names.
    • The list of the contigs longer than 5kb.
  3. Gene prediction
    • The predicted gene in GFF format, which showed information of the start and end of predicted genes in contigs.
    • The coding DNA sequences of predicted genes.
    • The protein sequences of predicted genes.
    • The abundance level of each predicted gene.
    • The number of predicted genes on each contig.
  4. Functional annotation
    • Genes annotated to KO groups.
    • Genes annotated to Pfam groups.
    • Genes annotated to viral protein families.
    • Genes identified as viral hallmark genes.
    • The number and the percentage of predicted genes annotated to KO groups on each contig.
    • The number and the percentage of predicted genes annotated to Pfam groups on each contig.
    • The number and the percentage of predicted genes annotated to viral protein families on each contig.
    • The number of identified viral hallmark genes on each contig.
  5. POGs 2012 version annotation
    • Genes annotated to general POGs.
    • Genes annotated to POGs with high VQ (VQ >0.9) that could be considered as virus-specific.
  6. Our updated POGs annotation
    • Genes annotated to general POGs.
    • Genes annotated to POGs with high VQ (VQ >0.8) that could be considered as virus-specific.
  7. Average depth and relative abundance
    • The mapped reads count for each contig.
    • The average depth for each contig.
  8. Phage contig identification
    • The metrics table including functional information like KO, pfam, viral hallmark, viral protein families etc. and other metrics characterizing each contig such as average depth, which is used for phage contigs identification.
    • The extracted of all the above metrics for predicted phage contigs.
    • The sequence of predicted phage contigs in FASTA format.
  9. Phage host relationship prediction
    • The sequence of identified spacers in FASTA format.
    • The alignment result of mapping spacers to identified phage contigs by blastn.
    • The final predicted phage host relationships.
  10. Tables and Figures

    Based on Pfam groups:

    • The table showing the abundance level of Pfam abundance for identified phage contigs across all samples.
    • The heatmap showing the distribution of Pfam abundances for identified phage contigs across all samples.
    • The clustering of all samples using Multidimensional Scaling based on the abundances of all Pfam groups for identified phage contigs.

    Differential abundance analysis (optional):

    • The list of significant pfam groups. The significant level depend on the value that users provided, otherwise by default 0.05.
    • The fold change (treatment vs control) of each pfam groups.
    • Ranking fold-change, extracting top 50 and bottom 50 pfam groups as the functional groups of the interest.

    Based on KO groups:

    • The table showing the abundance level of KO abundance for identified phage contigs across all samples.
    • The heatmap showing the distribution of KO abundances for identified phage contigs across all samples.
    • The clustering of all samples using Multidimensional Scaling based on the abundances of all KO groups for identified phage contigs.
    • The table showing the abundance level of KO abundance for identified phage contigs across all samples.
    • The barplot showing the relative abundance of KEGG pathway classes for identified phage contigs in each sample.
    • The heatmap showing the distribution of KEGG pathway class abundances for identified phage contigs across all samples.
    • The table showing the abundance level of KEGG module for identified phage contigs in each sample.
    • The heatmap showing the distribution of KEGG module abundances for identified phage contigs across all samples.

    Differential abundance analysis (optional):

    • The list of significant KO groups. The significant level depend on the value that users provided, otherwise by default 0.05.
    • The fold change (treatment vs control) of each KO groups.
    • Ranking fold-change, extracting top 50 and bottom 50 KO groups as the functional groups of the interest.

    Microbial diversity:

    • The table showing the calculated diversity indices including Shannon index, Simpson index and Pielou eveness index for each sample at genus level.
    • The table showing the calculated diversity indices including Shannon index, Simpson index and Pielou eveness index for each sample at species level.
    • The figure of boxplot showing genus-level microbial diversity of phage communities.
    • The figure of boxplot showing species-level microbial diversity of phage communities.

    Phage host interaction network:

    • The attributes of nodes used to generate network.
    • The attributes of edges used to generate network.
    • The table of phage host interaction network.

Example data

Our example data used to present the output of VirMiner was from Raymond et al. containing 24 healthy individuals. Eighteen individuals out of the 24 were treated for 7 days with cefprozil, and 6 individuals unexposed to antibiotics were referred to as controls. Fecal samples were obtained at various time points: before treatment (Exposed 0 and Control 0), at the end of treatment (Exposed 7 and Control 7) and 90 days after the end of treatment (Exposed 90 and Control 90). The metagenomic raw reads were downloaded from the European Nucleotide Archive database under accession number PRJEB8094. Please be reminded to that VirMiner currently only supports two-group comparison (that is, input metagenomic samples are only allowed to cover two conditions: control and treatment). If you want to get the result from the comparison among three or more conditions, you have to input every two compared conditions every time and combine those results manually. Here we just input a subset of metagenomic raw reads extracted from the two conditions: Exposed 0 and Exposed 7, as an example. You can check all the results generated by VirMiner on the "example data results display" webpage.