Usage
Creating the variant catalog
There are three main ways to obtain a catalog of SNP loci in MALVIRUS: creating it from a reference genome and a set of assemblies, uploading it from the user interface, and downloading the precomputed catalogs on your machine.
Building the catalog from a population of genomes
We will first show how to build the catalog from a reference genome and a set of genomic sequences. Note that, the reference genome must be in FASTA format, whereas the set of genomic sequences must be in multi-FASTA format.
To build the catalog, first head to the Reference VCFs tab which is at the top of the MALVIRUS homepage and click on the Build a reference VCF from genomes button that will redirect you to a submission form. First, fill in the Alias and Description fields as you please (Alias will be the name of the catalog so use some meaningful term). Then, in the Reference genome field, you can either select a reference genome from the ones provided by MALVIRUS (preloaded references) or you can select Custom reference and then upload your reference genome sequence in FASTA format in the Reference genomic sequence field that will subsequently appear.
If available, you can also upload the annotation of the reference genome in either GTF of GFF format in the Gene Annotation field. If you upload this information, then variant calls that will use this catalog will be able to annotate variants using gene names. If you use a preloaded reference, then variant calls will be also annotated using SnpEff.
Finally, upload the multi-FASTA of the set of genomic sequences in the Population genomic sequences field.
Once every field is filled with the correct information, click on the Submit button at the bottom of the page and the catalog will be built. After clicking on the submit button, you will be presented with the status page of the job; click the Refresh button at the bottom to refresh the information in it.
The job will align the genomic sequences of the population to the reference genome using mafft and will convert the MSA output of mafft to a VCF using snp-sites. This job should require less than 10 minutes to complete.
Clicking again on the Reference VCFs tab at the top will present the list of catalogs and the status of the job creating them. Once the status is Completed the catalog can be used to call variants.
Uploading the catalog using a precomputed VCF
If the set of known variants is already available as a VCF file, you can avoid computing the catalog and you can directly upload the reference genome and the VCF instead instead.
To do so, head to the Reference VCFs tab on top of the MALVIRUS website and click on the Upload a new reference VCF button that will redirect you to the upload form.
Fill in the Alias and Description fields as you please. Fill the reference genome as explained in the section above and, finally, upload the VCF file of the variants in the Reference VCF field.
Finally, click on the Submit button on the bottom to upload and create the new catalog.
Using a precomputed catalog
MALVIRUS ships with precomputed catalogs for SARS-CoV-2. These catalogs of population SNP loci are based on the data in GenBank and are updated frequently. You should find at least one precomputed catalog in the Reference VCFs tab, to detect them look at entries with status “Precomputed” (the alias of these catalogs usually starts with “NCBI-“).
To update the set of precomputed catalogs, click on the “Download new precomputed VCFs” button in the Reference VCFs tab. The application will download the catalogs in background and the list of catalogs will update.
Genotype calling
The main goal of MALVIRUS is to genotype an individual directly from a sequencing dataset.
To do so, head to the Variant calls tab on the top of the homepage of MALVIRUS and click on the Perform a new variant call button.
You will be presented with a submission form. First, set the Alias and Description fields to something meaningful. Alias will be the name of the job you will submit. Then, upload the sequencing data in either FASTA or FASTQ format (gzipped, possibly) in the Sample sequences field and choose a catalog to use while genotyping the data in the Reference VCF field.
If no reference VCF is available, MALVIRUS asks you to first create a variant catalog; head to the Creating the variant catalog section of this document and follow the instructions.
After uploading the sample and selecting the catalog you are able to submit the job by clicking the Submit button.
Note that you can tune the parameters of the analysis by setting them in the Advanced parameters box; the default parameters of the tool are tuned to work with high-coverage virological data (coverage higher than 100x).
This job will first extract the k-mers in the sample using KMC and will then use the k-mers call the variants using MALVA.
It is possible to track the status of the job by heading to the Variant calls tab. Look for your job in the table and then look at the Status column. Once the status changes from Running to Completed you can access the output of MALVIRUS.
If the status changes to Failed then something went wrong and the log is linked in the status page of the job.
Retrieving the results
The output of MALVIRUS is a single VCF file that describes the genotype of each known variant. You can access it by heading to the Variant calls tab and searching your job in the list displayed there. Click on the alias of the job you want to analyze and you’ll be presented with a table reporting various information of the job.
If the status is Completed, then you can access and download the output VCF that will be in the Output files row. By clicking the name of the output file, you’ll download the VCF file, whereas by clicking on the Show in tabular form button you’ll be redirected to another page that describes the VCF and that highlights the differences between the reference genome and the strain under analysis. In this page, by default only the wild-type variants detected in the strain are shown, to see all the variants uncheck the Show only loci with alt.~allele filter. If the variant catalog was built on a preloaded reference, then you can also view the (summarized) effect of each variant in the Effect column. The full effects predicted by SnpEff for that variant can be accessed by clicking on the summarized effect. Moreover, last column is color coded based on the quality of the call. For convenience, at the top of this page you can find two links to download the output in VCF or as a spreadsheet.