Variant Calling using UseGalaxy Server

14 Apr

Written By Nirmal Raj Rajendran

Import Data

Open the use galaxy platform and register yourself with a free account. Now we can either import our own research data or use secondary data from various genomic archives such as ENA, NCBI-SRA, etc. https://usegalaxy.org/

For secondary data, we can browse from the platform itself. To the left-hand side (LHS), there is a “Get Data” option. Click on that.
We can see a list of databases where we can select the sequence reads either a single-end or paired-end read.
The reads must be in FastQ file format. Now here, I am going to select the third option which will aid me to get the required sequence from NCBI-SRA.

Now a new dialogue box pops up, where we need to enter the SRR accession number. We can separately search for the required gene sequence and its accession number from NCBI SRA.
https://www.ncbi.nlm.nih.gov/sra
Once, you enter the required information, proceed to click on execute. The FastQ file will appear on the right side history tab.
Note that all the processing speed is dependent on your system specifications and also the internet connectivity.

Two separate files open up, if we are selecting paired-end reads then we need to perform a quality check on each individual FastQ file.
We can also click on the eye icon (red arrow) to view the sequence read data.

Quality Check

In the LHS, under the tools section, type in FASTQC.
FastQC is a function, which is used to check the quality of reads and by seeing this we can determine whether the reads are good enough for further analysis.
Click on the option “FastQC Read quality reports”

In the first box, select the FastQ file that is to be evaluated.
If we have adapter or contaminant list, we can add in the boxes below.
Let all other options be in default settings. Go to the end of the page and click on execute button.

A total of ten parameters is displayed in the FastQC report of which per base sequence quality, per base sequence content, and per sequence GC content are the most important ones.
These three parameters need to be in green color to proceed further.
Here, the per base sequence content is in red color. Then, we have to improve the quality of the reads through another function Trimmomatic.

Click on the per base sequence content (information about the ATGC content).
Here, we can see that in the initial stage, there are so many inequalities between A, T, G, and C. If we trim that portion, then we may have a quality read on our hands.

Trimming Reads

Search for the function Trimmomatic under the tools tab. Now, select the second option from the list.

At first box, select whether we are using single end or paired end reads.
Though we have taken paired end, now we are only going to trim a single file hence single end has been selected.
If we select paired end, then we can do trimming action for both the files simultaneously.

•As we scroll down, we can select the trimming operation that is to be carried out.
In FastQC report the problem only existed in the start of the reads, hence we have selected “cut bases off the start of a read”. Other options are also available according to ones purpose.
Once selected, click on the execute option.

The trimmed FastQ file will be available in the RHS.
We can again perform quality check to make sure the problems has been rectified.
If not, then re-sequencing must be carried out.
For further analysis, we have to use this trimmed file.
Similarly, the process must be done to the other FastQ file as we have taken paired end reads.

Mapping of Reads

Search “bwa” in the tools tab.
Two functions show up namely, Map with BWA and Map with BWA-MEM. The former is used if the read sequence is less than 100bp and the latter is used when the read sequence is more than 100bp.
We will select BWA-MEM

Select the reference genome to which the mapping needs to be done.
Below that, select whether it is paired-end or single-end.
Select the trimmed FastQ files in the subsequent boxes.

Select the analysis mode based on the sequencing platform used.
Click on execute
The mapped file will be in BAM file format in the RHS from where we can view the file.

BAM file includes the chromosome number for these short fragments of reads, their size and the corresponding sequences as given above.

Variant Calling

Search function FreeBayes and select from the option.
Select the BAM file as the input file for variant calling.
Again select the appropriate reference genome.
Set the parameter selection for simple diploid calling.
Click on execute option at the bottom

File is saved in VCF file format.
We can see information such as, the SNP variation between the reference genome and the sample, along with their position in the chromosome.
The same VCF file can be converted to GIF file to do Annotation (functional regions) using another server namely, Wannovar.

See you again in another plantbreederspot content!!!!

Nirmal Raj Rajendran

Quantification of AMMI Biplot using RStudio