Learn Galaxy

Because G-OnRamp is based on the Galaxy platform, the first step to learning how to use G-OnRamp is to acquire some basic familiarity with Galaxy. The “Overview of Galaxy” presentation in the Learning Materials section will give you the necessary basic information about Galaxy. The two screencasts from Galaxy linked here provide an introduction to getting data and comparing genomics features. If you want to learn more about Galaxy, visit the Learn Galaxy page on the Galaxy Wiki.

 

Workflows

Big picture

We have developed a comprehensive Galaxy workflow that produces multiple complementary datasets to facilitate the annotation of any eukaryotic genome. The entire workflow is shown below.

Figure 1: G-OnRamp workflow for UCSC

 

Figure 2: G-OnRamp workflow for JBrowse

Sub-workflows

The G-OnRamp workflow is divided into four sub-workflows: homology, repeat regions, RNA-Seq, and gene predictions. These sub-workflows will produce the input datasets for the Hub Archive Creator, which will create the UCSC Genome Browser Assembly Hub.


Homology

The genome assembly (in FASTA format) is the input dataset for the NCBI BLAST+ tool makeblastdb, which creates a nucleotide database for BLAST searches. The NCBI BLAST+ tblastn tool searches this nucleotide database against a collection of protein query sequences from an informant species. The blastXmlToPsl and pslToBigPsl tools are used to convert the tblastn search results to the BigPsl format required by the Hub Archive Creator.

BLAT alignment

RNA GenBank records is the input dataset for the gbToFasta tool, which converts RNA records to FASTA format. The genome assembly (in FASTA format) and RNA records (in FASTA format) are the input datasets for the UCSC BLAT alignment tool, which searches this genome assembly against a collection of RNA query sequences from an informant species. The UCSC pslCDnaFilter tool is used to select near best in genome alignments for each given cDNA and non-comparative, based only on the quality of an individual alignment. The UCSC pslCheck tool is used to validate the PSL output. The UCSC pslPosTarge tool flips psl strands so target is positive and implicit. The pslToBigPsl tool converts the BLAT search results to the BigPsl format required by the Hub Archive Creator.

Repeat regions

TrfBig partitions the genome assembly into smaller chunks and then runs Tandem Repeats Finder (TRF) on each chunk to identify tandem repeats within each genomic region. Note that the output of TRF is in BED4+12 format.


RNA-seq

RNA-Seq reads are mapped against the genome assembly by HISAT2, and StringTie assembles the mapped RNA-Seq reads into potential transcripts. The “junctions extract” subprogram in Regtools reports the locations of putative introns based on the spliced RNA-Seq reads in the BAM file. The RNA-Seq read coverage track was created by the “Convert BAM to BigWig” tool.


Gene predictions

Gene models from three gene predictors (Augustus, GlimmerHMM, and SNAP) were produced using species-specific parameters if they were available. The gene prediction results are converted into the bigGenePred format by the Hub Archive Creator.


Tools we use

Below is a glossary of the tools that we use in the Homology, RNA-Seq, Repeat Regions, and Gene Predictions sub-workflows:

Homology

NCBI BLAST+ makeblastdb: creates BLAST database from one or more FASTA files and/or BLAST databases.

NCBI BLAST+ tblastn: searches a translated nucleotide database using a protein query. Note that one should use the makeblastdb tool to convert the genome assembly into a BLAST database prior to performing a tblastn search.

blastXmlToPsl: converts BLAST output in XML format to the PSL format.

pslToBigPsl: transforms a file in PSL format to the BigPsl format.

BLAT alignment

gbToFasta: converts RNA records to FASTA format.

UCSC BLAT alignment tool: searches the genome assembly against a collection of RNA query sequences from an informant species.

UCSC pslCDnaFilter: selects near best in genome alignments for each given cDNA and non-comparative, based only on the quality of an individual alignment.

UCSC pslCheck: validates the PSL output.

UCSC pslPosTarge: flips psl strands so target is positive and implicit.

pslToBigPsl: transforms a file in PSL format to the BigPsl format.


RNA-seq

HISAT:  a fast and sensitive spliced alignment program for mapping RNA-seq reads. See the HISAT2 website for more information.

StringTie: a fast and highly efficient assembler of RNA-Seq alignments into potential transcripts. See the StringTie website for more information.

Convert Bam to BigWig: this tool calculates the alignment coverage from a BAM alignment file and converts the result into a BigWig file.

Regtools: extracts splice junctions from an RNA-Seq BAM file. For more information, check the link: https://regtools.readthedocs.io/en/latest/


Repeat regions

TrfBig: partitions a genome assembly into smaller chunks and then uses Tandem Repeats Finder (TRF) to identify tandem repeats within each chunk


Gene predictors

Augustus: a gene prediction program for eukaryotes written by Mario Stanke and Oliver Keller. For more information check the link: http://bioinf.uni-greifswald.de/augustus/

Multi_Fasta_GlimmerHmm: a gene finder based on a Generalized Hidden Markov Model (GHMM). For more information check the link: https://ccb.jhu.edu/software/glimmerhmm/

SNAP: is a general purpose gene finding program suitable for both eukaryotic and prokaryotic genomes. SNAP is an acronym for Semi-HMM-based Nucleic Acid Parser. For more information, check the link: http://korflab.ucdavis.edu/software.html


Create UCSC Genome Browser assembly hubs

Hub Archive Creator: this Galaxy tool converts a genome assembly and the results produced by different bioinformatics tools into an Assembly Hub so that the assembly and its evidence tracks can be visualized on the UCSC Genome Browser. For more information, check the links below:

https://github.com/goeckslab/hub-archive-creator

http://genome.ucsc.edu/goldenPath/help/hubQuickStartAssembly.html

Create JBrowse assembly hubs

JBrowse Archive Creator: this Galaxy tool converts a genome assembly and the results produced by different bioinformatics tools into an Assembly Hub so that the assembly and its evidence tracks can be visualized on the JBrowse. For more information, check the links below:

https://github.com/Yating-L/jbrowse-archive-creator

http://jbrowse.org


 Documentation

For detailed G-OnRamp tutorials, see G-OnRamp Documentation.