Genomic Mapping: RNA-Seq

Samples that pass QC are finally ready to be mapped and analyzed. The researchers will be consulted on what genome version they would like to map to, although the default would be to use the latest version available. On special request, we can try to use an older genome or even a custom one.

Align reads

Reads are mapped using TopHat2, which is a fast splice junction aware mapper for RNA-Seq. It aligns RNA-Seq reads to a genome using the ultra high-throughput short read aligner Bowtie2, and then analyzes the mapping results to identify splice junctions between exons. Tophat2 takes a GTF file to guide its transcriptome assembly. In fact, illumina and UMD have produced bundles for various organisms which among others, contain the genomes and also the required transcriptome files. To learn more about the software, please view this publication.

Mark duplicates

Duplicates are marked but not removed from the mapped data using PICARD and reads with quality less than 10 are filtered out. This leaves us with an alignment file with analysis ready data.

Raw counts

The marked, filtered, and sorted alignment files are now ready to be used to generate transcript counts, also called read summarization. These counts are generated using featureCounts, which is highly efficient and fast. We generate counts for the illumina generated transcriptome GTF file which is not classified into separate files by default and a well classified GENCODE annotated GTF file. The difference between the two is that the illumina-generated GTF will have one output file with all types of genes (coding, non-coding, known, novel, etc.), while the GENCODE GTF-generated output will be a number of files classified according to know/novel statuses, protein coding/non-coding for each of genes and transcripts. The following table contains counts and classes for the two GTFs. The igenomes GTF annotation is explained in this NCBI link while the GENCODE GTF format is explained here (mouse, human).

GENCODE classification schema (Note: not all ID's pertain to both Human and Mouse. The following is based on Human annotations)

Gene Long non-coding RNA "processed_transcript", "lincRNA" , "3prime_overlapping_ncrna" , "antisense" , "non_coding" , "sense_intronic" , "sense_overlapping" 799 13071
Small non-coding RNA "snRNA" , "snoRNA" , "rRNA" , "Mt_tRNA" , "Mt_rRNA" , "misc_RNA" , "miRNA" 5398 3615
Protein-coding "protein_coding" 19456 889
Immunoglobulin/T-cell receptor "IG_C_gene", "gene_type", IG_D_gene", "gene_type", "IG_J_gene", "gene_type", "IG_V_gene", " gene_type", "TR_C_gene", " gene_type", "TR_D_gene", "gene_type", "TR_J_gene", " gene_type", "TR_V_gene" 386 -
Transcript Long non-coding RNA "processed_transcript", "lincRNA" , "3prime_overlapping_ncrna" , "antisense" , "non_coding" , "sense_intronic" , "sense_overlapping" 46887 3842
Protein-coding "protein_coding" 53653 28161
Nonsense mediated decay transcripts "nonsense_mediated_decay" 13052 -

igenomes classification schema

Gene - 25259
Transcript RNA and protein products that are generated by the eukaryotic annotation pipeline. These records use accession prefixes XM_, XR_, and XP_ 345 XM, 1129 XR
RNA and protein products that are mainly derived from GenBank cDNA and EST data and are supported by the RefSeq eukaryotic curation group. These records use accession prefixes NM_, NR_, and NP_ 33556 NM, 7048 NR

Wiggle Generation

The wig/bigWig format is a very efficient way to visualize dense continuous data such as transcript abundance. These files are compatible with IGV and the UCSC genome browser. We provide these files for each of the forward and reverse strands of the RNA-Seq data for stranded experiments, or one forward track for unstranded experiments. We use HOMER for the purposes of generating bigWig files, one file for unstranded RNA-Seq data, or two files for stranded RNA-Seq data (positive and negative strands).