Data is generated from the sequencing core using paired-end (PE) 100x100 sequencing, which is very important to get excellent mapping statistics. We also aim to provide at least 50x mean coverage and 95–98% over 20x coverage per base for each Whole Exome sample sequenced. This is the kind of coverage that would be necessary to have a statistically sound analysis if indeed substantive variations are found.
The type of kit used only pertains to targeted sequencing i.e. Exome Capture. There are numerous kits available each with their unique protocols and capture sizes. The kit that we use predominantly is the SureSelect All ExonV4, which has a capture region of 51Mb defined by CCDS, RefSeq, and GENCODE databases.
Multiple samples can be pooled into a lane using sample specific barcode/index sequences. The illumina sequencing machines have a read cycle that specifically reads these barcode sequences which are usually 6bp long. The data can then be accurately separated into their respective samples via demultiplexing. We use CASAVA to demultiplex a typical Illumina sequencing run. During demultiplexing, the adapters can be masked or left intact if the researcher choses to do so. There are two FASTQ files produced for each sample, one for each read.
Sequence Quality Accessing
At the Eugene McDermott Center for Human Growth and Development, we are most careful about the quality of data generated by the sequencers. The first check is to make sure we have enough sequencing for the analysis. If samples do not pass the threshold required for the number of reads, they will be resequenced. Other parameters include % Passing Filter, Mean Quality Score and % of >=Q30 Bases to name a few. A detailed summary of the demultiplex stats used for initial Quality Accessing is available online.