Next-generation sequencing (NGS) is an emerging technology to determine DNA/RNA sequences for the whole genome or specific regions of interest at a much lower cost than traditional Sanger sequencing. Combined with other technologies such as RNA extraction (RNA-Seq), enrichment for exome (Exome-seq) or other genomic regions of interest, chromatin immuno-precipitation (ChIP-Seq), and bisulfate conversion (BS-seq), NGS can provide rich information about genetic variants, transcriptome dynamics, transcription factor binding profile, epigenetic modifications, and other information. The applications of NGS are rapidly expanding, and this calls for efficient and creative data storage, analysis, and visualization methods. We are actively involved in data analysis for a broad range of NGS applications and have mature analysis pipelines for RNA-Seq data, detection of rare variants, and ChIP-Seq data. We routinely use in-house programs, as well as multiple commercial and open-source tools for different steps of the NGS data analysis, from base calling, and sequence alignment, to downstream statistical analysis to suit various experimental designs. Moreover, we are devoted to developing novel and useful statistical tools for NGS data analysis. We carefully examine possible sources of abnormalities in data processing and search for ways to overcome inherent bias in NGS data analysis. This course is aimed at experimental or bench-based researchers working in the molecular life sciences who have little or no previous experience in NGS analysis. An undergraduate knowledge of a subject related to the life sciences would be an advantage.
Course Objectives
- To learn the general introduction to working in a Linux environment and running command-line tools.
- To understand NGS technology, algorithms, and data formats
- To use bioinformatics tools for handling sequencing data and visualization
- To perform downstream analyses for studying gene expression and genetic variations
Course Outcomes
After the completion of the certificate course, students will be able to
- Build a strong foundation in scripting languages and command line usage for NGS data analysis.
- Compare and apply appropriate short read aligners for genome mapping.
- Perform variant calling analysis and variant annotation
- Design and evaluate the algorithms for genomic data visualisations
Contents
Module: 1 Understanding Scripting Languages (6 hours)
- Basics of Scripting language
- Unix & High-Performance Computing
- Basic Linux Command lines and their advantages
- Basics of R and Bioconductor Programming
- Data Manipulation in R Basics of R and Bioconductor Programming
- Usage of important bioinformatics toolkits
- Programming Using data visualization and interpretation
- Case studies: Example data
Module: 2 Advancement of genome sequencing (6 hours)
- Introduction to sequencing technologies from a data analysts view
- Comparison of sequencing platforms
- Sequencing library preparation
- Sequence file formats
- Sequencing data analysis pipeline development
- Evaluation of sequencing platforms and report generation
- Case studies: Example Data
Module: 3 RNA-Seq/ Transcriptome-Seq Data Analysis (12 hours)
- Introduction to RNA/Transcriptome Sequencing
- Sequence data resources and Raw sequence files (FASTQ format)
- Preprocessing of raw reads: quality control (FastQC), adapter clipping, quality trimming
- Introduction to read mapping (Alignment methods, Mapping heuristics)
- Understand split-read mapping (TopHat, STAR)
- Mapping output (SAM/BAM format)
- Mapping statistics, Visualization of mapped reads (IGV, UCSC)
- Understand the Tuxedo Suite (Cufflinks, Cuffcompare, Cuffmerge, Cuffdiff, etc.)
- Understand the statistics behind DEseq2 and DIEGO
- Quantify exons/genes/transcripts
- Predict
- Differential splicing using DIEGO
- Differential gene expression using DEseq2
- Differential isoform expression using cuffdiff
- Create extensive diagnostic graphics with R
- Apply your new skills by working on challenging exercises
- Case Studies: Real Data
Module: 4 Whole Exome Sequencing Data Analysis (12 hours)
- Understanding Exome Sequencing and Data Generation
- Sequence Alignment
- Alignment of reads using reference Genome (BWA/Bowtie)
- Understanding Mapping Output (SAM/BAM, SAMtools & Bedtools)
- Variant detection using GATK & SAMtools
- Visualization of variation with IGV
- Complete annotation and variant effect prediction (SnpEff, SNPDB etc.)
- Predict the effects of coding non-synonymous variants on protein function using the SIFT algorithm
- Case studies: Real data
Textbooks & Supporting Literature
- Shanrong Zhao, Kirk Watrous, Chi Zhang and Baohong Zhang (2016), Cloud Computing for Next-Generation Sequencing Data Analysis.
- Robert G, “R programming in Bioinformatics”, CRC press, Taylor and Francis Group, USA, 2008.
- Tiago Antao (2015), Bioinformatics with Python Cookbook, Packt Publishing
- Ju Han Kim (2019), Genome Data Analysis, Springer Singapore
- Ali Masoudi-Nejad, Zahra Narimani, Nazanin Hosseinkhan; “Next Generation Sequencing and Sequence Assembly”, Methodologies and Algorithms, Springer; 2013.
- Stuart M. Brown, “Next-Generation DNA Sequencing Informatics”, Cold Spring Harbor Laboratory Press, 2013.
- Y. M. Kwon and S. C. Ricke; High Throughput Next Generation Sequencing: Methods and Applications; Humana Press; 2011.
- S. Knudsen; Guide to Analysis of DNA Microarray data; Wiley, 2004, 2nd edition.
- B. R. Korf and M. B. Irons; Human Genetics and Genomics; Wiley, 2013, 4th edition.
Online Material
https://usegalaxy.org/
Industry Collaboration
Scientific Bio-Minds