Abstract: Genome-wide analyses reveal that more than 90% of multi exonic human genes produce at least two transcripts through alternative splicing (AS). Various bioinformatics methods are available to analyze AS from RNAseq data. Most methods start by mapping the reads to an annotated reference genome, but some start by a de novo assembly of the reads. In this paper, we present a systematic comparison of a mapping-first approach (FaRLine) and an assembly-first approach (KisSplice). We applied these methods to two independent RNAseq datasets and found that the predictions of the two pipelines overlapped (70% of exon skipping events were common), but with noticeable differences. The assembly-first approach allowed to find more novel variants, including novel unannotated exons and splice sites. It also predicted AS in recently duplicated genes. The mapping-first approach allowed to find more lowly expressed splicing variants, and splice variants overlapping repeats. This work demonstrates that annotating AS with a single approach leads to missing out a large number of candidates, many of which are differentially regulated across conditions and can be validated experimentally. We therefore advocate for the combine use of both mapping-first and assembly-first approaches for the annotation and differential analysis of AS from RNAseq datasets.

Running and combining the output of KisSplice and FaRLine

I would like information to:
Run KisSplice
Run FaRLine
View interactive venn diagrams comparing KisSplice, FaRLine, MISO, Trinity and Cufflinks

On which dataset?
SKNSH dataset
MCF7 dataset

KisSplice on SKNSH data

The data

We dowloaded a total 959M reads from http://genome.crg.es/encode_RNA_dashboard/hg19/. They corresponded to long polyA+ RNAs generated by the Gingeras lab, and are also accessible with the following accession numbers (ENCSR000CPN - SRA: SRR315315, SRR315316 and ENCSR000CTT -SRA : SRR534309, SRR534310). For cell lines treated by retinoic acid, the reads were 76nt long, while they were 100nt long for the non treated cells. Hence we trimmed all reads to 76nt.

Running KisSplice

The KisSplice version used in the paper was Version 2.4.0-p1, which can be downloaded here: KisSplice v2.4.0-p1
To replicate the results obtained in the paper, run KisSplice as:

bin/kissplice -r wgEncodeCshlLongRnaSeqSknshCellPapFastqRd1Rep3_trimmed.fastq.gz -r wgEncodeCshlLongRnaSeqSknshCellPapFastqRd2Rep3_trimmed.fastq.gz -r wgEncodeCshlLongRnaSeqSknshCellPapFastqRd1Rep4_trimmed.fastq.gz -r wgEncodeCshlLongRnaSeqSknshCellPapFastqRd2Rep4_trimmed.fastq.gz -r wgEncodeCshlLongRnaSeqSknshraCellPapFastqRd1Rep1.fastq.gz -r wgEncodeCshlLongRnaSeqSknshraCellPapFastqRd2Rep1.fastq.gz -r wgEncodeCshlLongRnaSeqSknshraCellPapFastqRd1Rep2.fastq.gz -r wgEncodeCshlLongRnaSeqSknshraCellPapFastqRd2Rep2.fastq.gz --mismatches 2 --counts 2 --min_overlap 5 -C 0.02 --experimental

The file containing the bubbles corresponding to alternative splicing events can be found in results/*_coherents_type_1.fa, which from this point on will be referenced as <KisSpliceASFile>. Alternatively, you can download this file here: results_..._coherents_type_1.fa.

Aligning to the reference genome

This step takes around 30 GB of RAM.

STAR was used to align the alternative splicing events found by KisSplice back to the reference genome.
The reference genome used was hg19/GRCh37, which can be found here: Homo_sapiens.GRCh37.75.dna.primary_assembly.fa
The GTF file used was Ensembl75, which can be found here: Homo_sapiens.GRCh37.75.gtf

To build the STAR index of the genome in a given directory, do:

STAR --runMode genomeGenerate --genomeDir <directory_for_STAR_index> --genomeFastaFiles Homo_sapiens.GRCh37.75.dna.primary_assembly.fa --sjdbGTFfile Homo_sapiens.GRCh37.75.gtf

Alternatively, you can get the index already built here: STAR index

To align the alternative splicing events to the reference genome, do:

STARlong --genomeDir <directory_for_STAR_index> --readFilesIn <KisSpliceASFile> --outSAMunmapped Within --outFileNamePrefix SKNSHC0.02_type1_mapped2GRCh37Ensembl75

The output will be a SAM file SKNSHC0.02_type1_mapped2GRCh37Ensembl75Aligned.out.sam. Alternatively, you can get the alignment already done here:SKNSHC0.02_type1_mapped2GRCh37Ensembl75Aligned.out.sam

Running KisSplice2refgenome

To run KisSplice2refgenome, do:

kissplice2refgenome-1.0.0/kissplice2refgenome -a Homo_sapiens.GRCh37.75.gtf --counts 2 SKNSHC0.02_type1_mapped2GRCh37Ensembl75Aligned.out.sam

The output will be a file events.txt. Alternatively, you can get KisSplice2refgenome output here: events.txt

Running kissDE

To run kissDE, do:

#!/usr/bin/Rscript
library(kissDE)

k2rg_file <- 'events.txt'
myCounts <- kissplice2counts(k2rg_file, counts = 2, pairedEnd = TRUE, exonicReads = FALSE, k2rg = TRUE, keep=c("ES"), remove=c("MULTI"))
myConditions <- c("Sknsh", "Sknsh", "SknshRA", "SknshRA")
diffSplicing <- diffExpressedVariants(myCounts, myConditions, pvalue=1, output="kissplice_v2.4.0p1_k2rg_kissDE_sknsh.tsv")

The output will be a file kissplice_v2.4.0p1_k2rg_kissDE_sknsh.tsv. Alternatively, you can get kissDE output here: kissplice_v2.4.0p1_k2rg_kissDE_sknsh.tsv.

KisSplice on MCF7 data

The data

This data comes from a breast cancer cell line, MCF7. Two conditions were sequenced in duplicated. The cell line was transfected with a siRNA targeting 2 RNA helicase (DDX5 and DDX17) inducing their depletion. The second condition was a control transfection with a siRNA targeting the luciferase firefly gene (GL2). The sequencing was done with the paired-end Illumina technology.
The data can be downloaded here: MCF7 dataset.

Running KisSplice

The KisSplice version used in the paper was Version 2.4.0-p1, which can be downloaded here: KisSplice v2.4.0-p1
To replicate the results obtained in the paper, run KisSplice as:

bin/kissplice  -r siCTL_N1_GGCTAC_R1_trim_right_cutadapt_match_trim_to_100_match.fastq.gz -r siCTL_N1_GGCTAC_R2_trim_right_cutadapt_match_trim_to_100_match.fastq.gz -r siCTL_N2_CTTGTA_R1_trim_right_cutadapt_match_trim_to_100_match.fastq.gz -r siCTL_N2_CTTGTA_R2_trim_right_cutadapt_match_trim_to_100_match.fastq.gz -r siDDX5_17_N1_AGTCAA_R1_trim_right_cutadapt_match_trim_to_100_match.fastq.gz -r siDDX5_17_N1_AGTCAA_R2_trim_right_cutadapt_match_trim_to_100_match.fastq.gz -r siDDX5_17_N2_AGTTCC_R1_trim_right_cutadapt_match_trim_to_100_match.fastq.gz -r siDDX5_17_N2_AGTTCC_R2_trim_right_cutadapt_match_trim_to_100_match.fastq.gz --mismatches 2 --counts 2 --min_overlap 5 -C 0.02 --experimental

Aligning to the reference genome

STAR --runMode genomeGenerate --genomeDir <directory_for_STAR_index> --genomeFastaFiles Homo_sapiens.GRCh37.75.dna.primary_assembly.fa --sjdbGTFfile Homo_sapiens.GRCh37.75.gtf

Alternatively, you can get the index already built here: STAR index

To align the alternative splicing events to the reference genome, do:

STARlong --genomeDir <directory_for_STAR_index> --readFilesIn <KisSpliceASFile> --outSAMunmapped Within --outFileNamePrefix MCF7C0.02_type1_mapped2GRCh37Ensembl75

The output will be a SAM file MCF7C0.02_type1_mapped2GRCh37Ensembl75Aligned.out.sam. Alternatively, you can get the alignment already done here:MCF7C0.02_type1_mapped2GRCh37Ensembl75Aligned.out.sam

Running KisSplice2refgenome

To run KisSplice2refgenome, do:

kissplice2refgenome-1.0.0/kissplice2refgenome -a Homo_sapiens.GRCh37.75.gtf --counts 2 MCF7C0.02_type1_mapped2GRCh37Ensembl75Aligned.out.sam

The output will be a file events.txt. Alternatively, you can get KisSplice2refgenome output here: events.txt

Running kissDE

To run kissDE, do:

#!/usr/bin/Rscript
library(kissDE)

k2rg_file <- 'events.txt'
myCounts <- kissplice2counts(k2rg_file, counts = 2, pairedEnd = TRUE, exonicReads = FALSE, k2rg = TRUE, keep=c("ES"), remove=c("MULTI"))
myConditions <- c("siCTL", "siCTL", "siDDX5_17", "siDDX5_17")
diffSplicing <- diffExpressedVariants(myCounts, myConditions, pvalue=1, output="kissplice_v2.4.0p1_k2rg_kissDE_mcf7.tsv")

The output will be a file kissplice_v2.4.0p1_k2rg_kissDE_mcf7.tsv. Alternatively, you can get kissDE output here: kissplice_v2.4.0p1_k2rg_kissDE_mcf7.tsv.

FaRLine on SKNSH data

The data

Running FaRLine

Downloading

The FaRLine version used in the paper can be downloaded here: FaRLine
We recommend to use FaRLine preferrably on Ubuntu 14.04 LTS or later.

Installing

Extract Farline.tgz in the directory of your choice:

tar xvfz Farline.tgz

This will create a FARLINE directory containing several files and subdirectories:

[yourpath]/FARLINE/scripts                               => perl and bash scripts, including the main Farline_pipeline.sh script
[yourpath]/FARLINE/Lib                                   => perl (.pm) and R (.R) libraries 
[yourpath]/FARLINE/DB                                    => tables needed by Farline
[yourpath]/FARLINE/Ref                                   => genomic references
[yourpath]/FARLINE/Farline_SKNSH.conf                    => config file used to launch SKNSH job, need to be modified.
[yourpath]/FARLINE/Farline_MCF7.conf                     => config file used to launch MCF7 job, need to be modified.
[yourpath]/FARLINE/kissDE_1.0_clara_pseudo_count.tar.gz  => must be added to R (see below)

FaRLine needs the following dependences to work correctly:

TopHat 2.1.0
Samtools 0.1.19
Perl

libbio-samtools-perl 1.41
libexcel-template-perl 0.34
libmodern-perl-perl 1.20150127

r-base 3.2.3
r-bioc-biobase 2.30

DSS 2.10 (needs Biobase 2.30, bsseq 1.6.0, splines, methods)
KissDE (needs DESeq 1.22.1, aod 1.3, xtable 1.8-2, DSS 2.10, glmnet 2.0)
DESeq (needs XML 3.98-1.5)

You can easily install all dependencies with the following two scripts:

#Bash script
sudo apt-get update
sudo apt-get install tophat
sudo apt-get install samtools
sudo apt-get install r-bioc-biobase
sudo apt-get install libbio-samtools-perl
sudo apt-get install libexcel-template-perl
sudo apt-get install libmodern-perl-perl
sudo apt-get install libxml2-dev

#R script (execute as SU)
source("https://bioconductor.org/biocLite.R")
  - or try http:// if https:// URLs are not supported -
source("https://bioconductor.org/biocLite.R")

biocLite("Biobase")
biocLite("bsseq")
biocLite("DSS")
biocLite("glmnet")
biocLite("aod")
biocLite("xtable")
biocLite("DESeq")
install.packages ("[yourpath]/FARLINE/kissDE_1.0_clara_pseudo_count.tar.gz", repos = NULL, type="source")
q()

Your system should now be ready to run Farline properly.

Configuring your run

FaRLine needs a configuration file in order to run. For the SKNSH dataset, you can use as model the file [yourpath]/FARLINE/Farline_SKNSH.conf.
You will need still to set some run and install parameters in the configuration file:

## RUN GENERAL INFORMATIONS :

run_name="SKNSH"                                             ## Run name: a directory with this name will be created in $output_dir
output_dir="/home/yourname/RUN"                              ## Output Directory : where to put the results (! the directory must exist)
pathfastq="/home/yourname/rawdata/SKNSH/fastq"               ## default: $output_dir/$run_name/fastq  path to look for fastq
pathtophat=""                                                ## default: output_dir/$run_name/tophat path to look for Step1 (TopHat) results                   
pathresults=""                                               ## default: output_dir/$run_name path to look for Step2 results (annotations/ and stats*/)

## INSTALL GENERAL INFORMATIONS :
FARLINE_DIR="/home/yourname/FARLINE";                        ## Where to look for FARLINE directory
tophat="/usr/bin/tophat";                                    ## Where to look for tophat exe.   Try "which tophat" if you don't know
samtools='/usr/local/bin/samtools';                          ## Where to look for samtools exe. Try "which samtools" if you don't know
LibraryR='/usr/lib/R/library';                               ## Where to look for R Library

Other parameters are specific for the run and can be ignored.

Launching FaRLine

In order to launch FaRLine, use the following command-line:

[yourpath]/FARLINE/scripts/Farline_pipeline.sh [your_config_file].conf

you can also add [yourpath]/FARLINE/scripts to your path with the command (or add this line in your .bashrc and source it):

export PATH=$PATH:[yourpath]/FARLINE/scripts

If you do so, the command will be:

Farline_pipeline.sh [your_config_file].conf

To display all available options please use the command:

Farline_pipeline.sh

Launching FaRLine on SKNSH dataset

To replicate the results obtained in the paper, FaRLine can be run in three steps.

Mapping, need fastq files.
Farline computations, need Step1 results.
Stats, need Step2 results

Any part can be run independently if the results from the previous step is available.

Running the 3 steps
To launch all 3 steps, it is essential to define the variable pathfastq in the configuration file, and then run:

Farline_Pipeline.sh Farline_SKNSH.conf

Running Step2 and Step3 only
If you wish to skip Step1, you can directly launch Step2 and Step3.
To do so, you need the output of Step1, which can be downloaded here: SKNSH bam files
And to define the variable pathtophat, containing the mapping results from Step1, in the configuration file.
Then run:

Farline_Pipeline.sh Farline_SKNSH.conf -skip 1

Running Step3 only
If you wish to skip Step1 and Step2, you can directly launch Step3.
To do so, you need the output of Step2, which can be downloaded here: SKNSH Step 2 output. Extract the files in this archive.
And to define the variable pathresults in the configuration file to point to the output of Step2.
Then run:

Farline_Pipeline.sh Farline_SKNSH.conf -skip 1,2

Main output file
After running all the three steps, the main output can be found in the file named exon_skipping_stats_recap_file_SKNSH.xls.
Alternatively, you can download the main output file here: exon_skipping_stats_recap_file_SKNSH.xls.

Troubleshooting

If your ubuntu version is older than 14.04 or if you encounter difficulties to install requested dependencies, you may upgrade Ubuntu with theses commands (use at your own risk, upgrading your Ubuntu version may cause other installed softwares to stop working):

sudo apt-get update
sudo apt-get upgrade
sudo apt-get dist-upgrade
sudo apt-get install update-manager-core
sudo do-release-upgrade

FaRLine on MCF7 data

The data

Running FaRLine

Downloading

The FaRLine version used in the paper can be downloaded here: FaRLine
We recommend to use FaRLine preferrably on Ubuntu 14.04 LTS or later.

Installing

Extract Farline.tgz in the directory of your choice:

tar xvfz Farline.tgz

This will create a FARLINE directory containing several files and subdirectories:

[yourpath]/FARLINE/scripts                               => perl and bash scripts, including the main Farline_pipeline.sh script
[yourpath]/FARLINE/Lib                                   => perl (.pm) and R (.R) libraries 
[yourpath]/FARLINE/DB                                    => tables needed by Farline
[yourpath]/FARLINE/Ref                                   => genomic references
[yourpath]/FARLINE/Farline_SKNSH.conf                    => config file used to launch SKNSH job, need to be modified.
[yourpath]/FARLINE/Farline_MCF7.conf                     => config file used to launch MCF7 job, need to be modified.
[yourpath]/FARLINE/kissDE_1.0_clara_pseudo_count.tar.gz  => must be added to R (see below)

FaRLine needs the following dependences to work correctly:

TopHat 2.1.0
Samtools 0.1.19
Perl

libbio-samtools-perl 1.41
libexcel-template-perl 0.34
libmodern-perl-perl 1.20150127

r-base 3.2.3
r-bioc-biobase 2.30

DSS 2.10 (needs Biobase 2.30, bsseq 1.6.0, splines, methods)
KissDE (needs DESeq 1.22.1, aod 1.3, xtable 1.8-2, DSS 2.10, glmnet 2.0)
DESeq (needs XML 3.98-1.5)

You can easily install all dependencies with the following two scripts:

#Bash script
sudo apt-get update
sudo apt-get install tophat
sudo apt-get install samtools
sudo apt-get install r-bioc-biobase
sudo apt-get install libbio-samtools-perl
sudo apt-get install libexcel-template-perl
sudo apt-get install libmodern-perl-perl
sudo apt-get install libxml2-dev

#R script (execute as SU)
source("https://bioconductor.org/biocLite.R")
  - or try http:// if https:// URLs are not supported -
source("https://bioconductor.org/biocLite.R")

biocLite("Biobase")
biocLite("bsseq")
biocLite("DSS")
biocLite("glmnet")
biocLite("aod")
biocLite("xtable")
biocLite("DESeq")
install.packages ("[yourpath]/FARLINE/kissDE_1.0_clara_pseudo_count.tar.gz", repos = NULL, type="source")
q()

Your system should now be ready to run Farline properly.

Configuring your run

FaRLine needs a configuration file in order to run. For the MCF7 dataset, you can use as model the file [yourpath]/FARLINE/Farline_MCF7.conf.
You will need still to set some run and install parameters in the configuration file:

## RUN GENERAL INFORMATIONS :

run_name="MCF7"                                             ## Run name: a directory with this name will be created in $output_dir
output_dir="/home/yourname/RUN"                              ## Output Directory : where to put the results (! the directory must exist)
pathfastq="/home/yourname/rawdata/MCF7/fastq"               ## default: $output_dir/$run_name/fastq  path to look for fastq
pathtophat=""                                                ## default: output_dir/$run_name/tophat path to look for Step1 (TopHat) results                   
pathresults=""                                               ## default: output_dir/$run_name path to look for Step2 results (annotations/ and stats*/)

## INSTALL GENERAL INFORMATIONS :
FARLINE_DIR="/home/yourname/FARLINE";                        ## Where to look for FARLINE directory
tophat="/usr/bin/tophat";                                    ## Where to look for tophat exe.   Try "which tophat" if you don't know
samtools='/usr/local/bin/samtools';                          ## Where to look for samtools exe. Try "which samtools" if you don't know
LibraryR='/usr/lib/R/library';                               ## Where to look for R Library

Other parameters are specific for the run and can be ignored.

Launching FaRLine

In order to launch FaRLine, use the following command-line:

[yourpath]/FARLINE/scripts/Farline_pipeline.sh [your_config_file].conf

you can also add [yourpath]/FARLINE/scripts to your path with the command (or add this line in your .bashrc and source it):

export PATH=$PATH:[yourpath]/FARLINE/scripts

If you do so, the command will be:

Farline_pipeline.sh [your_config_file].conf

To display all available options please use the command:

Farline_pipeline.sh

Launching FaRLine on MCF7 dataset

To replicate the results obtained in the paper, FaRLine can be run in three steps.

Mapping, need fastq files.
Farline computations, need Step1 results.
Stats, need Step2 results

Farline_Pipeline.sh Farline_MCF7.conf

Running Step2 and Step3 only
If you wish to skip Step1, you can directly launch Step2 and Step3.
To do so, you need the output of Step1, which can be downloaded here: MCF7 bam files.
And to define the variable pathtophat, containing the mapping results from Step1, in the configuration file.
Then run:

Farline_Pipeline.sh Farline_MCF7.conf -skip 1

Running Step3 only
If you wish to skip Step1 and Step2, you can directly launch Step3.
To do so, you need the output of Step2, which can be downloaded here: MCF7 Step 2 output. Extract the files in this archive.
And to define the variable pathresults in the configuration file to point to the output of Step2.
Then run:

Farline_Pipeline.sh Farline_MCF7.conf -skip 1,2

Main output file
After running all the three steps, the main output can be found in the file named exon_skipping_stats_recap_file_MCF7.xls.
Alternatively, you can download the main output file here: exon_skipping_stats_recap_file_MCF7.xls.

Troubleshooting

sudo apt-get update
sudo apt-get upgrade
sudo apt-get dist-upgrade
sudo apt-get install update-manager-core
sudo do-release-upgrade

Complementarity of assembly-first and mapping-first approaches for alternative splicing annotation and differential analysis from RNAseq data.

Clara Benoit-Pilven et al.

Support webpage

Running and combining the output of KisSplice and FaRLine

KisSplice on SKNSH data

The data

Running KisSplice

Aligning to the reference genome

Running KisSplice2refgenome

Running kissDE

KisSplice on MCF7 data

The data

Running KisSplice

Aligning to the reference genome

Running KisSplice2refgenome

Running kissDE

FaRLine on SKNSH data

The data

Running FaRLine

Downloading

Installing

Configuring your run

Launching FaRLine

Launching FaRLine on SKNSH dataset

Troubleshooting

FaRLine on MCF7 data

The data

Running FaRLine

Downloading

Installing

Configuring your run

Launching FaRLine

Launching FaRLine on MCF7 dataset

Troubleshooting