TWAS

The pipeline

With fasta/fastq input from an RNA-seq experiment, SNPs are found by KisSplice without using a reference. KisSplice provides only a local context around the SNP, but a reference transcriptome can be built from the RNAseq data using a full-lenth transcriptome assembler like Trinity. Then SNPs predicted by KisSplice can be positionned along the transcripts (with BLAT). Some SNPs that do not map on the transcripts of Trinity, called orphan SNPs, are harder to study but can still be of interest. We propose a method, KisSplice2RefTranscriptome, to predict a functional impact for the positioned SNPs, and intersect these results with condition-specific SNPs. Overall, starting from RNAseq data only, we obtain a list of condition-specific SNPs stratified by functional impact.

KisSplice2RefTranscriptome

KisSplice2RefTranscriptome web page
KisSplice2RefTranscriptome version 1.2.0 (2016-29-03)
User's guide

Running the programs

Our method enables to find SNPs from RNA-seq data. In order to assess if the SNPs we find are correct, and if the list we output is exhaustive, we chose to test our method on RNA-seq data from the Geuvadis project. Indeed, the individuals whose transcriptome was sequenced in this project were already included in the 1000 genome project. Hence, their SNPs have been already well annotated. We downloaded fastq files from SRA and selected 10 Toscans and 10 Central Europeans. We sampled 10M reads for each individual and concatenated the fastq files in pools of 5 individuals.
The files used in our analysis are available here

Running KisSplice

Here is the command in order to run KisSplice on this dataset :

kissplice -s 1 -k 41 --experimental
-r cond1_replicat1_R1.fastq -r cond1_replicat1_R2.fastq -r cond1_replicat2_R1.fastq -r cond1_replicat2_R2.fastq
-r cond2_replicat1_R1.fastq -r cond2_replicat1_R2.fastq -r cond2_replicat2_R1.fastq -r cond2_replicat2_R2.fastq

The file containing the bubbles corresponding to the SNPs can be found here : results_cond1_cond2_k41_coherents_type_0.fa

Trinity ORFs file

Here is the command in order to run Trinity on the human dataset:

trinity --seqType fq --JM 100G
--left cond1_replicat1_R1.fastq, cond1_replicat2_R1.fastq, cond2_replicat1_R1.fastq, cond2_replicat2_R1.fastq
--right cond1_replicat1_R2.fastq, cond1_replicat2_R2.fastq, cond2_replicat1_R2.fastq, cond2_replicat2_R2.fastq
--CPU 10 --bflyCPU 10

The assembled transcripts can be found in the Trinity.fasta file. You will then need to run Transdecoder (included in the Trinity package). Transdecoder will predict the ORFs of the transcripts that will be used for the prediction of the functional impact of SNPs. Here is the command in order to run Transdecoder on the human dataset :

Transdecoder.LongOrfs -t Trinity.fasta
TransDecoder.Predict -t Trinity.fasta

Transdecoder will output several finals files, KisSplice2RefTranscriptome only needs the .bed file : Trinity.fasta.transdecoder.bed

Running BLAT

KisSplice2RefTranscriptome needs the bubbles to be positioned on the reference transcriptome. For now, only psl formated files are supported by K2rt. You can run BLAT to get this file. Here is the command in order to run BLAT on the human dataset :

blat --minIdentity=80 Trinity.fasta results_cond1_cond2_k41_coherents_type_0.fa out_blat_SNP_on_trinity_ID_80.psl

The BLAT output can be downloaded here : out_blat_SNP_on_trinity_ID_80.psl

Running KissDE (optional)

You can intersect KisSplice2RefTranscriptome results with condition-specific SNPs by giving to K2rt the output of kissDE.

#!/usr/bin/Rscript
library(kissDE)
snp<-kissplice2counts("results_cond1_cond2_k41_coherents_type_0.fa", pairedEnd=TRUE )
human_conditions<-c("c1","c1","c2", "c2")
res<-diffExpressedVariants(snp, human_conditions, pvalue=1)
write.table(res$finalTable, file="kissDE_output", sep="\t", quote=FALSE)

Here is the file kissDE_output

Running KisSplice2RefTranscriptome

kissplice2reftranscriptome
-b Trinity.fasta.transdecoder.bed
-k result_coherent_type0.fa
-t out_blat_SNP_on_trinity_ID_80.psl
-s kissDE_output
-o mainOutput.tsv
-l lowQueryCoverageOutput.tsv

The several output files can be found here :
mainOutput.tsv
lowQueryCoverageOutput.tsv

SNP calling from RNA-seq data without a reference genome: identification, quantification, differential analysis and impact on the protein sequence

A first step in Transcriptome-Wide Association Studies (TWAS)

Abstract