SNP calling from RNA-seq data without a reference genome: identification, quantification, differential analysis and impact on the protein sequence

A first step in Transcriptome-Wide Association Studies (TWAS)

Abstract

SNPs (Single Nucleotide Polymorphisms) are genetic markers whose precise identification is a prerequisite for association studies. Methods to identify them are currently well developed for model species, but rely on the availability of a (good) reference genome, and therefore cannot be applied to non-model species. They are also mostly tailored for whole genome (re-)sequencing experiments, whereas in many cases, transcriptome sequencing can be used as a cheaper alternative which already enables to identify SNPs located in transcribed regions. In this paper, we propose a method that identifies, quantifies and annotates SNPs without any reference genome, using RNA-seq data only. Individuals can be pooled prior to sequencing, if not enough material is available from one individual. Using pooled human RNA-seq data, we clarify the precision and recall of our method and discuss them with respect to other methods which use a reference genome or an assembled transcriptome. We then validate experimentally the predictions of our method using RNA-seq data from two non-model species. The method can be used for any species to annotate SNPs and predict their impact on the protein sequence. We further enable to test for the association of the identified SNPs with a phenotype of interest.

The pipeline


With fasta/fastq input from an RNA-seq experiment, SNPs are found by KisSplice without using a reference. KisSplice provides only a local context around the SNP, but a reference transcriptome can be built from the RNAseq data using a full-lenth transcriptome assembler like Trinity. Then SNPs predicted by KisSplice can be positionned along the transcripts (with BLAT). Some SNPs that do not map on the transcripts of Trinity, called orphan SNPs, are harder to study but can still be of interest. We propose a method, KisSplice2RefTranscriptome, to predict a functional impact for the positioned SNPs, and intersect these results with condition-specific SNPs. Overall, starting from RNAseq data only, we obtain a list of condition-specific SNPs stratified by functional impact.

KisSplice2RefTranscriptome

KisSplice2RefTranscriptome web page
KisSplice2RefTranscriptome version 1.2.0 (2016-29-03)
User's guide

Running the programs

Our method enables to find SNPs from RNA-seq data. In order to assess if the SNPs we find are correct, and if the list we output is exhaustive, we chose to test our method on RNA-seq data from the Geuvadis project. Indeed, the individuals whose transcriptome was sequenced in this project were already included in the 1000 genome project. Hence, their SNPs have been already well annotated. We downloaded fastq files from SRA and selected 10 Toscans and 10 Central Europeans. We sampled 10M reads for each individual and concatenated the fastq files in pools of 5 individuals.
The files used in our analysis are available here

Running KisSplice

Here is the command in order to run KisSplice on this dataset :

kissplice -s 1 -k 41 --experimental
-r cond1_replicat1_R1.fastq -r cond1_replicat1_R2.fastq -r cond1_replicat2_R1.fastq -r cond1_replicat2_R2.fastq
-r cond2_replicat1_R1.fastq -r cond2_replicat1_R2.fastq -r cond2_replicat2_R1.fastq -r cond2_replicat2_R2.fastq

The file containing the bubbles corresponding to the SNPs can be found here : results_cond1_cond2_k41_coherents_type_0.fa

Trinity ORFs file

Here is the command in order to run Trinity on the human dataset:

trinity --seqType fq --JM 100G
--left cond1_replicat1_R1.fastq, cond1_replicat2_R1.fastq, cond2_replicat1_R1.fastq, cond2_replicat2_R1.fastq
--right cond1_replicat1_R2.fastq, cond1_replicat2_R2.fastq, cond2_replicat1_R2.fastq, cond2_replicat2_R2.fastq
--CPU 10 --bflyCPU 10

The assembled transcripts can be found in the Trinity.fasta file. You will then need to run Transdecoder (included in the Trinity package). Transdecoder will predict the ORFs of the transcripts that will be used for the prediction of the functional impact of SNPs. Here is the command in order to run Transdecoder on the human dataset :

Transdecoder.LongOrfs -t Trinity.fasta
TransDecoder.Predict -t Trinity.fasta

Transdecoder will output several finals files, KisSplice2RefTranscriptome only needs the .bed file : Trinity.fasta.transdecoder.bed

Running BLAT

KisSplice2RefTranscriptome needs the bubbles to be positioned on the reference transcriptome. For now, only psl formated files are supported by K2rt. You can run BLAT to get this file. Here is the command in order to run BLAT on the human dataset :

blat   --minIdentity=80    Trinity.fasta    results_cond1_cond2_k41_coherents_type_0.fa    out_blat_SNP_on_trinity_ID_80.psl

The BLAT output can be downloaded here : out_blat_SNP_on_trinity_ID_80.psl

Running KissDE (optional)

You can intersect KisSplice2RefTranscriptome results with condition-specific SNPs by giving to K2rt the output of kissDE.

#!/usr/bin/Rscript
library(kissDE)
snp<-kissplice2counts("results_cond1_cond2_k41_coherents_type_0.fa", pairedEnd=TRUE )
human_conditions<-c("c1","c1","c2", "c2")
res<-diffExpressedVariants(snp, human_conditions, pvalue=1)
write.table(res$finalTable, file="kissDE_output", sep="\t", quote=FALSE)

Here is the file kissDE_output

Running KisSplice2RefTranscriptome


kissplice2reftranscriptome
-b Trinity.fasta.transdecoder.bed
-k result_coherent_type0.fa
-t out_blat_SNP_on_trinity_ID_80.psl
-s kissDE_output
-o mainOutput.tsv
-l lowQueryCoverageOutput.tsv


The several output files can be found here :
mainOutput.tsv
lowQueryCoverageOutput.tsv