Frequently asked questions
- What is a bubble and what is the difference with a BCC?
- Can sequencing errors create bubbles?
- How can I differentiate a SNP from a sequencing error ?
- How can I differentiate a SNP from an inexact repeat ?
- How can I differentiate a SNP from RNA editing?
- How can I differentiate an AS from an indel ?
- Why are there incoherent bubbles ? Aren't all bubbles created by the reads ?
- What are the "type 4" events ?
- What are the counts given by KisSplice?
- Can paralog genes create bubbles?
- Why are there duplicated events in the output?
- My results have changed between v 2.1.0 and v 2.2.0. Why ?
- I used to find more SNPs than I do with new versions. Is it normal ?
- Question not in the list? Ask your own!
What is a bubble and what is the difference with a BCC?
- What is a bubble ?
A bubble corresponds to a variation in the sequence. Formally, it is pair of disjoint paths with the same source and target nodes in the de Bruijn graph (DBG) derived from the reads. Depending on the length of the paths of the bubble, the variation will be a SNP, an indel or an AS event. Repeats may also generate bubbles, but in this case, the paths will not have the same length and will have high sequence similarity.
Here, an example of an alternative splicing event or an indel. These paths have the same sequences on the borders (a and b) but vary in the center (junction ab for the shorter path, junction aS, exon S and junction Sb for the longer path).
KisSplice will output the two paths of the bubble (here ab and aSb). Every output is pairwise. There will be the sequences in a fasta file, with the identifiers:
X is the bcc identifier, Y is the bubble identifier, Z is the type of event detected, L is the length of the sequence. Z can be 0 for SNPs and sequencing errors, 1 for alternative splicing events, 2 for inexact tandem repeats, 3 for short indels.
- What is a BCC ?
A BCC corresponds to a set of overlapping variations. In graph theory, a BCC is a BiConnected Component (maximal subgraph such that there are two paths between any two nodes). In KisSplice, a BCC is therefore a biconnected component of the de Bruijn graph derived from the reads. Each BCC contains at least one variation (SNP, indel, AS event, etc.). It may contain more than one variation if they overlap, for instance when a SNP is embedded in a skipped exon.
When a gene contains several variations, but distant by more than k nt, it will be split among several BCCs. Hence, a BCC does not correspond to a gene, but to a gene fragment.
If two genes share k nt, they will fall in the same connected component, not necessarily the same biconnected component. If they share at least 2 k-mers, then the region inbetween will form a BCC whose gene affiliation is unclear. The genes are said to be co-assembled. This is the case for instance for (recent) paralogous genes.
Can sequencing errors create bubbles?
Yes, they can. If two reads differ by one nucleotide, it could create a bubble and be classified as type 0. In practice, at least two reads are required (-c option) to call a bubble.
How can I differentiate a SNP from a sequencing error ?
In KisSplice, there are two filters which are meant to filter out sequencing errors. The first is an absolute filter (-c option) and removes each kmer seen less than c times. The second is relative (-C option) and removes, for each kmer, every outgoing (resp incoming) edge covered by less than C% of the sum of coverage of its outgoing (resp. incoming) edges.
By default, c=2 and C=5%. If the goal is to identify unfrequent variants, these parameters can be lowered, but replicates are then recommended. The rationale is that the same sequencing error should not appear at the same location is multiple replicates.
We choose not to use the quality because some PCR artifacts can generate high-quality sequencing error.
How can I differentiate a SNP from an inexact repeat ?
This is very difficult, especially if the copies of the repeats diverged recently.
An inexact repeat consisting of two identical regions of 2k+1 nt which diverged only in their central position will be strictly identical to a SNP in the DBG.
In practice, many inexact repeats correspond to larger divergence levels (divergent sites are located less than 41nt apart) and therefore will lead type0b bubbles, or no bubble at all (if there the two copies of the repeat do not share at least 2 kmers). Filtering out type0b bubbles is a simple and efficient strategy to get rid of most inexact repeats... at the expense of missing true SNPs located in a cluster.
Another useful idea is to rely on the differential analysis. If you have several biological conditions, and if one variant is enriched in one condition and the other variant is enriched in the other condition, this is an indication that either it is a real SNP, or it is an inexact repeat where one copy is condition-specific (which itself is also interesting).
How can I differentiate a SNP from RNA editing?
You cannot if you have only RNA-seq data.
How can I differentiate an AS from an indel ?If there is a reference genome, it is possible to re-align the paths and check the real nature of the event. By default, KisSplice does a classification based on the length of the variable part. If the variable part has length 1, 2, 4, 5, then the event is classified as an indel. Otherwise, it is classified as an alternative splicing event.
The motivation for this choice is based on the two following observations :
- indels tend to be short
- AS events tend to fall in CDS, not UTRs, hence tend to have a variable region whose size is a multiple of 3
The figure illustrates these observations on an RNAseq dataset from human tissues (brain and liver). This dataset is the one used in the original KisSplice paper. Clearly, when the variable part has size exactly 3, it is difficult to assess if the variation is an AS event or an indel.
Why are there incoherent bubbles ? Aren't all bubbles created by the reads?
A bubble is incoherent if there are no tilings of reads covering each path of the bubble. Most incoherent bubbles are artifacts of the DBG. They are due to the fact that DBG lose the information that two k-mers are in the same read. In practice, some incoherent bubbles can correspond to true variations in poorly expressed regions.
What are the "type 4" events ?
They are bubbles which can not be classified as type 1, 2, or 3 because they do not satisfy the criteria. For instance, they may correspond to cases where a SNP is close to the alternative splice site, hence generating a path of length larger than 2k-2.
What are the counts given by KisSplice?
The counts correspond to the number of reads that align to each path. If a read aligns to the border of a path with less than k nt, it is discarded.
Can paralog genes create bubbles ?
Yes they can, if they did not diverge too much. In this case, they fall in the category of inexact repeats, which indeed may generate bubbles which resemble SNPs.
In this case, focusing on condition specific bubbles (one member of the gene family enriched in one dataset, and the other enriched in the other dataset) is still relevant.
It is worth outlining that in the case of paralog genes, it is often the case that the copies differ by more than one position. If these positions are distant by more than k nt, this will generate several bubbles. Otherwise, this may generate a single bubble, but whose paths will be longer. In the case where the copies do not share at least two k-mers (i.e. the copies diverged too much), then no bubble is reported.
In practice, long bubbles of type 0 are often good indicators of paralogs. Of course, they may also correspond to clusters of true SNPs. Again, focusing on condition specific bubbles is one way to focus on the most interesting bubbles.
In the case where at least one member of a gene family is alternatively spliced, it will generate a bubble. If the copies of the gene family did not diverge much, it may be that it is impossible to decide which member of the gene family contibuted to which alternative transcript. In this case, the gene family can be thought of as "collectively spliced".
Why are there duplicated events in the output?
Looking at KisSplice outputs, some events may appear as duplicates. It may very well be that they differ only by one nt. This happens when a SNP/mutation is coupled with other events (such as AS events).
For example, in the case of an intron retention, the intron may accumulate many mutations. This will create many bubbles which will all be reported by KisSplice.
My results have changed between v 2.1.0 and v 2.2.0. Why ?
Repeated regions (transposable elements...) in RNA-seq data is an issue in assembly and in transcriptome reconstruction in general. In versions older than v 2.2.0, for some datasets, enumerating all bubbles was very long into highly complex regions in the graph created by repeats. A significative number of bubbles could be "trapped" into these regions, that could reach the half of the graph size.
Since v 2.2.0, a new algorithm deals with this issue by enumerating not bubbles, but bubbles with a restricted number of edges, called "branches". We proved (see Navigating in a sea of repeats in RNA-seq without drowning ) that by selecting and enumerating only bubbles with a low number of "branches", we would avoid repeated regions in the graph such as highly branching nodes in the DBG are prone to be repeats. As a consequence, we retrieve events that were previously surrounded by repeats. On two simulated datasets, we have been able to show that this new enumeration increases KisSplice sensitivity (we are able to find more events) and specificity (we have a bit less false positives).
I used to find more SNPs than I do with new versions. Is it normal ?
First you need to check if k value is the same between your run with an old version and a 2.y.z version, as default k has been increased from 25 to 41. This increase goes with a gain of specificity but you may find less events in return. Secondly, 2.x.z versions have a different way to deal with 'N' nucleotides in data than older versions. Older versions were replacing N's with G's, which led to an artifically high number of putative SNPs, while new versions ignore N's (every k-mer overlapping an N is discarded). This, according to the amount of 'N' in your dataset, can have a strong impact because new versions are thus producing less false positive.