When the crystal ball is cloudy: calling sequence data correctly
-->

Here's a monkey wrench of a paper (O'Rawe et al.), just published in Genome Medicine.  We're all being sold on the idea that knowing our whole genome sequence is going to make us much healthier. The DNA sequencer cum crystal ball will tell us what we're likely to be in for, and this will give us plenty of lead time to prevent it -- by running, lowering our cholesterol intake, losing weight, or whatever -- or to prepare for it.

But, among many other assumptions, this assumes first and foremost that the data are being read correctly, no false positives or negatives.  And here's the clincher: O'Rawe et al. compared five different software packages that read and interpret DNA sequence data, and they report low concordance between results.  Discrepancies have been found before, but not when comparing reads of the same raw data.

This group sequenced whole exomes of 15 different individuals, in 4 families, and fed the raw data through 5 sequence analysis pipelines.  They also sequenced one whole genome.  Sequences were done at 20 - 154X coverage, 120X average, meaning each nucleotide was read at least 20 times, but most often more, and at least 80% of the target sequence was obtained.

They found that the 5 programs agreed on single nucleotide variants (SNVs) about 60% of the time.  That is, 40% of the time a SNV was called by fewer than 5 of the programs.  Each of the pipelines detect variants that the others do not, and they aren't necessarily all false positives.
This disagreement is likely the result of many factors including alignment methods, post alignment data processing, parameterization efficacy of alignment and variant calling algorithms, and the underlying models utilized by the variant calling algorithm(s).
That is, each step along the way potentially introduces errors.  Indel (insertion/deletions, segments of DNA one or more nucleotides in length) concordance rates were even lower, at 26% between three indel calling programs.  (The paper goes into much more detail about specific pipelines and error rates.) Using family data can help reduce inaccuracies when it is possible to determine which calls just cannot be correct.  But, otherwise, with current methods reducing false positives means increasing false negatives, and vice versa.

The authors write,
In the realm of biomedical research, every variant call is a hypothesis to be tested in light of the overall research design. Missing even a single variant can mean the difference between discovering a disease causing mutation or not. For this reason, our data suggest that using a single bioinformatics pipeline for discovering disease related variation is not always sufficient.
This somewhat understates the problem.  Serious level testing of a SNP (single nucleotide polymorphism) to see if it has an effect on disease risk--especially when these effects are typically very small in any case, and biased upwards in GWAS type data, is no joke.  What do you do?  Put that single change into a lab mouse or rat and see if it might be more likely to develop slightly higher blood pressure at old age?  Or have a slightly higher risk of some sort of cancer (again, to be a human model, it should be at older ages)?  Which mouse strain would you use?  If humans are to be used for validation, how would you do it?

The questions are serious because miscalls by sequencers go both ways.  A sequencer can miss a SNV call, so you don't identify one of the variants that you really want to be checking.  Or, it can give you a false positive, and lead you farther astray.  And if you must choose between hundreds of variants across the genome, with comparable estimated effects, you are already in a bit of a bind even if they are all perfectly called!

No technology, or medical test, will be correct 100 percent of the time, and sequencing technologies are likely to get better, not worse (though if MS Windows is any guide, that's not necessarily true!). But, when disease risk estimates depend on accurate DNA sequence, it is obvious that we are way premature in proclaiming findings so loudly and demanding that so much effort and resources be poured into doing more of the same.  Again, focused studies on problems more important, clearer, and less vulnerable to these kinds of errors is where the effort should be going.

And, some subtle manipulation, too?
By the way, the standard term for a single nucleotide variation in a population is SNP (single nucleotide polymorphism).  Now, some authors use SNV (single nucleotide variant), essentially doing two things.  First, they are rhetorically equating 'variant' with causal variant--that is, tactily, subtly, or surreptitiously planting in your mind that they are onto something causal.  And second, they are tacitly, subtly, or surreptitiously suggesting that one of the two is the 'good'  or 'normal' (i.e., health-associated) variant.  This perpetuates the 'wild type' thinking--see our earlier post 'walk on the wild-type side'.

These are ways in which the community of researchers inadvertently or intentionally (you decide which) cooks the books in your and journalists', and even their own minds, entrenching a de facto genetic-causation worldview into their and everybody's thinking.  That's good for business, of course.

Comments 0


EmoticonEmoticon

:)
:(
=(
^_^
:D
=D
-_-
|o|
@@
;)
(y)
:-d
:p
<3
(>o<)