Wednesday, December 3, 2014

The Folly of Using Small Segments as Proof in Genealogical Research

Responsible genealogists adhere to high standards of proof in their research, in the evidence that they present and in the conclusions they reach. I strongly believe that genetic genealogists should as well. When we make claims that are not supported by sound science, then we undermine the credibility of our field.

Experience has demonstrated to me that there is great folly in claiming small segments can be used as proof (yes, even supporting) in genealogical research. When I use the term "small segments" in this article,  I am referring to unphased "matching" segments under 5 centiMorgans and I am addressing their use in matching, not admixture.  A few genetic genealogists have argued that there are certain instances when small segments are not only helpful in our genealogical research, but reliable. I strongly disagree.

One of the many problems with utilizing small segments is that, in general, people tend to see evidence that supports their theories and reject evidence that does not. Because the nature of small segments is so random, as I will demonstrate, it is possible that an individual will see patterns where none exist in reality, such as in a cluster of tiny, meaningless "matching" segments. This also holds true for admixture analysis.

Blaine Bettinger already wrote a great blog post explaining the work that has already been done on this issue along with some of his own comparisons, so I am going to concentrate on the multi-generational data to which I have access. Angie Bush has kindly allowed me access to her family's extensive data while she is unable to collaborate on this post since she is on a genealogy cruise. (Thanks, Angie!)

All of these examples are the first ones I looked at, so they are randomly chosen and not selected with bias. There is a huge amount of analysis that can still be performed on this data set. Since Gedmatch was down when I wrote this, I concentrated on Family Tree DNA data. When I am able to access Gedmatch again, I will add to my analysis.

First let's look at this simple chart of my data compared to James, a confirmed paternal fourth cousin, and then my father's data compared to that same cousin. As you can see, both my father and I have one substantial matching segment with James on Chromosome 4 (in purple). Some would argue that because we have one longer matching segment, that this makes the matching small segments reported more valid and thus can be more responsibly attributed to our known common ancestor.

Notice the segments highlighted in red in my chart. Those are all segments that were reported to be matching between me and James that do not show up as matches with my father. So, right off the bat, we can eliminate eight segments of what some might claim is supporting evidence of the known relationship with James.  That is 66.6% of the segments under 5 cM, which is in line with what was found in the 23andMe study.


Since I have no reason to believe that I inherited those segments from my mother, they are likely pseudo-segments. Pseudo-segments are spliced together by jumping between alleles from mom and dad, impersonating a matching stretch of DNA where one does not exist. The inability to distinguish these from authentic matching segments is a limitation of our current technology. Could they have actually come from my mother, you might be asking? My mother does not match James at the Family Tree DNA thresholds and I can't check Gedmatch to be sure, but there are no known common origins between them. (I am checking with James to see if he is willing to allow me to make that comparison for my next post.) Regardless, this analysis clearly disproves that the red segments are a result of the known paternal relationship. As such, there should be no argument to the conclusion that the majority of the small segments in this randomly chosen example cannot function as supporting evidence of the primary relationship in any way.

Next, look at the green segments. In this case, it appears that I inherited those from my father, but if you look closely, they are actually longer for me than for my father. This means that they are at least, partially, false positives or pseudo-segments. Incidentally, the one substantial matching segment we have in common (purple) is also reported to be a bit longer for me than for my father, which illustrates that it is questionable to rely too heavily on what appear to be exact assignments. In my list of matching segments, only the pink segments on chromosomes 2 and 3 are left as potentially fully IBD segments. Some will say that the fact that they persist from parent to child makes them more reliable indicators of a genealogical relationship. Perhaps, but there is no proof that that the pink segments weren't originally pseudo-segments interpreted as a match by the technology in my father's data and then passed to me through recombination of his two chromosomes. Does that sound far-fetched? Well let's see by looking at multi-generational data.

Please bear with me because this is going to take awhile. This chart is the matching DNA between Brynne and a known Bush cousin from her mother's father's father's branch of the family. The common ancestors are Frederick Bush and Martha White, so you can see that the expected path of inheritance for matching DNA between Brynne and this cousin is:
Brynne >> Angie >> Grandpa >> Great Grandpa


Here we are looking at the threshold set at 5 cM. Brynne's data compared to the Bush cousin is on the left and the comparison of her mother Angie to this same cousin is on the right.


This is her grandfather's (left) and great grandfather's (right) DNA compared to the same cousin.


These are nicely consistent with all of Brynne's matching segments being inherited from her great grandfather, as would be expected.

Now, let's look at the same comparisons with the threshold lowered to 1 centiMorgan.

Brynne and Angie:



Grandfather and great grandfather:
                   
             

As you can see things got very messy at this level. We have all kinds of problems and inconsistencies with the data now. Let's look at just a few.

Chromosome 11:

As you can see Brynne has three small segments (under 5 cM) in common with her known Bush cousin on Chromosome 11. One is lost as we move to her mother Angie's comparison, but two persist. So, if the theory is correct that when a small segment persists over two generations that it is more likely to be identical by descent or attributable to the known common ancestor, then the two remaining ones should be IBD. However, look what happens - another is lost when we move the next generation back in time toward the common ancestor with the known cousin and then finally all three have disappeared by the time we get to the great grandfather. This is the opposite of what we should be seeing. Could these last two segments be attributable to another common ancestor on Brynne's grandmother's and great grandmother's branches of her tree? Possibly, but if so, that still doesn't support the claim that small segments help to prove the primary relationship responsible for the large matching segments. In fact, it refutes it because it demonstrates that even in families with no known pedigree collapse, such as this one, there still may be small segments inherited from distant common ancestors.

We saw other problems too. In some cases, like on Chromosomes 3 and 6, segments disappear at one generation and seemingly reappear at the next. That tells us one of two things - that coincidences happen and/or that the technology is not reliably picking up these small segments consistently. Either scenario does not instill confidence in genealogical conclusions based on small segment analysis.

Chromosome 3: Grandpa was "skipped" and the segment was almost three times larger in the most recent generation which is opposite of what we would expect to see if it was identical by descent.

Chromosome 6: Mom was "skipped". Notice the high number of SNPs (again many more in the most recent generation), which makes it seem less likely that it was simply missed by the technology.

These examples lend credence to the myth that DNA can skip a generation, which we all know to be untrue.

Most importantly, in this entire comparison, NOT ONE of Brynne's small segments shared with her known Bush cousin persisted consistently through all four generations on the path back to the known common ancestor.

When going through this data, I saw so many examples that fly in the face of the belief that small segments can, in any way, be reliable indicators of a genealogical relationship that I couldn't even begin to cover them all here. Since Gedmatch was down while I was writing this, I was unable to do some of the comparisons I had planned, so perhaps I will do that at a later time.

In the meantime, since I read a lot of comments over the last few days that people feel comfortable mapping small segments to their known ancestors using comparisons of their close relatives, I decided to see if that, at least, could stand up to analysis. Let's look at Brynne compared to her maternal grandparents.


We can see her DNA mapped to her grandfather in orange and her grandmother in blue. It is quite clean at the 5 cM threshold on the left with almost no overlap as we would expect, however when you drop the threshold to 1 cM, you can start to see issues on the right. Look at Chromosome 1, for example. There are three small segments from the grandparents that are directly in opposition to the obvious inheritance pattern. You can also see it on chromosomes 3, 5, 6, 10, 12 and 14 (click image to enlarge). If you only had one of the grandparents tested, you would map those small segments to the wrong grandparent and, thus, be "barking up the wrong" branch of the tree.

Brynne's DNA mapped to her maternal grandparents

Let's look more closely at Brynne's Chromosome 14 and the inheritance from her maternal grandparents through to her great grandfather Bush.


The pink in the image below is the comparison with her mother, Angie. Of course, they share across the length of the chromosome. Then, you can see, in green, the DNA she shares with her maternal grandfather and, in blue, the DNA she shares with her great grandfather from the same line. It appears that she has one long segment from her grandfather and then one small one that she inherited from her great grandfather through her grandfather. You would feel pretty safe mapping that small blue/green segment to her great grandfather, right? There is only one problem...the orange is the DNA she inherited from her maternal grandmother! That small segment falls right where the DNA she inherited from her mother came from her maternal grandmother, not her grandfather! She couldn't have inherited DNA from both her maternal grandmother and her maternal grandfather on that spot, so the small segment must be a false positive even though it persisted over multiple generations.



You can see similar problems on Chromosome 1, 5 and 6.

Remember she can't inherit DNA in the same spot from both grandma and grandpa.


Pink - mother, green = grandfather, blue = great grandfather (father of grandfather), orange = grandmother. 

All three of these chromosomes show small segments that fall in sections inherited from the opposite side of the family, proving they are false positives. Look at the colorful pile-up on Chromosome 6. Some of these segments are almost 5 cM!

There is so much more to say about the use of small segments in genealogical research and a huge amount of data to explore, but I will stop here for today. I think that these few examples should give any genetic genealogist who believes that small segments can, in any way, support genealogical theories serious pause for thought.

In a later article, we will examine the assertion that small segments can prove useful as "population specific" guides and if there is any support for the recent ancient genome comparison analysis. The fact that these segments are not consistently inherited certainly calls that type of analysis into question as well.

I encourage those of you with access to multi-generational data to perform a similar analysis and let us know what you find. The more data, the better!

[Note: In the future, I believe that we will be able to utilize smaller segments in our research and even assign them to specific ancestors through chromosome mapping, but this will only be possible when technology has advanced considerably and we are using higher resolution autosomal DNA testing and much improved phasing engines. The exception is Tim Janzen who is attempting to do so now through highly technical and advanced work. He is phasing his data through testing and comparison of large numbers of known relatives, many more than the vast majority of genealogists will ever test. To my knowledge, he has never claimed to have used small segments to break down any genealogical brick walls or to have proven anything in that regard, even as supporting evidence.]

15 comments:

  1. I think the message is clear, until we have something other than what amounts to guilt by association to include the very small segments, we just plain shouldn't. This is, however, a difficult concept to get across to those new to genetic genealogy. Hopefully this blog post and the recent one by Blaine Bettinger will be helpful in that respect.

    ReplyDelete
  2. great sleuthing, and you may have mentioned this in the post, (i tend to skim and then look at the regions in the genome) the regions you highlight on chr3 and chr6 have highly repetitive sequences found all over the genome or found repeated many times in that region. for example the chr6 region contains the infamous HLA/MHC loci. these are interesting regions for SNP analysis and can lead to many inconsistencies as they are often under positive selection for resistance to certain pathogens. this can lead to inheritance patterns that don't always follow a "random" assumption. i'm sure Tim is compiling a list of particularly troublesome regions similar to Thomas's regions on the Y.

    ReplyDelete
  3. So glad to see this article, and very nicely done! Looking forward to the second part of your post. Another way I have found to demonstrate the danger of using small segments is to pick two people at GEDmatch you know are not closely related (or just two people at random) and run a one to one comparison using 100 for SNP threshold and 1 for minimum cM threshold. There will be LOTS of small segments indicated as being shared between the two people.

    ReplyDelete
  4. Thank you CeCe for an excellent and illuminating article.

    ReplyDelete
  5. Great article to help me and possibly others understand more about atDNA. Actually the first I have read to provide such an analysis. Looking at our DNA results many of us get confused and frustrated by it all and are often looking for just a silver bullet (quick & simple) but, we may forget DNA is just one tool for the genealogist. You still need to use all the other tools we learn in genealogy and DNA can help support some of your basic/advance genealogy research. Many will never be able to trace to a common ancestor for the so called cousin matches we see. I like to focus just on possible close relationships like 2nd cousins not trying to all the way back to 5th cousins. Many have gaps in their genealogy and have not/or cannot traced every single ancestral descendent to present day folks. With such gaps in our genealogy and could be impossible to determine a possible common ancestor for a very distant alleged cousin. What percentage of family genealogists can actually take their own line back more than 5 generations and bring forward every one of family members from that 5th generation. I know I can’t.
    Keep it coming CeCe, I look forward to future articles regarding the subject.

    ReplyDelete
  6. Thanks, CeCe.

    In the example of Chromosome #14, if the grandfather and greatgrandfather were related to the grandmother, say 4 or 500 years ago, would that show the same result? I get confused between identical by State, Chance, Descent or really old DNA that may or may not represent a common ancestor.

    ReplyDelete
  7. Thanks for sharing your knowledge. Please confirm if the ethnic ancestry (admixture) painting of small segments is folly. How small (in cM or % of a chromosome) can a "Native American" or "Sub-Saharan African" segment of an otherwise "Northern European" be before it is likely noise (or folly)?

    ReplyDelete
  8. Would NEW DNA test from TRIBECODE.com with it's next generation sequencing technology verify small segment matches?

    ReplyDelete
  9. Nicely done, CeCe. Especially the explanation of pseudo-segments.

    I wouldn't be surprised to see that if you broke down the alleles in the "missing generation" examples (chr 3 & 6), you would find that the missing ancestor has the alleles that fill in the blank from one parent and is missing the alleles from the other parent, thus creating the illusion of a break in the line.

    As for your statement in your second paragraph ("A few genetic genealogists have argued that there are certain instances when small segments are not only helpful in our genealogical research, but reliable. I strongly disagree."), I stand by the position that I outlined in my latest blog and repeated in the comments on Blaine's recent piece. Using the small segments from the bottom up can work..

    If my sisters and I, my aunt and my uncle all have a matching small segment, I can assign it to my father and to one (or a combination) of my grandparents. But I wouldn't take it any higher without signifucant help from the cousins.

    ReplyDelete
  10. Thanks for the great article. I'm looking forward to your follow-up. I haven't taken more than a cursory interest in gedmatch and the like since someone told me we were related and we were the nth cousins and descendants of Marie Antoinette and were therefore royalty. Clearly an over reach on their part and one which I couldn't swallow. Tools like gedmatch and others can be helpful but they must be used intelligently. Just as one shouldn't use a canon for a fly swatter - one shouldn't use these type tools without an understanding their both their strengths, limitations and purpose. Clearly further articles and discussion about genetic tools and their use should continue just as this science continues to evolve.

    ReplyDelete
  11. Great article, very good to see DNA across so many generations and to truly identify how small segments simply cannot be relied upon until (as you say) the technology gets better.

    A question on a slightly different but related topic. This definitely seems to back up people suggesting at least using 7cM for matches. I've seen the various charts that usually get to about 99% likelihood of relation at about 10cM etc. How does that relate to admixture? For example is something 7 cM - 10 cM in admixture (across multiple data sets) pretty likely to be correct? Or would you recommend the threshold being higher for admixture?

    ReplyDelete
  12. There is another genetic genealogist Roberta Estes who says using small segments can be used as proof:
    http://dna-explained.com/2013/11/17/proving-men-whose-y-lines-dont-match-are-related/

    Personally I think you are correct that small segments shouldn't be used and is like trying to fit a square peg into a round hole.

    ReplyDelete
  13. You make a compelling case for not using small segments! I've been one to use them as supporting evidence, or to tease out additional information if present with a large segment match. Perhaps it was nothing more than "folly" as you say!

    However, one thing I'm having a problem squaring in the argument against small segments is the probability involved with matching even as little as a couple hundred SNPs in a row. If the value of each SNP comes from a random event (I know it's not totally random, but for all intents and purposes I'm assuming it is), and the probability of matching hundreds of random events in a row is negligible, then randomness has to be discounted as a reason for these matches. Even when 700,000 SNPs are tested across 400,000+ people (in Ancestry's case) the odds remain very negligible of having matches of even this size if my math is correct (again, assuming the assigned value at each SNP location is a mutually exclusive random event).

    So the question is if not randomness, then what? Could population matches account for more of it than is thought? Is there less randomness in inheriting segments than is thought? Do our deep colonial roots account for a lot of the noise? Not sure.

    Kevin

    ReplyDelete
  14. Very interesting article. I notice that some of the segments you compare have exact start and end points. Regardless of whether the segment is small or larger (say, over 7 cM), does this fact imply anything about the potential for relationships between individuals? Or do certain chromosomes simply rearrange themselves in an exact order so that large numbers of people have certain sequences in their genetic makeup that cannot be accounted for by descent from a common ancestor?

    Richard Cochran

    ReplyDelete
  15. When we find a matching segment with a child generation person having more cMs than an overlapping segment with a parent generation person the first thing we think about is that there may be a second line of descent from the child's opposite parent to the primary person. In other words a secondary common ancestor (SCA) or what you refer to as pedigree collapse, in addition to the MRCA? This should be seen in endogamous populations. My question has always been: Is there a way to measure the effects of endogamy and thus be able to filter it out or flag it in reporting matches? If so it would help reduce false positives.

    ReplyDelete