Wednesday, December 3, 2014

The Folly of Using Small Segments as Proof in Genealogical Research

Responsible genealogists adhere to high standards of proof in their research, in the evidence that they present and in the conclusions they reach. I strongly believe that genetic genealogists should as well. When we make claims that are not supported by sound science, then we undermine the credibility of our field.

Experience has demonstrated to me that there is great folly in claiming small segments can be used as proof (yes, even supporting) in genealogical research. When I use the term "small segments" in this article,  I am referring to unphased "matching" segments under 5 centiMorgans and I am addressing their use in matching, not admixture.  A few genetic genealogists have argued that there are certain instances when small segments are not only helpful in our genealogical research, but reliable. I strongly disagree.

One of the many problems with utilizing small segments is that, in general, people tend to see evidence that supports their theories and reject evidence that does not. Because the nature of small segments is so random, as I will demonstrate, it is possible that an individual will see patterns where none exist in reality, such as in a cluster of tiny, meaningless "matching" segments. This also holds true for admixture analysis.

Blaine Bettinger already wrote a great blog post explaining the work that has already been done on this issue along with some of his own comparisons, so I am going to concentrate on the multi-generational data to which I have access. Angie Bush has kindly allowed me access to her family's extensive data while she is unable to collaborate on this post since she is on a genealogy cruise. (Thanks, Angie!)

All of these examples are the first ones I looked at, so they are randomly chosen and not selected with bias. There is a huge amount of analysis that can still be performed on this data set. Since Gedmatch was down when I wrote this, I concentrated on Family Tree DNA data. When I am able to access Gedmatch again, I will add to my analysis.

First let's look at this simple chart of my data compared to James, a confirmed paternal fourth cousin, and then my father's data compared to that same cousin. As you can see, both my father and I have one substantial matching segment with James on Chromosome 4 (in purple). Some would argue that because we have one longer matching segment, that this makes the matching small segments reported more valid and thus can be more responsibly attributed to our known common ancestor.

Notice the segments highlighted in red in my chart. Those are all segments that were reported to be matching between me and James that do not show up as matches with my father. So, right off the bat, we can eliminate eight segments of what some might claim is supporting evidence of the known relationship with James.  That is 66.6% of the segments under 5 cM, which is in line with what was found in the 23andMe study.


Since I have no reason to believe that I inherited those segments from my mother, they are likely pseudo-segments. Pseudo-segments are spliced together by jumping between alleles from mom and dad, impersonating a matching stretch of DNA where one does not exist. The inability to distinguish these from authentic matching segments is a limitation of our current technology. Could they have actually come from my mother, you might be asking? My mother does not match James at the Family Tree DNA thresholds and I can't check Gedmatch to be sure, but there are no known common origins between them. (I am checking with James to see if he is willing to allow me to make that comparison for my next post.) Regardless, this analysis clearly disproves that the red segments are a result of the known paternal relationship. As such, there should be no argument to the conclusion that the majority of the small segments in this randomly chosen example cannot function as supporting evidence of the primary relationship in any way.

Next, look at the green segments. In this case, it appears that I inherited those from my father, but if you look closely, they are actually longer for me than for my father. This means that they are at least, partially, false positives or pseudo-segments. Incidentally, the one substantial matching segment we have in common (purple) is also reported to be a bit longer for me than for my father, which illustrates that it is questionable to rely too heavily on what appear to be exact assignments. In my list of matching segments, only the pink segments on chromosomes 2 and 3 are left as potentially fully IBD segments. Some will say that the fact that they persist from parent to child makes them more reliable indicators of a genealogical relationship. Perhaps, but there is no proof that that the pink segments weren't originally pseudo-segments interpreted as a match by the technology in my father's data and then passed to me through recombination of his two chromosomes. Does that sound far-fetched? Well let's see by looking at multi-generational data.

Please bear with me because this is going to take awhile. This chart is the matching DNA between Brynne and a known Bush cousin from her mother's father's father's branch of the family. The common ancestors are Frederick Bush and Martha White, so you can see that the expected path of inheritance for matching DNA between Brynne and this cousin is:
Brynne >> Angie >> Grandpa >> Great Grandpa


Here we are looking at the threshold set at 5 cM. Brynne's data compared to the Bush cousin is on the left and the comparison of her mother Angie to this same cousin is on the right.


This is her grandfather's (left) and great grandfather's (right) DNA compared to the same cousin.


These are nicely consistent with all of Brynne's matching segments being inherited from her great grandfather, as would be expected.

Now, let's look at the same comparisons with the threshold lowered to 1 centiMorgan.

Brynne and Angie:



Grandfather and great grandfather:
                   
             

As you can see things got very messy at this level. We have all kinds of problems and inconsistencies with the data now. Let's look at just a few.

Chromosome 11:

As you can see Brynne has three small segments (under 5 cM) in common with her known Bush cousin on Chromosome 11. One is lost as we move to her mother Angie's comparison, but two persist. So, if the theory is correct that when a small segment persists over two generations that it is more likely to be identical by descent or attributable to the known common ancestor, then the two remaining ones should be IBD. However, look what happens - another is lost when we move the next generation back in time toward the common ancestor with the known cousin and then finally all three have disappeared by the time we get to the great grandfather. This is the opposite of what we should be seeing. Could these last two segments be attributable to another common ancestor on Brynne's grandmother's and great grandmother's branches of her tree? Possibly, but if so, that still doesn't support the claim that small segments help to prove the primary relationship responsible for the large matching segments. In fact, it refutes it because it demonstrates that even in families with no known pedigree collapse, such as this one, there still may be small segments inherited from distant common ancestors.

We saw other problems too. In some cases, like on Chromosomes 3 and 6, segments disappear at one generation and seemingly reappear at the next. That tells us one of two things - that coincidences happen and/or that the technology is not reliably picking up these small segments consistently. Either scenario does not instill confidence in genealogical conclusions based on small segment analysis.

Chromosome 3: Grandpa was "skipped" and the segment was almost three times larger in the most recent generation which is opposite of what we would expect to see if it was identical by descent.

Chromosome 6: Mom was "skipped". Notice the high number of SNPs (again many more in the most recent generation), which makes it seem less likely that it was simply missed by the technology.

These examples lend credence to the myth that DNA can skip a generation, which we all know to be untrue.

Most importantly, in this entire comparison, NOT ONE of Brynne's small segments shared with her known Bush cousin persisted consistently through all four generations on the path back to the known common ancestor.

When going through this data, I saw so many examples that fly in the face of the belief that small segments can, in any way, be reliable indicators of a genealogical relationship that I couldn't even begin to cover them all here. Since Gedmatch was down while I was writing this, I was unable to do some of the comparisons I had planned, so perhaps I will do that at a later time.

In the meantime, since I read a lot of comments over the last few days that people feel comfortable mapping small segments to their known ancestors using comparisons of their close relatives, I decided to see if that, at least, could stand up to analysis. Let's look at Brynne compared to her maternal grandparents.


We can see her DNA mapped to her grandfather in orange and her grandmother in blue. It is quite clean at the 5 cM threshold on the left with almost no overlap as we would expect, however when you drop the threshold to 1 cM, you can start to see issues on the right. Look at Chromosome 1, for example. There are three small segments from the grandparents that are directly in opposition to the obvious inheritance pattern. You can also see it on chromosomes 3, 5, 6, 10, 12 and 14 (click image to enlarge). If you only had one of the grandparents tested, you would map those small segments to the wrong grandparent and, thus, be "barking up the wrong" branch of the tree.

Brynne's DNA mapped to her maternal grandparents

Let's look more closely at Brynne's Chromosome 14 and the inheritance from her maternal grandparents through to her great grandfather Bush.


The pink in the image below is the comparison with her mother, Angie. Of course, they share across the length of the chromosome. Then, you can see, in green, the DNA she shares with her maternal grandfather and, in blue, the DNA she shares with her great grandfather from the same line. It appears that she has one long segment from her grandfather and then one small one that she inherited from her great grandfather through her grandfather. You would feel pretty safe mapping that small blue/green segment to her great grandfather, right? There is only one problem...the orange is the DNA she inherited from her maternal grandmother! That small segment falls right where the DNA she inherited from her mother came from her maternal grandmother, not her grandfather! She couldn't have inherited DNA from both her maternal grandmother and her maternal grandfather on that spot, so the small segment must be a false positive even though it persisted over multiple generations.



You can see similar problems on Chromosome 1, 5 and 6.

Remember she can't inherit DNA in the same spot from both grandma and grandpa.


Pink - mother, green = grandfather, blue = great grandfather (father of grandfather), orange = grandmother. 

All three of these chromosomes show small segments that fall in sections inherited from the opposite side of the family, proving they are false positives. Look at the colorful pile-up on Chromosome 6. Some of these segments are almost 5 cM!

There is so much more to say about the use of small segments in genealogical research and a huge amount of data to explore, but I will stop here for today. I think that these few examples should give any genetic genealogist who believes that small segments can, in any way, support genealogical theories serious pause for thought.

In a later article, we will examine the assertion that small segments can prove useful as "population specific" guides and if there is any support for the recent ancient genome comparison analysis. The fact that these segments are not consistently inherited certainly calls that type of analysis into question as well.

I encourage those of you with access to multi-generational data to perform a similar analysis and let us know what you find. The more data, the better!

[Note: In the future, I believe that we will be able to utilize smaller segments in our research and even assign them to specific ancestors through chromosome mapping, but this will only be possible when technology has advanced considerably and we are using higher resolution autosomal DNA testing and much improved phasing engines. The exception is Tim Janzen who is attempting to do so now through highly technical and advanced work. He is phasing his data through testing and comparison of large numbers of known relatives, many more than the vast majority of genealogists will ever test. To my knowledge, he has never claimed to have used small segments to break down any genealogical brick walls or to have proven anything in that regard, even as supporting evidence.]