AncestryDNA's Matching Threshold
First, mostly due to the large number of matches, it has been widely speculated that AncestryDNA is allowing for a much lower threshold than either 23andMe or Family Tree DNA, reporting matches based on as little as two cM. In reality, Ken tells me that AncestryDNA has been using a 5 Mb cutoff [Mb = mega base pairs = 1,000,000 base pairs] for reporting matches in their lowest category - "very low confidence". He explains how they came to this decision and what AncestryDNA sees as the benefits to their customers:
AncestryDNA, we believe, is the only service that phases the genotyping data and has validated the matching algorithm with large pedigrees. That leads to two important differences. First, it allowed us to test various segment cutoffs from 5-10 Mb* with and without a proprietary filter that preferentially removes incorrect matches. We've initially selected the 5 Mb cutoff with the filter as providing the best balance between false negative (true matches that we fail to call a match) and false positives (false matches that we call true matches). Second, it allowed us to make a better cousinship prediction. For example, our data suggest that most relationships that are theoretically predicted to be third cousins are really fourth cousins or deeper. Therefore, a fourth cousin match at AncestryDNA, we believe, is a third cousin match at other services.
Ken's assertion that AncestryDNA is using a more conservative prediction calculation does appear to be in agreement with what I and many of my colleagues have observed. Time will tell if it is indeed more accurate. The filter aimed at reducing the number of IBS (Identical By State) matches sounds like a promising addition. When we have the ability to examine the raw data we should be able to reach conclusions about how effective the filter is at fulfilling its purpose.
Mb vs cM - What does it mean to us?
As you may know, the centimorgan rather than mega base pairs is used by Family Tree DNA's Family Finder and also primarily by 23andMe as the length of measurement for matching autosomal DNA segments. So, how does this 5 Mb threshold compare to the 5.5 cM* threshold used by Family Tree DNA (*edited from 7.7 cM after I was sent this) and the 7 cM threshold used by 23andMe in their Relative Finder feature? The National Institutes of Health website tells us that in human genetics, "one centimorgan is equivalent, on average, to one million base pairs" or 1 Mb. Genome.gov agrees, "Generally, one centimorgan equals about 1 million base pairs." However, in reviewing my Ancestry Finder download at 23andMe, which lists the length of segments both in Mb and in cM, I came to the conclusion that, unfortunately, it isn't that simple - at least for our purposes. In some cases, the numerical value in Mb was larger than in cM for the same segment, but in other cases it was smaller. I copied portions of my Ancestry Finder download to demonstrate examples. If you have tested at 23andMe, take a look at your own file to get a feel for the comparison.
The first chart shows the respective values when the Mb value was 11 and the second when the cM value was 11:
The reason the number of base pairs that a centimorgan corresponds to varies so widely is because when the distance along a chromosome is measured in mega base pairs (Mb), the value strictly reflects how many millions of base pairs there are in a matching segment, but when using centimorgans (cM) to express the distance along the chromosome, the frequency or chance of recombination expected within that segment is being measured. Some portions of the genome are expected to recombine more often than others, therefore sometimes a segment of 1 Mb has a relatively good chance of remaining intact and sometimes it does not.
Different Predictions and More Matches
This difference between AncestryDNA's way of calculating the length of segments and that of the other two companies may explain, in part, the reason that some of us are seeing the same matches at AncestryDNA as we have at the other two companies, with very different predictions. The fact that AncestryDNA is using a phasing engine before running the matching algorithms will also account for some of the reported discrepancies. When asked why AncestryDNA is, on average, returning more matches than the other two companies, Ken offered one possible explanation. He said that it may be a result of the AncestryDNA database containing primarily customers with deep roots in the United States and, in many cases, descending from large Colonial New England families.
Adding International Customers to the Database
This discussion prompted me to inquire as to when AncestryDNA plans to offer the test to international customers. Ken said that it is certainly "on the radar", but they do not have an estimate of when this will happen yet. He explained some of the reasons for this:
1. Demand is still high within the United States and they are "processing samples as fast as we can right now".
2. The privacy laws in Europe are, in some cases, different than the US. Therefore, this will take additional time to address.
3. They will need to work out logistical issues.
He emphasized that Ancestry.com is a large company, which necessitates significant forethought and planning before taking action.
Uploadable Raw Data
During our conversation, Ken also addressed the questions surrounding the format the raw data will be presented in as well as the much-hoped-for matching segment data.
"We will be providing raw data download in early 2013. We have not made any formal decision on segment data. We understand that it is important to some of our customers and are taking it into serious consideration."
When I asked him whether the raw data would be formatted in such a way that will be compatible with uploading to third party sites such as Gedmatch.com, he assured me that it would. Fortunately, this puts to rest all of our speculation that the "related security enhancements" Ken referred to in his keynote address at the Consumer Genetics Conference last month would interfere with the data's usability. When I inquired further about the future availability of segment data, he said that he cannot promise anything in that regard, but was open to discussing what presentation formats of that data might be acceptable to genetic genealogists.
Admixture and Reasonable Assumptions
Some have also interpreted Ken's statement (reported by Esquire) that some customers are using their own knowledge to make reasonable assumptions that are leading to incorrect conclusions, to mean that AncestryDNA is using an altogether different method of determining our matches than the other two companies offering autosomal DNA tests for genealogy. He explained to me that what he was actually referring to was that many customers are assuming that because autosomal DNA matching is only applicable to relatively recent ancestry, that the admixture results also reflect that time period and should match what we know of our ancestral origins from our family trees. He emphasized that AncestryDNA's "Genetic Ethnicity" feature (like any admixture tool) is not looking at the large segments used for relative matching, but rather is examining much smaller blocks and single markers that are ancestrally informative. Therefore, some of this admixture is very old - offering a glimpse much further back in time than our known family trees. He offered reassurance to those who feel that this portion of the test is not yet as accurate as it should be (me included):
AncestryDNA is data-driven. Our team of scientists are constantly analyzing the data looking for ways to improve the ethnicity and matching prediction algorithms. The science, and hence the customer experience, is only going to improve with time.
At some point, the Sorenson data will likely be incorporated into the AncestryDNA test, which should improve the admixture predictions tremendously.
Even the CEO is working with his matches!
In closing, it was very nice to hear that "everyone from the CEO down is working with their matches" at Ancestry.com. This should lead to a management team that is educated about what we as genetic genealogists are trying to accomplish and how best to do it. As a result, I look forward to improved tools and results at AncestryDNA in the future.
I want to thank Ken Chahine and Stephen Baloglu for their recent efforts to shed some light on aspects of the AncestryDNA test and clear up some of the confusion. As I told both of them, the more transparency that AncestryDNA can offer to the genetic genealogy community, the more satisfied we will all be with the product. According to Ken, it is likely that more details will be revealed soon. That is a very good thing because as I was writing this, I thought of many more questions for him!