Saturday, December 7, 2013

23andMe Releases a Sample of Their New V4 File: First Look and Analysis

23andMe has released access to a sample file of their new v4 chip. This is a 100% custom chip with hand-selected SNPs. In a departure from the other two companies offering autosomal DNA testing to the genealogy community, they are now using the Illumina iSelect chip as the foundation instead of the Illumina OmniExpress chip. On the new v4 chip, the total number of autosomal DNA and X DNA SNPs has decreased substantially, while the total number of mtDNA SNP and Y-SNPs has increased.

Since I heard about this new chip, I have been concerned about the impact it will have on the genetic genealogy community and, particularly, on the compatibility with 3rd party uploads and tools. If 23andMe customers are unable to take advantage of the extremely beneficial opportunity to upload their data into Family Tree DNA's Family Finder database and to make use of the wonderful tools at GEDmatch, that would be a huge loss for all of us. It would also be a shame if the usefulness of the Y-SNPs and mtDNA SNPs tested by 23andMe is reduced for our community. To try to determine if this will be the case, Dr. Tim Janzen has helped me to analyze the new file overall and Larry Vick has specifically analyzed the Y-SNPs.

The v4 chip currently has just over 602,000 total SNPs versus 967,000 total SNPs on the v3 chip. This chip is not as robust as the chip previously used by 23andMe or the chips used by Family Tree DNA and AncestryDNA for the autosomal and X-SNPs.This is of great concern, but 23andMe has stated that they plan to impute a large number of SNP allele values from our results (see comments at http://blog.23andme.com/news/23andmes-new-custom-chip/), so hopefully Family Tree DNA and citizen scientists will be able to do the same to extract the most utility and compatibility from this platform. I am continually surprised and amazed by the resourcefulness of our community, so I am hopefully optimistic.

This change in platforms was intended to help the company ramp up their processing capacity in conjunction with their massive marketing campaign to acquire one million customers and beyond.  (They can run 24 samples on each v4 chip at once instead of 8 on each of the v3 chips.) Especially in light of 23andMe's recent decision to provide ancestry-related interpretation and raw data exclusively, the v4 chip does not appear to be a beneficial development. In fact, it may result in additional loss of sales. Currently, the genetic genealogy and the DNAAdoption communities generally recommend first testing at 23andMe and then transferring the raw data into FTDNA's Family Finder database in order to be in two databases at a reduced price. If it turns out that FTDNA is unable to continue to accept 23andMe transfers, these recommendations will likely change. Loss of compatibility with Gedmatch would also have a very detrimental effect on the utility for those using the data for genealogical and admixture research.

The following analysis is in depth and intended for those who want the specific details of the changes. Instructions to download the v4 file will follow the analysis.
 
First let's look at the Y-SNPs.

Larry Vick tells us:
I compared my Y-SNP file I downloaded on 14 Aug 2012 to the Y-SNP file I downloaded today to see if my current file has any changes. There weren't any.  I then compared the (v4) file for Greg MENDEL that I downloaded today to a v3 file I downloaded for a friend (CRL) on 21 Mar 2013.

There are 2,329 Y-SNPs in the Greg MENDEL v4 file. CRL had 1,766 Y-SNPs in his v3 file. So the v4 file has 562 more SNPs than CRL's v3 file. Looking at the SNPs, there were 446 in CRL's v3 file that were not in the MENDEL v4 file. There were 1,009 in the MENDEL v4 file that weren't in the CRL v3 file. Of the 446 in the CRL v3 file that weren't in the MENDEL v4 file, 314 were no calls.

Of the 1,009 in the MENDEL v4 file that weren't in CRL's v3 file, 295 were no calls. When I compared the 1,009 in the MENDEL v4 file to my file I downloaded today, 494 were not in my file (although they could have been in past downloads but were removed prior to today's download). I have SNPs from v1, v2, and v3. Of the 515 that were in my file, 99 were no calls.

I compared the 494 that were not in my file to Adriano's file in the Y-Chromosome Comparison Project, and 268 were in his file. All but one of those 268 had i prefixes for the reference sequence number.

I then compared the 226 that weren't in Adriano's file to the ISOGG Y-SNP Compendium by position number (build 37), and 200 were in the ISOGG file. I created a list of those 200 with the first note field. The 26 that weren't in the ISOGG file included the one SNP with an rs number (rs5603911). The MENDEL file had 46 no calls for the 200 SNPs in the ISOGG file. All but one of the 226 that weren't in Adriano's file had i prefix reference sequence numbers.


Now let's look at the overall composition of the new v4 chip as compared to the other platforms.

Tim reviewed the v4 file and compared it to v2, v3, and FF files. The following are some general statistics on which he based his analysis.

atDNA SNPs
v2: 
561,846 atDNA SNPs in 23andMe v2 data in a 2009 download.
515 i atDNA SNPs in 23andMe v2 data in a 2009 download.
556,787 atDNA SNPs in 23andMe v2 data in a fresh download.
758 i atDNA SNPs in 23andMe v2 data in a fresh download.

v3:
930,381 atDNA SNPs in 23andMe v3 data in a fresh download.
7,455 i atDNA SNPs in 23andMe v3 data in a fresh download.

v4:
577,382 atDNA SNPs in 23andMe v4 data.
41,855 i atDNA SNPs in 23andMe v4 data.

Family Finder:
708,092 atDNA SNPs in Family Finder data in general, but a fresh download only had 707,269 SNPs in it.


Y-SNPs
v2:
1880 Y SNPs in 23andMe v2 data in a fresh download. Of these 213 are i SNPs.

v3:
1766 Y SNPs in 23andMe v3 data in a fresh download. Of these 232 are i SNPs.

v4:
2329 Y SNPs in 23andMe v4 data. Of these 526 are i SNPs.


X-SNPs
v2:
13,876 X SNPs in 23andMe v2 data in a 2009 download. Of these 19 are i SNPs.
13,828 X SNPs in 23andMe v2 data in a fresh download. Of these 96 are i SNPs.

v3:
26,007 X SNPs in 23andMe v3 data in a fresh download. Of these 1006 are i SNPs.

v4:
19,487 X SNPs in 23andMe v4 data in a fresh download. Of these 4227 are i SNPs.

Family Finder:
18,022 X SNPs in Family Finder build 37 data in a fresh download.


mtDNA
v2:
2019 mtDNA SNPs in 23andMe v2 data in a fresh download. Of these 1572 are i SNPs.

v3:
2459 mtDNA SNPs in 23andMe v3 data in a fresh download. Of these 2016 are i SNPs.

v4:
3154 mtDNA SNPs in 23andMe v4 data. Of these 2681 are i SNPs.


Here are his comparisons between the various platforms.

atDNA:
  • 453,854 atDNA SNPs in 23andMe v4 data are also found in 23andMe v2 data in a 2009 download. Of these SNPs, 419 are i SNPs. 
  • 453,357 atDNA SNPs in 23andMe v4 data that are also found in 23andMe v2 data in a fresh download. Of these SNPs, 546 are i SNPs. 
  • 509,630 atDNA SNPs in 23andMe v4 data that are also found in 23andMe v3 data in a fresh download. Of these SNPs, 6153 are i SNPs. 
  • 304,864 atDNA SNPs in 23andMe v4 data that have rs numbers are also found in Family Finder in a fresh download. 
  • I then checked the 41,855 i atDNA SNPs in 23andMe v4 data and checked for matching positions in the Family Finder data. I found that there were 2556 i atDNA SNPs in 23andMe v4 data that had matching positions in the Family Finder data.
  • Assuming that all of those matching positions correspond with the same SNP in the Family Finder data, there are a maximum of 307,420 atDNA SNPs in 23andMe v4 data that are also found in Family Finder in a fresh download. 

Y-SNPs:
  • 979 Y DNA SNPs in 23andMe v4 data that are also found in 23andMe v2 data in a fresh download. 
  • 1320 Y DNA SNPs in 23andMe v4 data that are also found in 23andMe v3 data in a fresh download.
  • This means that there are 1009 Y SNPs found on the v4 chip that aren’t found on the v3 chip
  • There are 563 more Y SNPs in v4 data than in v3 data. 

X-SNPs:
  • 11,070 X SNPs in 23andMe v4 data that are also found in 23andMe v2 data in a fresh download. 
  • 11,009 X SNPs in 23andMe v4 data that are also found in 23andMe v2 data in a 2009 download. 
  • 14,437 X SNPs in 23andMe v4 data that are also found in 23andMe v3 data in a fresh download. 
  • 7,513 X DNA SNPs in 23andMe v4 data that are also found in Family Finder in a fresh download.

mtDNA:
  • 1698 mtDNA SNPs in 23andMe v4 data that are also found in 23andMe v2 data in a fresh download. 
  • 2208 mtDNA SNPs in 23andMe v4 data that are also found in 23andMe v3 data in a fresh download. 
  • This means that there are 946 mtDNA SNPs found on the v4 chip that aren’t found on the v3 chip

Tim's preliminary conclusions
The fact that there are only 307,420 atDNA SNPs in 23andMe v4 data that are also found in Family Finder is highly concerning. The specificity of matches when comparing v4 data to FF or AncestryDNA data will be significantly reduced in projects such as my Mennonite autosomal project. At DNA SNP coverage for the overlapping SNPs between v4 data compared to FF or AncestryDNA data will only be about 44 SNPs per cM. I don’t know if FTDNA and GEDmatch will be able to allow imports of 23andMe v4 data. The fact that there are about 130,000 more atDNA SNPs in a Family Finder dataset than in v4 and the fact that v4 data won’t be readily uploadable to GEDmatch is forcing me to rethink 23andMe as my primary testing lab for distant relatives.


Third Parties and Download
My sincere hope is that Family Tree DNA and GEDmatch are able to adjust their systems to work with this new data from 23andMe. As soon as I hear anything, I will be sure and report it.

If you are interested in analyzing the file yourself and you have a 23andMe account, these are the directions for downloading the sample v4 file from 23andMe:

Enable the Mendel family in your account here:
https://www.23andme.com/user/edit/examples/

Then select Greg Mendel's raw data file from the drop-down list here:
https://www.23andme.com/you/download/

This file will probably still change slightly as they complete their validation process, but it should be pretty close to what we will start to see for new customers at 23andMe in the coming weeks.

Please let me know what you think. I am especially interested to hear analysis from the citizen scientists and the creators of our community's third party features. 

12 comments:

  1. This certainly sounds disheartening; not the good news to say the least, but expected knowing at some point 23andMe would upgrade to a V4 platform.

    Personally I doubt FTDNA will allow a V4 transfer since they do not allow a V2 transfer now and only 307,420 SNPs are in-common between the two files according to this post.

    I remember the days when V2 files compared to FTDNA files in gedmatch produced some interesting results and caused too many assumptions of matching, later to be dispelled in a like for like comparison when V3 files arrived.

    It is not impossible to compare 307K SNPs in two files and find IBD segmentation yet it is certainly not as ideal as using over double that number. 307K is a little too low in my opinion but then I have not tried to delve deeply into using that low of a SNP count in two files.

    This will be an opportunity for 3rd party developers to advance their tools used to export 23andMe data and map matching segment information on chromosome maps.

    ReplyDelete
  2. More bad news? In light of the recent developments with the FDA and 23andme, and now reading this, this may have a huge impact on the DNAadoption community if transfers from 23andme to FTDNA or uploads to Gedmatch are no longer doable.

    We at DNAadoption.com have previously recommended testing first at 23andme, then upload the raw data to FTDNA and Gedmatch, in order to find the most matches in the 3 databases for those seeking unknown ancestors.

    Until we have more answers from FTDNA and Gedmatch on upload capabilities, we may have to adjust the recommendations by DNAadoption.com as to the best route to take for testing.

    ReplyDelete
  3. Compatibility isn't the only concern. SNP density is a large concern. How many of the new SNPs will 23andMe be using for Relative Finder/DNA R and how many are intended for scattershot medical research? Low densities mean that segments have to be longer to meet the SNP thresholds. 23andMe may have thought the cap on matches and rapidly growing database would make the density restrictions on segment size moot because everyone's smaller segment matches would never show up anyway. Also, Countries of Ancestry continues to be of less use.

    There are serious issues at all three companies but I continue to believe that AncestryDNA will be the best long term for adoptees. They need to convert to cM. The longer they wait the worse it will be when they eventually have to make the move.

    ReplyDelete
    Replies
    1. KS Rose, AncestryDNA has major issues of its own. You only get full benefit if you are a paying member. My friend dropped his sub after he could barely access it for for over 3 months. He lost a large amount of functionality after the sub ended. Requiring a paid sub to get full use is not a good ploy on Ancestry's part.About half of his matches don't have a tree or only have a private tree. He learned from others it is a waste of time to contact them. Less than 100 are solid matches in the 3rd - 4th cousin range, the rest are 5th+ cousin and 50% or less confidence level of being a true match. Have a problem with AncestryDNA, good luck on getting customer support's help.

      I agree with the parts about SNP density. They are testing 1/3 to 2/5 fewer SNPs and it has to affect the match rate in a major way. We were going to test with all three companies, but now are holding off from 23andme until they come out with a v5 chip or address the issue in a better manner. We do not care about the medical results since Promethease, Live Wello and several other places offer better health results than 23andme's old ones.

      Delete
  4. I'm looking for the best place to have DNA tests done for myself and 2 family members. I was leaning toward 23andMe, but have been reading about the concerns with their new v4 chip. I am somewhat interested in possible matches for myself and planned to upload my data from "23" to FTDNA, but recently read on the FTDNA website that they can't accept data from the v4 chip. My adopted son is wanting to know his ethnicity, and '23' seemed to be the best place for that. But with the v4 chip I'm wondering if the profile for his recent ethnicity will be as good as what they were getting with the v3. I wonder if I should just put this on hold until things become clearer regarding the v4 chip. However my son is turning 40 this year and all of his life he has wondered what 'he is'. He was born in Vietnam during the war years and came to us on the 1975 babylift with absolutely no background info. He doesn't look typically Vietnamese, yet it's unclear what his other ethnicity/ethnicities are. I've quipped that he is "everyman". I would appreciate any further info or advice regarding using 23andMe now that their testing has changed. I'm also wondering if anyone who previously tested with them has tested again with the v4 and if so, if the results were the same.

    ReplyDelete
  5. I just tried the transfer to FTDNA with the new V4 chip, and it's not compatible. I'm hoping to get a refund from FTDNA because of this but haven't heard back from them yet.

    ReplyDelete
  6. randers, I hope you get your money refunded. I've heard from a 23andMe rep. who verified that their v4 chip data isn't compatible with FTDNA. She explained that her company is using the v4 because they believe it is the best testing mechanism for their customers. (I wonder if it will be used by FTDNA eventually.)

    ReplyDelete
  7. FTDNA refunded my money (less the credit card company cut, which ended up being $1). It took about 10 days to receive the refund.

    ReplyDelete
  8. I'm a little late getting my two cents in on this issue. I was analyzed at all three services, 23andMe, FTDNA and Ancestry.com as well as Geno 2.0. I have already transferred my Geno2.0 and 23andMe data to FTDNA and am very happy with the results. I don't intend to get analyzed on the V4 chip. It would cost $69 to do the transfer and a complete Family Finder analysis is only $30 more. It seems a no brainer...just pay the $99. The only downside is that you won't have the 23andMe database available. On the other hand at 23andMe the response to share genomes is so low that you don't lose that much by going over to FTDNA for the full monte.

    ReplyDelete
  9. I'm a little late to this blog but I have had my DNA analyzed by 23andMe, FTDNA, Ancestry.com and Geno 2.0 and I have transferred my Geno 2.0 and 23andMe data to FTDNA and enjoy the better tools and easier contact means at FTDNA. In addition DNAgedcom.com ICW tool is invaluable in sorting our cousin triangulation but works only with FTDNA downloads (including v3 results uploaded to them) to date. True, 23andMe has the largest database but the sharing response is so low that it's not that much of an advantage. However, that may change with the influx of new users who are orientated toward the genealogy aspect of DNA.
    I don't intend to move to the V4 chip anytime soon so I will keep your blog under surveillance. Thanks for your analysis of the new data sets from 23andMe. For $30 more you can have FTDNA do a fresh analysis and still have the V4 data at 23andMe.

    ReplyDelete
  10. What are i SNPs, is it Haplogroup I SNP?

    ReplyDelete