Thursday, September 20, 2012

Let's All Start Using Terminal SNP Labels Instead of Y Haplogroup Subclade Names, Okay?


Is it just me or have the subclade names for Y-DNA just gotten out of control? I work with DNA all day long and I can't even keep up with all of the changes, so I have decided to start using the Terminal SNP labels exclusively. May I gently suggest that you do so also?













I frequently receive emails from otherwise well-informed people asking what their Y-DNA haplogroup subclade means, and it isn't their fault they are confused. You see, if they try to Google it, they are often unable to find information. If they try to locate academic papers on it, they are usually unsuccessful. Why is this? Well, the subclade name that they are given by their testing company may not be the same name that another testing company uses, or even the same as it was when they were first assigned it...and, quite likely, it isn't the same as the one on the most up-to-date tree at the International Society of Genetic Genealogy.

I have to admit that when R1b1b2 was changed to R1b1a2, I just started saying "R1b...whatever" when referring to it. Isn't it easier to just remember the defining SNP name R-M269?

For example, if you are R-L21+, then according to Family Tree DNA's Haplotree, you are R1b1a2a1a1b4, the ISOGG 2011 Haplogroup Tree's name for it. At 23andMe, you are R1b1b2a1a2f in agreement with the 2010 ISOGG Haplogroup Tree. If you tested in 2008, you might still think you are R1b1b2a1b6.  On ISOGG's 2011 Haplogroup Tree , L21+ was R1b1a2a1a1b4, but on ISOGG's 2012 Haplotree, you are R1b1a2a1a1b3. Apparently, R1b1a2a1a1b4 is now referring to L238/S182! I mean, really, how can anyone keep track? (Ah, for the days of the simple little R tree.) I don't know how our ISOGG Haplogroup Tree Committee* does it anymore! Apparently, the academics are getting tired of it too and it's just going to get worse when results from Geno 2.0 start rolling in with LOTS of new SNPs and subclades being defined.

Take look at the history of R-M222, the "Ui Niall Subclade", on just the ISOGG SNP Tree:
2007 = R1b1c7
2008 = R1b1b2a1b6b
2009 = R1b1b2a1a2f2
2010 = R1b1b2a1a2f2
2011 = R1b1a2a1a1b4b
2012 = R1b1a2a1a1b3a1a1

Reportedly, Geno 2.0 will define at least three new subclades beneath M222, but I hear it may be more. Do you think those subclade names might get even longer?









The R Haplogroup Tree is definitely the worst, but the problem is starting to affect other haplogroups too. At FTDNA, my dad is I2b1. Same at 23andMe. Sounds simple, right? Not anymore! The subclade name was recently changed to I2a2a on the ISOGG 2012 tree. I am so confused! This was one subclade name that I felt very comfortable with. I think I will just learn to call it I-M223 from now on. (I'll just ignore that his brother recently tested Z2062+, which isn't even on any of the trees yet!)

Actually, there is some rhyme or reason to these discrepancies, so let me share it for those of you who have no idea what I am writing about. FTDNA last updated their Y haplogroup tree in 2010 and 23andMe in 2011.*  So, they are going by the subclade names that were recognized at those times. In contrast, the ISOGG Haplogroup Tree has been updated over 60 times JUST THIS YEAR! Every time a new SNP is discovered that is upstream of a known SNP (which is happening faster and faster all the time), it has to be inserted into the tree, thus changing the subclade naming pattern. This is why it is so much simpler to just learn the Terminal SNP label.

The ISOGG Haplogroup Tree is a tremendous resource that anyone who is doing Y-DNA research should be utilizing. It helps to keep things straight by giving the various names of the SNPs that are being used by different companies and labs. When two or more SNPs are identical, meaning that they are on the same place on the Y-haplogroup tree with the same mutation, ISOGG shows the names in a series punctuated by "/". For example, let's look at M173/P241/Page29. M173 comes from Peter Underhill's lab at Stanford; P241 comes from Michael Hammer's lab at the University of Arizona and Page29 comes from the Page, Whitehead Institute for Biomedical Research. They appear in academic publications with these names and ISOGG lets you know that they are identical SNPs. That way, if you are Googling or looking for academic papers about your SNP, you know to try those ones too.

Going back to my first example of R-L21+; Wikipedia states, "R1b1a2a1a1b4 (R-L21) is defined by the presence of the marker L21, also referred to as M529 and S145." The label L21 comes from Thomas Krahn's FTDNA lab in Houston, M529 comes from Peter Underhill's lab at Stanford and S145 comes from Jim Wilson's lab at University of Edinburgh. ISOGG shows the SNP as L21/M529/S145. The bottom line is that if you test L21, M529 or S145 at any company, your assignment is in an identical place on the Y-DNA tree, so the subclade name is not the significant factor, the SNP name is.

Of course, with so many new SNPs being discovered and assigned to the tree, there will likely be a certain amount of continuing confusion among those of us doing Y-DNA research for the time being, but I hope you will all consider joining me in taking the next step forward in the evolution of Y-DNA research in genetic genealogy and stop trying to remember those mind-bending sublcade names! And, while we're at it, let's give our ISOGG Y-Haplotree Committee a well-deserved virtual pat on the back too!



*History:
The Y Chromosome Consortium (YCC) is a cooperative association of geneticists, led by Dr. Michael Hammer, who first published the paper in 2002, "A Nomenclature System for the Tree of Human Y-Chromosomal Binary Haplogroups", introducing the modern haplogroup nomenclature of Y-DNA. The tree was subsequently revised in 2003 by Mark A. Jobling and Chris Tyler-Smith in another paper, "The human Y chromosome: an evolutionary marker comes of age". Next, Family Tree DNA created the 2005 Y-Chromosome Phylogenetic Tree, which was the first online tree and only available to their customers. Soon thereafter, ISOGG created the first public online tree in 2006.  Tatiana Karafet of Dr. Hammer's lab (and others) published a paper further refining the Y chromosome tree in 2008, "New binary polymorphisms reshape and increase resolution of the human Y chromosomal haplogroup tree". As a result, both FTDNA and ISOGG updated their trees at that time. Then in 2010, FTDNA came out with a YCC-sanctioned tree which was distributed at the FTDNA conference and, as a result, ISOGG promptly did a major update to stay in alignment with the YCC.  Since then, no updates have come from the YCC. Undaunted, the ISOGG Y Haplogroup Tree Committee has continued to add information as it becomes available from various sources and is now the most up-to-date source of this information.  In November 2011 at the FTDNA Project Administrator's Conference, Spencer Wells of National Geographic, Michael Hammer of the University of Arizona, Thomas Krahn and Bennett Greenspan of FTDNA and Alice Fairhurst of ISOGG, reportedly agreed to all stay in alignment with the most current Y-DNA nomenclature to the best of their abilities. As always, there is new research that has not yet become public. As it is released, ISOGG will again align its tree with the most current information and will continue to add updates as they become available. With the upcoming launch of Geno 2.0, the ISOGG Committee will have their work cut out for them! Current ISOGG members who work with the tree and deserve our great appreciation are: Coordinator: Alice Fairhurst. Design team: Tanmoy Bhattacharya, Tom Hutchison, Richard Kenyon, Doug McDonald. Content experts: Abdulaziz Ali, Whit Athey, Ray H. Banks, Katherine Hope Borges, Aaron R. Brown, Phil Goff, Gareth Henson, Tim Janzen, Bob May, Eugene Matyushonok, Lawrence Mayka, Charles Moore, Ana Oquendo Pabon, Marja Pirttivaara, David Reynolds, Bonnie Schrack, Vince Tilroe, Aaron Salles Torres, Steve Trangsrud, Ann Turner and David Wilson.

24 comments:

  1. Yes to simplification. BUT, has anyone tried to devise a scheme based on information I care about as a genealogist? Which lab identified the SNP is not information of significance outside a small scientific group.

    ReplyDelete
    Replies
    1. JDR - To me the greatest usefulness of identifying the names of the SNPs from different labs is to help those who are looking for more information (like papers) on the SNPs that *might* be relevant to their family history research and/or may have tested at a lab that uses a less common identifying SNP name. For instance, if someone tests at Jim Wilson's lab "Britain's DNA", they may not realize that their S145+ SNP is equivalent to the better known L21+. Since we know a lot about L21's origins, this information could be helpful to their genealogy.
      As far as genealogical usefulness on a broader scale, hopefully, we are getting close to the time when a lot more of these SNPs will start to have genealogical relevance. Thanks for your insightful comment!

      Delete
  2. I agree CeCe. The R tree is the absolute worst and true other trees are also affected. In 4 years, I've gone from I1a to I1 to I1* to I1 to I1f to I1f1a to I1a3a1a. But during the last two years, the terminal SNP has changed as well and with the National Genographic test, expect more new SNPs.

    I was terminal I-M253, then I-Z58, then I-Z62 and now I-Z140. It is an easier to remember this way, but an easier way is to name the haplogroups, but with so many cropping up, that may be difficult to do as well.

    Ken Nordtvedt had already determined sub-clades in the I1 family based on common STRs. He called many of these AS (for Anglo-Saxon) with a number. Most of these lined up nicely with the newly discovered SNPs - so that was a successful endeavor.

    One advantage of the the current nomenclature is that you can at least trace the ancestry of the haplogroup.

    While simplicity is sought and needed, it will require some forethought on how to simplify this whole mess.

    Jim

    ReplyDelete
  3. Thank you for bringing this to the front. "Apparently, the academics are getting tired of it too ..." I can confirm this. At DNA in Forensics 2012 meeting this topic was discussed and Chris Tyler-Smith in his speak also made the proposal to use abbreviated names:
    R1b-M269
    I don't know if he uses this form in his new paper, but I like it more then R-M269, because if your are not expert of a haplogroup you can better recognize a label like J2a-Z2324 then J-Z2324.
    For deep branches it would be also good to have an information about the main SNP divider, example: R1b-U152-Z2371, R1b-L21-Z2431. Just some thoughts of me...
    Chris Tyler-Smith also said that his thousends of new SNPs from 1k Genome and other data will receive just a rs-Number, so many by this will have a terminal SNP like rs16122246. The haplogroup name then could be R1b1a2a1a1a3b2b1a1b2 (current ISOGG), R-rs16122246, R1b-rs16122246 or R1b-M405/S21/U106-rs16122246. So with this the situation would be not much better even using the shortest form R-rs16122246 for a personal terminal SNP.
    I think we will need more discussion what is the best solution. We can try to give the rs-numbers a shorter name, but with thousends that will help only for important branches and until no new deep sequencing results provide new massive updates. One idea of me is, that for example when you can say, that you statistically with high coverage have found the 999 most imformative (near) terminal haplogroups for Europe, meaning that they have all roughly the same population size, you can number the defining SNPs with a System from 1-999. You will end up with for example ~40% R1b, ~25% I1, etc. and they all have a short SNP name and hopefully are not much away from terminal SNPs.
    But we obviously need a solution before we have that situation (near to a all significant Y-SNPs discovered).

    ReplyDelete
    Replies
    1. Chris, These are all terrific points you make. I think Tyler-Smith's proposal is probably the solution that will get the best response overall, but I like your 1-999 idea too. You're right, we do need to discuss this subject more before reaching a conclusion. Thanks for the very interesting comment.

      Delete
    2. CeCe,
      At the DNA in Forensics conference in Innsbruck Dr. Chris Tyler-Smith also made the point that if we continue using the current naming convention when we gent to full genome sequencing and a much expanded Phylogenetic Tree we will be obliged to use names with up to 38 characters. Of this is unsustainable and unwieldy. I routinely use terminal SNP and they are understood across ISOGG, 23andme, FRDNA and Academia.
      They would also like to use better toolsets to manage the Phylogenetic Tree. importing ISOGG into Excel is not an option.
      My suggestion is the following. We have been using GEDCOM for 20 years now in the Genetic Genealogy community and it works. It is designed to manage ancestry trees and there is an abundance of software (free and commercial) for handling this and millions of trees managed in databases such as Ancestry, Geni, MyHeritage etc. Could we extend the GEDCOM standard to support the requirements of the Phylgenetic Tree and Academia.
      One of the benefits to the Genetic Genealogy Community will be the ability to link your Clan Structure to your Terminal SNP and

      Delete
  4. The timing of this article couldn't be better. I spent hours today trying to understand the HG nomenclature and why companies use different classifications, why the classification change, how often, etc.

    Obviously I am a newbie.

    Could someone be so kind as to tell me what my 'terminal' is? 23&Me has me as R1b1b2a1a. Would I say my terminal is L52, P311, L11, P310?

    The ISOGG chart seems to stop at R1b1b (M335)

    Thank you!

    ReplyDelete
    Replies
    1. Good questions. It looks to me like those SNPs are all synonymous, meaning they have not split from each other (so far), so technically any/all of them are your terminal SNP. Your 2012 ISOGG designation is R1b1a2a1a1. The major subclade previously called R1b1b12 changed to R1b1a2, which is why you couldn't find it on the tree. It is quicker to search for one of your SNPs on the page rather than your ever-changing subclade designation. I'm not sure which SNP I would use as your terminal SNP if you were to adopt the terminal SNP label. I would probably use the one of those four that is best known. Does anyone have an opinion on that? L11?

      Delete
  5. 1. I agree with Chris Tyler-Smith on this format: R1b-L371. The idea is to keep it simple, consistent, informative, consumer friendly for newbies you are trying to connect with. Follow the principles of SEO - Search Engine Optimization.

    2. Where possible, embed a hyperlink "within" R1b-L371 such as: http://tech.groups.yahoo.com/group/RL371/ or if beside it use a shortened URL such as http://goo.gl/X70AE

    3. In the hyperlink, include other names for the SNP such as R1b-S300 and any earlier names.

    4. Include in the hyperlink, a country of origin such as Wales.

    5. Include in the hyperlink, the age of the SNP such as 1000 YBP and note if it is a Terminal SNP.

    6. Include in the hyperlink, some confirmed surnames related to the SNP such as Griffith, Pugh, etc.

    7. In the hyperlink, include the Y-STR Signature for R1b-L371 such as R-17-14-10 or DYS448=17; DYS456=14; DYS450=10

    8. In the hyperlink, include a URL for a SNP Predictor such as: http://www.rcasey.net/DNA/R_L21/SNP_Predictor/index.php

    9. In the hyperlink, tell newbies how to go about getting DNA tested. I am referring all newbies to the Geno 2.0 Test

    10. In the hyperlink, include correlated Autosomal SNPs; Autosomal AIMs or Chromosome regions which may relate to the Y-DNA SNP in question.

    ReplyDelete
  6. I belong to an uncommon haplogroup in Europeans, so presently I don't have the problem, but I have been using for years, the terminal or the subclade SNP name which I wish to use. The nomenclature of both mitochondrial and Y chromosome haplogroups was doomed from the start by lack of planning and preparation i.e using letters, and the same letters in both mitochondrial and Y chromosome haplogroups. I don't have any sympathy at all for R1b men folks; you created your own demons by being so common in Europe, and the self absorbed obsession with the most insignificant minutiae of that haplogroup. Some haplogroups have been entirely neglected, haplgroup C for example, and all haplogroups that are not common in Europe.

    My irritation is that the phylogenetic process is so slow, SNPs are found and placed on hold and little effort made to work out where they fit in the haplogroup tree. Isogg is years behind. Whole dna scanning of the Y chromosome would help instead of the piecemeal approach used my genetic researchers and dna companies.

    ReplyDelete
  7. Persons like Ponto suffer from Phobias.

    He says as much at http://opensnp.org/users/403 So, he has Phobias and they appear to be Genetically related to his "Uncommon Haplogroup(s)" Somebody needs to phase his 23andme data and see if it is from his Paternal Ancestors or his Maternal Ancestors. Perhaps we can also isolate the SNP in question and name that AIM, "G-Ponto". The "G" like the "M", "L", "S", etc in SNP names is for the person or lab who made that discovery.

    Apparently one of Ponto's Phobias is SNPophobia, a fear and loathing of more "Common Haplogroups". I am adding this newly discovered phobia to a list at http://phobialist.com/

    His SNPophobia can be shown in Ponto's own words.

    1. " I don't have any sympathy at all for R1b men folks; you created your own demons by being so common in Europe, and the self absorbed obsession with the most insignificant minutiae of that haplogroup."

    2. "African Americans are a mongrel group of people, not a race." "Think about it. What has the Pig done to Jews and Muslims? Graves said that the Pig was once their God pushed out by Yahweh/Jehovah/Allah." http://dienekes.blogspot.com/2009/08/john-hawks-on-anne-wojcicki-on-race.html

    So, when you see the name "Ponto" on blogger posts, think SNPophobia.

    ReplyDelete
  8. Looks like FTDNA listened to you. This is what we can now read in the Y-haplotree section:

    "Long time customers of Family Tree DNA have seen the YCC-tree of Homo Sapiens evolve over the past several years as new SNPs have been discovered. Sometimes these new SNPs cause a substantial change in the “longhand” explanation of your terminal Haplogroup. Because of this confusion, we introduced a shorthand version a few years ago that lists the branch of the tree and your terminal SNP, i.e. J-L147, in lieu of J1c3d. Therefore, in the very near term, Family Tree DNA will discontinue showing the current “longhand” on the tree and we will focus all of our discussions around your terminal defining SNP.

    This changes no science - it just provides an easier and less confusing way for us all to communicate.

    Bennett Greenspan, Family Tree DNA
    Dr. Michael Hammer, University of Arizona"

    ReplyDelete
    Replies
    1. Hi Itai,
      I don't think they listened to *me*, but they definitely must agree that the issue I addressed here is getting out of hand. Thanks for the heads-up.
      CeCe

      Delete
  9. So, as a newbie, how do I determine my terminal SNP from an upstream haplogroup? I took Ancestry's DNA test years ago and have I1 and my marker values.

    ReplyDelete
  10. I think using short terminal based SNP labels is a good thing for conversations. The old long labels are cumbersome and hard to read and remember.

    However, and I admit I'm biased towards R1b, I request that you consider using "R1b" instead of just "R" in conversations. For myself that'd mean I'm R1b-L21 instead of just R-L21. I know FTDNA will just use the single letter but it is a bit misleading.

    R1a and R1b are two large groups of heavily tested people. They along with R2 are no closer related than probably 20,000 years ago. They probably should have had separate single letters in the first place, but has happenstance of SNP discovery or early research, they got lumped together.

    R1b and R1a are common defacto standard terms in literature, forums and on the internet. Try googling R1b. Next try googling R. If we want to help newbies out, they do much more of their homework if they quickly figure out whether they are in R1b or R1a.

    Here is another problem caused by the single letter R. In a spreadsheet or on an analysis report if you sort by haplogroup. R-L176.1 and R-L176.2 will come out right next to each other. A novice could easily expect them to be closely related. They are not, L176.1 is in R1a and L176.2 is R1b and they two are no more closely related than 20,000 years.

    BTW, we still need the long phylogenetically intelligent labels. It's just they are needed for data analysis, sorting, searching, totalling, etc. I agree that they are a bane for common conversation.

    ReplyDelete
    Replies
    1. @MikeW - After all the very intelligent discussion on this subject since I wrote this post, I have come to agree with the view that using the major subclade rather than just the haplogroup is a superior idea, especially with Haplogroup R (i.e. R1b-L21 as you suggest). Thank you for your well-presented argument in favor of this solution.
      CeCe

      Delete
    2. "Here is another problem caused by the single letter R. In a spreadsheet or on an analysis report if you sort by haplogroup. R-L176.1 and R-L176.2 will come out right next to each other. A novice could easily expect them to be closely related. They are not, L176.1 is in R1a and L176.2 is R1b and they two are no more closely related than 20,000 years."

      If this is a concern, are you sure "R1b" is enough? Won't R1b-L1 and R1b-L2 likewise come out right next to each other? Wouldn't a novice be inclined to be fooled into thinking they are closely related, when in fact the former is in U106 and the latter in U152? Similarly, won't R1b-L20, R1b-L21, and R1b-L23 all come out next to each other? Clearly, "R1b" in the shorthand prefix does not offer enough protection to the novice.

      Obviously, we need an even longer prefix! But how long should it be?

      Delete
    3. Anonymous, "R1b" is a defacto standard just as "R1a" is. Many papers have used this and it is not likely to go away. There has not been a consistent standard for the letters after "R1b". I agree that the long phylogenetic labels are gobbled-gook and a bane to conversations.

      However, "R1b" is short, easy to remember and a defacto standard. I didn't arbitrarily pick those three letters. They've been out in the literature or a long time. I'm just looking for ways to make it easier for newbies to do a little self learning. Undoubtedly, that doesn't solve all of the problems though.

      Delete
  11. Hello again CeCe,

    I was rereading your article, and I noticed the current sentence: "FTDNA last updated their Y haplogroup tree in 2010 and 23andMe in 2011."

    I am quite surprised by this since 23andme still categorizes me as G2a5 (L31). This category comes straight from the ISOGG Y-tree of 2008. However as soon as 2009, we can see in the new ISOGG tree that L31 was discovered to not be an independant clade after all but a synonyme of P15 (G2a). As neither L223 nor L91 are successfully tested on 23andme V3, I should be correctly categorized as simply G2a, and certainly not as G2a5 which is not only outdated but erroneous and very confusing to people in there. The same happens with G2a4 (L32), also a category in ISOGG 2008, found to be a synonymous of L30 (G2a3 in FTDNA, G2a1c on the current ISOGG tree).

    So I don't know if some parts of 23andme haplogroup designation were updated in 2011, but as far as the haplogroup G is concerned, the designations date from 2008... :(

    ReplyDelete
  12. CeCe This blog is very good. I have been following this topic for several months now on World Family and Yahoo forum and I wish to add my two cents worth, I am R1b>DF13 and have several SNP's in the Works so the terminal SNP could Change tomorrow. The nominclature is easy to up date But I think R-DF13 is too brief and R1b1b2... unnecessary. What I am worried about most is will FTDNA feep listing the Neg snips we have done? I think these are important as well. The Pos tells you where your family has been, the Neg tells you where your family probably has not been at least not long enough to pick up new mutations. Over simplification, yes, but I find the neg almost as interesting. Plus it helps going through other researchers work keeping track of this very interesting field. crw

    ReplyDelete
  13. M222+?, + what I wonder for me so that's why I am doing it. But 2012's R1b1a2a1a1b3a1a1 +? is something I will never be able to remember. Hopefully being plain old H which is pretty boring will also be improved upon for me.

    ReplyDelete
  14. Right Aussie, You get my point, I am waiting for the results of SNP Z255 and if it turns out neg then I am a DF13* Then wait again. I am doing GEO 2.o and maybe... I have done the Y-111 and getting somewhere with the Haplotype but still a long way to go with the Haplogroup.
    I will try to find out more in Huston. CRW

    ReplyDelete
  15. Some kind of conversion chart or online Y Haplogroup Subclade to terminal SNP Label generator might be helpful.

    ReplyDelete