Sunday, November 3, 2013

First Look at the Full Genomes Y-Sequencing Results from Itaï Perez

(**Warning - Advanced Content**)

Guest Blogger, Itaï Perez, reviews his Full Genome results for my readers:

For those wondering what the results from Full Genomes look like, here’s a first look.  

After a long wait while my kit was sequenced and analysed, I finally received an email from Full Genomes with an attached rar archive containing 9 files.

Almost all these files are in a formatted text format, and can easily be converted to an excel table (which I did).

Here’s the description of the 9 files, as much as I understand it, one by one:

File #1 -  PrivateSNPs

This one is easy to understand. It is a list of all Private SNPs discovered in my sequencing. Here is the description found in the beginning of the file (which I removed when converting to Excel).

#based on 20131001 variantCompare analysis using PGP083013.filt.pyfilt.1kGfilt.vcf and ALL.1kG.samplelist.redo.sorted.paths.20130812.curated_pm.filt2.called.pyfilterCG2k.vcf reference files

And here is the file itself:

The columns are SNP name, position, ancestral and derived base. The position of these new SNPs have been removed from each image in order to give the Full Genomes team time to register and name them.

File #2 -   yknot

This is the only file which is not a table. This text file includes a tree, following my positive SNPs from Y-Adam to my current most recent SNP, as defined in the ISOGG Y-tree.


File #3 - variantCompare

This file is more complex. Here is the description in the beginning of the file:

#FGC report: Analysis of Called Variants
#this report analyzes variants called as differing from the GRCh37 reference sequence
#for best viewing, open with tab-delimiting in a spreadsheet viewer
#reliability flag key: no flag: over 99% likely genuine; *: over 95% likely genuine; **: about 40% likely genuine; ***: about 10% likely genuine
#it is strongly suggested that results analysis be restricted to variants with zero or one asterisks
#citations for reference data include:
#            1000 Genomes Project: An integrated map of genetic variation from 1,092 human genomes, McVean et al., Nature 491, 56-65 (01 November 2012) doi: 10.11038/nature11632
#            Personal Genome Project: Ball, Madeleine P., et al. A public resource facilitating clinical use of genomes. Proceedings of the National Academy of Sciences 109.30 (2012): 11920-11927.
#            A High-Coverage Genome Sequence from an Archaic Denisovan Individual, Meyer et al. Science 338, 222-226 doi:10.1126/science.1224344
#            R. Drmanac, et. al. Science 327(5961), 78. [DOI: 10.1126/science.1181498]

GRCh37 is the Genome Reference Consortium human genome (build 37). I guess it is a reference genome similar to CRS or RSRS for mtDNA. This table lists all the SNPs which vary from this reference. The fields are position, base change, rsID, SNP name, reliability and a list of the reference genomes which share this change. There are four successive sections: shared SNPs, private SNPs, shared INDELs and pricate INDELs.

Here’s how the file looks:

I also received a small manual describing this file and how to use it:

File #4 - strcall203.lobystr203report

This table contains the list of all STRs.

Here’s the description in the file :

#FGC Y-STR report generated based on lobSTR pre-v2.0.3 (sourceforge git revision 34534b) processing
#lobSTR citation: Gymrek M, Golan D, Rosset S, & Erlich Y. lobSTR: A short tandem repeat profiler for personal genomes. Genome Research. 2012 April 22.
#Repeat counts reported according to lobSTR standards; conversion required in certain cases to produce results based on other counting standards
#chrY coordinates based on hg19 / b37 reference sequence
#Marker conversions to FTDNA standards for DYS448, DYS449, DYS607, DYS576, DYS511, DYS640, and DYS485 are provisional
#Marker results known to be unreliable include: DYS413a/b, DYS490, DYS572, DYS726, DYS534, DYS446, and DYS487
#default lobSTR database has been augmented with results for DYS540, DYS712, DYS593, DYS715, DYS513, DYS561, DYS497, DYS510, DYF385.1, and DYF385.2, which should be treated as provisional
#Only two copies of DYS464 and DYF371 are called here; fully-spanning read details can provide insight into additional copies
#DYF371 includes DYS425
#NR = not reported / no reads
#NA = not available
#call confidence: 1 corresponds to highest confidence, 0 corresponds to lowest confidence; results with call confidence below 0.2 should be considered very speculative
#conflict flags: ? = conflicting fully-spanning reads; * = conflicting partially-spanning reads; % = het result in diploid calling for marker not recognized as multicopy; & = not called in diploid calling
#read details: Format is [repeat count]|[number of reads supporting given repeat count], with different counts separated by ';'. In the case of multicopy markers like DYS464, the fully-spanning read details can be used to determine repeat counts for additional copies

And here is what the file looks like:

Now this gets very technical and I don’t understand everything, but from what I can figure out, first we have the STR name and the estimated result, and then follows information explaining how this result was found and how sure the program is of it.

File #5 - strcall203.lobystr203report ftdna

This table also lists the STRs, but in a much simpler form. You simply have the name and the results, and the STRs are in the order they are found at Family Tree DNA.

The description in the file is:

#Marker conversions to FTDNA standards for DYS448, DYS449, DYS607, DYS576, DYS511, DYS640, and DYS485 are provisional
#see main Y-STR report for further information regarding reliability, etc.

And here’s the table:

File #6 - mttype.RSRS.MT

This table gives the mtDNA results in RSRS format. It gives for each SNP the position and the ancestral and derived result.

Here’s the description:

#FGC mtDNA report
#Variants with respect to RSRS

And here is the file:

File #7 - mttype.rCRS.MT

This one is exactly the same, but using the CRS format.

#FGC mtDNA report
#Variants with respect to rCRS

File #8 - haplogroupCompare

This table lists my SNPs and compares them to some reference results from my haplogroup or close to it. It quite similar to the variantCompare file. The fields are position, base change, rsID, SNP name, reliability and the reference results mine is compared to. There are two successive sections: shared SNPs and private SNPs.

Here’s the description :

#FGC report: Detailed Analysis of Called SNPs
#refer to Analysis of Called Variants for citations and other details
#in the reporting below, it is assumed that the reference allele is ancestral ("-") and the sample allele is derived ("+"); "x"=ambiguous and "?"=no-read/no-call
#note that this report uses a different, simplified variant calling approach from that used in the Analysis of Called Variants report, so results may differ, especially for less-reliable variants
Haplogroups in the neighborhood of G-L91 being considered; includes: G-L91;G-L166;G-M286

And here’s the file: 

File #9 -  gtype

This one is also a bit complex. It lists the Y-SNPs and seems to detail how the results were determined.

Here’s what it looks like:

This ends the description of these nine analysis files. Note that I am still waiting for access to my results on the website and to my sequencing raw file. If you are interested I’ll write another article to show it to you then.

Thanks Itaï! 

These tools were developed by Dr. Greg Magoon with the supervision of Justin Loe. Justin tells us "these are not final versions and will be upgraded to a more user-friendly presentation by specialists in user-interfaces."  

BGI provided the sequencing services and developed the Y chromosome chip.

If you have any questions, please post them below and I will try to get them answered. I'm sure we will be seeing a lot more regarding the Full Genomes test soon...

1 comment:

  1. Itaï Perez, thanks for the information, and yes, I am interested what will be available on the website and possibilities to sequence the raw file. This kind of information helps us to see if we get value for money.