World Families Forums - TMRCA calculations

Welcome, Guest. Please login or register.
October 22, 2014, 12:55:24 AM
Home Help Search Login Register

+  World Families Forums
|-+  General Forums - Note: You must Be Logged In to post. Anyone can browse.
| |-+  R1b General (Moderator: rms2)
| | |-+  TMRCA calculations
« previous next »
Pages: 1 2 [3] 4 5 ... 7 Go Down Print
Author Topic: TMRCA calculations  (Read 8733 times)
razyn
Old Hand
****
Offline Offline

Posts: 406


« Reply #50 on: May 03, 2012, 01:53:58 PM »

The mutation rates that Ken used in Generations111T are from Marko Heinila; I have so far not been able to find his mutation rates online.

http://dl.dropbox.com/u/50201824/jsphylosvg_trees.html

Scroll to the bottom and click the link about mutation rates... I think.  These are specific to 111 marker haplotypes.  This site was linked on the MolGen forum.
Logged

R1b Z196*
Jdean
Old Hand
****
Offline Offline

Posts: 678


« Reply #51 on: May 03, 2012, 02:23:49 PM »

The mutation rates that Ken used in Generations111T are from Marko Heinila; I have so far not been able to find his mutation rates online.

http://dl.dropbox.com/u/50201824/jsphylosvg_trees.html

Scroll to the bottom and click the link about mutation rates... I think.  These are specific to 111 marker haplotypes.  This site was linked on the MolGen forum.

Very interesting, these are the mutation rates Ken didn't use ?
Logged

Y-DNA R-DF49*
MtDNA J1c2e
Kit No. 117897
Ysearch 3BMC9

Mark Jost
Old Hand
****
Offline Offline

Posts: 707


« Reply #52 on: May 03, 2012, 11:15:02 PM »

The rates in the file link are the ones posted in KenN Gen111T sheet on row 14.

The new rate sum of 111 markers is 0.290653.

Repost of the text in the link:

"A set of mutation rate estimates based on about 4,000 haplotypes at 111 STR level.

Estimation method considers haplotype pairs as random draws from a model distribution of observed STR matches. Overall scaling is according to YHRD data. Ballantyne et al. data and results from a large 1-67 dataset was used for cross-validation. Error in overall scale is about 5%, and in relative rates from about 20% to 100% from fast to slow loci.

Rate sums are 0.13, 0.17, 0.29 for 1-37, 1-67, 1-111, respectively, with about 7% error.

(Mutation rates have strong dependence on the repeat number, and accurate general results do not exists. Errors are then considered relative to FTDNA public dataset allele
distribution average. )"
--------------------------------------------------------------------------------
http://dl.dropbox.com/u/50201824/jsphylosvg_trees.html


Here is average used for each number of markers
Quote
0.290653
Markers    Per Marker   FTDNA Order
1    0.002618   Sum by MarkoH
12   0.031422   0.024239
25   0.065462   0.060527
37   0.096884   0.132304
67   0.175439   0.172844
111 0.290653   0.290653

MJost
Logged

148326
Pos: Z245 L459 L21 DF13**
Neg: DF23 L513 L96 L144 Z255 Z253 DF21 DF41 (Z254 P66 P314.2 M37 M222  L563 L526 L226 L195 L193 L192.1 L159.2 L130 DF63 DF5 DF49)
WTYNeg: L555 L371 (L9/L10 L370 L302/L319.1 L554 L564 L577 P69 L626 L627 L643 L679)
Mark Jost
Old Hand
****
Offline Offline

Posts: 707


« Reply #53 on: May 03, 2012, 11:24:08 PM »

Ok, here it is

Quote

YrsPerGen*   Count   AGE   Generations   YBP   Founder   Generations   YBP
30   N=8   GA coal=   16.1   481.8   GA=   19.5   585.9


2124   R1b1a2   Gregor founder of the clan
2909   R1b1a2   Peter McGregor c 1860
108707   R1b1a2a1a1b4   Robert McGregor m1815, Irvine, Scotland
120820   R1b1a2a1a1b4   daniel mcfarland, 1655-1738
131269   James Henry McGregor b. @ 1769 d. 1826   
133637   R1b1a2   John Dubh of Drumnacharrie and Stronfearnan
190435   R1b1a2a1a1b4   David Stewart, Greenbrier & Nicholas Cos VA 1780-1
191228   R1b1a2   Alexander McGregor bo 1700s 'Lanark'
MJost

Using the minute difference of the other set of mutation rates of Marko H shown in KenN's sheet here is your alternate TMRCA. Just a half generation added.
Quote
YrsPerGen*   Count   AGE   Generations   YBP   Founder   Generations   YBP
30   N=8   GA coal=   16.4   493.5   GA=   20.0   600.1

Logged

148326
Pos: Z245 L459 L21 DF13**
Neg: DF23 L513 L96 L144 Z255 Z253 DF21 DF41 (Z254 P66 P314.2 M37 M222  L563 L526 L226 L195 L193 L192.1 L159.2 L130 DF63 DF5 DF49)
WTYNeg: L555 L371 (L9/L10 L370 L302/L319.1 L554 L564 L577 P69 L626 L627 L643 L679)
ironroad41
Old Hand
****
Offline Offline

Posts: 219


« Reply #54 on: May 04, 2012, 08:14:48 AM »

thanks mark for all your good work!  I don't really think STR's can provide much better than +/-50 to 100 years here?  My only comment to date re: the new rates is that I would say 388 is too fast for R1b.  The data set may be biased by I,J entries where 388 has a higher modal and rate.

Its obvious to me that you cherry-picked a little in your selection of 8 haplotypes.  was it because these all had 111 dys loci values?  I wonder if you used only 67 and applied it to as many as have that number measured what you would get?  That would be more representative of a "random" set of entries?
« Last Edit: May 04, 2012, 08:17:30 AM by ironroad41 » Logged
Mark Jost
Old Hand
****
Offline Offline

Posts: 707


« Reply #55 on: May 04, 2012, 08:31:28 AM »

Yes I only used the eight 111 marker guys. I thought that was the purpose. I was a chore to extract those eight  due to combined multicopy markers. I will sent you my mod copy if you wish.

 When MikeW gets the interlaced version maybe he can code a marker length macro calc.

Logged

148326
Pos: Z245 L459 L21 DF13**
Neg: DF23 L513 L96 L144 Z255 Z253 DF21 DF41 (Z254 P66 P314.2 M37 M222  L563 L526 L226 L195 L193 L192.1 L159.2 L130 DF63 DF5 DF49)
WTYNeg: L555 L371 (L9/L10 L370 L302/L319.1 L554 L564 L577 P69 L626 L627 L643 L679)
Jdean
Old Hand
****
Offline Offline

Posts: 678


« Reply #56 on: May 04, 2012, 09:34:19 AM »

Yes I only used the eight 111 marker guys. I thought that was the purpose. I was a chore to extract those eight  due to combined multicopy markers. I will sent you my mod copy if you wish.

 When MikeW gets the interlaced version maybe he can code a marker length macro calc.




I copy the data into excel as text and then use the replace function in notepad to change the hyphens to tabs, normally works quite well.
Logged

Y-DNA R-DF49*
MtDNA J1c2e
Kit No. 117897
Ysearch 3BMC9

Mike Walsh
Guru
*****
Offline Offline

Posts: 2964


WWW
« Reply #57 on: May 04, 2012, 10:08:33 AM »

Yes I only used the eight 111 marker guys. I thought that was the purpose. I was a chore to extract those eight  due to combined multicopy markers. I will sent you my mod copy if you wish.

 When MikeW gets the interlaced version maybe he can code a marker length macro calc.

I will, but I just to wait until Ken's produces a "non-draft" version.
Logged

R1b-L21>L513(DF1)>S6365>L705.2(&CTS11744,CTS6621)
Mark Jost
Old Hand
****
Offline Offline

Posts: 707


« Reply #58 on: May 04, 2012, 11:12:06 AM »

There also is no sigma calc added in the beta as well. Did Ken say this was a beta or final?
Logged

148326
Pos: Z245 L459 L21 DF13**
Neg: DF23 L513 L96 L144 Z255 Z253 DF21 DF41 (Z254 P66 P314.2 M37 M222  L563 L526 L226 L195 L193 L192.1 L159.2 L130 DF63 DF5 DF49)
WTYNeg: L555 L371 (L9/L10 L370 L302/L319.1 L554 L564 L577 P69 L626 L627 L643 L679)
Mike Walsh
Guru
*****
Offline Offline

Posts: 2964


WWW
« Reply #59 on: May 04, 2012, 11:49:57 AM »

There also is no sigma calc added in the beta as well. Did Ken say this was a beta or final?

To be honest, I've just opened it up. I haven't tried to figure it out.  Please go on to the Hg I forum and ask Ken.  I know he thinks Sigma's are a critical part of any model.

http://archiver.rootsweb.ancestry.com/th/index/Y-DNA-HAPLOGROUP-I
Logged

R1b-L21>L513(DF1)>S6365>L705.2(&CTS11744,CTS6621)
spanjool
Member
**
Offline Offline

Posts: 38


« Reply #60 on: May 06, 2012, 04:08:28 AM »

I like to bring a the effect of a selection of data on the MRCA calculations.

From the total data pool of P312+ samples the U152, L21 etc were selected out leaving a wider spread of SNP's and haplotypes and therefore creating a higher MRCA ( lets call it peripheral MRCA).
The selected out SNPs the MRCA will be lower (lets call it central MRCA).

The Z196 SNP stepped into the peripheral selection of P312* and create a central MRCA leaving out an P312* with a high MRCA.

An example (MRCA/Coalescence in generations)
Z1418          103/84       19
Z196            120/102     18
Z196exM153 130/111     19
M153              36/30        6
Z209               99/61        8
Z220             153/117     36     
Another (111STR)
L176.2           124/91      33
L165              103/73      30
L176.2xL165  154/126     28
The Z1418 is a central selection in the P312 clade; Z196 is more wide spread then this selection (as Mike W many times emphasized).

The difference between MRCA and Coalescence have to be taken in account.
As the latter points more to neutral mutations  (mutation pressure); the first related to strong effects like founders of bottlenecks.
The smaller the difference the more the population is stable and the more trustable the MRCA (bell shaped pair wise mismatches in the STR alleles).
A bigger difference points to a non settled MRCA because of a relative high peripheral selection; together with a ragged shaped pair wise mismatch indicative for a subpopulation in harsh circumstances and in isolation (no gene flow).
The medium differences shows a subpopulation with reasonable growth and wealth but affected by gene flow; less isolated.
Hans
Logged

R1b-Z220*
JeanL
Old Hand
****
Offline Offline

Posts: 425


« Reply #61 on: May 06, 2012, 09:55:49 AM »


An example (MRCA/Coalescence in generations)

M153              36/30        6


The difference between MRCA and Coalescence have to be taken in account.
As the latter points more to neutral mutations  (mutation pressure); the first related to strong effects like founders of bottlenecks.

The smaller the difference the more the population is stable and the more trustable the MRCA (bell shaped pair wise mismatches in the STR alleles).

A bigger difference points to a non settled MRCA because of a relative high peripheral selection; together with a ragged shaped pair wise mismatch indicative for a subpopulation in harsh circumstances and in isolation (no gene flow).

The medium differences shows a subpopulation with reasonable growth and wealth but affected by gene flow; less isolated.
Hans

So what was  the sigma that you got for R-M153, right now I see the MRCA was 36 generations or 900 ybp(using 25 year/gen) or 1080 ybp (using 30 years/gen), would you say that it is an accurate estimate of the age of R-M153?
Logged
gtc
Old Hand
****
Offline Offline

Posts: 238


« Reply #62 on: May 06, 2012, 11:33:26 AM »


Repost of the text in the link:

"A set of mutation rate estimates based on about 4,000 haplotypes at 111 STR level.

Make that 5,000 haplotypes. ;-)

Interesting thread, guys.
« Last Edit: May 06, 2012, 11:34:39 AM by gtc » Logged

Y-DNA: R1b-Z12* (R1b1a2a1a1a3b2b1a1a1) GGG-GF Ireland (roots reportedly Anglo-Norman)
mtDNA: I3b (FMS) Maternal lines Irish
Mike Walsh
Guru
*****
Offline Offline

Posts: 2964


WWW
« Reply #63 on: May 06, 2012, 05:17:01 PM »

thanks mark for all your good work!  I don't really think STR's can provide much better than +/-50 to 100 years here?  

This may surprise you, because you know I think STR diversity is useful, but...
there is no way we have the precision to be within 100 years of anything.
Logged

R1b-L21>L513(DF1)>S6365>L705.2(&CTS11744,CTS6621)
Jdean
Old Hand
****
Offline Offline

Posts: 678


« Reply #64 on: May 06, 2012, 06:58:32 PM »

thanks mark for all your good work!  I don't really think STR's can provide much better than +/-50 to 100 years here?  

This may surprise you, because you know I think STR diversity is useful, but...
there is no way we have the precision to be within 100 years of anything.

Agreed, no matter what time scale we are talking about or how many loci are involved.
Logged

Y-DNA R-DF49*
MtDNA J1c2e
Kit No. 117897
Ysearch 3BMC9

Mark Jost
Old Hand
****
Offline Offline

Posts: 707


« Reply #65 on: May 06, 2012, 07:45:49 PM »

thanks mark for all your good work!  I don't really think STR's can provide much better than +/-50 to 100 years here?  

This may surprise you, because you know I think STR diversity is useful, but...
there is no way we have the precision to be within 100 years of anything.
Maybe his specific use of range of years was at fault but the concept was correct.

TMRCA's with the exactly match at all loci, the fact is that when more markers used the adding additional markers increases the precision of the test. With a 100 markers, the 50% probability corresponds to 2 generations (exact value 1.7).

A pretty good precision I would suggest!
 
What is the effect of adding even more markers?
http://nitro.biosci.arizona.edu/ftdna/TMRCA.html

MJost
Logged

148326
Pos: Z245 L459 L21 DF13**
Neg: DF23 L513 L96 L144 Z255 Z253 DF21 DF41 (Z254 P66 P314.2 M37 M222  L563 L526 L226 L195 L193 L192.1 L159.2 L130 DF63 DF5 DF49)
WTYNeg: L555 L371 (L9/L10 L370 L302/L319.1 L554 L564 L577 P69 L626 L627 L643 L679)
Mark Jost
Old Hand
****
Offline Offline

Posts: 707


« Reply #66 on: May 06, 2012, 07:49:29 PM »

thanks mark for all your good work!  I don't really think STR's can provide much better than +/-50 to 100 years here?  My only comment to date re: the new rates is that I would say 388 is too fast for R1b.  The data set may be biased by I,J entries where 388 has a higher modal and rate.

Its obvious to me that you cherry-picked a little in your selection of 8 haplotypes.  was it because these all had 111 dys loci values?  I wonder if you used only 67 and applied it to as many as have that number measured what you would get?  That would be more representative of a "random" set of entries?

I remove the guy who has the largest GD from the rest, kit 120200 just to show the close relatedness of the remaining haplotypes.
Quote
YrsPerGen* Count   AGE  Generations  YBP  Founder  Generations  YBP   MoD
30  N=7  GA coal=  9.5  283.6     GA=  11.4  342.9

Logged

148326
Pos: Z245 L459 L21 DF13**
Neg: DF23 L513 L96 L144 Z255 Z253 DF21 DF41 (Z254 P66 P314.2 M37 M222  L563 L526 L226 L195 L193 L192.1 L159.2 L130 DF63 DF5 DF49)
WTYNeg: L555 L371 (L9/L10 L370 L302/L319.1 L554 L564 L577 P69 L626 L627 L643 L679)
ironroad41
Old Hand
****
Offline Offline

Posts: 219


« Reply #67 on: May 07, 2012, 07:46:39 AM »

thanks mark for all your good work!  I don't really think STR's can provide much better than +/-50 to 100 years here?  My only comment to date re: the new rates is that I would say 388 is too fast for R1b.  The data set may be biased by I,J entries where 388 has a higher modal and rate.

Its obvious to me that you cherry-picked a little in your selection of 8 haplotypes.  was it because these all had 111 dys loci values?  I wonder if you used only 67 and applied it to as many as have that number measured what you would get?  That would be more representative of a "random" set of entries?

I remove the guy who has the largest GD from the rest, kit 120200 just to show the close relatedness of the remaining haplotypes.
Quote
YrsPerGen* Count   AGE  Generations  YBP  Founder  Generations  YBP   MoD
30  N=7  GA coal=  9.5  283.6     GA=  11.4  342.9


I think you meant entry 120820, daniel mcfarland.  Yes they are all closely related and split off from the main branch at different times.  If you remove the entries with a higher number of mutations, you reduce the TMRCA as you have shown.   So, you need a spread of old and newer to find the correct (?) TMRCA.  The value of 11 at 426 is a large contributor and requires a large number of entries/a large number of dys loci to compensate for and overweigh the estimate.  By majority vote, it doesn't appear that 2124, the direct descendant has had a mutation since 391 went from 11 to 10? I have learned a lot working this set of data as entries have accumulated.  It is a highly correlated set, but you have to be careful not overcounting mutations due to close relationships. Re: 120820, did you count the 19 at 458 as one step or two steps?
Logged
Mark Jost
Old Hand
****
Offline Offline

Posts: 707


« Reply #68 on: May 07, 2012, 08:48:57 AM »

Researcher's choice on what kit to utilize but I just used the DNA kits the the project administrator placed in the Red category with 111 markers.

In checking GD it is in MikeW's 111 marker spread sheet that I modified for GDs, it counts every mutation from modal or selected modal. The below list is from Selected Modal
Quote

KIT   GD @ 111 Selected mode   Std 1-25   439   459   DYS464   Std  26-37   YCAII   CDY   Std  38-67   425   413   Std  68-111
2909   2   0   0   0   0   0   0   0   2   0   0   0
108707   2   0   0   0   0   0   0   1   0   0   0   1
2124   3   0   0   0   0   0   0   0   0   0   0   3
190435   3   1   0   0   0   0   0   2   0   0   0   0
131269   4   1   0   0   0   0   0   1   0   0   0   2
133637   5   0   0   0   0   0   0   1   0   0   0   4
191228   8   1   0   1   2   0   0   2   1   0   0   1
120820   15   2   0   0   1   4   0   1   0   0   0   7
Logged

148326
Pos: Z245 L459 L21 DF13**
Neg: DF23 L513 L96 L144 Z255 Z253 DF21 DF41 (Z254 P66 P314.2 M37 M222  L563 L526 L226 L195 L193 L192.1 L159.2 L130 DF63 DF5 DF49)
WTYNeg: L555 L371 (L9/L10 L370 L302/L319.1 L554 L564 L577 P69 L626 L627 L643 L679)
Mark Jost
Old Hand
****
Offline Offline

Posts: 707


« Reply #69 on: May 07, 2012, 09:08:15 AM »

Researcher's choice on what kit to utilize but I just used the DNA kits the the project administrator placed in the Red category with 111 markers.

In checking GD it is in MikeW's 111 marker spread sheet that I modified for GDs, it counts every mutation from modal or selected modal. The below list is from Selected Modal
Quote

KIT   GD @ 111 Selected mode   Std 1-25   439   459   DYS464   Std  26-37   YCAII   CDY   Std  38-67   425   413   Std  68-111
2909   2   0   0   0   0   0   0   0   2   0   0   0
108707   2   0   0   0   0   0   0   1   0   0   0   1
2124   3   0   0   0   0   0   0   0   0   0   0   3
190435   3   1   0   0   0   0   0   2   0   0   0   0
131269   4   1   0   0   0   0   0   1   0   0   0   2
133637   5   0   0   0   0   0   0   1   0   0   0   4
191228   8   1   0   1   2   0   0   2   1   0   0   1
120820   15   2   0   0   1   4   0   1   0   0   0   7

An after thought to show GD average with or without 120820:
GD Avg:5.3, Min:2, Max:15, N=8      
GD Avg:3.9, Min:1, Max:8, N=7      
Logged

148326
Pos: Z245 L459 L21 DF13**
Neg: DF23 L513 L96 L144 Z255 Z253 DF21 DF41 (Z254 P66 P314.2 M37 M222  L563 L526 L226 L195 L193 L192.1 L159.2 L130 DF63 DF5 DF49)
WTYNeg: L555 L371 (L9/L10 L370 L302/L319.1 L554 L564 L577 P69 L626 L627 L643 L679)
ironroad41
Old Hand
****
Offline Offline

Posts: 219


« Reply #70 on: May 07, 2012, 10:01:26 AM »

I would question GD's exceeding one except for possibly at CDYa,b.  Consider especially the result you attained after dropping the entry with the most mutations. Some 300+ years.  The probability of two mutations at the same dys loci is vanishingly small.  I do not usually use 385a,b; 389i,ii, the 464 series, 459a,b, the YCAIIa,b in my computations.  I don't think the mutation rates are well understood?

I think the GD's exceeding one are probably multistep single event mutations?

Edit:  I do not believe the concept of GD describes a process which has a range in excess of 100:1.  All events are not equally likely even though it is a random process.
« Last Edit: May 07, 2012, 10:04:00 AM by ironroad41 » Logged
Mark Jost
Old Hand
****
Offline Offline

Posts: 707


« Reply #71 on: May 07, 2012, 10:45:30 AM »

I would question GD's exceeding one except for possibly at CDYa,b.  Consider especially the result you attained after dropping the entry with the most mutations. Some 300+ years.  

If you look at the outlier, most of the differences are in the Std  26-37 and Std  68-111 panels which since he is a true lineage, a very very early split in the branch. here is the GDs from the Base Modal
Quote
Base Modal   
GD Avg:21.6, Min:20, Max:26, N=8   
108707   20
2124   20
190435   20
2909   21
131269   22
133637   22
120820   22
191228   26
Quote

The probability of two mutations at the same dys loci is vanishingly small.  I do not usually use 385a,b; 389i,ii, the 464 series, 459a,b, the YCAIIa,b in my computations.  I don't think the mutation rates are well understood?

I think the GD's exceeding one are probably multistep single event mutations?

Edit:  I do not believe the concept of GD describes a process which has a range in excess of 100:1.  All events are not equally likely even though it is a random process.
KenN's Generations 111T doesnt use DYS385, 389, 459, CDY's (11 markers).
MJost

Logged

148326
Pos: Z245 L459 L21 DF13**
Neg: DF23 L513 L96 L144 Z255 Z253 DF21 DF41 (Z254 P66 P314.2 M37 M222  L563 L526 L226 L195 L193 L192.1 L159.2 L130 DF63 DF5 DF49)
WTYNeg: L555 L371 (L9/L10 L370 L302/L319.1 L554 L564 L577 P69 L626 L627 L643 L679)
Mark Jost
Old Hand
****
Offline Offline

Posts: 707


« Reply #72 on: May 07, 2012, 10:50:04 AM »

I have been in communications with Marko Heinila concerning his new mutation rate calculations and his new Phylo chart. He responded as follows:

I updated my dropbox page
http://dl.dropbox.com/u/50201824/jsphylosvg_trees.html
so that "interactive" trees now cover more haplogroups and time values embedded to xml trees are based on the new mutation rates.

The fact that these mutation rates are somewhat higher than what I used before seems to be quite resonable in the case of some deep genealogical clusters. The majority trees do not have that much structure, however. Some of the clearer cases are Scandinavian/Scottish R1a1a1/R1a1a1h (111 STR) and MacGregor Ian Cam in R-L21 (67 STR).

-Marko H.


>MJost wrote:A few days ago I used KenN's new Gen111T spread sheet and he had your new rates included but used his own by default. His 100 marker (no multi-copy's) rate was nearly identical to yours. KenN's 100 had a sum of 0.243231 and yours was at 0.237478


These are both based on the same about 4,000 111 level samples. The idea in the
estimation is that each haplotype pair is considered an independent random draw from a model distribution. Model distribution suggests what is the ratio of mismatches and
matches in a given marker if pairs with a given number of matching markers in general are considered. The pair data is then used to solve the mutation rates. This is the same idea as in Chandler's paper on mutation rate estimation.

The difference in the two estimates is that one set uses weight one for each pair. This
would work ideally if the dataset would be composed of isolated close haplotype pairs
that are distant from the other pairs. Problem with actual data is that deeper tree
branches with lots of derived samples get overweighted since there are many associated haplotype pairs, all of them sharing some evolution in the deeper past. Quality of the mutation rate estimates is then reduced since some mutations are (effectively) used many times in the estimation process.

Other set mentioned on the dropbox page uses "phylogenetic weigth". This should reduce the error in the independent draw assumption. "Isolated pairs" get weigth one but more distant pairs within bigger clusters are downweighted since they are interdependent. -Marko H.


>MJost wrote:I don't totally understand what Chandler's paper on mutation rate estimation process exactly does, but I defer to your knowledge and results. Chandler's expressions for the “mutation model curve” (MMC) of Hutchison et al. (2004) and outline a procedure for using the high-match end of the MMC for extracting mutation rates is on my list to understand fuller sometime.

In the case of haplotype pairs with one mismatch, the relative frequencies of pairs with
various mismatching loci are proprotional to the relative mutation rates. This can be
generalized to larger number of mismatching loci.

For a given time distance characteristic to a given number of mismatches, I use expression for probability of mismatch in a given locus derived from a symmetric up/down model. (It is actually still the same result as long as up/down ratio and up+down sum stay constants.) From this one can compute the probability of mismatch in locus i if n loci do not match in total. Mutation rates are solved by fitting these quantities against observations.

Still another thing is the rate "calibration": There the expected number of mutations for datasets like YHRD is scaled to the same as observed number by adjusting the overall scale. That is expected_mutations = sum_i estimated_mutation_rate_i * yrhd_meioses_for_locus_i is set equal to yhrd_observed_mutations = sum_i yhrd_mutations_in_locus_i by adjusting the overall scale of the mutation rate estimates.

I use cross-validation argument to choose maximum number of locus mismatches considered in the calculation. For given observational data such like YHRD or Ballantyne at al, and given mutation rate estimates, one can compute the probability at which the observed set of per locus mutation counts {M_1, M_2, .... , M_n} actully occurs in an "experiment" with given number of transmissions and mutation rates. Better use of haplotype data produces higher probability for directly and independently observed data.

*******************

MJost
Logged

148326
Pos: Z245 L459 L21 DF13**
Neg: DF23 L513 L96 L144 Z255 Z253 DF21 DF41 (Z254 P66 P314.2 M37 M222  L563 L526 L226 L195 L193 L192.1 L159.2 L130 DF63 DF5 DF49)
WTYNeg: L555 L371 (L9/L10 L370 L302/L319.1 L554 L564 L577 P69 L626 L627 L643 L679)
MarkoH
Member
**
Offline Offline

Posts: 20


« Reply #73 on: May 08, 2012, 12:03:49 PM »

Just for completeness:

At ftdna.com/public, there are more than 60,000 unique haplotypes at 37 level, and almost 35,000 with 67. The question is just what happens if more data is used with the same estimation method that produced the estimates in the 111T calculator.  This spreadsheet contains a collection of mutation rate estimates based on various datasets.

Columns B (61,981 x37; 35-36/37), D (34,184 x67; 64-66/67), and F (3,803 x111 102-110/111) are the most relevant, others could be ignored.  

The B column, say, considers 35/37 and 36/37 matching pairs in the dataset of 61,982 samples. 35-36/37 comes from cross-validation argument, where YHRD and Balantyne et al data suggested that there is no improvement with including larger mismatches.  

The 111T spreadsheet used column F as one option and this as a second set based on the same data as F column (102-110/111) but used more complex method.  The more complex method produced improvement according to cross-validation with larger mismatches included (86-110/111); there was also closer agreement with the D column results (64-66/67) without using more data than the F estimates.

-Marko H.

« Last Edit: May 08, 2012, 12:19:00 PM by MarkoH » Logged
Mike Walsh
Guru
*****
Offline Offline

Posts: 2964


WWW
« Reply #74 on: May 08, 2012, 01:09:08 PM »

-Marko H.

Thank you for joining.
Logged

R1b-L21>L513(DF1)>S6365>L705.2(&CTS11744,CTS6621)
Pages: 1 2 [3] 4 5 ... 7 Go Up Print 
« previous next »
Jump to:  


SEO light theme by © Mustang forums. Powered by SMF 1.1.13 | SMF © 2006-2011, Simple Machines LLC

Page created in 0.133 seconds with 17 queries.