World Families Forums - TMRCA calculations

Welcome, Guest. Please login or register.
July 26, 2014, 10:44:04 AM
Home Help Search Login Register

+  World Families Forums
|-+  General Forums - Note: You must Be Logged In to post. Anyone can browse.
| |-+  R1b General (Moderator: rms2)
| | |-+  TMRCA calculations
« previous next »
Pages: 1 ... 3 4 [5] 6 7 Go Down Print
Author Topic: TMRCA calculations  (Read 8184 times)
ironroad41
Old Hand
****
Offline Offline

Posts: 219


« Reply #100 on: May 11, 2012, 03:18:09 PM »

I get confused. I edited my post.  Yes, I doubt that the O'Neills are that old.  If they are, it sure conflicts with all of MikeW's work.

I'm still not convinced we have a good handle on time using STR's?  The original Variance/ASD formulation was very early in the game (2001), before we knew hardly any SNP's, didn't recognize multi-steps; in general had very little data.

I am also concerned about the effect of "catastrophes" and their impact on our calculations.  The probability of one haplotype having 4 or 5 very slow mutations in a few thousand years is unimaginably low. JMHO
Logged
ironroad41
Old Hand
****
Offline Offline

Posts: 219


« Reply #101 on: May 12, 2012, 02:57:22 PM »

 It appears to be quite simple.  If I take a set of 21 entries and 23 dys loci (out of the first 37) and estimate the TMRCA for a group who is Z253+ and L226- I get c. 1000 BC.  If I do the same type of estimate on a group of Z253+,  L226+  I usually get a TMRCA c. 400 AD.  It seems pretty clear to me?  Note L226 is a subclade of Z253.

So L226 is younger than Z253, I'll alert the press :)

Actually I think I worked out what you were on about above. I think you were referring to M269* not M269 which would make the rest of your statements make more sense and which I should have realised at the time (apologies)

However it's feasible for a group of people to be negative for all known SNPs below an old SNP and still have low diversity when compared with each other.

I've been mulling over the issue of tree lines meeting at an SNP.  My thinking has changed.  All the entries below M226 e.g. merge at his SNP  and then they all have the same tree line to the next SNP, say Z253.  So all the tree lines are of the same length, but what the lower SNP does is "bias" in some sense The TMRCA estimate, since many of the entries will have the same tree line from M226 to Z253. 

As I've observed with the Ian Cam of Clan Gregor, it is best to select "independent" entries , or said another way, entries with as different a set of mutations as possible.  This would minimize the type of bias described above.  Even when evaluating Coalescence time or TMRCA's, it would seem prudent to "cherry-pick" the data set to minimize any biases.
Logged
MarkoH
Member
**
Offline Offline

Posts: 20


« Reply #102 on: May 12, 2012, 03:48:24 PM »

(...)
So all the tree lines are of the same length, but what the lower SNP does is "bias" in some sense The TMRCA estimate, since many of the entries will have the same tree line from M226 to Z253.  

As I've observed with the Ian Cam of Clan Gregor, it is best to select "independent" entries , or said another way, entries with as different a set of mutations as possible.  This would minimize the type of bias described above.  Even when evaluating Coalescence time or TMRCA's, it would seem prudent to "cherry-pick" the data set to minimize any biases.

This is approximately the idea that is implemented in the "weighting" inherent to certain tree estimation methods, such like WPGM (Weighted Pair-Group Method) and also Neighbor Joining.    The two tree lines "below" bifurcation point get (1/2 , 1/2) weight compared to tree line entering the node from deeper past (deeper past in the case of WPGM).   Also the same difficulty and suggested cure that I tried to explain in the context of mutation rate estimation.  Nothing also prevents doing TMRCA estimation with this kind weighting for a given tree. "Cherry Picking" is a type of tree estimation that focuses on major splits.

« Last Edit: May 12, 2012, 03:55:18 PM by MarkoH » Logged
ColinUSA
New Member
*
Offline Offline

Posts: 4


« Reply #103 on: May 13, 2012, 08:55:06 AM »

Quote
I updated my dropbox page
http://dl.dropbox.com/u/50201824/jsphylosvg_trees.html
so that "interactive" trees now cover more haplogroups and time values embedded to xml trees are based on the new mutation rates.

Hi Marko,
Its nice that these trees link back to the original data when you click on a haplotype. Are they supposed to also display the age of a node when you click a node? Apart from the xml, what methods do you use to construct a tree?
Cheers,
Colin
Logged
MarkoH
Member
**
Offline Offline

Posts: 20


« Reply #104 on: May 13, 2012, 11:28:33 AM »

Quote
I updated my dropbox page
http://dl.dropbox.com/u/50201824/jsphylosvg_trees.html
so that "interactive" trees now cover more haplogroups and time values embedded to xml trees are based on the new mutation rates.

Hi Marko,
Its nice that these trees link back to the original data when you click on a haplotype. Are they supposed to also display the age of a node when you click a node? Apart from the xml, what methods do you use to construct a tree?
Cheers,
Colin

One can get age estimates by opening trees in the .zip file  with Archaeopteryx viewer.  The trees displayed on the page use "jsPhyloSVG" that has less features.  However, the underlying files have branch length values (in centuries) which are used to scale the drawing, like, eg., the data file for 111 STR R1 tree: R1-111.xml (this has been reformatted from 111/majority/R1.xml in the zip file).

Trees are constructed with weighted minimum distance algorithms by using FNJ tree (fast neighbor joining) tree as an optimization starting point. Both STR and SNP data is used.  Besides minimum distance, also maximum likelihood based tree modifications are used sometimes as a 3rd tree optimization stage, but not with this version, however. The final result is rooted and time estimates are computed with a maximum likelihood optimization for haplotypes and branch lengths.  This process produces speculative binary trees; several such trees can be then summarized as majority constructions that (on the page) only contain splits that are present in all underlying binary phylogenies; time values are averages over the underlying constructions.   This uses experimental routines I have written in several occasions over the last few years.

Edit: The "jsPhyloSVG" trees on the above mentioned page now have root branches. Root branch length is always set to  1,000 years to give a rough idea of the time scales.
« Last Edit: May 13, 2012, 02:45:49 PM by MarkoH » Logged
ColinUSA
New Member
*
Offline Offline

Posts: 4


« Reply #105 on: May 14, 2012, 08:20:17 AM »

Quote
The final result is rooted and time estimates are computed with a maximum likelihood optimization for haplotypes and branch lengths.
Sounds pretty good. I understand how branch lengths are variables subject to optimization but have only a vague idea of how haplotypes would be optimized, by that do you mean constructs for the root and nodes? Something akin to "median vectors" as used in the Fluxus network diagrams?

Quote
This uses experimental routines I have written in several occasions over the last few years.
I hope you are developing something we project admins might be able to use. We really need a new tool that like the Phylip routine kitsch constructs a time constrained rooted tree and like Fluxus connects the haplotypes with knowledge of shared mutations.

Cheers,
Colin
Logged
MarkoH
Member
**
Offline Offline

Posts: 20


« Reply #106 on: May 14, 2012, 09:40:29 AM »

Quote
The final result is rooted and time estimates are computed with a maximum likelihood optimization for haplotypes and branch lengths.
Sounds pretty good. I understand how branch lengths are variables subject to optimization but have only a vague idea of how haplotypes would be optimized, by that do you mean constructs for the root and nodes? Something akin to "median vectors" as used in the Fluxus network diagrams?

"Minumum distance" algorithms produce median vectors. In a binary tree, minimum change haplotypes are medians of the three nearby nodes.  (Related algorithms also allow to "predict" missing values though not uniquely.)

Maximum likelihood methods optimize log likelihood, that is, summed log probabilities of STR transitions over tree branches.  This quantity depends on ancestral and derived haplotypes as well as the branch length. It can be optimized with respect to the both.  Typical case where maximum likelihood and medians differ is the case where median construction would suggest no change in a fast locus over large branch length, but maximum likelihood likes to do it otherwise. For example, if the neighbor node haplotypes and  branchlengths (in generations) for a node are (A, 10) , (A, 10),  (B, 1),  median would be value A, but B represents  maximum likelihood for a fast locus.  ("Fast locus" has mutation rate more than about 2/100 in this example if kept simple with symmetric stepwise model and one-step change.) This also shows how maximum likelihood haplotypes have more variability at fast loci than the minimum distance ones.

The inherent problem with median vectors and maximum likelihood constructions is that fast loci cannot be constructed in the distant past but the methods produce a compromise haplotypes with less node-to-node changes than realistic solution. (Theoretically, this could be avoided with a Bayesian approach and Metropolis-Hastings like simulation of internal node states; but this is not practical with large networks.) I use additional heuristic to phase off faster loci in the more distant past so that time under estimation is mitigated up to a degree.

Quote
Quote
This uses experimental routines I have written in several occasions over the last few years.
I hope you are developing something we project admins might be able to use. We really need a new tool that like the Phylip routine kitsch constructs a time constrained rooted tree and like Fluxus connects the haplotypes with knowledge of shared mutations.

Possible, but would require lots of software work for user friendliness.
« Last Edit: May 14, 2012, 05:46:10 PM by MarkoH » Logged
ColinUSA
New Member
*
Offline Offline

Posts: 4


« Reply #107 on: May 16, 2012, 07:31:19 PM »

Thank you, that was informative. I searched for phylogenetic trees with key words such as maximum likelihood and Bayesian to see what might turn up. Couldn't find anything that dealt with STR haplotypes, they all seem to be for DNA sequences.
Cheers,
Colin
Logged
Mark Jost
Old Hand
****
Online Online

Posts: 707


« Reply #108 on: May 16, 2012, 11:10:22 PM »

Thank you, that was informative. I searched for phylogenetic trees with key words such as maximum likelihood and Bayesian to see what might turn up. Couldn't find anything that dealt with STR haplotypes, they all seem to be for DNA sequences.
Cheers,
Colin
Colin,

You may have this already, but here ya go! Basic STR for Light bedside reading.  :)

http://nitro.biosci.arizona.edu/courses/EEB596/handouts/Bayesian.pdf

http://nitro.biosci.arizona.edu/ftdna/models.html

MJost

Logged

148326
Pos: Z245 L459 L21 DF13**
Neg: DF23 L513 L96 L144 Z255 Z253 DF21 DF41 (Z254 P66 P314.2 M37 M222  L563 L526 L226 L195 L193 L192.1 L159.2 L130 DF63 DF5 DF49)
WTYNeg: L555 L371 (L9/L10 L370 L302/L319.1 L554 L564 L577 P69 L626 L627 L643 L679)
ColinUSA
New Member
*
Offline Offline

Posts: 4


« Reply #109 on: May 22, 2012, 07:59:35 AM »

Hi Mark,
I thought I found some software to try called Lamarc. Then I read these two statements!
1, A good LAMARC work-out can use hundreds of megabytes of RAM,
2. it's not unusual for a complete, solid run of LAMARC to take a week or two
http://evolution.genetics.washington.edu/lamarc/documentation/tutorial.html
Cheers,
Colin
Logged
Mark Jost
Old Hand
****
Online Online

Posts: 707


« Reply #110 on: May 22, 2012, 02:19:07 PM »

Colin,

Guess you wanna try it first???  LOL This is something like MarkoH uses I think.

I might look at it next week while on vacation.... prolly not!

MJost
Logged

148326
Pos: Z245 L459 L21 DF13**
Neg: DF23 L513 L96 L144 Z255 Z253 DF21 DF41 (Z254 P66 P314.2 M37 M222  L563 L526 L226 L195 L193 L192.1 L159.2 L130 DF63 DF5 DF49)
WTYNeg: L555 L371 (L9/L10 L370 L302/L319.1 L554 L564 L577 P69 L626 L627 L643 L679)
Mike Walsh
Guru
*****
Offline Offline

Posts: 2963


WWW
« Reply #111 on: May 29, 2012, 01:08:59 PM »

Quote
The final result is rooted and time estimates are computed with a maximum likelihood optimization for haplotypes and branch lengths.
Sounds pretty good. I understand how branch lengths are variables subject to optimization but have only a vague idea of how haplotypes would be optimized, by that do you mean constructs for the root and nodes? Something akin to "median vectors" as used in the Fluxus network diagrams?

"Minumum distance" algorithms produce median vectors. In a binary tree, minimum change haplotypes are medians of the three nearby nodes.  (Related algorithms also allow to "predict" missing values though not uniquely.)

Maximum likelihood methods optimize log likelihood, that is, summed log probabilities of STR transitions over tree branches.  This quantity depends on ancestral and derived haplotypes as well as the branch length. It can be optimized with respect to the both.  Typical case where maximum likelihood and medians differ is the case where median construction would suggest no change in a fast locus over large branch length, but maximum likelihood likes to do it otherwise. For example, if the neighbor node haplotypes and  branchlengths (in generations) for a node are (A, 10) , (A, 10),  (B, 1),  median would be value A, but B represents  maximum likelihood for a fast locus.  ("Fast locus" has mutation rate more than about 2/100 in this example if kept simple with symmetric stepwise model and one-step change.) This also shows how maximum likelihood haplotypes have more variability at fast loci than the minimum distance ones.

The inherent problem with median vectors and maximum likelihood constructions is that fast loci cannot be constructed in the distant past but the methods produce a compromise haplotypes with less node-to-node changes than realistic solution. (Theoretically, this could be avoided with a Bayesian approach and Metropolis-Hastings like simulation of internal node states; but this is not practical with large networks.) I use additional heuristic to phase off faster loci in the more distant past so that time under estimation is mitigated up to a degree.

Quote
Quote
This uses experimental routines I have written in several occasions over the last few years.
I hope you are developing something we project admins might be able to use. We really need a new tool that like the Phylip routine kitsch constructs a time constrained rooted tree and like Fluxus connects the haplotypes with knowledge of shared mutations.

Possible, but would require lots of software work for user friendliness.

Marko, I saved a copy of your TMRCA estimates from last year because it is best, consistent look across major haplogroups with large samples and long haplotypes, IMO.
http://dl.dropbox.com/u/17907527/TMRCAs_for_major_Y_Hgs_by_Heinila_2011.html

I realize you are busy, but just thought I'd check. Do you have any updates for this?
« Last Edit: May 29, 2012, 01:17:55 PM by Mikewww » Logged

R1b-L21>L513(DF1)>L705.2
razyn
Old Hand
****
Offline Offline

Posts: 405


« Reply #112 on: May 29, 2012, 03:06:48 PM »


Marko, I saved a copy of your TMRCA estimates from last year because it is best, consistent look across major haplogroups with large samples and long haplotypes, IMO.

Cool, I didn't realize anybody had saved this, when it suddenly went non-public last fall.

Marko was the first, and for a long time the only, person to take Z196 seriously enough to include it in such a table.  It isn't obvious -- you have to scroll way to the bottom to see it (or some of the other SNPs that were new last year, and had small sample size).  We are now working on DF27, and on many subclades of Z196 -- only a few of which were tested last year.  But this was way better than nothing, and I only had tiny bits of it in my head, or my notes.  Most of the said notes were on DNA-Forums, and are now just as non-public as Marko's old site.
Logged

R1b Z196*
ironroad41
Old Hand
****
Offline Offline

Posts: 219


« Reply #113 on: May 30, 2012, 07:01:38 AM »

Quote
The final result is rooted and time estimates are computed with a maximum likelihood optimization for haplotypes and branch lengths.
Sounds pretty good. I understand how branch lengths are variables subject to optimization but have only a vague idea of how haplotypes would be optimized, by that do you mean constructs for the root and nodes? Something akin to "median vectors" as used in the Fluxus network diagrams?

"Minumum distance" algorithms produce median vectors. In a binary tree, minimum change haplotypes are medians of the three nearby nodes.  (Related algorithms also allow to "predict" missing values though not uniquely.)

Maximum likelihood methods optimize log likelihood, that is, summed log probabilities of STR transitions over tree branches.  This quantity depends on ancestral and derived haplotypes as well as the branch length. It can be optimized with respect to the both.  Typical case where maximum likelihood and medians differ is the case where median construction would suggest no change in a fast locus over large branch length, but maximum likelihood likes to do it otherwise. For example, if the neighbor node haplotypes and  branchlengths (in generations) for a node are (A, 10) , (A, 10),  (B, 1),  median would be value A, but B represents  maximum likelihood for a fast locus.  ("Fast locus" has mutation rate more than about 2/100 in this example if kept simple with symmetric stepwise model and one-step change.) This also shows how maximum likelihood haplotypes have more variability at fast loci than the minimum distance ones.

The inherent problem with median vectors and maximum likelihood constructions is that fast loci cannot be constructed in the distant past but the methods produce a compromise haplotypes with less node-to-node changes than realistic solution. (Theoretically, this could be avoided with a Bayesian approach and Metropolis-Hastings like simulation of internal node states; but this is not practical with large networks.) I use additional heuristic to phase off faster loci in the more distant past so that time under estimation is mitigated up to a degree.

Quote
Quote
This uses experimental routines I have written in several occasions over the last few years.
I hope you are developing something we project admins might be able to use. We really need a new tool that like the Phylip routine kitsch constructs a time constrained rooted tree and like Fluxus connects the haplotypes with knowledge of shared mutations.

Possible, but would require lots of software work for user friendliness.

Marko, I saved a copy of your TMRCA estimates from last year because it is best, consistent look across major haplogroups with large samples and long haplotypes, IMO.
http://dl.dropbox.com/u/17907527/TMRCAs_for_major_Y_Hgs_by_Heinila_2011.html

I realize you are busy, but just thought I'd check. Do you have any updates for this?

I think this work is "proof" of a substantial loss of diversity in R1b somewhat after M73 or possibly even earlier back to prior to M-269.  Note how the interclade for R-V88 to R-M73 is 11K, with decreasing intraclades, but right at 312 the bottom falls out and everything below is 4.2K or less.  What happened.  I have talked to Marko and he agrees the lack of data suggests a bottleneck occurred!
Logged
Mike Walsh
Guru
*****
Offline Offline

Posts: 2963


WWW
« Reply #114 on: May 30, 2012, 11:37:48 AM »

Quote
The final result is rooted and time estimates are computed with a maximum likelihood optimization for haplotypes and branch lengths.
Sounds pretty good. I understand how branch lengths are variables subject to optimization but have only a vague idea of how haplotypes would be optimized, by that do you mean constructs for the root and nodes? Something akin to "median vectors" as used in the Fluxus network diagrams?

"Minumum distance" algorithms produce median vectors. In a binary tree, minimum change haplotypes are medians of the three nearby nodes.  (Related algorithms also allow to "predict" missing values though not uniquely.)

Maximum likelihood methods optimize log likelihood, that is, summed log probabilities of STR transitions over tree branches.  This quantity depends on ancestral and derived haplotypes as well as the branch length. It can be optimized with respect to the both.  Typical case where maximum likelihood and medians differ is the case where median construction would suggest no change in a fast locus over large branch length, but maximum likelihood likes to do it otherwise. For example, if the neighbor node haplotypes and  branchlengths (in generations) for a node are (A, 10) , (A, 10),  (B, 1),  median would be value A, but B represents  maximum likelihood for a fast locus.  ("Fast locus" has mutation rate more than about 2/100 in this example if kept simple with symmetric stepwise model and one-step change.) This also shows how maximum likelihood haplotypes have more variability at fast loci than the minimum distance ones.

The inherent problem with median vectors and maximum likelihood constructions is that fast loci cannot be constructed in the distant past but the methods produce a compromise haplotypes with less node-to-node changes than realistic solution. (Theoretically, this could be avoided with a Bayesian approach and Metropolis-Hastings like simulation of internal node states; but this is not practical with large networks.) I use additional heuristic to phase off faster loci in the more distant past so that time under estimation is mitigated up to a degree.

Quote
Quote
This uses experimental routines I have written in several occasions over the last few years.
I hope you are developing something we project admins might be able to use. We really need a new tool that like the Phylip routine kitsch constructs a time constrained rooted tree and like Fluxus connects the haplotypes with knowledge of shared mutations.

Possible, but would require lots of software work for user friendliness.

Marko, I saved a copy of your TMRCA estimates from last year because it is best, consistent look across major haplogroups with large samples and long haplotypes, IMO.
http://dl.dropbox.com/u/17907527/TMRCAs_for_major_Y_Hgs_by_Heinila_2011.html

I realize you are busy, but just thought I'd check. Do you have any updates for this?

I think this work is "proof" of a substantial loss of diversity in R1b somewhat after M73 or possibly even earlier back to prior to M-269.  Note how the interclade for R-V88 to R-M73 is 11K, with decreasing intraclades, but right at 312 the bottom falls out and everything below is 4.2K or less.  What happened.  I have talked to Marko and he agrees the lack of data suggests a bottleneck occurred!
Of course, many bottlenecks have occurred for almost all paternal lineages. In fact, most did not survive one bottleneck somewhere along the line and went extinct.

Is there some outstanding revelation available with this knowledge?  Remember, most of these TMRCA calculations (at least the intraclade ones) are not estimating the birth date of an SNP, just the time of the most recent common ancestor of those still living (tested.)

We do not know how much "loss of diversity" there was if we don't know how much diversity (and population size) was there prior to the bottleneck.  The only thing we can do is look at interclade ages where valid ones are available.   The amount of diversity between the time from the birth of SNP and that lineage's expansion after a bottleneck is unknown. There could have been just one thin lineage struggling along for a long time before the burst of the success.
« Last Edit: May 30, 2012, 11:41:41 AM by Mikewww » Logged

R1b-L21>L513(DF1)>L705.2
MarkoH
Member
**
Offline Offline

Posts: 20


« Reply #115 on: May 30, 2012, 11:38:01 PM »

These spreadsheets collect together TMRCA numbers from current 67 STR run1 and run2: Average tmrca's for haplogroups and for SNPs.  Idea would be to compare these against each other. There are some issues related to details of SNP tree, and I removed cases that might be affected. Marko H.
Logged
Jdean
Old Hand
****
Offline Offline

Posts: 678


« Reply #116 on: May 31, 2012, 05:11:21 AM »

Marko

Not wanting to barrage you with questions but when you’ve got five maybe you could have a think about this.

I was having a chat with somebody the other day who is of the opinion that different haplogroups have different mutation rates. Personally I can’t think of a logical reason why they should unless the modal for a loci was particularly different from one group to another.

However I commented that if this effect did happen then your methods of calculating mutation rates would probably show it.
Logged

Y-DNA R-DF49*
MtDNA J1c2e
Kit No. 117897
Ysearch 3BMC9

MarkoH
Member
**
Offline Offline

Posts: 20


« Reply #117 on: May 31, 2012, 10:19:44 AM »

I was having a chat with somebody the other day who is of the opinion that different haplogroups have different mutation rates. Personally I can’t think of a logical reason why they should unless the modal for a loci was particularly different from one group to another.

However I commented that if this effect did happen then your methods of calculating mutation rates would probably show it.

There are some studies that seemingly suggested  different mutation rates for different haplogroups.  However, similar reasoning might have also suggested that mutation rates would depend on anything that could be used for data classification, region, surname etc.  Strictly speaking dependence on haplogroup  would mean that a few SNP's would affect mutation rates.  Rather than that, it would be more likely that samples from some haplogroup were co-incidentally within the dataset correlated with some other factors that affect mutation rates.  Some genome wide studies suggest that mutation rates are highly variable "within and between families"   (Conrad et al, Nature Genetics 43, 712–714 (2011)). If so, the estimation of true average rate would require larger datasets than what is suggested by usual error limit argument of mutation rate  error variance being inversely proportional to observed mutations.





Logged
Mike Walsh
Guru
*****
Offline Offline

Posts: 2963


WWW
« Reply #118 on: May 31, 2012, 11:45:35 AM »

I was having a chat with somebody the other day who is of the opinion that different haplogroups have different mutation rates. Personally I can’t think of a logical reason why they should unless the modal for a loci was particularly different from one group to another.

However I commented that if this effect did happen then your methods of calculating mutation rates would probably show it.

There are some studies that seemingly suggested  different mutation rates for different haplogroups.  However, similar reasoning might have also suggested that mutation rates would depend on anything that could be used for data classification, region, surname etc.  Strictly speaking dependence on haplogroup  would mean that a few SNP's would affect mutation rates.  Rather than that, it would be more likely that samples from some haplogroup were co-incidentally within the dataset correlated with some other factors that affect mutation rates.  Some genome wide studies suggest that mutation rates are highly variable "within and between families"   (Conrad et al, Nature Genetics 43, 712–714 (2011)). If so, the estimation of true average rate would require larger datasets than what is suggested by usual error limit argument of mutation rate  error variance being inversely proportional to observed mutations.

What is the difference between the "haplogroup" table and the "SNP" table of TMRCA calculations?

Are you saying that the "haplogoup" table uses mutation rates specific (tailored) to the respective haplogroups while the SNP table uses one broad set of mutation rates?
Logged

R1b-L21>L513(DF1)>L705.2
alan trowel hands.
Guru
*****
Offline Offline

Posts: 2012


« Reply #119 on: May 31, 2012, 12:32:43 PM »

Quote
The final result is rooted and time estimates are computed with a maximum likelihood optimization for haplotypes and branch lengths.
Sounds pretty good. I understand how branch lengths are variables subject to optimization but have only a vague idea of how haplotypes would be optimized, by that do you mean constructs for the root and nodes? Something akin to "median vectors" as used in the Fluxus network diagrams?

"Minumum distance" algorithms produce median vectors. In a binary tree, minimum change haplotypes are medians of the three nearby nodes.  (Related algorithms also allow to "predict" missing values though not uniquely.)

Maximum likelihood methods optimize log likelihood, that is, summed log probabilities of STR transitions over tree branches.  This quantity depends on ancestral and derived haplotypes as well as the branch length. It can be optimized with respect to the both.  Typical case where maximum likelihood and medians differ is the case where median construction would suggest no change in a fast locus over large branch length, but maximum likelihood likes to do it otherwise. For example, if the neighbor node haplotypes and  branchlengths (in generations) for a node are (A, 10) , (A, 10),  (B, 1),  median would be value A, but B represents  maximum likelihood for a fast locus.  ("Fast locus" has mutation rate more than about 2/100 in this example if kept simple with symmetric stepwise model and one-step change.) This also shows how maximum likelihood haplotypes have more variability at fast loci than the minimum distance ones.

The inherent problem with median vectors and maximum likelihood constructions is that fast loci cannot be constructed in the distant past but the methods produce a compromise haplotypes with less node-to-node changes than realistic solution. (Theoretically, this could be avoided with a Bayesian approach and Metropolis-Hastings like simulation of internal node states; but this is not practical with large networks.) I use additional heuristic to phase off faster loci in the more distant past so that time under estimation is mitigated up to a degree.

Quote
Quote
This uses experimental routines I have written in several occasions over the last few years.
I hope you are developing something we project admins might be able to use. We really need a new tool that like the Phylip routine kitsch constructs a time constrained rooted tree and like Fluxus connects the haplotypes with knowledge of shared mutations.

Possible, but would require lots of software work for user friendliness.

Marko, I saved a copy of your TMRCA estimates from last year because it is best, consistent look across major haplogroups with large samples and long haplotypes, IMO.
http://dl.dropbox.com/u/17907527/TMRCAs_for_major_Y_Hgs_by_Heinila_2011.html

I realize you are busy, but just thought I'd check. Do you have any updates for this?

I think this work is "proof" of a substantial loss of diversity in R1b somewhat after M73 or possibly even earlier back to prior to M-269.  Note how the interclade for R-V88 to R-M73 is 11K, with decreasing intraclades, but right at 312 the bottom falls out and everything below is 4.2K or less.  What happened.  I have talked to Marko and he agrees the lack of data suggests a bottleneck occurred!

Perhaps we are looking at a hunter gather people who were experiencing constant bottlenecks and maybe a major one at the younger dryas about 11000 years ago.  Subsequently the group remained a  low population that did not start to pick up until farming arrived late, perhaps in the copper age.
Logged
MarkoH
Member
**
Offline Offline

Posts: 20


« Reply #120 on: May 31, 2012, 01:07:22 PM »

What is the difference between the "haplogroup" table and the "SNP" table of TMRCA calculations?

The haplogroup table is for ISOGG haplogroups.  The "SNP table" is mostly for "new SNP's" that do not have ISOGG code. (Some defining SNP's may get there if there are backmutations.)   Probably I should have combined the two like before.

The SNP table contains first the narrowest ISOGG haplogroup that containts all positives for  the "new non-ISOGG SNP", this is used to organize the list by location.  Then there is the non-ISOGG SNP which is found within the haplogroup mentioned, count of positives for this non-ISOGG SNP, and the related tmrca estimate (age estimate for the subtree covering all positives).

There is a technical difference in handling these two, the ISOGG haplogroup codes are used like constraints for where specific sample can be placed in the tree, this would cover all SNP's that seem to be equivalent to  ISOGG codes.  If there is no defined ISOGG code, an SNP is handled more or less like STR marker without forced constraints that assume unique mutation. There some rough checks for non-ISOGG SNP's, such like exclusion of clear non-UEP cases.

Quote
Are you saying that the "haplogoup" table uses mutation rates specific (tailored) to the respective haplogroups while the SNP table uses one broad set of mutation rates?

Not at all, they all come from the same calculation and use the same set of mutation rates.
« Last Edit: May 31, 2012, 01:18:24 PM by MarkoH » Logged
Jdean
Old Hand
****
Offline Offline

Posts: 678


« Reply #121 on: May 31, 2012, 01:22:05 PM »


There are some studies that seemingly suggested  different mutation rates for different haplogroups.  However, similar reasoning might have also suggested that mutation rates would depend on anything that could be used for data classification, region, surname etc.  Strictly speaking dependence on haplogroup  would mean that a few SNP's would affect mutation rates.  Rather than that, it would be more likely that samples from some haplogroup were co-incidentally within the dataset correlated with some other factors that affect mutation rates.  Some genome wide studies suggest that mutation rates are highly variable "within and between families"   (Conrad et al, Nature Genetics 43, 712–714 (2011)). If so, the estimation of true average rate would require larger datasets than what is suggested by usual error limit argument of mutation rate  error variance being inversely proportional to observed mutations.


So is a problem of investigating this idea the lack of suitably large enough databases ?

As you say it would be difficult to imagine how a SNP could affect mutation rates, but maybe outside forces could. However then there is the conundrum of why these would affect one haplogroup more than another.

I've come across a couple of examples of groups of people with the same name turning up in clusters. They are clearly more closely related to each other than other members of the cluster but outside the range you would expect inside a genealogical timeframe. The possibility that these families might have high mutation rates has occurred to me but then there are problems associated with that as well since in one example there are subgroups that appear to exhibited 'normal' mutation rates. I have wondered if this may be explained by families originating from the a village consequently adopting the same surname, but then again it could be happenstance.
« Last Edit: May 31, 2012, 01:24:16 PM by Jdean » Logged

Y-DNA R-DF49*
MtDNA J1c2e
Kit No. 117897
Ysearch 3BMC9

MarkoH
Member
**
Offline Offline

Posts: 20


« Reply #122 on: May 31, 2012, 01:39:10 PM »


There are some studies that seemingly suggested  different mutation rates for different haplogroups.  However, similar reasoning might have also suggested that mutation rates would depend on anything that could be used for data classification, region, surname etc.  Strictly speaking dependence on haplogroup  would mean that a few SNP's would affect mutation rates.  Rather than that, it would be more likely that samples from some haplogroup were co-incidentally within the dataset correlated with some other factors that affect mutation rates.  Some genome wide studies suggest that mutation rates are highly variable "within and between families"   (Conrad et al, Nature Genetics 43, 712–714 (2011)). If so, the estimation of true average rate would require larger datasets than what is suggested by usual error limit argument of mutation rate  error variance being inversely proportional to observed mutations.


So is a problem of investigating this idea the lack of suitably large enough databases ?


One can imagine many reasons that might affect mutation rates, like autosomal genes and their activity (epigenetics),  different mtdna haplogroups producing different levels of oxidative stress, and so on.  The question would be if it is somehow possible to predict mutation rates better is some specific group than by using average over everything: this is a rather complex issue.



Logged
Mike Walsh
Guru
*****
Offline Offline

Posts: 2963


WWW
« Reply #123 on: May 31, 2012, 01:48:15 PM »


There are some studies that seemingly suggested  different mutation rates for different haplogroups.  However, similar reasoning might have also suggested that mutation rates would depend on anything that could be used for data classification, region, surname etc.  Strictly speaking dependence on haplogroup  would mean that a few SNP's would affect mutation rates.  Rather than that, it would be more likely that samples from some haplogroup were co-incidentally within the dataset correlated with some other factors that affect mutation rates.  Some genome wide studies suggest that mutation rates are highly variable "within and between families"   (Conrad et al, Nature Genetics 43, 712–714 (2011)). If so, the estimation of true average rate would require larger datasets than what is suggested by usual error limit argument of mutation rate  error variance being inversely proportional to observed mutations.


So is a problem of investigating this idea the lack of suitably large enough databases ?

One can imagine many reasons that might affect mutation rates, like autosomal genes and their activity (epigenetics),  different mtdna haplogroups producing different levels of oxidative stress, and so on.  The question would be if it is somehow possible to predict mutation rates better is some specific group than by using average over everything: this is a rather complex issue.

I have not read of any connections between the typical junk Y DNA SNPs used for haplogroup designations and Y STR mutation rates and alleles.   Do you think there is any direct cause/effect relationship between Y SNPs and Y STRs?  I just don't see a biological one but I'm no biologist.
« Last Edit: May 31, 2012, 01:50:20 PM by Mikewww » Logged

R1b-L21>L513(DF1)>L705.2
MarkoH
Member
**
Offline Offline

Posts: 20


« Reply #124 on: May 31, 2012, 03:22:01 PM »

These spreadsheets collect together TMRCA numbers from current 67 STR run1 and run2: Average tmrca's for haplogroups and for SNPs.  Idea would be to compare these against each other. There are some issues related to details of SNP tree, and I removed cases that might be affected. Marko H.

One of the main problems here is that I have used the ISOGG SNP index to capture the ISOGG tree. Unfortunately the website doesn't provide the tree in spreadsheet format. The index is the only electronically usable document at isogg.org; I didn't use ISOGG tree until I noticed that the Index contains haplogroup codes. Unfortunately the index is out of date and contains many errors, especially in the case of Hg R.  My approximation of the ISOGG tree can be found here.  
« Last Edit: May 31, 2012, 03:45:48 PM by MarkoH » Logged
Pages: 1 ... 3 4 [5] 6 7 Go Up Print 
« previous next »
Jump to:  


SEO light theme by © Mustang forums. Powered by SMF 1.1.13 | SMF © 2006-2011, Simple Machines LLC

Page created in 0.148 seconds with 17 queries.