World Families Forums - STR Wars: Is diversity meaningful? more meaningful than Hg frequency?

Welcome, Guest. Please login or register.
April 16, 2014, 01:20:19 AM
Home Help Search Login Register

+  World Families Forums
|-+  General Forums - Note: You must Be Logged In to post. Anyone can browse.
| |-+  R1b General (Moderator: rms2)
| | |-+  STR Wars: Is diversity meaningful? more meaningful than Hg frequency?
« previous next »
Pages: 1 ... 4 5 [6] 7 8 ... 14 Go Down Print
Author Topic: STR Wars: Is diversity meaningful? more meaningful than Hg frequency?  (Read 15867 times)
Mike Walsh
Guru
*****
Offline Offline

Posts: 2963


WWW
« Reply #125 on: April 24, 2012, 06:23:16 PM »

...
I would like to emphasize one other aspect of the Goldstein derivation in which he states that each dys loci can be used to infer the TMRCA but in practice several are used and averaged.  Note:  I do not believe this calculation can be made using Kens approach since he uses averages of mutation rates? ...

If you are critiquing Ken Nordtvedt's TMRCA methodology you should probably read his web site documentation and understand his spreadsheet. http://knordtvedt.home.bresnan.net/  You can also get direct answers from him on the Rootsweb Hg I forum.  He'll answer, particularly if you have a critique.

I've seen where Anatole Klyosov uses an average rate across a set of markers.   Nordtvedt aggregates STRs into a summary TMRCA but he does call them individual experiments and he does use the individual STR mutation rates in his spreadsheet formulas. He has a column for each STR.  Anyway, I don't think this is averaging the rates together in the sense that you mean, but I'm not sure what you mean.  I think when you get down to the specifics you have to talk about the details of the formulas.
« Last Edit: April 24, 2012, 06:24:13 PM by Mikewww » Logged

R1b-L21>L513(DF1)>L705.2
ironroad41
Old Hand
****
Offline Offline

Posts: 219


« Reply #126 on: April 25, 2012, 07:04:18 AM »

...
I would like to emphasize one other aspect of the Goldstein derivation in which he states that each dys loci can be used to infer the TMRCA but in practice several are used and averaged.  Note:  I do not believe this calculation can be made using Kens approach since he uses averages of mutation rates? ...

If you are critiquing Ken Nordtvedt's TMRCA methodology you should probably read his web site documentation and understand his spreadsheet. http://knordtvedt.home.bresnan.net/  You can also get direct answers from him on the Rootsweb Hg I forum.  He'll answer, particularly if you have a critique.

I've seen where Anatole Klyosov uses an average rate across a set of markers.   Nordtvedt aggregates STRs into a summary TMRCA but he does call them individual experiments and he does use the individual STR mutation rates in his spreadsheet formulas. He has a column for each STR.  Anyway, I don't think this is averaging the rates together in the sense that you mean, but I'm not sure what you mean.  I think when you get down to the specifics you have to talk about the details of the formulas.
  I left my statement re: Kens approach as a question mark, since I haven't looked over his work in quite a while.  If he uses individual dys loci rates then his approach should be amenable to to the same SD calculation.  My major point was that if the rates of the loci are similar, then the estimates are closer and the SD is smaller.
Logged
Mike Walsh
Guru
*****
Offline Offline

Posts: 2963


WWW
« Reply #127 on: April 25, 2012, 08:51:14 AM »

...
I would like to emphasize one other aspect of the Goldstein derivation in which he states that each dys loci can be used to infer the TMRCA but in practice several are used and averaged.  Note:  I do not believe this calculation can be made using Kens approach since he uses averages of mutation rates? ...

If you are critiquing Ken Nordtvedt's TMRCA methodology you should probably read his web site documentation and understand his spreadsheet. http://knordtvedt.home.bresnan.net/  You can also get direct answers from him on the Rootsweb Hg I forum.  He'll answer, particularly if you have a critique.

I've seen where Anatole Klyosov uses an average rate across a set of markers.   Nordtvedt aggregates STRs into a summary TMRCA but he does call them individual experiments and he does use the individual STR mutation rates in his spreadsheet formulas. He has a column for each STR.  Anyway, I don't think this is averaging the rates together in the sense that you mean, but I'm not sure what you mean.  I think when you get down to the specifics you have to talk about the details of the formulas.
  I left my statement re: Kens approach as a question mark, since I haven't looked over his work in quite a while.  If he uses individual dys loci rates then his approach should be amenable to to the same SD calculation.  My major point was that if the rates of the loci are similar, then the estimates are closer and the SD is smaller.
Okay, so you are not critiquing Ken's methodology then, because you haven't read his work for quite a while. Since you are mentioning him by name and using hypotheticals like "if he uses" then to be fair to him why don't you challenge him directly?  If you feel uncomfortable, if you will craft a set of very specific questions, I'll ask them on the Hg I Rootsweb forum so he will answer. That way the questions are somewhat anonymous from your perspective..  The guy is good with math so I doubt if he hasn't spent a lot of time on the issues related to this.


Logged

R1b-L21>L513(DF1)>L705.2
Mike Walsh
Guru
*****
Offline Offline

Posts: 2963


WWW
« Reply #128 on: April 25, 2012, 12:54:09 PM »

I'm just cataloging this from the Busby thread since Busby did an analysis of the linear duration of STRs, which somewhat questions the concept, but then seems to rely on them (STRs) to make their case about various forms of R1b in Europe.

I've also agree that STR evaluation is useful.  I just think that using limited numbers like 10 or 15 is not enough.  That's what I see when I do my own comparisons on hundreds of long haplotypes anyway.  I also think Busby's application of STRs does not match their own linear duration standards. That is an attack, but perhaps I just don't understand. Can you explain?

You are right; they showed that there is a significant effect of microsatellite choice in age estimates that they should have used that finding when calculating TMRCA of R-S127 haplogroup which is on figure-4a. However, in figure-2 they did not calculate TMRCA in generations, but explored the bootstrapped variance, and in fact they do not seem to think that variance is affected by choice of STR, which is why they used 10 STRs on figure-2.  In a nutshell they showed that microsatellite choice can have an effect on age estimates, but still used a combined set of 10 STRs to explore variance.  Perhaps they think one should choose the STRs when calculating TMRCA based on similarity on mutations rates and the presumed time span for common ancestry, i.e. use the average mut/marker for the slowest or fastest STRs depending on the presumed TMRCA, but not the average mut/marker for the whole set, but if you want to calculate variance use the combined set of STRs.

This where I get confused about Busby's theme. I don't know really understand which methods they think are best, but at least I see they value STR diversity in their analyses, just using different techniques I guess.


« Last Edit: April 25, 2012, 01:11:16 PM by Mikewww » Logged

R1b-L21>L513(DF1)>L705.2
ironroad41
Old Hand
****
Offline Offline

Posts: 219


« Reply #129 on: April 25, 2012, 04:04:26 PM »

This is always a contentious issue. I think STR diversity is useful. There are challenges and they must be considered in context.

In my opinion, people are fine with it until it disagrees with their theory, then they must shoot it down rather than adjust their theory. To me it is just another data point, and unfortunately we are in dire need of those.

Anyway, let's discuss this topic here so we don't have to argue the points over and over again in other topics, drowning them out.

As always seems to happen we have strayed from your original question and can't see the forest for the trees (or something like that).  My observations have been little discussed.  My major point in answering you is that I do not believe most Y STR dys loci follows a drunkards walk model which is mathematically equivalent to using ASD/Variance to describe the process.  I know that Nordtvedt is using Variance but my reference for that derivation has been Goldstein,et.al. ( who by-the-way heads up the human genome lab at Duke Univ.).  I believe, based on analyzing the data set I referenced that his model does match the data.  I'm not throwing rocks at anyone, he had no data!  1.  No distribution of allele values around the  modal for the set of dys loci. 2.  No knowledge of multisteps.  When you include these factors I have to conclude that the model doesn't work.

Additionally, the data also suggests that if many of the dys loci mutate away from their modal, then the most probable next mutation is back to the modal, because except for the 5% of multisteps, their aren't any entries with values greater than +/- of themodal.

so to bluntly anwer your original question I would say that diversity isn't meaningful since its masked by hidden mutations which makes time shorter, we count less mutations than really occurred and I don't think ASD/Variance can handle that. (note, the original statement re ASD/Var compensating for hidden mutations was based on the drunkards walk model, where the distance from the modal increase with time and the squaring of the difference between the modal and the present value does compensate for back mutations)

I would be very interested in seeing some data from existing R1b data sets re: STR locus distributions around the modal.  I simply don't have the math tools to extract that from the datasets myself.
Logged
MHammers
Old Hand
****
Offline Offline

Posts: 347


« Reply #130 on: April 25, 2012, 05:25:02 PM »

@Mikewww or anyone familiar with Generations7 spreadsheet

Do you know if there is an explanation somewhere as to what math operation Ken uses to account for hidden mutations?  

« Last Edit: April 25, 2012, 05:26:07 PM by MHammers » Logged

Ydna: R1b-Z253**


Mtdna: T

Mike Walsh
Guru
*****
Offline Offline

Posts: 2963


WWW
« Reply #131 on: April 25, 2012, 06:10:24 PM »

@Mikewww or anyone familiar with Generations7 spreadsheet

Do you know if there is an explanation somewhere as to what math operation Ken uses to account for hidden mutations?  

You should probably look at his formulas and his powerpoint charts where he charts out and tries to explain his methodology.

I think the answer is something along the lines of what John Chandler is saying.  This is from reply #5 of this thread.

A recent conversation from Rootsweb:
Quote from: general question
My own layman's viewpoint has always been to wonder how such unknowable factors like bottle-necks, back mutations, etc. can ever be adequately compensated for
Here is a response from a Scientist at MIT. John Chandler is the guy who calculated the mutation rates most of us use.
Quote from: John Chandler
That "etc." is exactly the difficulty. I'll point out in passing that back mutations are automatically accounted for in the variance method, ...
http://archiver.rootsweb.ancestry.com/th/read/genealogy-dna/2012-03/1333051203

My understanding of the explanation is that their mathematical model does not care about hidden mutations or even multi-step mutations. The mutation rates were derived based on visible mutations so, as long as they have adequate data to build the mutation rates, the way the TMRCA method uses them is consistent.  We should not think of the published mutation rate as literally the physical rate of change per the STR, but rather the observable rate of change.

What is required is that the STRs act somewhat consistently, in other words the expected (predicted) rates up and down should be the same and the rates shouldn't change given the allele value, etc.   This would be where the concern about STRs reaching saturation and high alleles values comes into play.  If an STR doesn't show linear duration (of its rate) during the timeframe we care about then it is not helpful.   The goal of the math model is to include STRs that are linear or "on average" (in aggregate) linear.
« Last Edit: April 25, 2012, 06:11:35 PM by Mikewww » Logged

R1b-L21>L513(DF1)>L705.2
Mike Walsh
Guru
*****
Offline Offline

Posts: 2963


WWW
« Reply #132 on: April 25, 2012, 06:24:16 PM »

...  My major point in answering you is that I do not believe most Y STR dys loci follows a drunkards walk model which is mathematically equivalent to using ASD/Variance to describe the process.  I know that Nordtvedt is using Variance but my reference for that derivation has been Goldstein,et.al. ( who by-the-way heads up the human genome lab at Duke Univ.).  I believe, based on analyzing the data set I referenced that his model does match the data.  I'm not throwing rocks at anyone, he had no data!  1.  No distribution of allele values around the  modal for the set of dys loci. 2.  No knowledge of multisteps.  When you include these factors I have to conclude that the model doesn't work...


I haven't read Goldstein's report. Would you mind posting it again?

All I can say is that it is apparent that when looking at R1b haplogroup haplotypes... real ones, lots of them and long ones ...   that STR diversity generally increases with haplogroups that are bigger (older) branches on the Y DNA tree.  In other words, it actually happens STR variance is higher for haplogroups that the SNP based Y DNA tree says are older.  -  This is observable. Not hypothetical. Please check reply #72 in this thread and around it. I've done this for pretty much all of R-L11. It works nicely.

Is STR variance precise?  No, but folks like Nordtvedt take great pains to produce confidence ranges that you can use and used advanced techniques like interclade comparisons to improve precision.

Academics and testing companies also use STR diversity and have been for a long time.

I know you are aware of Marko Hienila's TMRCA method. He said it is NOT ASD/variance based so that might alleviate your fears.  He calls it a "maximum likelihood" method which I believe is especially well suited for back or multi-step mutations.....
but it matters little. Marko comes up with TMRCAs for the R1b haplogroups that are similar to what Nordtvedt's method does.

Are all STRs good in terms of their linearity with time? No, surely not. The multi-copy ones aren't very linear at all. Some of the faster ones, or at least the high allele value ones may not be reliable either.

Is it possible that some samples of haplotypes are biased by a particular group?  Sure, that is what the "resampling" thing is all about in the Busby and Myres work.  However, this is primarily an intraclade problem. Nordtvedt's interclade approach can reduce or eliminate those biases significantly.  

Maybe the mutation rates are all wrong, but I don't think anyone can effectively argue that most of FTDNA's STRs don't accumulate variance with time.  It's also intuitive, if you consider that most of these STRs are single steppers per event and you overlay that on to the family structure (tree).
« Last Edit: April 25, 2012, 06:43:28 PM by Mikewww » Logged

R1b-L21>L513(DF1)>L705.2
MHammers
Old Hand
****
Offline Offline

Posts: 347


« Reply #133 on: April 25, 2012, 08:17:03 PM »

@Mikewww or anyone familiar with Generations7 spreadsheet

Do you know if there is an explanation somewhere as to what math operation Ken uses to account for hidden mutations?  

You should probably look at his formulas and his powerpoint charts where he charts out and tries to explain his methodology.

I think the answer is something along the lines of what John Chandler is saying.  This is from reply #5 of this thread.

A recent conversation from Rootsweb:
Quote from: general question
My own layman's viewpoint has always been to wonder how such unknowable factors like bottle-necks, back mutations, etc. can ever be adequately compensated for
Here is a response from a Scientist at MIT. John Chandler is the guy who calculated the mutation rates most of us use.
Quote from: John Chandler
That "etc." is exactly the difficulty. I'll point out in passing that back mutations are automatically accounted for in the variance method, ...
http://archiver.rootsweb.ancestry.com/th/read/genealogy-dna/2012-03/1333051203

My understanding of the explanation is that their mathematical model does not care about hidden mutations or even multi-step mutations. The mutation rates were derived based on visible mutations so, as long as they have adequate data to build the mutation rates, the way the TMRCA method uses them is consistent.  We should not think of the published mutation rate as literally the physical rate of change per the STR, but rather the observable rate of change.

What is required is that the STRs act somewhat consistently, in other words the expected (predicted) rates up and down should be the same and the rates shouldn't change given the allele value, etc.   This would be where the concern about STRs reaching saturation and high alleles values comes into play.  If an STR doesn't show linear duration (of its rate) during the timeframe we care about then it is not helpful.   The goal of the math model is to include STRs that are linear or "on average" (in aggregate) linear.

Thanks Mike.

What about using a Poisson distribution process to help gauge how many hidden mutations are accumulated over time?  For example,  Let's say the average observable genetic distance between any two L11+'s is 20.  Poisson should show us how many should be the average at x point in time.  Maybe 30 at 6000 years, 40 at 8000, or only a small increase.
Logged

Ydna: R1b-Z253**


Mtdna: T

MHammers
Old Hand
****
Offline Offline

Posts: 347


« Reply #134 on: April 25, 2012, 09:26:34 PM »

I ran a simple Poisson distribution with Excel using an average mutation rate of .0023 and average generation time of 30 years/G over 49 markers.  This is to see how many mutation events can be expected in x time between two haplotypes.

For 67 generations or 2000 years, I get 7 mutations with the probability mass function.  At 10,000 years, 37 mutations with the same.  

This hypothetically includes hidden mutations.  Many L11 members are 20+ away from others in observable mutations, so approximately 37 on average when including back or multi-step mutations might not be far off.  However, this is still a simple model for what we are trying to answer and the snp L11 is probably closer to 2,000 than 10,000 years old.



« Last Edit: April 25, 2012, 11:07:18 PM by MHammers » Logged

Ydna: R1b-Z253**


Mtdna: T

Mark Jost
Old Hand
****
Offline Offline

Posts: 707


« Reply #135 on: April 25, 2012, 10:58:46 PM »

MikeH

I manually calculated the rate used and here is what I show.
Using the same average 0.23 mutation rate equals 1 mutation per 435 birth events.

435/49 markers equals 8.9 per birth events

@49 markers: 8.9 x 30 years per generation equals 267 years
(using 25yrs per gen equals 222.5)

2000 years divided by 267 equals 7.5 mutations will occur. so at 10K yrs 37.5.
(2000/222.5 = 9.0 Mutations, 10K = 4.5)

Pretty close.

MJost
Logged

148326
Pos: Z245 L459 L21 DF13**
Neg: DF23 L513 L96 L144 Z255 Z253 DF21 DF41 (Z254 P66 P314.2 M37 M222  L563 L526 L226 L195 L193 L192.1 L159.2 L130 DF63 DF5 DF49)
WTYNeg: L555 L371 (L9/L10 L370 L302/L319.1 L554 L564 L577 P69 L626 L627 L643 L679)
ironroad41
Old Hand
****
Offline Offline

Posts: 219


« Reply #136 on: April 26, 2012, 07:57:35 AM »

@Mikewww or anyone familiar with Generations7 spreadsheet

Do you know if there is an explanation somewhere as to what math operation Ken uses to account for hidden mutations?  

You should probably look at his formulas and his powerpoint charts where he charts out and tries to explain his methodology.

I think the answer is something along the lines of what John Chandler is saying.  This is from reply #5 of this thread.

A recent conversation from Rootsweb:
Quote from: general question
My own layman's viewpoint has always been to wonder how such unknowable factors like bottle-necks, back mutations, etc. can ever be adequately compensated for
Here is a response from a Scientist at MIT. John Chandler is the guy who calculated the mutation rates most of us use.
Quote from: John Chandler
That "etc." is exactly the difficulty. I'll point out in passing that back mutations are automatically accounted for in the variance method, ...
http://archiver.rootsweb.ancestry.com/th/read/genealogy-dna/2012-03/1333051203

My understanding of the explanation is that their mathematical model does not care about hidden mutations or even multi-step mutations. The mutation rates were derived based on visible mutations so, as long as they have adequate data to build the mutation rates, the way the TMRCA method uses them is consistent.  We should not think of the published mutation rate as literally the physical rate of change per the STR, but rather the observable rate of change.

What is required is that the STRs act somewhat consistently, in other words the expected (predicted) rates up and down should be the same and the rates shouldn't change given the allele value, etc.   This would be where the concern about STRs reaching saturation and high alleles values comes into play.  If an STR doesn't show linear duration (of its rate) during the timeframe we care about then it is not helpful.   The goal of the math model is to include STRs that are linear or "on average" (in aggregate) linear.

I didn't see the term multisteps discussed by John?  I do note that when he refers to compensation for hidden mutations he is making reference to Dys loci that behave like a drunkards walk model and are unbounded.  HIs comment about the linearity of a dys loci is appropriate and I believe the number of mutations is undercounted because of  the boundedness of many of the dys loci.

I llike his presentation of the Zhiv problem and how they found a constant fudge factor to compensate for some unknown factor in the mutational process.  I happen to believe the unknown factor is real and is related to the hidden mutation issue.

You don't need the fudge factor if you can intelligently count mutations, when you can't then maybe it is the best option when you're trying to infer Large TMRCA's.
Logged
Mike Walsh
Guru
*****
Offline Offline

Posts: 2963


WWW
« Reply #137 on: April 26, 2012, 09:36:05 AM »

... What about using a Poisson distribution process to help gauge how many hidden mutations are accumulated over time?  For example,  Let's say the average observable genetic distance between any two L11+'s is 20.  Poisson should show us how many should be the average at x point in time.  Maybe 30 at 6000 years, 40 at 8000, or only a small increase.
I don't know the statistics well enough comment on the advantages or disadvantages. I know the "Maximum Likelihood" method that Marko Heinila uses can be applied to a Poisson distribution but I don't know have any details on Marko's formulas.  He might have them posted somewhere.

John Chandler would probably comment if you post this on Rootsweb GENEALOGY-DNA.
Logged

R1b-L21>L513(DF1)>L705.2
ironroad41
Old Hand
****
Offline Offline

Posts: 219


« Reply #138 on: April 26, 2012, 09:40:48 AM »

...  My major point in answering you is that I do not believe most Y STR dys loci follows a drunkards walk model which is mathematically equivalent to using ASD/Variance to describe the process.  I know that Nordtvedt is using Variance but my reference for that derivation has been Goldstein,et.al. ( who by-the-way heads up the human genome lab at Duke Univ.).  I believe, based on analyzing the data set I referenced that his model does match the data.  I'm not throwing rocks at anyone, he had no data!  1.  No distribution of allele values around the  modal for the set of dys loci. 2.  No knowledge of multisteps.  When you include these factors I have to conclude that the model doesn't work...


I haven't read Goldstein's report. Would you mind posting it again?

All I can say is that it is apparent that when looking at R1b haplogroup haplotypes... real ones, lots of them and long ones ...   that STR diversity generally increases with haplogroups that are bigger (older) branches on the Y DNA tree.  In other words, it actually happens STR variance is higher for haplogroups that the SNP based Y DNA tree says are older.  -  This is observable. Not hypothetical. Please check reply #72 in this thread and around it. I've done this for pretty much all of R-L11. It works nicely.

Is STR variance precise?  No, but folks like Nordtvedt take great pains to produce confidence ranges that you can use and used advanced techniques like interclade comparisons to improve precision.

Academics and testing companies also use STR diversity and have been for a long time.

I know you are aware of Marko Hienila's TMRCA method. He said it is NOT ASD/variance based so that might alleviate your fears.  He calls it a "maximum likelihood" method which I believe is especially well suited for back or multi-step mutations.....
but it matters little. Marko comes up with TMRCAs for the R1b haplogroups that are similar to what Nordtvedt's method does.

Are all STRs good in terms of their linearity with time? No, surely not. The multi-copy ones aren't very linear at all. Some of the faster ones, or at least the high allele value ones may not be reliable either.

Is it possible that some samples of haplotypes are biased by a particular group?  Sure, that is what the "resampling" thing is all about in the Busby and Myres work.  However, this is primarily an intraclade problem. Nordtvedt's interclade approach can reduce or eliminate those biases significantly.  

Maybe the mutation rates are all wrong, but I don't think anyone can effectively argue that most of FTDNA's STRs don't accumulate variance with time.  It's also intuitive, if you consider that most of these STRs are single steppers per event and you overlay that on to the family structure (tree).

The Goldstein/Stumpf paper is from Science, Vol. 191, 2 march 2001

I would expect diversity of a set of haplotypes to increase with time.  As time elapses more to the slower mutations occur which have a very small probability of reoccurring.  I think the medium rate haplotypes (mostly tetra motif) go in and out randomly as they mutate around the modal?

Markko traces/uses apparent mutations as does Ken.

I am arguing that most tetra motif dys loci don't accumulate variance with time.  Variance increases requires an unbounded model.  I don't see that in the small amount of data I have looked at?
Logged
Mike Walsh
Guru
*****
Offline Offline

Posts: 2963


WWW
« Reply #139 on: April 26, 2012, 09:55:19 AM »

Markko traces/uses apparent mutations as does Ken.
Yes, of course, because that's all that is observable.

...
I am arguing that most tetra motif dys loci don't accumulate variance with time.
What tetra STR markers out of FTDNA's first 67 should be eliminated.  Please provide the list.  It should be easy to run a couple of comparisons.   Maybe this will line up with Marko Heinila's linear duration analysis in which case the "36 linear" markers that I use will be appropriate.

Variance increases requires an unbounded model.

An infinitely unbounded model is not required, just a general linear relationship for the time duration that is applicable.
« Last Edit: April 26, 2012, 10:12:42 AM by Mikewww » Logged

R1b-L21>L513(DF1)>L705.2
ironroad41
Old Hand
****
Offline Offline

Posts: 219


« Reply #140 on: April 26, 2012, 10:15:10 AM »

The set you probably should use depends on the time frame of interest.  This was Busbys observation, but not practice if I read his paper correctly.  Its a probability issue.  For independent events, as mutations are, the probability of two mutations at a loci is equal to the P(1) mutation squared.  I don't have a good rule for picking, I observe, whatever their rates are, that CDYa,b can have more than one mutations per entry in a relative short time, hundreds of years.  Maybe you can scale from their rate to estimate which dys loci have a low probability of two mutations in 1K years and so on?

When I say bounded I mean that (excepting multisteps), the mutational process at a dys loci is bounded/confined to modal +/-1.
Logged
Mike Walsh
Guru
*****
Offline Offline

Posts: 2963


WWW
« Reply #141 on: April 26, 2012, 10:16:05 AM »

I haven't read Goldstein's report. Would you mind posting it again?

The Goldstein/Stumpf paper is from Science, Vol. 191, 2 march 2001

That paper is only available for fee.  Please post the excerpts that apply.
Logged

R1b-L21>L513(DF1)>L705.2
Mark Jost
Old Hand
****
Offline Offline

Posts: 707


« Reply #142 on: April 26, 2012, 10:24:49 AM »

I posted this on DNA-forums last year.

http://www.plosone.org/article/info:doi/10.1371/journal.pone.0007276

Our findings suggest that Y chromosome STRs of increased repeat unit size have a lower rate of evolution, which has significant relevance in population genetic and evolutionary studies.


Principal Findings
In order to study the evolutionary dynamics of STRs according to repeat unit size, we analysed variation at 24 Y chromosome repeat loci: 1 tri-, 14 tetra-, 7 penta-, and 2 hexanucleotide loci. According to our results, penta- and hexanucleotide repeats have approximately two times lower repeat variance and diversity than tri- and tetranucleotide repeats, indicating that their mutation rate is about half of that of tri- and tetranucleotide repeats.'
Logged

148326
Pos: Z245 L459 L21 DF13**
Neg: DF23 L513 L96 L144 Z255 Z253 DF21 DF41 (Z254 P66 P314.2 M37 M222  L563 L526 L226 L195 L193 L192.1 L159.2 L130 DF63 DF5 DF49)
WTYNeg: L555 L371 (L9/L10 L370 L302/L319.1 L554 L564 L577 P69 L626 L627 L643 L679)
Mark Jost
Old Hand
****
Offline Offline

Posts: 707


« Reply #143 on: April 26, 2012, 10:28:54 AM »

Here is the Google Cache of the thread.

http://webcache.googleusercontent.com/search?q=cache:skOm4nTP5SQJ:dna-forums.org/index.php%3F/topic/16142-star-wars-i-mean-str-wars-for-r1b/page__st__20&hl=en&gl=us&prmd=imvns&strip=1
Logged

148326
Pos: Z245 L459 L21 DF13**
Neg: DF23 L513 L96 L144 Z255 Z253 DF21 DF41 (Z254 P66 P314.2 M37 M222  L563 L526 L226 L195 L193 L192.1 L159.2 L130 DF63 DF5 DF49)
WTYNeg: L555 L371 (L9/L10 L370 L302/L319.1 L554 L564 L577 P69 L626 L627 L643 L679)
ironroad41
Old Hand
****
Offline Offline

Posts: 219


« Reply #144 on: April 26, 2012, 10:35:49 AM »

I haven't read Goldstein's report. Would you mind posting it again?

The Goldstein/Stumpf paper is from Science, Vol. 191, 2 march 2001

That paper is only available for fee.  Please post the excerpts that apply.
Did you try www.sciencemag.org?  For older issues I believe you can access without cost, but it may require you to register?  If I have it stored on-line I'll email it to you.
Logged
Mike Walsh
Guru
*****
Offline Offline

Posts: 2963


WWW
« Reply #145 on: April 26, 2012, 10:49:32 AM »

You have an argument. This is fine.

...
I am arguing that most tetra motif dys loci don't accumulate variance with time.

I ask you for some detail so I can modify my variance calculations and look at STRs you think are appropriate. I'm volunteering to do this for you.  I don't really think this effort is going to lead to anything, but I'm willing to test your argument with real data like I have on Marko's "linear markers" or Ken's idea of "more markers is better except multi-copy, etc."

What tetra STR markers out of FTDNA's first 67 should be eliminated?  Please provide the list.  It should be easy to run a couple of comparisons.   Maybe this will line up with Marko Heinila's linear duration analysis in which case the "36 linear" markers that I use will be appropriate.

Below is your answer.  My request to help you is simple but you are not helping me help you.

The set you probably should use depends on the time frame of interest.  This was Busbys observation, but not practice if I read his paper correctly.  Its a probability issue.  For independent events, as mutations are, the probability of two mutations at a loci is equal to the P(1) mutation squared.  I don't have a good rule for picking, I observe, whatever their rates are, that CDYa,b can have more than one mutations per entry in a relative short time, hundreds of years.  Maybe you can scale from their rate to estimate which dys loci have a low probability of two mutations in 1K years and so on?

When I say bounded I mean that (excepting multisteps), the mutational process at a dys loci is bounded/confined to modal +/-1.

You don't have to agree with the results, but please provide specifics on your argument so it can be tested in some manner.

I think we've gone over this, but CDYa,b are multi-copy markers and no one that I know of uses them in TMRCA calculations. They are already excluded from the argument.  I exclude DYS385, YCAII, DYS464, DYS459, DYS413, DYS395s1, DYS425 (possible null), DYS439 (possible null) in any of my STR variance calculations. I do include those on straight GD calculations using modified infinite allele techniques.

I have played with adding and subtracting STRs and comparing relative variance across haplogroup. I've done this more systematically with the linearity estimates Marko Heinila has provided.  I can tell you, it doesn't make much difference as long as you get enough STRs (individual experiments) going. The benefits of the law of large numbers seems to apply.

I am not going to extra research and gyrations unless you can be specific on what you want to test and do your own homework.   Do you want to improve the processes? or you just don't like the answers?
« Last Edit: April 26, 2012, 11:38:39 AM by Mikewww » Logged

R1b-L21>L513(DF1)>L705.2
Mike Walsh
Guru
*****
Offline Offline

Posts: 2963


WWW
« Reply #146 on: April 26, 2012, 11:23:21 AM »

I posted this on DNA-forums last year.

http://www.plosone.org/article/info:doi/10.1371/journal.pone.0007276

Our findings suggest that Y chromosome STRs of increased repeat unit size have a lower rate of evolution, which has significant relevance in population genetic and evolutionary studies. ...

umm... this is making a litte more sense to me in terms of the academic back and forth.

"Decreased Rate of Evolution in Y Chromosome STR Loci of Increased Size of the Repeat Unit" by Jarve also includes Zhivotovsky as an author.  Zhivotovsky is the guy who gets his name hung on as the label for the famous (or infamous) evolutionary mutation rates.  I should go try to find Nordtvedt's Rootsweb posts. He really just plain calls the Zhivotovsky evolutionary rates bad science.  That's another side discussion, but it would make sense that given criticism, Zhivotovsky would need to go out and find some bad STRs to help support what some people call his times 3 fudge factor.

Nevertheless, some STRs probably do behave non-linearly outside of certain time ranges. Marko Heinila addressed this with a statistical analysis across tens of thousands of haplotypes. Don't ask me about his method. He's way beyond me. It seemed  logical when he presented it on the "TMRCA report" thread (Aug 2011) on DNA forums. I don't remember any arguments against his methods.

Here were all the markers where "timeframe for each locus where saturation effects are relatively insignificant" were greater than 5000 years.  I don't use the multi-copy markers, even if he included them.
 
Quote
DYS426     > 100000
 DYS447            > 100000
 DYS590            > 100000
 DYS641            > 100000
 DYS472            > 100000
 DYS425            > 100000
 DYS436            > 100000
 DYS490            > 100000
 DYS450            > 100000
 DYS617            > 100000
 DYS492            > 100000
 DYF395S1b           93052
 DYS455              92365
 DYS388              91912
 DYS392              63939
 DYS438              44590
 DYS578              42906
 DYS448              35579
 DYS454              32780
 YCAIIa                       32468
 DYS385a             31095
 DYS520              26205
 DYS531              24566
 DYS446              24038
 DYS594              24008
 YCAIIb                      23585
 DYS385b             23191
 DYS640              22915
 DYS568              16304
 DYS607              15957
 DYS557              15291
 DYS481              14970
 DYS413b             14512
 DYS537              13943
 DYS437              13858
 DYF395S1a           13021
 DYS487              11721
 DYF406S1            11405
 DYS570              10071
 DYS565               9546
 DYS393               9512
 DYS459a              8550
 DYS413a              8471
 DYS449               8044
 DYS19a               7964
 DYS390               7178
 DYS511               5475
 DYS572               5285
 DYS442               5260
 DYS444               5163

In my "36 linear" marker set I'm not using the ones at the bottom, like DYS572. I'm only using STRs with timeframes greater than 7000 years (to cover the Neolithic time.)  As I've said, I don't use any multi-copy markers.
« Last Edit: April 26, 2012, 11:25:38 AM by Mikewww » Logged

R1b-L21>L513(DF1)>L705.2
ironroad41
Old Hand
****
Offline Offline

Posts: 219


« Reply #147 on: April 26, 2012, 12:29:56 PM »

You have an argument. This is fine.

...
I am arguing that most tetra motif dys loci don't accumulate variance with time.

I ask you for some detail so I can modify my variance calculations and look at STRs you think are appropriate. I'm volunteering to do this for you.  I don't really think this effort is going to lead to anything, but I'm willing to test your argument with real data like I have on Marko's "linear markers" or Ken's idea of "more markers is better except multi-copy, etc."

What tetra STR markers out of FTDNA's first 67 should be eliminated?  Please provide the list.  It should be easy to run a couple of comparisons.   Maybe this will line up with Marko Heinila's linear duration analysis in which case the "36 linear" markers that I use will be appropriate.

Below is your answer.  My request to help you is simple but you are not helping me help you.

The set you probably should use depends on the time frame of interest.  This was Busbys observation, but not practice if I read his paper correctly.  Its a probability issue.  For independent events, as mutations are, the probability of two mutations at a loci is equal to the P(1) mutation squared.  I don't have a good rule for picking, I observe, whatever their rates are, that CDYa,b can have more than one mutations per entry in a relative short time, hundreds of years.  Maybe you can scale from their rate to estimate which dys loci have a low probability of two mutations in 1K years and so on?

When I say bounded I mean that (excepting multisteps), the mutational process at a dys loci is bounded/confined to modal +/-1.

You don't have to agree with the results, but please provide specifics on your argument so it can be tested in some manner.

I think we've gone over this, but CDYa,b are multi-copy markers and no one that I know of uses them in TMRCA calculations. They are already excluded from the argument.  I exclude DYS385, YCAII, DYS464, DYS459, DYS413, DYS395s1, DYS425 (possible null), DYS439 (possible null) in any of my STR variance calculations. I do include those on straight GD calculations using modified infinite allele techniques.

I have played with adding and subtracting STRs and comparing relative variance across haplogroup. I've done this more systematically with the linearity estimates Marko Heinila has provided.  I can tell you, it doesn't make much difference as long as you get enough STRs (individual experiments) going. The benefits of the law of large numbers seems to apply.

I am not going to extra research and gyrations unless you can be specific on what you want to test and do your own homework.   Do you want to improve the processes? or you just don't like the answers?

I can't answer many of your queries.  I think it is important first to agree, or disagree, on my premise that many of the dys loci (medium rate) are limited/bounded.  I've provided a dataset that suggests they are, but I think we need more data.

A prior paper by goldstein, referenced in busby, gives a linearity equation.  Thats what busby used.  I don't know what range of values for each STR Markko used.  If he didn't recognize the problem with multisteps, I would question his definition of linearity.

I'm not asking you to run any test cases yet since I don't know how to specify what you are asking.  If someone who is much cleverer with S/W than I am could create some distribution tables, then we can evaluate that data and determine the next step.

I know Kens opinion of Zhiv.  That said, a lot of folks, as you know, who are knowledgeable are supportive of his approach.  What I'm trying to do is to come up with an understanding of why he had to fudge the data sets referenced by Chandler.  I don't think we are chasing ghosts here.

I appreciate all the attention you've paid to my comments.  I am limited in what guidance I can provide.
Logged
JeanL
Old Hand
****
Offline Offline

Posts: 425


« Reply #148 on: April 26, 2012, 12:46:05 PM »


Here were all the markers where "timeframe for each locus where saturation effects are relatively insignificant" were greater than 5000 years.  I don't use the multi-copy markers, even if he included them.
 
Quote
DYS426     > 100000
 DYS447            > 100000
 DYS590            > 100000
 DYS641            > 100000
 DYS472            > 100000
 DYS425            > 100000
 DYS436            > 100000
 DYS490            > 100000
 DYS450            > 100000
 DYS617            > 100000
 DYS492            > 100000
 DYF395S1b           93052
 DYS455              92365
 DYS388              91912
 DYS392              63939
 DYS438              44590
 DYS578              42906
 DYS448              35579
 DYS454              32780
 YCAIIa                       32468
 DYS385a             31095
 DYS520              26205
 DYS531              24566
 DYS446              24038
 DYS594              24008
 YCAIIb                      23585
 DYS385b             23191
 DYS640              22915
 DYS568              16304
 DYS607              15957
 DYS557              15291
 DYS481              14970
 DYS413b             14512
 DYS537              13943
 DYS437              13858
 DYF395S1a           13021
 DYS487              11721
 DYF406S1            11405
 DYS570              10071
 DYS565               9546
 DYS393               9512
 DYS459a              8550
 DYS413a              8471
 DYS449               8044
 DYS19a               7964
 DYS390               7178
 DYS511               5475
 DYS572               5285
 DYS442               5260
 DYS444               5163

In my "36 linear" marker set I'm not using the ones at the bottom, like DYS572. I'm only using STRs with timeframes greater than 7000 years (to cover the Neolithic time.)  As I've said, I don't use any multi-copy markers.

Perhaps it would be good to know what methodology he used, because he gets a linearity that is three and four folds greater than the previously observed linearity based on the Busby et al(2011) study.

For example Busby et al. gets 19244 ybp of linearity for DYS392, whereas above it shows 63939 ybp for DYS392, that is 3.3  times greater. DYS438 12465 ybp(Busby et al.) vs.44590 ybp(Above) again 3.6 times greater. DYS437 4357 ybp(Busby et al)vs.13858ypb(Above), DYS19 1888 ybp(Busby et al)vs. 7964 ybp (Above).

There are some STRs such as DYS439, DYS635, DYS456, DYS389I, DYS389II, DYS458, Y-GATA-H4 that I couldn’t find above. Others such as DYS448 do not differ by much(i.e. Busby et al. 25381 ybp vs.35579), and DYS393 which gets 5648 ybp in Busby et al. vs. 9512 ybp(Above). The exception would be DYS390 which gets 9211 ybp Busby et al. vs. 7178 ybp(Above). The main point here is that out of 7 STRs that overlap in both cases, 6 have their linearity inflated, what’s worse is that STRs such as DYS437, DYS19, DYS393 which are being used as “most linear” because they have a linearity of more than 7000 ybp, actually show a linearity that is well below 7000 ybp.

Again I don’t know how that person came about those numbers, I know how Busby et al. came about their numbers, which was based on the observed range of alleles in each loci, and the mutation rates measured in father-son’s pairs.
Logged
Mike Walsh
Guru
*****
Offline Offline

Posts: 2963


WWW
« Reply #149 on: April 26, 2012, 12:46:12 PM »

....    I don't know what range of values for each STR Markko used.  If he didn't recognize the problem with multisteps, I would question his definition of linearity. ...

You were the one who referred me to Marko Heinila. I was not familiar with him until you put me in contact with him.  He definitely recognizes and tries to account for  back-mutations and multi-step mutations. It is my understanding that is why he chose to use the "maximun likelihood" method.

His definition of linearity is very clear.
Quote from: Marko Heinila
timeframe for each locus where saturation effects are relatively insignificant

I am beginning to think you won't accept anything that does not fit your preset theories on Doggerland or on various clans.  If that is the actual basis for your disagreement, that is fine - just say so.
« Last Edit: April 26, 2012, 01:08:12 PM by Mikewww » Logged

R1b-L21>L513(DF1)>L705.2
Pages: 1 ... 4 5 [6] 7 8 ... 14 Go Up Print 
« previous next »
Jump to:  


SEO light theme by © Mustang forums. Powered by SMF 1.1.13 | SMF © 2006-2011, Simple Machines LLC

Page created in 0.249 seconds with 18 queries.