World Families Forums - STR Wars: Is diversity meaningful? more meaningful than Hg frequency?

Welcome, Guest. Please login or register.
December 21, 2014, 12:58:20 PM
Home Help Search Login Register

+  World Families Forums
|-+  General Forums - Note: You must Be Logged In to post. Anyone can browse.
| |-+  R1b General (Moderator: rms2)
| | |-+  STR Wars: Is diversity meaningful? more meaningful than Hg frequency?
« previous next »
Pages: [1] 2 3 ... 14 Go Down Print
Author Topic: STR Wars: Is diversity meaningful? more meaningful than Hg frequency?  (Read 27712 times)
Mike Walsh
Guru
*****
Offline Offline

Posts: 2964


WWW
« on: April 12, 2012, 12:17:08 PM »

This is always a contentious issue. I think STR diversity is useful. There are challenges and they must be considered in context.

In my opinion, people are fine with it until it disagrees with their theory, then they must shoot it down rather than adjust their theory. To me it is just another data point, and unfortunately we are in dire need of those.

Anyway, let's discuss this topic here so we don't have to argue the points over and over again in other topics, drowning them out.
Logged

R1b-L21>L513(DF1)>S6365>L705.2(&CTS11744,CTS6621)
Mike Walsh
Guru
*****
Offline Offline

Posts: 2964


WWW
« Reply #1 on: April 12, 2012, 12:24:35 PM »

Are there any scientific papers out there that use or show different Y DNA STR mutation rates for different haplogroups?
Logged

R1b-L21>L513(DF1)>S6365>L705.2(&CTS11744,CTS6621)
JeanL
Old Hand
****
Offline Offline

Posts: 425


« Reply #2 on: April 12, 2012, 12:30:24 PM »

This is why I like looking at the relative STR variance numbers because that avoids the mutation rate issues you are talking about, but variance still gives indications of direction/migration.

My only issue with relative STR variance, is that one measures STR variance off what one presumes to be the modal haplotype for a given population. I mentioned on a different thread the two key assumptions that are made when calculating modals:

1-The ancestral allele for a given locus is such that minimizes the number of mutations in that given locus for a population.

2- The ancestral allele is still present in the sample being analyzed.

Moreover, often times, if not always, the STR variance is given as a function of the overall variance, not for each locus analyzed. So what does it matter if population A has an excessive variance coming from locus DYS-XXX, if locus DYS-XXX is known to mutate very fast? Does that somehow makes that population somewhat older because they have a higher overall variance, what if a population-B doesn't have as many mutations in DYS-XXX, but has more than twice the number of mutations population-A has on a different locus DYS-XXY, which is known to mutate very slowly? Still when one looks at the overall variance population-A is going to have more variance than population-B, but  once the variance is broken down per locus, we find that population-B accumulated more variance in the slower marker than population-A. Of course there are at least two possible explanations for these phenomena:

1-) Population-B for some odd reason(environmental, positive selection, modal allele having more repetitions) actually accumulates mutations on DYS-XXY at a faster rate than population-A.

2-)Population-B accumulates mutations on DYS-XXY at the same rate as population-A, but it just so happens that on locus DYS-XXX population-B has experienced more back-mutations than population-A.

« Last Edit: April 12, 2012, 12:35:53 PM by JeanL » Logged
Mike Walsh
Guru
*****
Offline Offline

Posts: 2964


WWW
« Reply #3 on: April 12, 2012, 04:43:44 PM »

This is why I like looking at the relative STR variance numbers because that avoids the mutation rate issues you are talking about, but variance still gives indications of direction/migration.

My only issue with relative STR variance, is that one measures STR variance off what one presumes to be the modal haplotype for a given population. I mentioned on a different thread the two key assumptions that are made when calculating modals:

1-The ancestral allele for a given locus is such that minimizes the number of mutations in that given locus for a population.

2- The ancestral allele is still present in the sample being analyzed.

Moreover, often times, if not always, the STR variance is given as a function of the overall variance, not for each locus analyzed. So what does it matter if population A has an excessive variance coming from locus DYS-XXX, if locus DYS-XXX is known to mutate very fast? Does that somehow makes that population somewhat older because they have a higher overall variance, what if a population-B doesn't have as many mutations in DYS-XXX, but has more than twice the number of mutations population-A has on a different locus DYS-XXY, which is known to mutate very slowly? Still when one looks at the overall variance population-A is going to have more variance than population-B, but  once the variance is broken down per locus, we find that population-B accumulated more variance in the slower marker than population-A. Of course there are at least two possible explanations for these phenomena:

1-) Population-B for some odd reason(environmental, positive selection, modal allele having more repetitions) actually accumulates mutations on DYS-XXY at a faster rate than population-A.

2-)Population-B accumulates mutations on DYS-XXY at the same rate as population-A, but it just so happens that on locus DYS-XXX population-B has experienced more back-mutations than population-A.

The use of variance in statistical analysis is pretty standard stuff.
Quote
variance is a measure of how far a set of numbers is spread out. It is one of several descriptors of a probability distribution...

Real-world distributions ... not fully known, unlike the behavior of perfect dice or an ideal distribution such as the normal distribution, because it is impractical to account for every raindrop. Instead one estimates the mean and variance of the whole distribution as the computed mean and variance of a sample of n observations drawn suitably randomly from the whole sample space,
http://en.wikipedia.org/wiki/Variance


If I understand your concerns about variance, I think there is a counter-consideration - The Law of Large Numbers -
Quote
According to the law, the average of the results obtained from a large number of trials should be close to the expected value, and will tend to become closer as more trials are performed.
http://en.wikipedia.org/wiki/Law_of_large_numbers

This is fairly intuitive but guys much smarter than I have figured this out long ago.

Ken Nordtvedt describes STR variance based calculations as "individual experiments", one per each STR. It is true that any one STR set of allele frequencies for a sample population may not be representative of the total population. However, the more STR "experiments" you run the more likely you are to receive and accurate result.  Most of the variance calculations I've displayed lately have been on 49 STRs. That is a pretty healthy set, particularly compared to academics performing analysis on only 10 or 15 STRs.

Having more STRs is a good thing.  So is having more haplotypes (a larger sample.)

« Last Edit: April 12, 2012, 04:44:25 PM by Mikewww » Logged

R1b-L21>L513(DF1)>S6365>L705.2(&CTS11744,CTS6621)
Mike Walsh
Guru
*****
Offline Offline

Posts: 2964


WWW
« Reply #4 on: April 12, 2012, 04:50:36 PM »

There is a concern that STR variance is not linear with time (number of generations.)
There is no variance when mutations have happened many times forwards and backwards, as it happened for a long time ago. This principle is worth only for a short lapse of time. I have said this to you many times in the past, and you are free to believe in what you like, but for very ancient times I'm afraid that your theories will come out wrong. Already the ADNA whose also JeanL spoke has demonstrated this.

Has anyone done any analysis on STRs that have short durations?  Maliclavelli, what do you consider a short lapse of time?

I am aware that Busby et al did an analysis of 15 or 20 STRs.   Marko Heinila has evaluated all 67 of FTDNA's 67 STR marker set across tens of thousands of haplotypes, in an effort to determine linear duration.
Logged

R1b-L21>L513(DF1)>S6365>L705.2(&CTS11744,CTS6621)
Mike Walsh
Guru
*****
Offline Offline

Posts: 2964


WWW
« Reply #5 on: April 12, 2012, 05:02:48 PM »

Some people have concerns that back mutations are "hidden" and therefore cause an error.
There is no variance when mutations have happened many times forwards and backwards...  

A recent conversation from Rootsweb:
Quote from: general question
My own layman's viewpoint has always been to wonder how such unknowable factors like bottle-necks, back mutations, etc. can ever be adequately compensated for
Here is a response from a Scientist at MIT. John Chandler is the guy who calculated the mutation rates most of us use.
Quote from: John Chandler
That "etc." is exactly the difficulty. I'll point out in passing that back mutations are automatically accounted for in the variance method, and true bottlenecks are quite rare (since total extinction is the usual outcome of a steep decline). However, variable fecundity, whether systematic or random, introduces an unknown distortion into any statistical method based solely on the sampling of the current population. In other words, the "coalescence time" is necessarily a biased estimate of the TMRCA -- the bias direction is known, but the
amount is not.
http://archiver.rootsweb.ancestry.com/th/read/genealogy-dna/2012-03/1333051203

Chandler addressed that back mutations are accounted for, but I quoted the whole answer because Chandler alludes to "coalescence time."  Most intraclade TMRCA estimates we see are estimated from the "coalescence time." The true TMRCA is unknowable and all of these methods just estimated their original time of expansion.

M222 might be a good example. It's TMRCA's are generally youthful, less than 2000 ypb, but this is really the coalescence time. M222 has a distinctive haplotype and interclade TMRCA estimates with other DF23* subclades indicate M222's lineage broke away from DF23* long ago, may be 4000 ybp. M222 could have been "born" anywhere from 4000 to 1500 years ago and there is no way of knowing for sure where in that time period.  We just know that fairly recently M222 began expanding in earnest.
« Last Edit: April 13, 2012, 07:59:55 AM by Mikewww » Logged

R1b-L21>L513(DF1)>S6365>L705.2(&CTS11744,CTS6621)
JeanL
Old Hand
****
Offline Offline

Posts: 425


« Reply #6 on: April 12, 2012, 05:51:57 PM »


Ken Nordtvedt describes STR variance based calculations as "individual experiments", one per each STR. It is true that any one STR set of allele frequencies for a sample population may not be representative of the total population. However, the more STR "experiments" you run the more likely you are to receive and accurate result.  Most of the variance calculations I've displayed lately have been on 49 STRs. That is a pretty healthy set, particularly compared to academics performing analysis on only 10 or 15 STRs.

Having more STRs is a good thing.  So is having more haplotypes (a larger sample.)


While I agree that the more loci one analyzes the more accurate the results would be, my main concern is that when one mixes slow and fast mutating loci, one is undermining the relative variance on each loci. I can tell you that a set of 49 STRs where some STRs are three and four order of magnitude slower than others isn’t going to yield results more accurate than a set of 10 or 15 STRs where all STRs mutate with a very similar mutation rate. I would say that when it comes to STRs quantity matters, but quality matters more. Choose STRs that have similar (i.e. they are not two orders of magnitude apart) mutation rates and calculate the variance using those, and you should be ok.
« Last Edit: April 12, 2012, 06:02:27 PM by JeanL » Logged
Skip McDonald
New Member
*
Offline Offline

Posts: 3


« Reply #7 on: April 12, 2012, 06:23:39 PM »

I agree that there are orders of magnitude differences in the rate that STR markers mutate, but they also stay in predictable ranges.   That indicates that there are different rates for the same STR depending on its value.   The chance of a 16 changing to a 17 is probably not the same rate as a 17 moving to an 18 or back to a 16.


Our rates are averages, over a large population and ranges of values and should be applied to the Macro questions of comparing different large populations against each other.   If you try to apply it to a Micro sized question the answers you get will be absolutely wrong, but not entirely useless.   What we get is an educated guess, but a guess none the less.

Testing more and more markers is absolutely the best way to help improve these guesses.   Excluding fast moving markers is probably a bad idea as there is useful information there and it should improve your guesses in a large population.  

More sophisticated models are called for perhaps one day we will have better mutation rates based on STR and the STR value.   But to do that we need more people to test and to test more STRs.  

The other "Elephant in the room" from a statistical standpoint is that the population we have is NOT a random sample,  often large clusters of close/distant kin get tested.   Age estimates that ignore known kinship of participants may skew the results.   Likewise people with virtually identical haplotypes that can document they have no common ancestor for 6 or 7 generations don't have those years added to estimates either.   Many researchers assume they are the same and may even throw out the duplicates.

The bottom line is that Statistics isn't perfect but is one of the best tools we have.

My 2 cents..

Skip
« Last Edit: April 12, 2012, 06:31:00 PM by Skip McDonald » Logged
Mike Walsh
Guru
*****
Offline Offline

Posts: 2964


WWW
« Reply #8 on: April 12, 2012, 06:33:32 PM »

Ken Nordtvedt describes STR variance based calculations as "individual experiments", one per each STR. It is true that any one STR set of allele frequencies for a sample population may not be representative of the total population. However, the more STR "experiments" you run the more likely you are to receive and accurate result.  Most of the variance calculations I've displayed lately have been on 49 STRs. That is a pretty healthy set, particularly compared to academics performing analysis on only 10 or 15 STRs.

Having more STRs is a good thing.  So is having more haplotypes (a larger sample.)

While I agree that the more loci one analyzes the more accurate the results would be, my main concern is that when one mixes slow and fast mutating loci, one is undermining the relative variance on each loci. I can tell you that a set of 49 STRs where some STRs are three and four order of magnitude slower than others isn’t going to yield results more accurate than a set of 10 or 15 STRs where all STRs mutate with a very similar mutation rate. I would say that when it comes to STRs quantity matters, but quality matters more....

Ken Nordtvedt runs simulations on different methodologies he evaluates.  He does concur there is a potential saturation effect with faster STRs but in his simulations he says the positives of removing some of those STRs are outweighed by the the negatives of cutting out STRs, and cutting out fast STRs definitely reduces precision. It's a question of using a watch to measure hours versus using a calendar.

An M222 hobbyist/researcher, Sandy Paterson, has done simulations on the number of STRs to use and he comes up with 50.  This is partially why I'm using the 49 non-multi-copy/non-null STRs of FTDNA's first 67.
http://archiver.rootsweb.ancestry.com/th/read/dna-r1b1c7/2012-03/1332498888

Quote from: JeanL l
Choose STRs that have similar (i.e. they are not two orders of magnitude apart) mutation rates and calculate the variance using those, and you should be ok.

Do you have any papers or research that demonstrates this is effective?

I don't know where and how to draw the line based on statistics and I don't run true simulations, but I've made some comparison runs to see if I could "eyeball" any distinctions.

I've run through multiple comparisons, some of which you can probably find on this forum, of selected STR sets based on Marko Heinila's analysis of the linearity of STRs.  Generally, there is not much difference in the relative positioning of variance between haplogroups between using 49 mixed speed markers or Marko's 36 "best" linear markers (out of the first 67.) There is one exception - U198.

I've also tried to weight each STR against its maximum variance so that no STR would have more weight than another. That didn't work out so well. I received some crazy results. I think it goes back to using a calendar to measure hours and every now then even the slowest STRs have fairly quick successive mutations. It's like the calendar page turned on that STR when I'm only trying to measure 10 or 12 hours worth of time.

Ultimately, the Law of Large Numbers can average or "wash" out aberrations. Nothing is perfect, though.
« Last Edit: April 13, 2012, 08:02:31 AM by Mikewww » Logged

R1b-L21>L513(DF1)>S6365>L705.2(&CTS11744,CTS6621)
rms2
Board Moderator
Guru
*****
Offline Offline

Posts: 5023


« Reply #9 on: April 12, 2012, 07:49:21 PM »

I'm certainly no expert on the math of variance calculations, but I do think they must be considered within the context of other evidence, like history, the distribution of a y haplogroup, its known ethnolinguistic affiliations, etc. I think the SNP trail, where one can be established, is probably more important than variance.

Variance cannot be the sole consideration in trying to determine where a particular y haplogroup originated or where it was at a particular time. It can only establish an upper bound on the age of a haplogroup in a place, for one thing, barring something odd like a bottleneck or genetic drift, both of which are nearly impossible to prove outside of known historical incidents (like a plague, for example).

A haplogroup could have a fairly high variance in a place and yet be a relatively late arrival there.

Witness Mike's recent North American U106 variance calculation. It was fairly high, even relative to places in Europe. If it were the sole or even paramount consideration, someone might be tempted to conclude that u106 has been in North America for millennia.

« Last Edit: April 12, 2012, 07:50:17 PM by rms2 » Logged

JeanL
Old Hand
****
Offline Offline

Posts: 425


« Reply #10 on: April 12, 2012, 08:54:16 PM »

Do you have any papers or research that demonstrates this is effective?
While there aren’t any papers that address the issue directly, the Busby et al(2011) publication talked about an appreciable effect of microsatellite choice on age estimates.
Quote from: Busby et al(2011)
We further investigate the young, STR-based time to the most recent common ancestor estimates proposed so far for R-M269-related lineages and find evidence for an appreciable effect of microsatellite choice on age estimates.

Ultimately, the Law of Large Numbers can average or "wash" out aberrations. Nothing is perfect, though.
Yes, and no. The law of large number would average or “wash” out aberrations when this aberrations are nothing but outliers, but when you have 49 markers, and half of them are slow, and the other half are fast, then the law of large number  doesn’t do squat for us.  Now if in your sample you have 44 STRs which have a somewhat similar mutation rate, and 5 that have a different(slower or faster) one, then yeah chances are any effects would be “wash” out.




-----------------------------------------------------------------------------------------------------------
I agree that there are orders of magnitude differences in the rate that STR markers mutate, but they also stay in predictable ranges.   That indicates that there are different rates for the same STR depending on its value.   The chance of a 16 changing to a 17 is probably not the same rate as a 17 moving to an 18 or back to a 16.

In fact, what happens is that the mutation rate increases as the total length of the repetitions increases. So a mutation from 17 to 18 is more likely than a mutation from 16 to 17. So as we try to estimate the time that it took population A to mutate from ancestral allele 13 to allele 16, one needs to take into account that the mutation rate was slower from 13 to 14, than from 14 to 15, than from 15 to 16. Now imagine folks that simply use a mean mutation rate or rather a constant mutation rate. If the change in mutation rate was linear then sure, one could use an average mutation rate, but the thing is that it  isn’t linear.

Testing more and more markers is absolutely the best way to help improve these guesses.   Excluding fast moving markers is probably a bad idea as there is useful information there and it should improve your guesses in a large population.   

More sophisticated models are called for perhaps one day we will have better mutation rates based on STR and the STR value.   But to do that we need more people to test and to test more STRs.

Couldn’t agree more on that. 

The other "Elephant in the room" from a statistical standpoint is that the population we have is NOT a random sample,  often large clusters of close/distant kin get tested.   Age estimates that ignore known kinship of participants may skew the results.   

Indeed a lot of samples used here come from FTDNA Projects, so unfortunately there is a lack of randomness which is vital for statistical analyses. I tried to use only samples from published studies, but even those often times offer too little resolution.(i.e. They only test a limited number of STRs, or give too basic resolution into the SNP levels). Hopefully this situation will change in the near future. 
« Last Edit: April 12, 2012, 08:54:43 PM by JeanL » Logged
Mike Walsh
Guru
*****
Offline Offline

Posts: 2964


WWW
« Reply #11 on: April 13, 2012, 08:12:16 AM »

I'm certainly no expert on the math of variance calculations, but I do think they must be considered within the context of other evidence, like history, the distribution of a y haplogroup, its known ethnolinguistic affiliations, etc. I think the SNP trail, where one can be established, is probably more important than variance.
I agree, although I'd add a couple of other disciplines to the list to consider in context, including archeology, geography, terrain, climate and what's known about prehistory. I'm sure you would agree.

I think that looking at variance and the SNP trail go hand in glove as we try to look at deeper resolution as new SNPs are discovered.

Variance cannot be the sole consideration in trying to determine where a particular y haplogroup originated or where it was at a particular time. It can only establish an upper bound on the age of a haplogroup in a place, for one thing, barring something odd like a bottleneck or genetic drift, both of which are nearly impossible to prove outside of known historical incidents (like a plague, for example).

A haplogroup could have a fairly high variance in a place and yet be a relatively late arrival there.....
I agree, although I'd add a couple of other disciplines to the list to consider in context, including archeology, geography, terrain, climate and what's known about prehistory. I'm sure you would agree.

This is one of the difficulties about looking at variance by geography that is not inherent in looking at variance by haplogroup.  We know the haplogroup is all related people but within a geography there are really probably a mix of sub-haplogroups, some of which came in at different times.

Variance can be high in a geography, but how does one tell whether that pooling point or crossroads from a launch/origin point?

Nevertheless, if variance is low for a haplogroup in a location, it is young there.


Logged

R1b-L21>L513(DF1)>S6365>L705.2(&CTS11744,CTS6621)
Mike Walsh
Guru
*****
Offline Offline

Posts: 2964


WWW
« Reply #12 on: April 13, 2012, 08:21:43 AM »

Ultimately, the Law of Large Numbers can average or "wash" out aberrations. Nothing is perfect, though.
Yes, and no. The law of large number would average or “wash” out aberrations when this aberrations are nothing but outliers, but when you have 49 markers, and half of them are slow, and the other half are fast, then the law of large number  doesn’t do squat for us.  Now if in your sample you have 44 STRs which have a somewhat similar mutation rate, and 5 that have a different(slower or faster) one, then yeah chances are any effects would be “wash” out.  
That is not true, the Law of Large Numbers is still applicable. You may argue that 49 markers is not enough. That if fine, but Sandy Paterson (the M222 researcher) has done simulations and determined that 50 was enough for reasonable precision.

As I said, Ken Nordtvedt has also run simulations on this and concludes you want a mix of slow and fast markers. You don't want to discard the fast markers unless you have to, like is done with the multi-copy markers.

Most of the scientific genetic research available today is based on 15, 10 or less markers.  You should write a counter-argument paper to tell them they are all wrong.

Even Busby, who you cite, uses STR diversity on only 10 markers to justify their primary point against Balaresque, that there are no clines across Europe for R1b-L11/S127.
Logged

R1b-L21>L513(DF1)>S6365>L705.2(&CTS11744,CTS6621)
Mike Walsh
Guru
*****
Offline Offline

Posts: 2964


WWW
« Reply #13 on: April 13, 2012, 08:31:40 AM »

Choose STRs that have similar (i.e. they are not two orders of magnitude apart) mutation rates and calculate the variance using those, and you should be ok.
Do you have any papers or research that demonstrates this is effective?
While there aren’t any papers that address the issue directly, the Busby et al(2011) publication talked about an appreciable effect of microsatellite choice on age estimates.

The Busby discussion was based on picking microsatellites (STR markers) that had long linear durations. They did not try to pick STRs based based on whether they had similar mutation rates or not.

Perhaps Busby's real problem was not considering enough STRs and taking advantage of more "individual experiments." I can see that the fewer the STRs you have, the more critical it becomes to pick ones that are representative and linear withing your target population.  The problem is how do you really know you are picking the right ones or you might really just be cherry picking the data.

BTW, Busby left a gaping logic hole in the STRs they used for their analysis.
Logged

R1b-L21>L513(DF1)>S6365>L705.2(&CTS11744,CTS6621)
JeanL
Old Hand
****
Offline Offline

Posts: 425


« Reply #14 on: April 13, 2012, 08:43:23 AM »

That is not true, the Law of Large Numbers is still applicable. You may argue that 49 markers is not enough. That if fine, but Sandy Paterson (the M222 researcher) has done simulations and determined that 50 was enough for reasonable precision.

As I said, Ken Nordtvedt has also run simulations on this and concludes you want a mix of slow and fast markers. You don't want to discard the fast markers unless you have to, like is done with the multi-copy markers.

Most of the scientific genetic research available today is based on 15, 10 or less markers.  You should write a counter-argument paper to tell them they are all wrong.

Even Busby, who you cite, uses STR diversity on only 10 markers to justify their primary point against Balaresque, that there are no clines across Europe for R1b-L11/S127.

Well if you think the law of large number still applies, check the variance on a set of slow markers, then on a set of fast markers and then on the combined set of the two.  See if your combined variance falls anywhere within the standard deviation of the mean variance of any of the other two variances. Again, I’m not talking about a set of 49 STRs where 44 STRs have similar mutation rates, I’m talking about a set of 49 STR where one has about 50% of them being slow markers, 50% being fast markers.

As for Dr.Nordtvedt, yeah one could mix fast and slow markers if one presumes that the TMRCA on the set is fairly recent, and that any mutation coming from the slow markers is either 0, or simply just one mutation, as time frame isn’t long enough for any of the very slow ones to have backmutated.

I don’t think scientists are wrong in using only 10-15 STRs, it would be preferable to use larger numbers, but often times budget constrains lead us to choose the most cost effective option.

You are right Busby used 10-15 STRs, but his team also showed that the TMRCA varied a lot when he changed the choice of STRs, see Figure S4 in his study.  

The Busby discussion was based on picking microsatellites (STR markers) that had long linear durations. They did not try to pick STRs based based on whether they had similar mutation rates or not.

You wanna take a guess which STRs have the longer linearity: fast or slow mutating ones? See figure-1 in the Busby study to get your answer.
« Last Edit: April 13, 2012, 08:52:40 AM by JeanL » Logged
Mike Walsh
Guru
*****
Offline Offline

Posts: 2964


WWW
« Reply #15 on: April 13, 2012, 11:14:31 AM »

...
The Busby discussion was based on picking microsatellites (STR markers) that had long linear durations. They did not try to pick STRs based based on whether they had similar mutation rates or not.

You wanna take a guess which STRs have the longer linearity: fast or slow mutating ones? See figure-1 in the Busby study to get your answer.

My point is still correct. Busby et al sought, as they should, to find linear correlation with time. Yes, that generally means slower markers rather than faster, but that is NOT the criteria and is not the 100% rule.

Anyway, Busby's analysis was only half-hearted. Marko Heinila's is much more thorough.  The real finding is that STR markers with high absolute allele values (i.e. 30, 31, etc.) are the ones that are saturated and have linearity concerns.

This is the study that shows that.  "Decreased Rate of Evolution in Y Chromosome STR Loci of Increased Size of the Repeat Unit" by Jarve et al, 2009. Marko Heinila's work also reflects this but it is not a 100% rule either.

Marko's perspective is that the mutation rate actually increases as a marker reaches the high end number of repeat units, but that back-mutations increase dramatically so they appear to "saturate."

It is not the mutation rate that is the issue although any marker with a high mutation rate could easily end up in the high end of the allele range.

Still, Ken Nordtvedt has run this through simulations and his statistical outcomes show that the loss of linearity is not worth the lost precision from including faster markers.  Again, he does agree that multi-copy markers should be removed in such calculations.
« Last Edit: April 13, 2012, 11:17:14 AM by Mikewww » Logged

R1b-L21>L513(DF1)>S6365>L705.2(&CTS11744,CTS6621)
Mike Walsh
Guru
*****
Offline Offline

Posts: 2964


WWW
« Reply #16 on: April 13, 2012, 01:12:28 PM »

... I don’t think scientists are wrong in using only 10-15 STRs, it would be preferable to use larger numbers, but often times budget constrains lead us to choose the most cost effective option.

You are right Busby used 10-15 STRs, but his team also showed that the TMRCA varied a lot when he changed the choice of STRs, see Figure S4 in his study.  

The Busby discussion was based on picking microsatellites (STR markers) that had long linear durations. They did not try to pick STRs based based on whether they had similar mutation rates or not.

You wanna take a guess which STRs have the longer linearity: fast or slow mutating ones? See figure-1 in the Busby study to get your answer.

I've looked at the Busby data in detail and compared it with Marko Heinila's analysis and the study I just cited.

Scientists are not arbitrarily wrong if their budget is limited and they use only 10-15 STRs, but they are dramatically decreasing their precision and dramatically increasing their risk of having a wrong conclusion.

Let's look at one of Busby's illogical application.

"The peopling of Europe and the cautionary tale of Y chromosome lineage R-M269" by Busby et al, 2010.

The 2nd column is the ybp.  I added the "xxx"s to show you which markers Busby used in their R1b-L11/S127 STR diversity calculations.

Quote from: Busby
Fifteen Y-STRs with mutation rates, range of alleles and estimate of duration of linearity. All STRs investigated in this study are shown with their mutation rates (μ), estimated from Ballantyne et al, and range of observed observed alleles, R, with 95% CI is taken from the YHRD. θ(R)/2μ is an estimate of the duration of linearity.
(from Table 1)

Y-STR____ θ(R)/2μ
DYS448___ 25381   
DYS392___ 19244 XXX
DYS438___ 12465 XXX
DYS390___ 9211 XXX
DYS393___ 5648 XXX<<<
DYS439___ 4861 XXX<<<
DYS437___ 4357 XXX<<<
DYS635___ 4221   
DYS456___ 3289   
DYS389II_ 3111 XXX<<<
DYS391___ 2554 XXX<<<
DYS458___ 1944   
DYS19____ 1888 XXX<<<
Y-GATA-H4_ 1630   
DYS389I___ 953 XXX<<<  

Please note that Busby's key conclusion that is a counter-argument to Barlaresque's R1b Neolithic argument is based on the STR diversity of R1b-L11/S127.
Quote from: Busby
(Abstract)
Our analysis reveals no
geographical trends in diversity, in contradiction to expectation under the Neolithic hypothesis...
(Conclusions)
Alternatively,if R-S127 originated prior to the Neolithic wave of expansion, then either it was already present in most of Europe before the expansion, or the mutation occurred in the east, and was spread before or after the expansion, in which case we would expect higher diversity in the east closer to the origins of agriculture, which is not what we observe.

Notice how that the Neolithic revolution started some 10k ybp and was spreading across Europe about 7k ybp.
Quote from: Busby
(Introduction)
Following the development of agriculture in the Fertile Crescent some 10000 years ago, this technology spread from the Near East westward into Europe...

Go back up and look at Busby's Table 1. Only three of the ten STRs they used to draw their conclusion were based on STRs with enough linear duration according to their own evaluation! I put "<<<"'s next to the STRs with linear durity less than 7k ybp.

I've asked this on this forum, on DNA-forums and on Rootsweb. Isn't this a major flaw in their logic? Their own analysis argues against the validity of their primary conclusion, which was to argue against Balaresque.  No one has yet to respond as to the logic.
« Last Edit: April 13, 2012, 01:15:52 PM by Mikewww » Logged

R1b-L21>L513(DF1)>S6365>L705.2(&CTS11744,CTS6621)
Mike Walsh
Guru
*****
Offline Offline

Posts: 2964


WWW
« Reply #17 on: April 13, 2012, 01:18:35 PM »

... The other "Elephant in the room" from a statistical standpoint is that the population we have is NOT a random sample,  often large clusters of close/distant kin get tested.   ...
Skip, I agree. I think a scientifically designed cross-sectionally designed, random sampling of Europe and Western Asia, including the Near East is needed. It should be based on long haplotypes and high resolution deep clade testing. We don't have that anywhere that I can see. Hence, we are speculating.
« Last Edit: April 13, 2012, 01:19:08 PM by Mikewww » Logged

R1b-L21>L513(DF1)>S6365>L705.2(&CTS11744,CTS6621)
Mike Walsh
Guru
*****
Offline Offline

Posts: 2964


WWW
« Reply #18 on: April 13, 2012, 01:26:20 PM »

Ken Nordtvedt describes STR variance based calculations as "individual experiments", one per each STR. It is true that any one STR set of allele frequencies for a sample population may not be representative of the total population. However, the more STR "experiments" you run the more likely you are to receive and accurate result.  Most of the variance calculations I've displayed lately have been on 49 STRs. That is a pretty healthy set, particularly compared to academics performing analysis on only 10 or 15 STRs.

Having more STRs is a good thing.  So is having more haplotypes (a larger sample.)

While I agree that the more loci one analyzes the more accurate the results would be, my main concern is that when one mixes slow and fast mutating loci, one is undermining the relative variance on each loci. I can tell you that a set of 49 STRs where some STRs are three and four order of magnitude slower than others isn’t going to yield results more accurate than a set of 10 or 15 STRs where all STRs mutate with a very similar mutation rate. I would say that when it comes to STRs quantity matters, but quality matters more....

Ken Nordtvedt runs simulations on different methodologies he evaluates.  He does concur there is a potential saturation effect with faster STRs but in his simulations he says the positives of removing some of those STRs are outweighed by the the negatives of cutting out STRs, and cutting out fast STRs definitely reduces precision. It's a question of using a watch to measure hours versus using a calendar.

An M222 hobbyist/researcher, Sandy Paterson, has done simulations on the number of STRs to use and he comes up with 50.  This is partially why I'm using the 49 non-multi-copy/non-null STRs of FTDNA's first 67.
http://archiver.rootsweb.ancestry.com/th/read/dna-r1b1c7/2012-03/1332498888

Quote from: JeanL l
Choose STRs that have similar (i.e. they are not two orders of magnitude apart) mutation rates and calculate the variance using those, and you should be ok.

Do you have any papers or research that demonstrates this is effective?

I don't know where and how to draw the line based on statistics and I don't run true simulations, but I've made some comparison runs to see if I could "eyeball" any distinctions.

I've run through multiple comparisons, some of which you can probably find on this forum, of selected STR sets based on Marko Heinila's analysis of the linearity of STRs.  Generally, there is not much difference in the relative positioning of variance between haplogroups between using 49 mixed speed markers or Marko's 36 "best" linear markers (out of the first 67.) There is one exception - U198.

I've also tried to weight each STR against its maximum variance so that no STR would have more weight than another. That didn't work out so well. I received some crazy results. I think it goes back to using a calendar to measure hours and every now then even the slowest STRs have fairly quick successive mutations. It's like the calendar page turned on that STR when I'm only trying to measure 10 or 12 hours worth of time.

Ultimately, the Law of Large Numbers can average or "wash" out aberrations. Nothing is perfect, though.

There are those that disagree with some of Vincent Vizachero's arguments. He has been a long term project admin for the R1b ht35 project and has a very large database for R1b. My position is that he is very credible on R1b, just like I consider Ken Nordtvedt very credible on TMRCAs, John Chandler on mutation rates and Marko Heinila on STR linear durations and TMRCAs.

Quote from: Vincent Vizachero
For young haplogroups (e.g. within R-M269) that random component in the GD matrix swamps the true phylogenetic signal with such short (e.g. 67 marker) haplotypes such that the relationship between the haplotypes proposed by the algorithms is almost entirely phantom.

For old haplogroups (e.g. more than 25 ky old) the problem of non-linear accumulation of GD due to marker saturation becomes the dominant problem. Creating trees from STRs in this timeframe is typically not necessary, thankfully, now that our SNP-based trees are so much more complete than they were several years ago.
http://archiver.rootsweb.ancestry.com/th/read/GENEALOGY-DNA/2012-04/1334313410

Linear duration of STRs is an issue (I don't deny that,) but you can see that Vizachero considers this to be the issue with haplogroups that are old (of 25k ybp age) and he does not consider R-M269 in that category.  

This is in general agreement with Ken Nordtvedt's simulations although Ken never gives a time break when STR linear duration causes a negative (accuracy risk) return for shorter durations STRs.
« Last Edit: April 13, 2012, 05:05:13 PM by Mikewww » Logged

R1b-L21>L513(DF1)>S6365>L705.2(&CTS11744,CTS6621)
eochaidh
Old Hand
****
Offline Offline

Posts: 400


« Reply #19 on: April 13, 2012, 02:07:26 PM »

Maybe I've missed it, but has there ever been a case where a population with the highest frequency (percentage and/or numbers) has had a greater diversity than places with a lower frequency (percentage and /or numbers)?

And, if an SNP DF952b+ (just for fun) was found in Ireland among 95% of the L21+ men and was found in Germany  and France among 20 L21+ men total, would the origin of the SNP be Continental if the diversity was higher among the 20 German men? I would say the answer would be yes among most people on these forums.

I will also say again that as soon as one L226+ is found on the Continent then L226 becomes Continental. Actually, I'd say that L226+ is already thought of as Continental, but nothing on the Continent has been found yet. :)

The big question is why this is accepted theory.
Logged

Y-DNA: R1b DF23
mtDNA: T2g
Mike Walsh
Guru
*****
Offline Offline

Posts: 2964


WWW
« Reply #20 on: April 13, 2012, 02:21:23 PM »

Maybe I've missed it, but has there ever been a case where a population with the highest frequency (percentage and/or numbers) has had a greater diversity than places with a lower frequency (percentage and /or numbers)?....
I don't know.

I don't think even those two numbers, in context of each other, are enough to declare an origination point.  The archaeology, the cultures, linguistics, terrain, etc. all must be considered.
« Last Edit: April 13, 2012, 02:21:56 PM by Mikewww » Logged

R1b-L21>L513(DF1)>S6365>L705.2(&CTS11744,CTS6621)
NealtheRed
Old Hand
****
Offline Offline

Posts: 930


« Reply #21 on: April 13, 2012, 02:23:49 PM »

Maybe I've missed it, but has there ever been a case where a population with the highest frequency (percentage and/or numbers) has had a greater diversity than places with a lower frequency (percentage and /or numbers)?

And, if an SNP DF952b+ (just for fun) was found in Ireland among 95% of the L21+ men and was found in Germany  and France among 20 L21+ men total, would the origin of the SNP be Continental if the diversity was higher among the 20 German men? I would say the answer would be yes among most people on these forums.

I will also say again that as soon as one L226+ is found on the Continent then L226 becomes Continental. Actually, I'd say that L226+ is already thought of as Continental, but nothing on the Continent has been found yet. :)

The big question is why this is accepted theory.

I would bet that L226 arose in the Isles. I don't think it has been found on the Continent, but I may be wrong.

I would say the likelihood is stronger that Z253, L226's father, has a Continental origin.
Logged

Y-DNA: R-Z255 (L159.2+) - Downing (Irish Sea)


MTDNA: HV4a1 - Centrella (Avellino, Italy)


Ysearch: 4PSCK



eochaidh
Old Hand
****
Offline Offline

Posts: 400


« Reply #22 on: April 13, 2012, 02:48:13 PM »

If the place with the highest diversity is never in the place of highest frequency, then something is wrong.
Logged

Y-DNA: R1b DF23
mtDNA: T2g
Jean M
Guru
*****
Offline Offline

Posts: 1253


« Reply #23 on: April 13, 2012, 04:03:59 PM »

If the place with the highest diversity is never in the place of highest frequency, then something is wrong.

It will depend on the pattern of movement. Where a mutation occurs in a fairly stable population, we would expect it to very gradually spread out in all directions from the point of origin. That leaves a pattern with a high frequency centre which should also be high in diversity. R1b-U152 looks roughly like that.

Where a mutation occurs at the spearhead of a migration, you can expect to see the highest density at the point where the migration hits a barrier such as an ocean and is forced to stop. See clines and waves.

In reality of course the nice neat patterns created by one kind of movement are likely to be messed up later by another movement. So we can't expect everything to look exactly like a computer model.
Logged
Mike Walsh
Guru
*****
Offline Offline

Posts: 2964


WWW
« Reply #24 on: April 13, 2012, 04:57:23 PM »

Maybe I've missed it, but has there ever been a case where a population with the highest frequency (percentage and/or numbers) has had a greater diversity than places with a lower frequency (percentage and /or numbers)?....
I don't know.

I don't think even those two numbers, in context of each other, are enough to declare an origination point.  The archaeology, the cultures, linguistics, terrain, etc. all must be considered.


I don't know the answer to your question, but I would be surprised if there weren't some situations, perhaps many, that highest diversity and highest frequency correspond.

However, I don't get the point in looking for that just for the sake of looking for that. We have many challenges and difficulties with all of this data, which is being well discussed. Why going looking for hypothetical situations?

If the place with the highest diversity is never in the place of highest frequency, then something is wrong.

What's the point you are trying to make?
« Last Edit: April 13, 2012, 04:58:43 PM by Mikewww » Logged

R1b-L21>L513(DF1)>S6365>L705.2(&CTS11744,CTS6621)
Pages: [1] 2 3 ... 14 Go Up Print 
« previous next »
Jump to:  


SEO light theme by © Mustang forums. Powered by SMF 1.1.13 | SMF © 2006-2011, Simple Machines LLC

Page created in 0.165 seconds with 19 queries.