Analyzing population structure using Slatkin & Maddison's s
Background
This is a parametric bootstrapping approach, where an explicit model is used to generate a distribution of expected values of a test statistic. If the observed value of this test statistic falls outside of the 95% confidence interval of the simulated distribution, the model is rejected. In this case, the test statistic, s, is counting the number of changes (in parsimony steps) of a character corresponding to population membership (Slatkin & Maddison 1989). In the instructions below, we deal with two types of taxa: genes and populations. Genes are the gene sequences used to generate a gene tree; each gene sequence usually corresponds to a single sampled individual. Populations are those groupings, determined by the user, for hierarchically organizing individual gene sequences into groupings based on geography, morphology, or other information.Instructions & Example
These instructions are provided with a worked example. Because each model being tested is likely specific to the question being asked, many parameter values and models of population structure will be unique to each study. The models and parameter values used in this example are for educational purposes only, and we encourage investigators to allow reasonable consideration of model paremeters.Getting Started — You will need a Mesquite file with the following parts:
- Taxa Block corresponding to genes sampled
- Reconstructed gene tree
- One or more Taxa Blocks corresponding to populations — You may have multiple population taxa blocks corresponding to different models of population structure. For the ease of this example, we will only deal with a single population taxa block.
- Association between the gene taxa and population taxa
- Population Tree Block corresponding to model(s) being tested — Instructions for creating these model trees are provided below.
Calculating observed value of s. The easiest way to do this is to add a Tree Legend to the gene tree. If you do not have a tree window open showing the reconstructed gene tree, open one using Taxa&Trees > New Tree Window. Add the legend from the Analysis:Tree menu (Analysis:Tree > Tree Legend...), and select "s of Slatkin & Maddison". If you have more than one association included in the file, you will be asked to choose which association to use for the calculations. Record the number that appears in the tree legend — this is the observed value of the test statistic.
In our example, we have 20 gene sequences, sampled from 4 populations. The color of the gene sequences corresponds to the population from which they were sampled (note: taxon name colors are only used for aesthetics and are not required for this type of analysis. If you would like to color taxon names, you will need to set up "Groups" of taxa, which are not the same as associations.). In this example, s is counting the number of changes (in parsimony steps) of a character corresponding to population membership (as indicated by color): Counting the most parsimonious number of changes leads to s = 9 |
In our example, we would like to test two models; the first is a model of population fragmentation, in which all four populations became isolated from one another, some time in the past, say 100,000 generations ago. Starting with a default symmetrical tree, we use the collapse branch tool () to collapse the branches indicated in red on the tree to the left, to create the tree on the right. |
||
Fragmentation Model |
For our alternative model, we will test the hypothesis that Populations 1 & 2 and Populations 3 & 4 were isolated in two refugia 100000 generations ago, and only recently split (1000 generations ago) from two into four populations. For this model tree, we manipulate the fragmentation model tree to reflect the alternative model, and store a copy of this tree under another name (using Store Tree As... instead of Store Tree to avoid overwriting the fragmentation model tree; storing this model tree in the same tree block as the fragmentation model tree will make things easier later on). Using the arrow tool () in the population tree window, we join the branches of populations 1 & 2 and populations 3 & 4 (below, left). To assign branch lengths to individual branches, use the Adjust Branch Length tool () on each branch, assigning branches 1-4 a value of 1000 generations (recent split) and branches 5 & 6 a value of 99000 (so the age of the split is 100000 generations ago). Because the branch lengths of branches 1-4 are relatively short, it may be difficult to visualize them when branches are drawn proportional to lengths (below, right), so turning off Branches Proportional to Length may ease manipulation.
Refugia Model |
We will use the two model trees, fragmentation and refugia, to generate a simulated distribution of the test statistic.
Simulating distributions of test statistic. To simulate a distribution of Slatkin & Maddison's s, bring the tree window of the population tree corresponding to the model you wish to test to the front. Select Analysis > New Bar & Line Chart for > Trees. In the "Select Taxa" dialog that opens, select the block of taxa corresponding to the gene sequences. Choose "Simulated Trees" as the source of trees, and "Coalescence Contained within Current Tree" as the Tree Simulator (in older versions of Mesquite, you may have to check the "Show Secondary Choices" box for this tree simulator to be an option). Set the effective population size when prompted, and choose "s of Slatkin & Maddison" as the value to calculate for trees (s of Slatkin & Maddison may be a Secondary Choice, so you'll have to check the appropriate box for it to appear as an option). You will be prompted to enter the number of trees for the chart, and asked if the current window is appropriate for simulations (you should check to be sure the tree window corresponding to the model you wish to test is the one Mesquite is querying you about). If you have multiple associations in your file, you will be prompted to choose the association for generating gene trees first, then you will be prompted again to choose the association for calculating s. These will often be the same, but certain situations may arise when the association on which the model is based is not the same as the association for which the statistic is calculated.
Mesquite uses a stepwise process to generate the distribution of the test statistic. It repeats the following steps N times, where N is the number of trees you asked Mesquite to use for calculations:
- Simulate a gene tree within population gene tree
- Calculate the value of s for that gene tree and given association
- Record the value of s
- Discard gene tree
After Mesquite has recorded the simulated values, a Chart Window will show a histogram of the simulated distribution. By clicking on the "Text" tab, you can see the numbers used to generate the frequency distribution for the test statistic. These numbers can be copied & pasted into a spreadsheet program to generate graphs outside of Mesquite. Additionally, in the "Graphics" view, you can ask Mesquite to draw lines on the distribution corresponding to 95% confidence interval. In the Chart Window, select Chart > Analysis > Percentiles.... From here, you can choose upper or lower bounds, as well as the size of the percentiles (e.g. for the 90% confidence interval, enter 0.1 for the Percentile Boundary).
The chart window displaying the distribution of s is dependent on the Tree Window showing the population tree on which the gene trees were simulated. By default, the Chart Window is set to re-calculate s if any changes are made to the Tree Window. Thus, if you make changes to the Tree Window (re-arrange branches, change branch lengths, or scroll between trees), the chart will re-simulate gene trees and perform necessary calculations to reflect the distibution of s corresponding to the tree shown in the Tree Window. The auto-recalculate option can be toggled on/off from the Chart menu, if desired. However, if you are testing more than one model of population structure, it is easiest to:
- Generate a distribution for the first model to be tested
- Record the distribution (copy & paste from the "Text" tab into a spreadsheet program)
- In the Tree Window, scroll to the next model to be tested; the distribution will be automatically recalculated, based on the new model
- Repeat for each model being tested, recording the distribution of s for each model
If the observed value of s falls outside of the 95% confidence interval for a particular model, that model, and the associated parameter values (branch lengths, population size, etc.) are not supported by the data.
We begin by simulating a distribution of s for the fragmentation model. From the Tree Window showing the fragmentation model population tree, we select Analysis > New Bar & Line Chart for > Trees, choosing, as prompted, Genes Taxa Block (taxa to calculate value), Simulated Trees (source of trees), Coalescence Contained within Current Tree (tree simulator), 10000 (effective population size), s of Slatkin & Maddison (Value to calculate), 1000 (trees for the chart), and check to be certain that Mesquite is using the fragmentation model tree. The chart window opens, showing the distribution of simulated values of s. By clicking on the "Text" tab, we can see the details of the simulations: Because our observed values of s = 9 was not within the 95% confidence interval of the simulated distribution, we reject the fragmentation model. To test the refugia model, we can take advantage of the Chart Window's dependancy on the Tree Window. Leaving the Chart Window open, we can scroll to the refugia model tree in the Tree Window. The Chart Window will be updated with new simulations, based on the refugia model tree (if it does not, you may need to turn on "Auto-recalculate" from the Chart menu in the Chart Window). By selecting Chart > Analysis > Percentiles..., setting the Percentile boundary to 0.05, and checking the boxes for both upper and lower tails, we can see the 95% confidence interval for this distribution: Because our observed value s = 9 falls within the 95% confidence interval, we do not reject the two refugia model. By clicking on the "Text" tab, the full details of the distribution can be seen: From these results, we reject the fragmentation model in favor of the two refugia model. Further tests of deeper and shallower divergence times can provide confidence interval for this model, for given parameter values. |
In closing, it is important to consider parameter values (Ne, divergence times, etc.) and the explanatory power of the tests. In the example above, if we had used an effective population size 100 times larger (Ne = 1,000,000), the fragmentation model is not rejected (the observed value of s = 9 falls within the 95% confidence interval of the simulated distribution of s). We encourage careful consideration of reasonable parameter values in these approaches. Rather than relying on one or a few point estimates of parameter values, it may be most informative to explore parameter space, and define that parameter space that would best explain the given pattern of genetic variation.