Taxa, Trees, Characters ˇ

 

Analyzing Molecular Data

Molecular data (DNA or protein sequences) can be simulated and analyzed in various ways in Mesquite. Most of the features discussed elsewhere concerning editing and analysis of general categorical data also apply to molecular data; here we focus on features specifically designed for sequence data.

Contents


BLASTing sequences

The following features allow one to BLAST sequences against GenBank or a local BLASTable database.

BLAST in Web Browser

Select a sequence or portion thereof in the data matrix. Choose Matrix>Search>BLAST in Web Browser, and Mesquite will send a BLAST request to GenBank to search for matching sequences. Your default web browser should open and take you to NCBI's BLAST page.

Top BLAST Matches

This tool will search a database using BLAST, and return information about the top hits, and optionally import them into the current matrix. You will be given the choice to either BLAST NCBI's database, or BLAST a local database that you have created. To use this tool, select a sequence or portion thereof in the data matrix. Choose Matrix>Search>Top BLAST Matches. You will be given the choice between BLASTing the NCBI GenBank Server, or BLASTing a local database, on your computer.
If you choose to BLAST GenBank directly, you will be asked to choose among various options, including:


If you wish instead to BLAST a local database, perhaps containing sequences you generated, then you will first need to create a BLASTable database. You can do this by following the instructions on this NCBI BLAST page. After you install the NCBI tools, you can turn a FASTA file containing your sequences using the following instructions. Let's imagine that your FASTA file is called "FASTAfileName", and that you wish to create a BLASTable nucleotide database called "BLASTableDatabaseName". Then use the following command in your computer's terminal shell:

makeblastdb -in FASTAfileName -out BLASTableDatabaseName -dbtype nucl

This will create three files, BLASTableDatabaseName.nin, BLASTableDatabaseName.nsq, and BLASTableDatabaseName.nhr. Move these to whatever folder is required for NCBI's BLAST tools to function (e.g., on the macOS, this is a "db" folder in a "blast" folder in your home directory). You can then BLAST the BLASTableDatabaseName file directly from within Mesquite. Within Mesquite, after you choose to BLAST a local database, you will first be asked to specify some details about the local database before you choose the BLAST options in the "Local BLAST Options" dialog box. In the "Database to search", put the name of the database (e.g., "BLASTableDatabaseName"). You needn't enter any other options in this dialog box for databases you have created yourself. However, if you are going to do a blastX to a local protein database that you downloaded from GenBank, you will need to check "Use ID in Definition" box.

If your BLAST executables are not in the default folder that the installer placed them, then you will need uncheck "BLAST programs in default location" and specify the path to the folder. If the BLAST databases are not in the default folder specified by the installer, then you will need uncheck "BLAST databases in default location" and specify the path to their folder.

Fetch & Add GenBank Sequences

This option, in Matrix>Utilities> Fetch & Add GenBank Sequences, allows one to enter a comma-delimited list of GenBank accession numbers, and Mesquite will acquire these from GenBank and import them into the current matrix.

Simulating DNA sequence evolution

DNA sequence evolution can be simulated to build statistical tests, for instance via parametric bootstrapping. See the page on simulating DNA sequences.

Statistics for DNA sequences

Calculations for categorical characters in general can be applied to DNA sequences. For example, parsimony calculations can be made for DNA sequences, as can basic descriptive statistics such as the percent of a sequence or character that is missing data or gaps. In addition, there are several modules specifically designed for DNA data, illustrated by examples in Mesquite_Folder/examples/Molecular. These calculate compositional bias:

gcBiasTaxa.jpg


gcSequence.jpg


gcBiasColorMatrix.jpg


DNA Distances

Mesquite supports several distances for DNA data:

There are several options available (in the Distance Parameters submenu) for dealing with ambiguous bases and gaps:



Statistics for Protein Data



stepsVsHydro.jpg




aahydroColorMatrix.jpg

Visualizing tertiary structure

Although there are not yet dedicated windows for visualizing phylogenetic statistics in the context of molecular structure, features have been added to the Scattergram chart to allow it to be adapted for this purpose. For instance, in this image cytochrome B is shown, with the amino acids colored according to a simple phylogenetic statistic: the number of parsimony steps on a phylogeny. The colors are smoothed by a moving window, and show that several coils of the molecule, a few at the left and one deep at the right, evolve more rapidly than others. This example is illustrated in the data file at Mesquite_Folder/examples/Molecular/06-cytochromeB.nex
cytStructure.jpg

To build such a chart, begin with a file with a matrix of protein sequences. The procedure is also described in the example files 08-cytochromeBlinked.nex and 09-cytochromeBscatter.nex.


Sequence data within populations

See the page on population genetics.

Reconstructing ancestral states

Ancestral states of molecular characters can be reconstructed as described in the page on reconstructing ancestral states. Likelihood methods are not yet available for molecular characters.