Analyzing Molecular Data

Molecular data (DNA or protein sequences) can be simulated and analyzed in various ways in Mesquite. Most of the features discussed elsewhere concerning editing and analysis of general categorical data also apply to molecular data; here we focus on features specifically designed for sequence data.

BLASTing sequences

The following features allow one to BLAST sequences against GenBank or a local BLASTable database.

BLAST in Web Browser

Select a sequence or portion thereof in the data matrix. Choose Matrix>Search>BLAST in Web Browser, and Mesquite will send a BLAST request to GenBank to search for matching sequences. Your default web browser should open and take you to NCBI's BLAST page.

Top BLAST Matches

This tool will search a database using BLAST, and return information about the top hits, and optionally import them into the current matrix. You will be given the choice to either BLAST NCBI's database, or BLAST a local database that you have created. To use this tool, select a sequence or portion thereof in the data matrix. Choose Matrix>Search>Top BLAST Matches. You will be given the choice between BLASTing the NCBI GenBank Server, or BLASTing a local database, on your computer.

You will first need to install the BLAST+ executables on your local computer. You can downoad the latest version from here: https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/. Note which folder they are installed in, as that will be relevant when you set up some of the tools.

BLASTing GenBank

If you choose to BLAST GenBank directly, you will be asked to choose among various options, including:

maximum number of matches to be sought by Mesquite, and the eValue cutoff for acceptable hits
whether the report of results should be saved to a text file
whether details of the taxonomic lineage of the sequences should be reported
whether to import the top matches into the current matrix
the maximum time in seconds to wait for the BLAST search to complete.

BLASTing a database on your computer

If you wish instead to BLAST a local database, perhaps containing sequences you generated, then you will first need to create a BLASTable database. You can do this in one of two ways:

Invoke the makeblastdb command yourself: In your computer's command-line terminal shell, change to the directory containing the FASTA file. Let's imagine that your FASTA file is called "FASTAfileName", and that you wish to create a BLASTable nucleotide database called "BLASTableDatabaseName". Then use the following command in your computer's terminal shell:

makeblastdb -in FASTAfileName -out BLASTableDatabaseName -dbtype nucl

Note: if you are using the version 2.10.0 or later of NCBI BLAST+ tools, you will need to also add "-blastdb_version 4" to the command, as follows:

makeblastdb -in FASTAfileName -out BLASTableDatabaseName -dbtype nucl -blastdb_version 4

Ask Mesquite to create the databases: If you choose this option, then Mesquite will take a directory of FASTA files, and for each of them call makeblastdb to create a BLASTable database. To do this, touch on Mesquite's log window. You will then see a Utilities menu. In there, choose "Make BLASTable files from FASTA...". You will get a dialog box like this:

After choosing your options here, you will be asked for the directory containing the FASTA files. After you choose, Mesquite will take each file in that directory that ends in the requested extension (e.g., "fas"), and ask makeblastdb to make a database for each one. The file names used will be based upon the original FASTA file name. So, for example, if the FASTA file name is "fileName.fas", then the BLASTABLE database will be referred to as "fileNameDB".

Either of these two approaches will create three or four files for each FASTA file, with extensions .nin, .nsq, and .nhr.

Move the files created to whatever folder is required for NCBI's BLAST tools to function (e.g., on the macOS, this is a "db" folder in a "blast" folder in your home directory). You can then BLAST the BLASTableDatabaseName file directly from within Mesquite. Within Mesquite, after you choose to BLAST a local database by choosing Matrix>Search>Top BLAST Matches and then choosing Lcoal Blaster, you will be presented with the Local BLAST Options dialog box:

In the "Database to search", put the name of one or more databases (e.g., "BLASTableDatabaseName"); if you put multiple, then separate their nams by commas. These databases will need to be either in the default location expected by your installation of NCBI's BLAST executables (e.g., in a "db" folder within a "blast" folder within your home directory), or in another folder you specify - you choose those options using the choices that specify the database locations at the bottom of the dialog box. If you enter a "*" in the "Databases to search" field, then Mesquite will BLAST all databases that are contained within the folder containing the databases. You needn't enter any other options in this dialog box for databases you have created yourself. However, if you are going to do a blastX to a local protein database that you downloaded from GenBank, you will need to check "Use ID in Definition" box.

If your BLAST executables are not in the default folder that an installer placed them, then you will need uncheck "BLAST programs in default location" and specify the path to the folder.

Fetch & Add GenBank Sequences

This option, in Matrix>Utilities> Fetch & Add GenBank Sequences, allows one to enter a comma-delimited list of GenBank accession numbers, and Mesquite will acquire these from GenBank and import them into the current matrix.

Simulating DNA sequence evolution

DNA sequence evolution can be simulated to build statistical tests, for instance via parametric bootstrapping. See the page on simulating DNA sequences.

Statistics for DNA sequences

Calculations for categorical characters in general can be applied to DNA sequences. For example, parsimony calculations can be made for DNA sequences, as can basic descriptive statistics such as the percent of a sequence or character that is missing data or gaps. In addition, there are several modules specifically designed for DNA data, illustrated by examples in Mesquite_Folder/examples/Molecular. These calculate compositional bias:

ACGT Compositional Bias — This module supplies the compositional bias of taxa, measured over the taxon's sequence. The bias is treated as a continuous character, and thus can be used wherever characters are used, as for instance in the reconstruction of the evolution of compositional bias as shown in the image below. It can return either the proportion G+C, or separately A, C, G, and T proportions.

Character Compositional Bias — This module supplies the compositional bias for characters. It calculates the percent of taxa with particular nucleotides (GC bias, or individual frequency of A, C, G or T) for a character. The image below shows a moving window analysis of compositional bias along a sequence; the instructions for generating the chart are given here.

GC bias coloring of matrices — The cells of the Character Matrix Editor may be colored according to a moving window of GC bias along the sequence, as shown below, by selecting Display
>Color Cells>Color By Cell Value, then once shown the colors can be smoothed by a moving window analysis by selecting Display>Moving Window (for colors).

DNA Distances

Mesquite supports several distances for DNA data:

Uncorrected (P) distances
Jukes-Cantor
Kimura 2-parameter
F81 (Felsenstein, 1981)
F84 (Felsenstein, 1984)

There are several options available (in the Distance Parameters submenu) for dealing with ambiguous bases and gaps:

Count Sites with Gap in Pair: If this is chosen, then when the distance between two sequences is calculated, then sites that contain a gap in one of the sequences but not the other will be by default included in the calculation. If this option is unchecked, then any site at which either sequence is a gap will be excluded.
Estimate Ambiguity Differences: If this is chosen, then sites in which one member of the pair has an ambiguity have their contribution to the distance estimated based upon the base patterns at non-ambiguous sites.
Base Frequence on Entire Matrix: If this is chosen, then those distances that use as parameters the frequencies of bases (A, C, G, and T) will calculate those frequencies based on all sequences; if it is not chosen, then the frequencies are calculated based upon just the two sequences being compared at any one time.

Statistics for Protein Data

Site hydrophobicity — This module supplies the average amino acid hydrophobicity, averaged across taxa, for each site. It can be used in charts, for instance to see the relationship between a phylogenetic statistic for the site (character) and it average hydrophobicity. This chart, for example, shows parsimony character steps as a function of hydrophobicity:

Amino Acid hydrophobicity — The cells of the Character Matrix Editor may be colored according to a moving window of hydrophobicity along the sequence, as shown below, by selecting Display>Color Cells>Color By Cell Value, then once shown the colors can be smoothed by a moving window analysis by selecting Display>Moving Window (for colors).

Visualizing tertiary structure

Although there are not yet dedicated windows for visualizing phylogenetic statistics in the context of molecular structure, features have been added to the Scattergram chart to allow it to be adapted for this purpose. For instance, in this image cytochrome B is shown, with the amino acids colored according to a simple phylogenetic statistic: the number of parsimony steps on a phylogeny. The colors are smoothed by a moving window, and show that several coils of the molecule, a few at the left and one deep at the right, evolve more rapidly than others. This example is illustrated in the data file at Mesquite_Folder/examples/Molecular/06-cytochromeB.nex

To build such a chart, begin with a file with a matrix of protein sequences. The procedure is also described in the example files 08-cytochromeBlinked.nex and 09-cytochromeBscatter.nex.

Select New Linked Matrix from the Characters menu. When a matrix is made to be linked to a second matrix, the two matrices are constrained to have the same number of characters.
Indicate that you want the linked matrix to be a Continuous matrix, and link it to your protein matrix. Then, turn it into a three dimensional matrix (Taxa X Characters X Coordinates [x, y and z]) by using Add Item and Rename Item in the Utilities submenu of the Matrix menu of the Character Matrix Editor. The x,y,z coordinates could be added for all taxa if known, but otherwise only one taxon needs to be filled out (because we will use the average x,y,z coordinates for the amino acids).
Once the linked matrix of xyz amino acid positions is entered, select Analysis>New Scattergram for> Characters. Indicate you want the scattergram to be for Stored Characters. (If you have "Use Stored Characters/Matrices by Default" turned on in the Defaults submenu if the File menu, Mesquite won't ask you and will simply use Stored Characters.) Indicate Same value for the two axes. In the dialog box "Values for axes", choose Mean Value of Character (Linked Matrix). In response to "Use characters from which matrix? (for Character Source)" choose the protein sequence matrix as the matrix to be used. This will plot the sites (amino acids, characters) in their correct places, but as a series of round spots.
To change the appearance of the plot, select Join the Dots in the Special Effects submenu of the Scattergram menu. Then select Thick Joints, deselect Show Dots, deselect Join First to Last, and set the marker size larger (e.g., 8). This will result in a plot as shown above, but without the colors.
Next, choose Color by Third Value from the Colors menu and choose the value by which to color the amino acids. For parsimony steps, for instance, choose under Character Value with current tree.
Finally, to use a moving window to smooth the colors, select Moving Window for Colors from the Colors menu and indicate the window size (e.g., 5).

Sequence data within populations

See the page on population genetics.

Reconstructing ancestral states

Ancestral states of molecular characters can be reconstructed as described in the page on reconstructing ancestral states. Likelihood methods are not yet available for molecular characters.