Taxa, Trees, Characters ˇ

 

Analyzing Molecular Data

Molecular data (DNA or protein sequences) can be simulated and analyzed in various ways in Mesquite. Most of the features discussed elsewhere concerning editing and analysis of general categorical data also apply to molecular data; here we focus on features specifically designed for sequence data.

Contents


BLASTing sequences

The following features allow one to BLAST sequences against GenBank or a local BLASTable database.

BLAST in Web Browser

Select a sequence or portion thereof in the data matrix. Choose Matrix>Search>BLAST in Web Browser, and Mesquite will send a BLAST request to GenBank to search for matching sequences. Your default web browser should open and take you to NCBI's BLAST page.

Top BLAST Matches

This tool will search a database using BLAST, and return information about the top hits, and optionally import them into the current matrix. You will be given the choice to either BLAST NCBI's database, or BLAST a local database that you have created. To use this tool, select a sequence or portion thereof in the data matrix. Choose Matrix>Search>Top BLAST Matches. You will be given the choice between BLASTing the NCBI GenBank Server, or BLASTing a local database, on your computer.

You will first need to install the BLAST+ executables   on your local computer.  You can downoad the latest version from here: https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/. Note which folder they are installed in, as that will be relevant when you set up some of the tools.

BLASTing GenBank

If you choose to BLAST GenBank directly, you will be asked to choose among various options, including:

BLASTing a database on your computer

If you wish instead to BLAST a local database, perhaps containing sequences you generated, then you will first need to create a BLASTable database. You can do this in one of two ways:

Invoke the makeblastdb command yourself: In your computer's command-line terminal shell, change to the directory containing the FASTA file. Let's imagine that your FASTA file is called "FASTAfileName", and that you wish to create a BLASTable nucleotide database called "BLASTableDatabaseName". Then use the following command in your computer's terminal shell:

makeblastdb -in FASTAfileName -out BLASTableDatabaseName -dbtype nucl

Note: if you are using the version 2.10.0 or later of NCBI BLAST+ tools, you will need to also add "-blastdb_version 4" to the command, as follows:

makeblastdb -in FASTAfileName -out BLASTableDatabaseName -dbtype nucl -blastdb_version 4

Ask Mesquite to create the databases: If you choose this option, then Mesquite will take a directory of FASTA files, and for each of them call makeblastdb to create a BLASTable database. To do this, touch on Mesquite's log window. You will then see a Utilities menu. In there, choose "Make BLASTable files from FASTA...". You will get a dialog box like this:

After choosing your options here, you will be asked for the directory containing the FASTA files. After you choose, Mesquite will take each file in that directory that ends in the requested extension (e.g., "fas"), and ask makeblastdb to make a database for each one. The file names used will be based upon the original FASTA file name. So, for example, if the FASTA file name is "fileName.fas", then the BLASTABLE database will be referred to as "fileNameDB".

Either of these two approaches will create three or four files for each FASTA file, with extensions .nin, .nsq, and .nhr.

Move the files created to whatever folder is required for NCBI's BLAST tools to function (e.g., on the macOS, this is a "db" folder in a "blast" folder in your home directory). You can then BLAST the BLASTableDatabaseName file directly from within Mesquite. Within Mesquite, after you choose to BLAST a local database by choosing Matrix>Search>Top BLAST Matches and then choosing Lcoal Blaster, you will be presented with the Local BLAST Options dialog box:

In the "Database to search", put the name of one or more databases (e.g., "BLASTableDatabaseName"); if you put multiple, then separate their nams by commas. These databases will need to be either in the default location expected by your installation of NCBI's BLAST executables (e.g., in a "db" folder within a "blast" folder within your home directory), or in another folder you specify - you choose those options using the choices that specify the database locations at the bottom of the dialog box. If you enter a "*" in the "Databases to search" field, then Mesquite will BLAST all databases that are contained within the folder containing the databases. You needn't enter any other options in this dialog box for databases you have created yourself. However, if you are going to do a blastX to a local protein database that you downloaded from GenBank, you will need to check "Use ID in Definition" box.

If your BLAST executables are not in the default folder that an installer placed them, then you will need uncheck "BLAST programs in default location" and specify the path to the folder.

Fetch & Add GenBank Sequences

This option, in Matrix>Utilities> Fetch & Add GenBank Sequences, allows one to enter a comma-delimited list of GenBank accession numbers, and Mesquite will acquire these from GenBank and import them into the current matrix.

Simulating DNA sequence evolution

DNA sequence evolution can be simulated to build statistical tests, for instance via parametric bootstrapping. See the page on simulating DNA sequences.

Statistics for DNA sequences

Calculations for categorical characters in general can be applied to DNA sequences. For example, parsimony calculations can be made for DNA sequences, as can basic descriptive statistics such as the percent of a sequence or character that is missing data or gaps. In addition, there are several modules specifically designed for DNA data, illustrated by examples in Mesquite_Folder/examples/Molecular. These calculate compositional bias:

gcBiasTaxa.jpg


gcSequence.jpg


gcBiasColorMatrix.jpg


DNA Distances

Mesquite supports several distances for DNA data:

There are several options available (in the Distance Parameters submenu) for dealing with ambiguous bases and gaps:



Statistics for Protein Data



stepsVsHydro.jpg




aahydroColorMatrix.jpg

Visualizing tertiary structure

Although there are not yet dedicated windows for visualizing phylogenetic statistics in the context of molecular structure, features have been added to the Scattergram chart to allow it to be adapted for this purpose. For instance, in this image cytochrome B is shown, with the amino acids colored according to a simple phylogenetic statistic: the number of parsimony steps on a phylogeny. The colors are smoothed by a moving window, and show that several coils of the molecule, a few at the left and one deep at the right, evolve more rapidly than others. This example is illustrated in the data file at Mesquite_Folder/examples/Molecular/06-cytochromeB.nex
cytStructure.jpg

To build such a chart, begin with a file with a matrix of protein sequences. The procedure is also described in the example files 08-cytochromeBlinked.nex and 09-cytochromeBscatter.nex.


Sequence data within populations

See the page on population genetics.

Reconstructing ancestral states

Ancestral states of molecular characters can be reconstructed as described in the page on reconstructing ancestral states. Likelihood methods are not yet available for molecular characters.