Taxa, Trees, Characters ˇ

 

Managing Molecular Data

Molecular data (DNA or protein sequences) can be edited, manipulated, simulated and analyzed in various ways in Mesquite. Most of the features discussed elsewhere concerning editing and analysis of general categorical data also apply to molecular data; here we focus on features specifically designed for sequence data.

Contents


DNAMatrix.gif

Editing molecular data

Molecular data can be imported from files of NBRF, FASTA, GenBank/GenPept, PHYLIP, CLUSTAL, and simple table format. It can also be exported to some of these formats.
The Character Matrix Editor can be used to edit a molecular sequence matrix. Standard ambiguity codes are allowed.

Alter/Transform Tools

The following can be applied to all or the selected portions of a molecular sequence matrix in the Character Matrix Editor. These are available under the Alter/Transform submenu of the Matrix menu (some of these may be under "Other Choices"):
Other options may appear; see the page on characters for standard choices in this submenu. You can also apply the other editing tools described for character matrices.

The view of the matrix can be adjusted in various ways in the Display menu. Cells can be colored according to the state at the site (Color Cells submenu, Character State) or according to a value like the GC bias (Color Cells submenu,Cell Value; can request this coloring to use a moving window). Examples of this are shown below. The Display menu contains other options such as a Bird's eye view which makes the cells narrow to show more of the sequences.

Copy Sequence (at bottom of Edit menu) copies the selected cells of the matrix into the computer's clipboard as a sequence. That is, whereas the standard Copy would place into the clipboard selected pieces of the matrix in tab-delimited text format (e.g., if the sequence AATCA is selected, "A-tab-A-tab-T-tab-C-tab-A" would be copied), this modified Copy Sequence command does not include tabs (thus, "AATCA" would be copied). This style of copying is useful when interacting with programs like Sequencher (TM). For instance, if you want to find a piece of sequence in a matrix in Mesquite within a chromatogram viewer of Sequencher, do the following: select the sequence in Mesquite, choose Copy Sequence, then go to Sequencher, select Find Bases, and paste the sequence as the search string.

Alignment

The Align package contains utilities for sequence alignment. These include manual alignment tools (for shifting blocks of sequence, for example), and automated tools (e.g., sending a region of the matrix to MAFFT or MUSCLE to align, or a pairwise alignment tool in the editor that will align one sequence to another). See the Align manual for more details.

Finding Sequences

You have several options to find sequences in a matrix.
Pieces of sequences can be found using the Find Sequence and Find All Sequences submenus of the Edit menu. The current options are:

Combining molecular matrices or sequences

Often you have sequences in different matrices, and you need to fuse them into a single matrix for analyses. You may be adding new sequences from an existing gene, or you may be adding new genes for existing taxa.

Adding new taxa/sequences

If you want to take two matrices and concatenate them vertically, i.e. add new taxa to an existing set of gene sequences, then you can do it either from a menu or using Drag and Drop.
gene1beforeConcat.jpg
genePQRBeforeInclude.jpg
becomesArrow.gif
genesAfterInclude.jpg

Here is a video demonstrating various ways to merge matrices from different files:

 

Adding genes

There are two aspects to bringing together data from two separate genes: first, to bring the sequences from the two genes into the same file as two separate matrices, and second, to concatenate them into a single matrix. Here we will explain how to bring them into the same file; in the next section we will explain how to concatenate the matrices.
Our goal is to bring the two sequence matrices into a common file and belonging to a common block of taxa. That way, if you subsequently edit taxon names or add and delete taxa, these changes will affect both matrices synchronously.
If you have two files, each with sequences from a different gene but with identical blocks of taxa (i.e. the list of taxa is the same and with the exact same names), then you can open one file first, then use Include File to bring the second matrix into the same file. This will result in both matrices belonging to the same taxa block.
If you have two files, each with sequences from a different gene but with blocks of taxa that are non-identical because taxon names are different, then the only mechanism in Mesquite to bring them together in Mesquite currently is to use Fused Matrix Export (NEXUS) described under Concatenating Genes.
If you have two files, each with sequences from a different gene but with blocks of taxa that are non-identical because of differing taxon inclusion, then you can use Merge Taxa & Matrices from File to bring the two files together. This requires that any taxa that are shared between the two files have the exact same names. This is the procedure:
  1. First, open the one file. Then select Merge Taxa & Matrices from File from the Taxa & Trees menu, and choose the second file.
  2. You will be asked "To which block of taxa do you want to fuse the taxa being read in?". In most cases you will have just one choice, the taxa block in your first file; choose it and hit OK.
  3. It may warn you that there are duplicate taxon names; click OK because you were expecting taxon names to be the same in the two files.
  4. Then it will ask you to "Select matrix with which to fuse the matrix being read." If the new matrix being read in concerns a different gene, then you will probably want to hit Cancel so as to make a new matrix for these sequences. (You would hit OK if the second file being read in contained the same gene but new taxa; see Adding New Taxa/Sequences above.)
  5. You should now have a file with one taxa block and two matrices:

File 1:
gene1beforeConcat.jpg

File 2:
gene2beforeConcat.jpg
becomesArrow.gif
Merged File:
genesAfterMerge.jpg

Concatenating genes

If you want to take two sequence matrices and concatenate them horizontally to make long sequences including both genes, then how you do this depends on whether your taxa have the exact same names in the two matrices, or not.
gene1beforeConcat.jpg
gene2beforeConcat.jpg
becomesArrow.gif
genesAfterConcat.jpg

gene1beforeConcat2.jpg
gene2beforeInclude.jpg
becomesArrow.gif
genesAfterConcat.jpg

The Fused Matrix exporter permits you to export these into a single matrix as long as you have indicated how the sequences correspond to one another. To do this, we suggest you create a new taxa block representing the species or specimens. In this example, create a taxa block "Species" with taxa A, B and C. This will be the "master block of taxa" that will organize the export. (Alternatively, you could choose one of the genes' taxa blocks as the master block.) Set up a Taxa Association between the master block of taxa and each of the other blocks of taxa. With the first Taxa Association between species and Gene 1 indicate that "A1" belongs with species A, "B1" belongs with species B, and "C1" belongs with species C. Set up the species-Gene 2 association similarly. The two taxa association will look like this in the List of Taxa window for the master taxa block:
assocBeforeFusedExport.gif
Then when you choose Fused Matrix Export, choose Species as your master taxa. The exporter will then find all of the data corresponding to each species, either under the species taxon itself or under one of the linked taxa indicated by the Taxa Association, and compose a fused matrix. If a single master taxon has more than one corresponding taxa in one of the other matrices, the data are merged using the same rules as for Merge Taxa.

Other Tools for Managing Molecular matrices

Managing sequences in different matrices, especially if from different genes, can be difficult. Several functions assist in this. These features are not restricted to molecular data, but we anticipate most of their use will be with sequences.

Display of Sequences

Protein-coding sequences can be colored by the amino acid into which a triplet would be translated (under the genetic code for that triplet) by choosing Display>Color Cells>Color Nucleotide by Amino Acid.

Consensus Sequences

Consensus Sequences can be displayed above the character matrix by choosing Display>Add Info Strip>Consensus Sequence Strip, as indicated below by the arrows:

consensus.gif

In the above examples, two consensus sequences strips are displayed, with slightly different options.
Options are available by touching on the consensus sequence:

consensusAA.gif


Codon Positions

You can assign codon positions to a portion of a sequence by the following steps:

This is explained in the following video:






Genetic Codes

The genetic code for sequence data can be specified in Mesquite's List of Characters window (by choosing List of Characters, and then Columns>Current Genetic Codes, or with a data matrix frontmost, Matrix>Genetic Codes...). Genetic codes are assigned to individual characters (thus allowing one to have a mixed matrix of mitochondrial and nuclear data, for example). To assign a genetic code, choose select the characters, and use the popup menu of the title of the "Genetic Code" column in the List of Characters window.
The genetic code affects, among other things, the Translate DNA to Protein command, as well as the coloring of nucleotide sequences if Color Nucleotide by Amino Acid is chosen.