Select fileChange Remove

File format:

Click here for more about tree file formats than you'll ever want to know

Newick

The simplest and most common file format for phylogenetic trees is the Newick (or New Hampshire) format. Newick files are plain text files whose sole contents consist of parenthetical statements describing one or more tree shapes. You can verify this by opening your tree file in a text editor such as NotePad on Windows, or TextEdit on Mac OSX: if you only see lots of parentheses with text inside it's probably a Newick file. Regrettably, the Newick "standard" is not followed too carefully by different programs. We have attempted to be permissive in what the monophylizer accepts but there are some caveats to keep in mind:

Newick files can contain multiple tree descriptions. The monophylizer only reads the first tree.
The monophylizer operates strictly on topology. Branch lengths are therefore optional: you can use either cladograms or phylograms. If you do provide branch lengths, they can be arbitrary length integers, floating point numbers, or use scientific notation.
The tips in your tree should have names that consist of the species name (Genus species or Genus species subspecies), and some kind of separator (such as the | symbol), and then some kind of unique identifier such as a specimen or sequence identifier. In total, each name should therefore look something like 'Genus species|ID2347' You can specify which record separator symbol you're using in the 'Tree reading' tab. You can use other symbols besides | as long as they don't have special meaning in Newick. The following are therefore disallowed: ,'"[]();:_
BOLD has a tendency to break the standard, especially when you export trees with more metadata than strictly needed. If you export trees from BOLD make sure the trees contain species names and unique identifiers, but nothing else. Metadata such as names of collection localities or taxonomic authorities are quite likely to contain characters that have special meanings in Newick trees, yet BOLD does nothing to prevent their unsafe inclusion. For example, a place name such as Côte d'Azur is problematic, firstly because the accent circonflexe might not transfer correctly (this has to do with character encoding) and secondly because the apostrophe is interpreted as an opening quote for which a file reader that follows the standard will expect a closing quote. Likewise, any text that is [inside square brackets] is interpreted in a special way: file readers that follow the standard assume that such text is a comment that should normally be stripped out of the tree description (some programs insert special data in these comments, though this is "non standard"). When BOLD inserts square brackets into your tree description this can break the tree reading, especially when there is an opening bracket but no closing one, something that we've actually seen "in the wild".
Whitespace is normally stripped out of Newick trees before they are read. If a name in a tree needs to have spaces in them, such as the space between the genus name and the specific epithet, then this is done either by using an underscore symbol _ instead of a space, or by putting the entire name in quotes. This allows certain programs to insert spaces in tree descriptions as indentation, to show the "depth" of the taxon in the tree description. When such trees are read by a standard compliant tree reader, all the indentation spaces are stripped, as they should. However, BOLD inserts spaces between genus names and specific epithets without quoting the names. Any strictly valid tree reader is going to remove those spaces as well, which results in concatenation of the genus and species names. We provide a workaround for this (incorrect) syntax variant in the 'Tree reading' tab, but it is up to you to know the meaning of whitespace because a tree reader can't guess this automatically.

Here are examples of usable Newick files:

Newick files with quoted whitespace:
Newick files with underscored whitespace:
Technically invalid whitespace usage as exported by BOLD, allowed by the monophylizer (if 'Keep whitespace in Newick species names' is checked):
Indented Newick files with underscored whitespace. These are technically valid, but the monophylizer only accepts these if 'Keep whitespace in Newick species names' is unchecked. This is so that the common case of BOLD trees as input can be processed with default settings:
Newick cladograms. All syntax rules about quoting and whitespace apply, the only difference is the absence of branch lengths:

Nexus

The Nexus standard is a more complex, plain text-based file format for phylogenetic data. Nexus files can contain other data besides trees, such as sequence alignments or other types of character state matrices, distance matrices, and so on. Many commonly-used programs read and/or write Nexus data. These include FigTree, Mesquite, TreeAnnotator, MrBayes, MacClade, PAUP*, etc. Nexus files often - but by no means always - have the extension *.nex or *.nxs. You can recognize whether your file is a Nexus file by opening it in a text editor such as NotePad on Windows or TextEdit on Mac OSX. If the very first word of the file is #nexus, either in lower or in uppper case, then the file is most likely a Nexus file. Regrettably, the Nexus "standard" is not followed too carefully by different programs. We have attempted to be permissive in what the monophylizer accepts but there are some caveats to keep in mind:

Nexus files can contain multiple tree descriptions. The monophylizer only reads the first tree.
The monophylizer operates strictly on topology. Branch lengths are therefore optional: you can use either cladograms or phylograms. If you do provide branch lengths, they can be arbitrary length integers, floating point numbers, or use scientific notation.
Nexus files can contain data "blocks" besides trees. The monophylizer ignores all non-tree blocks.
Tree descriptions in Nexus files follow the same syntax as Newick. However, in some cases the taxon names are replaced with integers that correspond with a numbered list of taxon names, the so-called "translation table". This is done to save space and cut down on redundant names in files that contain many trees for the same taxa (for example in Bayesian analyses). The monophylizer accepts either usage: names embedded inside the tree statements or a separate translation table are both fine.
The tips in your tree should have names that consist of the species name (Genus species or Genus species subspecies), and some kind of separator (such as the | symbol), and then some kind of unique identifier such as a specimen or sequence identifier. In total, each name should therefore look something like 'Genus species|ID2347' You can specify which record separator symbol you're using in the 'Tree reading' tab. You can use other symbols besides | as long as they don't have special meaning in Nexus. The following are therefore disallowed: ,'"[]();:_
BOLD has a tendency to break the standard, especially when you export trees with more metadata than strictly needed. If you export trees from BOLD make sure the trees contain species names and unique identifiers, but nothing else. Metadata such as names of collection localities or taxonomic authorities are quite likely to contain characters that have special meanings in Nexus trees, yet BOLD does nothing to prevent their unsafe inclusion. For example, a place name such as Côte d'Azur is problematic, firstly because the accent circonflexe might not transfer correctly (this has to do with character encoding) and secondly because the apostrophe is interpreted as an opening quote for which a file reader that follows the standard will expect a closing quote. Likewise, any text that is [inside square brackets] is interpreted in a special way: file readers that follow the standard assume that such text is a comment that should normally be stripped out of the tree description (some programs insert special data in these comments, though this is "non standard"). When BOLD inserts square brackets into your tree description this can break the tree reading, especially when there is an opening bracket but no closing one, which we've seen "in the wild".

Here are examples of usable Nexus files:

Nexus files with underscored whitespace:
Nexus files without a translation table:
Nexus files with quoted whitespace:
Nexus cladograms:

XML-based formats

Many of the ambiguities and different dialects in plain text-based file formats are due to differences in the way whitespace is treated, the way data is broken up into "words" or tokens. Extensible markup language (XML), and the generic parsers that are available for it, go some way to removing these tokenization issues. In addition, because formal grammars can be defined and published, some other ambiguities (such as number formats) can be resolved as well. Two such grammars have been developed for phylogenetics: phyloXML and NeXML. The former was developed with use cases in comparative genomics (such as orthology assessment) in mind, the latter was developed to represent the block-like data structure of Nexus files as XML. You can recognize these files by opening them in a text editor such as NotePad on Windows or TextEdit on Mac OSX. In both cases, as for all XML documents, they are likely to start with a in instruction such as the following: <?xml version="1.0" encoding="UTF-8"?>, which is then followed, respectively, by a statement that begins with <phyloxml for phyloXML, or with <nex:nexml for NeXML. Assuming your files are valid (which can be verified using the respective grammars of the standards) there are only a few caveats to keep in mind when using these files in the monophylizer:

Both phyloXML and NeXML files can contain multiple tree descriptions. The monophylizer only reads the first tree.
The monophylizer operates strictly on topology. Branch lengths are therefore optional: you can use either cladograms or phylograms.
The tips in your tree should have names that consist of the species name (Genus species or Genus species subspecies), and some kind of separator (such as the | symbol), and then some kind of unique identifier such as a specimen or sequence identifier. In total, each name should therefore look something like 'Genus species|ID2347' You can specify which record separator symbol you're using in the 'Tree reading' tab. You can use other symbols besides | as long as they don't have special meaning in Newick. The following are therefore disallowed: ,'"[]();:_

Here are examples of usable XML-based files:

The following options configure how tip names are processed. The monophylizer operates on trees where every tip must have both a species name (a binomial or trinomial) and a unique identifier, which are demarcated using a record separator. When you export trees from BOLD this separator is by default the '|' symbol, so you can leave this setting as is when your trees come from BOLD. If you generate your trees in some other way make sure you update the settings here accordingly.

Record separator:

Keep whitespace in Newick species names

Split subspecies

Write output as TSV:

Additional metadata:

Select fileChange Remove

FAQ

How does the program determine what's para- or polyphyletic?

The steps of the algorithm are described in the manuscript (see: 'How to cite') of which this service is a part. In short:

Label all interior nodes with a pre- and a post-order index.
Extract all distinct taxa from the tree
For each taxon:
1. Collect all tips that belong to it
2. Find the MRCA for the collected tips
3. Collect all descendants of the MRCA. If this set is identical to the set of step i. then the taxon is monophyletic and the analysis moves on to the next taxon.
4. Collect all nodes that subtend tips from the focal taxon as well as at least one other taxon and sort these by their post-order index.
5. Group the collected, sorted nodes into distinct root-to-tip paths. Internal nodes that are nested in each other are identified (and collected in the same group) by checking that the pre-order index of the focal node is larger, and the post-order index of the focal node is smaller than that of the next node in the sorted list. If there is more than one distinct root-to-tip path (i.e., group), the taxon is considered polyphyletic, otherwise paraphyletic.
6. For each first (i.e. most recent) node in each group, collect all subtended species. The union of these sets across groups forms the set of entangled species.

How do I interpret the results?

The output of the algorithm is presented in tabular form. Each row represent one taxon from the tree. The 'assessment' column shows whether that taxon is mono-, para- or polyphyletic. The 'tanglees' column shows with which other taxa, if any, the focal taxon is entangled.

How do I export the results?

You can copy and paste the results from the browser window into a spreadsheet program. Easier still would be to check the 'TSV' box in the 'output' tab. The results will then be written as tab-separated data that can be imported directly into spreadsheet programs, R, etc.

How is this algorithm implemented?

The algorithm is written in the Perl programming language and uses the Bio::Phylo libraries to read the input data.

Can I use this "off-line"?

Yes. You can run the script locally. Consult the embedded Perl-Doc documentation in the script or run perl monophylizer.pl --help for more info.

Can I use this through an API?

Yes. The data and parameters are uploaded through an HTTP POST request with multipart/form data and the results are in the response, so this service functions as a RESTful web service. You will need to provide the following parameters:

infile, which is a file upload
format, input file format, one of: newick, nexus, nexml, phyloxml
separator, the character that separates the taxon name from its identifier. Default is |
trinomials, which is an optional argument that, when given any value other than '0', indicates that subspecific epithets need to be parsed
astsv, which is an optional argument that, when given any value other than '0', indicates that the output must be written as tab-separated data
metadata, which is an optional file upload with additional tab-separated data to join with the taxa in the output

How is this code licensed?

The analysis script proper is licensed under the Apache License, which is very permissive for most forms of re-use.

Who wrote this code?

The analysis code was written by Rutger Vos.

Where is this hosted?

The web service is hosted at Naturalis Biodiversity Center , the source code in a GitHub repository.

How to cite?

This service is part of a publication. If you use this service in your research, please cite the publication:

Mutanen, M. et al. 2016. Species-Level Poly- and Paraphyly in DNA barcode Gene Trees: Strong Operational Bias in European Lepidoptera. Systematic Biology

How to format taxonomic names?

Taxonomic names are read without any intelligence: the expectation is that names consist of the genus, the specific epithet, and, optionally, the subspecific epithet, followed by an identifier. Anything else, such as 'sp.', 'cv.', and so on are going to cause problems and will have to be avoided for this analysis.

I am getting error messages?

This is almost certainly because your input file is somehow syntactically invalid. On the 'upload' tab there is a link to extensive documentation, with example files, to explain what input file formats are accepted. If you can't figure out what's wrong with your file, try opening it in Mesquite and exporting it as NEXUS: its intepretation of the NEXUS standard is readily understood by this service, and Mesquite might give you useful feedback about your file.