Ziheng Yang (z.yang@ucl.ac.uk)
PAML is a program package for phylogenetic analyses of DNA or protein sequences using maximum likelihood. It is maintained and distributed for academic use free of charge by Ziheng Yang. ANSI C source codes as well as PowerMAC and Windows 95/NT executables are provided.
This document is about downloading and compiling PAML and getting started. See the manual (pamlDOC.pdf) for more information about running programs in the package.
Possible uses of the programs are
A summary of the types of analyses performed by different programs in the package is given below.
baseml
:
ML analysis of nucleotide sequences: estimation of tree
topology, branch lengths, and substitution parameters
under a variety of nucleotide substitution models (JC69,
K80, F81, F84, HKY85, TN93, REV); constant or gamma rates
for sites; molecular clock (rate constancy among
lineages) or no clock, among-gene and within-gene
variation of substitution rates; models for combined
analyses of multiple sequence data sets; calculation of
substitution rates at sites; reconstruction of ancestral
nucleotides. basemlg
:
ML analysis of nucleotide sequences under the model of
gamma rates among sites. The (continuous) gamma model is
used with one of the following substitution models: JC69,
K80, F81, F84, HKY85, TN93, and REV. codonml
(codeml
with seqtype
= 1
): ML analysis of protein-coding DNA
sequences using codon substitution models (e.g.,
Goldman and Yang 1994); calculation of the codon-usage
table; estimation of synonymous and nonsynonymous
substitution rates; likelihood ratio test of positive
selection or relaxed selective constraints along lineages
based on the dN/dS
rate ratios; identification of amino acid sites or
evolutionary lineages potentially under positive
selection; reconstruction of ancestral codon sequences. aaml
(codeml
with seqtype
= 2
): ML analysis of amino acid
sequences under a number of amino acid substitution
models (Poisson, Proportional, empirical models such as
those of Dayhoff et al., Jones et al.,
mtREV24, and mtmam, and REV); constant or
gamma-distributed rates among sites; molecular clock
(rate constancy among lineages) or no clock, among-gene
and within-gene variation of substitution rates; models
for combined analyses of multiple gene data; calculation
of substitution rates at sites; reconstruction of
ancestral amino acid sequences. pamp
:
Parsimony-based analyses for a given tree topology,
estimation of the substitution pattern by the method of
Yang and Kumar (1996); estimation of the gamma parameter
for variable rates among sites by the method of moments,
the method of Sullivan et al. (1995), and the method of
Yang and Kumar (1996); reconstruction of ancestral
character states using the algorithm of Hartigan (1973)
and an unpublished "improved parsimony" method.
mcmctree
:
Bayesian estimation of phylogenies using DNA sequence
data (Rannala and Yang, 1996; Yang and Rannala, 1997).
Markov chain Monte Carlo calculation of posterior
probabilities of trees. The algorithm is too slow to be usable.evolver
:
This program used to be named listtree
and
does miscellaneous things, such as listing all rooted and
unrooted trees for a given number of species, generating
random trees with branch lengths from a birth-death
process with species sampling, and calculating tree
bipartition distances. It now also simulates nucleotide,
codon, or amino acid sequence data sets. Parameters for
the simulation are specified in the files MCbase.dat
,
MCcodon.dat
, and MCaa.dat
. You
can run the program to see the main menu, and then
consult one of those files to see the details. This
program can easily fill your hard disk. What does PAML do?
PAML is not good for tree making. There are a few options for heuristic tree search, but they do not work well except for small data sets of only a few species. If you hope to use PAML to compare trees from relatively large data sets, one possibility is to get a collection of candidate trees and then compare them using more sophisticated models implemented in PAML. You can get candidate trees by using other programs/methods implemented in PAUP*, PHYLIP, MOLPHY etc.
PAML may be useful if you are interested in the process of sequence evolution. The two main programs, baseml and codeml, implement a number of sophisticated models, which you can use to construt likelihood ratio tests of evolutionary hypotheses. Right now, the following options/models do not seem available in other packages.
Windows 95/98/NT/2000/XP. Download the win32 archive
(paml*.*.win32.exe. The self-extracting archive has all the files for
the package (source codes, example data files, control files,
documentation, and executables). You can run it and it will explode
into a directory with all the files. Programs in the package (the
.EXE
files) are simple Win32 Console applications, and do
not support mice or menues. Open a "command prompt" box and type the
name of the program rather than double-clicking the program name from
Windows Explorer.
UNIX, linux, MAC OS X or other systems. Download the UNIX archive (paml*.*.tar.gz or paml*.*.tar.Z) and save it on the disk. Then unpack it into a folder. You either
gzip -d paml*.*.tar.gz
or
uncompress paml*.*.tar.Z
Then
tar xf paml*.*.tar
Change directory (cd) to src/. Type make to compile. You might have to open and edit the file Makefile (or copy from Makefile.UNIX) before you compile. For example, you can change cc to gcc and -fast to -O3 or -O4. Read readme.txt in the same folder for compiling instructions.
MAC OS X. You should open a command terminal (Applications-Utilities-Terminal) and then compile and run the programs from the terminal. You cd to the paml folder and then look at the readme.txt or Makefile or Makefile.UNIX files. To compile the programs you either type the command make or copy the commands in te readme file. You will need the Mac Developer Toolkit installed on your MAC, otherwise you will get a "Command not found" error with either cc or make. You can go to the Apple web site to download and install the Toolkit (http://developer.apple.com/tools/index.html). There are some more notes about running programs on MAC OS X or UNIX at the FAQ page.
For linux administrators, a spec file paml.spec for the linux installer rpm has been kindly prepared by Hunter Matthews <thm@duke.edu>.
PowerMacs (PPC or G3 prior to OS X).
Download the compressed self-extracting archive for the PowerMac.
There is usually a time lag before the Mac version is made
available. When you run the PowerMac programs, a command-line
window will pop up. You can then type in the name of the control
file. You can also hit Enter to use the default control file names
(baseml.ctl
for baseml
and
basemlg
, codeml.ctl
for
codeml
). The sequence data files and tree structure
files do not have fixed names and can be specified in the control
files. Thanks to Andrew Rambaut for preparing the archive.
The programs in distribution are essentially the copies I work on every day, as I make only minor changes before release to the public. So the programs are not always well tested. Models that I have never used myself, even it they look sensible or possible from options in the control file, should be taken with great caution. I have included example data sets that were used in our papers for the purpose of error checking. You are encouraged to duplicate our analysis first to check that the program works and also to get familiar with the format of the data file and the interpertation of results.
Programs baseml and codeml estimate parameters and calculate the log likelihood values, but do not calcualte the likelihood ratio statistics. You need to do the subtraction yourself. The theory is like this. If a more-general model involves p parameters and has log likelihood l1, and a simpler model (which is a special case of the general model) has q parameters with log-likeliood value l0, then 2(l1 - l0) can be comared with a chi-square distribution with d.f. = p - q. Suppose we want to test whether the transition/transversion rate ratio kappa = 1. We run the JC69 model and get l0, and run K80 to get l1. Then we compare 2(l1 - l0) with the chi-square distribution with one degree of freedom.
Running PAML. Most programs in the PAML package have control
files that specify the names of the sequence data file, the tree
structure file, and models and options for the analysis. The default
control files are baseml.ctl
for baseml
and
basemlg
, codeml.ctl
for codeml
,
pamp.ctl
for pamp
, mcmctree.ctl
for mcmctree
. The progam evolver
does not
have a control file, and uses a simple user interface. All you do is
to type evolver
and then choose the options. For other
programs, you should prepare a sequence data file and a tree structure
file, and modify the appropriate control files before running the
programs. The formats of those files are detailed in the documentation
in the package.
You need to prepare a sequence data file (e.g.,
brown.nuc
) and modify the options in the appropriate
control file. If you have chosen runmode
= 0 or 1 in the
control file, which means that the tree topologies are specified, you
also need to prepare a tree structure file (e.g.,
trees.4s
). On UNIX or Windows systems, you run the
programs from a command prompt by
or
On the Mac, you simply click on the program name or icon. You can do this on a Windows machine too, but it is better if you open a command box and run the program from there.
Update history and bug fixes collected here.
Due to the amount of emails I receive about the programs, I will not reply to some emails if I believe the answers are readily available from the documentation (pamlDOC.pdf), the web site, the FAQ page and/or the examples included in the package. Questions of this nature include "how do I specify such and such options to do such and such analysis", "can you tell me how maximum likelihood works", and "my supervisor wants results by this afternoon. Can you do the analysis for me?"
I will try to reply to reports of bugs and problems or if the answers are not easily available from those sources. When you have a problem, try to identify where the problem lies. Most of programs in paml read your control file and sequence file before doing any calculation. You can ask questions such as "Does the program read the sequences correctly?", "Does it read my tree file correctly". Problems at this stage can mostly be solved by your editing the files.
When reporting problems, please mention the version number of
the package you use (for example, 3.0c for UNIX) and include a copy of
the control file (baseml.ctl
or
codeml.ctl
). Please let me know exactly
what happened and when, and inlcude screen output generated by the
program, especially the last few lines on the screen. I would also
like to know the number of sequences and the sequence length in case
the problem has to do with the size of the data set.
Try to provide enough information for me to understand and reproduce the problem. The most frustrating email I get says "PAML does not work on my data. Can you help?", without any explanation about what the problem is.
The nice buttons below do not work. So don't click on them.
Counter since 8 March 2002.