Package SubsMat
source code
Substitution matrices, log odds matrices, and operations on them.
General:
-------
This module provides a class and a few routines for generating
substitution matrices, similar ot BLOSUM or PAM matrices, but based on
user-provided data.
The class used for these matrices is SeqMat
Matrices are implemented as a dictionary. Each index contains a 2-tuple,
which are the two residue/nucleotide types replaced. The value differs
according to the matrix's purpose: e.g in a log-odds frequency matrix, the
value would be log(Pij/(Pi*Pj)) where:
Pij: frequency of substitution of letter (residue/nucletide) i by j
Pi, Pj: expected frequencies of i and j, respectively.
Usage:
-----
The following section is layed out in the order by which most people wish
to generate a log-odds matrix. Of course, interim matrices can be
generated and investigated. Most people just want a log-odds matrix,
that's all.
Generating an Accepted Replacement Matrix:
-----------------------------------------
Initially, you should generate an accepted replacement matrix (ARM)
from your data. The values in ARM are the _counted_ number of
replacements according to your data. The data could be a set of pairs
or multiple alignments. So for instance if Alanine was replaced by
Cysteine 10 times, and Cysteine by Alanine 12 times, the corresponding
ARM entries would be:
['A','C']: 10,
['C','A'] 12
As order doesn't matter, user can already provide only one entry:
['A','C']: 22
A SeqMat instance may be initialized with either a full (first
method of counting: 10, 12) or half (the latter method, 22) matrices. A
Full protein alphabet matrix would be of the size 20x20 = 400. A Half
matrix of that alphabet would be 20x20/2 + 20/2 = 210. That is because
same-letter entries don't change. (The matrix diagonal). Given an
alphabet size of N:
Full matrix size:N*N
Half matrix size: N(N+1)/2
If you provide a full matrix, the constructore will create a half-matrix
automatically.
If you provide a half-matrix, make sure of a (low, high) sorted order in
the keys: there should only be
a ('A','C') not a ('C','A').
Internal functions:
Generating the observed frequency matrix (OFM):
----------------------------------------------
Use: OFM = _build_obs_freq_mat(ARM)
The OFM is generated from the ARM, only instead of replacement counts, it
contains replacement frequencies.
Generating an expected frequency matrix (EFM):
---------------------------------------------
Use: EFM = _build_exp_freq_mat(OFM,exp_freq_table)
exp_freq_table: should be a freqTableC instantiation. See freqTable.py for
detailed information. Briefly, the expected frequency table has the
frequencies of appearance for each member of the alphabet
Generating a substitution frequency matrix (SFM):
------------------------------------------------
Use: SFM = _build_subs_mat(OFM,EFM)
Accepts an OFM, EFM. Provides the division product of the corresponding
values.
Generating a log-odds matrix (LOM):
----------------------------------
Use: LOM=_build_log_odds_mat(SFM[,logbase=10,factor=10.0,roundit=1])
Accepts an SFM. logbase: base of the logarithm used to generate the
log-odds values. factor: factor used to multiply the log-odds values.
roundit: default - true. Whether to round the values.
Each entry is generated by log(LOM[key])*factor
And rounded if required.
External:
---------
In most cases, users will want to generate a log-odds matrix only, without
explicitly calling the OFM --> EFM --> SFM stages. The function
build_log_odds_matrix does that. User provides an ARM and an expected
frequency table. The function returns the log-odds matrix.
Methods for subtraction, addition and multiplication of matrices:
-----------------------------------------------------------------
* Generation of an expected frequency table from an observed frequency
matrix.
* Calculation of linear correlation coefficient between two matrices.
* Calculation of relative entropy is now done using the
_make_relative_entropy method and is stored in the member
self.relative_entropy
* Calculation of entropy is now done using the _make_entropy method and
is stored in the member self.entropy.
* Jensen-Shannon distance between the distributions from which the
matrices are derived. This is a distance function based on the
distribution's entropies.
|
_build_obs_freq_mat(acc_rep_mat)
build_obs_freq_mat(acc_rep_mat): Build the observed frequency matrix,
from an accepted replacements matrix The acc_rep_mat matrix should be
generated by the user. |
source code
|
|
|
_exp_freq_table_from_obs_freq(obs_freq_mat) |
source code
|
|
|
_build_exp_freq_mat(exp_freq_table)
Build an expected frequency matrix exp_freq_table: should be a
FreqTable instance |
source code
|
|
|
_build_subs_mat(obs_freq_mat,
exp_freq_mat)
Build the substitution matrix |
source code
|
|
|
_build_log_odds_mat(subs_mat,
logbase=2,
factor=10.0,
round_digit=0,
keep_nd=0)
_build_log_odds_mat(subs_mat,logbase=10,factor=10.0,round_digit=1):
Build a log-odds matrix logbase=2: base of logarithm used to build
(default 2) factor=10.: a factor by which each matrix entry is
multiplied round_digit: roundoff place after decimal point keep_nd:
if true, keeps the -999 value for non-determined values (for which
there are no substitutions in the frequency substitutions matrix). |
source code
|
|
|
make_log_odds_matrix(acc_rep_mat,
exp_freq_table=None,
logbase=2,
factor=1.0,
round_digit=9,
keep_nd=0) |
source code
|
|
|
observed_frequency_to_substitution_matrix(obs_freq_mat) |
source code
|
|
|
|
|
two_mat_relative_entropy(mat_1,
mat_2,
logbase=2,
diag=3) |
source code
|
|
|
|
|
two_mat_DJS(mat_1,
mat_2,
pi_1=0.5,
pi_2=0.5) |
source code
|
|
|
NOTYPE = 0
|
|
ACCREP = 1
|
|
OBSFREQ = 2
|
|
SUBS = 3
|
|
EXPFREQ = 4
|
|
LO = 5
|
|
EPSILON = 1e-14
|
|
diagNO = 1
|
|
diagONLY = 2
|
|
diagALL = 3
|
|
__package__ = ' Bio.SubsMat '
|
_build_log_odds_mat(subs_mat,
logbase=2,
factor=10.0,
round_digit=0,
keep_nd=0)
| source code
|
_build_log_odds_mat(subs_mat,logbase=10,factor=10.0,round_digit=1):
Build a log-odds matrix logbase=2: base of logarithm used to build
(default 2) factor=10.: a factor by which each matrix entry is multiplied
round_digit: roundoff place after decimal point keep_nd: if true, keeps
the -999 value for non-determined values (for which there are no
substitutions in the frequency substitutions matrix). If false, plants
the minimum log-odds value of the matrix in entries containing -999
|