Sampling from Potts Models

The mcmc_sample function allows you to sample sequences from a Potts model using Markov Chain Monte Carlo (MCMC) methods. This page provides an overview of the sampling process, including how to use the mcmc_sample function, its inputs, and outputs. The docstring of mcmc_sample provides extra details.

Overview

The mcmc_sample function supports different modes of sampling.

Chain: sample sequences along a Markov chain for a specified number of steps.
Tree: sample sequences along the branches of a phylogenetic tree.
Discrete: the evolutionary time corresponds to a discrete number of mcmc steps.
Continuous: the evolutionary time corresponds to a continuous time markov chain, as in many phylogenetics methods.

It is of course possible to mix modes (1,2) with (3,4). The sampling process is controlled by the structure SamplingParameters, which allows you to customize the behavior of the MCMC algorithm.

Chain Sampling

Start an evolutionary trajectory, i.e. a montecarlo chain, which is sampled at specified times. The code below samples M=5 sequences, each separated by 50 discrete Gibbs steps.

julia> L, q, M = 20, 21, 5; # sequence of lenth 20, 21 amino acids, 5 samples
julia> g = PottsGraph(L, q, Float64; init=:rand); # random graph
julia> params = SamplingParameters(; sampling_type=:discrete, Teq=50, burnin=100)SamplingParameters{Int64}
  sampling_type = :discrete
  step_type = :gibbs
  Teq = 50
  burnin = 100
  step_meaning = :accepted
  fraction_gap_step = 0.9
  branchlength_meaning = BranchLengthMeaning(:step, :exact)
julia> initial_sequence = AASequence(L)AASequence{Int64}([4, 6, 1, 8, 19, 5, 12, 1, 13, 14, 16, 14, 18, 13, 21, 20, 18, 5, 1, 1])
julia> result = mcmc_sample(g, M, params; init=initial_sequence);

The output is a named tuple with the following fields:

sequences: an alignment or vector of sampled sequences.
tvals: a vector with the number of steps at each sample.
info: information about the run.
params: parameters used for the run as a Dict.

julia> result.sequences # a BioSequenceMappings.AlignmentAlignment of M=5 sequences of length L=20 - Shown as `MxL` matrix
5×20 adjoint(::Matrix{Int64}) with eltype Int64:
 18   7   7  19  20   6   3   2  12  …   6   2  18  12  14   3  16  12  21
 17  14  15  21   8  10  10  19  18      4   3   5  11  13   4  16   9   3
 10  15  18  11  11   6  12   1   6      4   1   9  11   8   5   9  10   1
 18   1  17  15  11   8   2   8  16     15  10  16  12   9  20  10  17   1
 20   6   1   5  21  17   7   8  11     19   6  13  18  14  21  18  19   7
julia> write("/tmp/example.fasta", result.sequences) # can be written to fasta format
julia> run(`cat /tmp/example.fasta`) # headers correspond to the time of sampling>100
TGGVWFCAMRAFATMPCRMY
>150
SPQYHKKVTRADCELNDRIC
>200
KQTLLFM-FTAD-ILHEIK-
>250
T-SQLHAHRTPQKRMIWKS-
>300
WF-EYSGHLKCVFNTPYTVG
Process(`cat /tmp/example.fasta`, ProcessExited(0))

The burnin parameter means that 100 steps are performed before taking the first sample in the chain. The first sequence of the chain is therefore different from initial_sequence:

julia> AASequence(result.sequences[1]) != initial_sequence
true

If we want to start the chain from a specific sequence, we just set burnin to 0.

julia> params = SamplingParameters(; sampling_type=:discrete, Teq=50, burnin=0);

julia> result = mcmc_sample(g, M, params; init=initial_sequence);

julia> AASequence(result.sequences[1]) == initial_sequence
true

It is also possible to sample at non regularly spaced times by providing a vector of time values instead of the number M of sequences.

julia> time_values = round.(Int, logrange(10, 1000, length=M)) # times at which the chain is sampled
5-element Vector{Int64}:
   10
   32
  100
  316
 1000

julia> result = mcmc_sample(g, time_values, params; init=initial_sequence);

julia> length(result.sequences)
5

julia> result.tvals == time_values
true

julia> result.sequences.names # sequence labels in the alignment correspond to sampling time
5-element Vector{String}:
 "10"
 "32"
 "100"
 "316"
 "1000"

Tree Sampling

Sampling along the branches of the tree is also done using mcmc_sample. The input tree is provided either from a newick file, either as a TreeTools.Tree objet, see TreeTools.

julia> params = SamplingParameters(; sampling_type=:discrete, Teq=50, burnin=0);
julia> result = mcmc_sample(g, "../../example/small_tree_integers.nwk", params; init=initial_sequence);
julia> result.tree # the tree object, with sequences stored at each node
                                                    _________________________ 1
                           ________________________|
                          |                        |_________________________ 2
  ________________________|
 |                        |                         _________________________ 7
 |                        |________________________|
 |                                                 |_________________________ 8
_|
 |                                                  _________________________ 3
 |                         ________________________|
 |                        |                        |_________________________ 4
 |________________________|
                          |                         _________________________ 5
                          |________________________|
                                                   |_________________________ 6
julia> result.tree.root.data.seq == initial_sequence # the initial sequence is placed at the roottrue

The return value contains alignments of leaf and internal node sequences. Sequences in the alignments are labeled according to the labels of the nodes in the input tree. If nodes are not labeled in the input tree, they are automatically assigned labels which can be read in the output tree.

julia> result.internal_sequencesAlignment of M=7 sequences of length L=20 - Shown as `MxL` matrix
7×20 adjoint(::Matrix{Int64}) with eltype Int64:
 15  18  16   6   8   5  18   2   9  …  14   2   1  17  13  19  12  15  19
 13  20   1  10   4  13   4   3  15      2  16  20  19  17   2  14  17  20
 12   4   3  20  20  21   9   3   1      4  16  11   2  10  19  14   5  17
  6  10   1  21  21   5   2  16  17      3   9  18   9  18  16   3   1   9
  5   2   8   7   1  11  17  15  21      3  16   4   5  17   2  14  12  15
 21   3   2   8  20   5  21   4  14  …   3  12  18  13  10   3   8   1   5
  4   6   1   8  19   5  12   1  13     14  18  13  21  20  18   5   1   1
julia> result.leaf_sequencesAlignment of M=8 sequences of length L=20 - Shown as `MxL` matrix
8×20 adjoint(::Matrix{Int64}) with eltype Int64:
  4   9   7   6  21  12  21   6   2  …   9  10  19   6  13   1   5  16  17
 11   5   6   4  18  18   2   5   6     12   2  15  17  13   1  12  11   6
  9  10  20  10  19  17  13  12   5     11   2  15  19  20   6  21  17  14
 20  15  20  20  21  10   1  11   1      3   2   9  17  17   2  14   4  19
 12  10   9   9  20   9   8  15  15     14  16  11  15   4  14  13  16  17
 18  19  11  12   7  19  11   3  17  …   2   5  11   4   8   8   4  21   9
  6   2  20   9   2  11  10   8  17     18  16  17  20  11   1  17  21  15
 21   2  19  10   3  15  21   6   5      2  16   1   7  13  20  18   8  12
julia> result.leaf_sequences.names8-element Vector{String}:
 "1"
 "2"
 "7"
 "8"
 "3"
 "4"
 "5"
 "6"

Alternatively, sequences can be stored in dictionaries indexed by node labels:

julia> result = mcmc_sample(
           g, "../../example/small_tree_integers.nwk", params;
           init=initial_sequence, alignment_output=false
       );
julia> result.leaf_sequencesDict{String, AASequence{Int64}} with 8 entries:
  "3" => AASequence{Int64}([13, 5, 12, 9, 11, 12, 14, 10, 19, 18, 20, 18, 20, 9…
  "4" => AASequence{Int64}([20, 1, 14, 13, 16, 10, 7, 15, 16, 15, 11, 8, 2, 9, …
  "1" => AASequence{Int64}([6, 17, 15, 11, 20, 10, 9, 1, 19, 6, 18, 10, 18, 7, …
  "5" => AASequence{Int64}([6, 20, 17, 13, 3, 15, 10, 19, 20, 19, 10, 8, 16, 21…
  "2" => AASequence{Int64}([4, 4, 11, 8, 4, 10, 6, 20, 11, 14, 12, 21, 6, 14, 4…
  "7" => AASequence{Int64}([2, 3, 5, 19, 18, 11, 7, 9, 16, 6, 7, 20, 4, 5, 4, 1…
  "6" => AASequence{Int64}([8, 11, 12, 16, 5, 20, 3, 17, 20, 4, 18, 21, 11, 19,…
  "8" => AASequence{Int64}([20, 5, 3, 6, 9, 11, 1, 20, 12, 20, 20, 21, 16, 9, 2…

It is sometimes useful to perform sampling several times in a row. This is done by calling mcmc_sample(graph, tree, M, params), where M is the number of repetitions. In this case, it is can be practical to have all sequences sampled at a given node grouped in a single alignment, instead of having a list of alignments that each correspond to one tree. This is achieved by calling PottsEvolver.pernode_alignment on the output of mcmc_sample:

julia> M = 5 # five repeats5
julia> result = mcmc_sample(
           g, "../../example/small_tree_integers.nwk", M, params; init=initial_sequence
       );
julia> result = PottsEvolver.pernode_alignment(result);
julia> result.sequences["1"] # all sequences sampled at node "1"Alignment of M=5 sequences of length L=20 - Shown as `MxL` matrix
5×20 adjoint(::Matrix{Int64}) with eltype Int64:
  1   2   6  18  15   1   6   9  15  …  19  17   4  20  21  11   9   3   7
 17  13   2   2   1   2   6  16  13     20   1   2  21  21  17  16   5   5
 13   3  15  11   1   4   3  16  18      2   9  17   9  12  12  15  19  19
  9  16  12   6  19  12   5   2   3     20   7   4  20   5   3  11   7  19
 17  10  17   6   9   6  17  14  11      9   4   7   1  20  19   5  17  15

Interpreting branch lengths

When sampling from a continuous time model, any real valued branch length is valid. Things get more complicated when sampling from a discrete model, where evolutionary time is measured in number of MCMC steps. The BranchLengthMeaning type is used to facilitate this. It is provided as a field of SamplingParameters. For instance

SamplingParameters(; branchlength_meaning=BranchLengthMeaning(; type=:sweep, length=:exact))

will first multiply each branch in the tree by a factor L (length of the sequence), and expects the result to be exactly an integer.

Continuous time sampling

Discrete sampling is quite straightforward: the time corresponds to a discrete number of mcmc steps. In continuous time however, it is customary to scale time so that the expected number of mutations per site per unit of time is 1. This is done by setting the average_substitution_rate field of SamplingParameters, see Generative continuous time model reveals epistatic signatures in protein evolution. This average substituion rate can be computed either by sampling the potts model or by using an existing alignment of sequences.

julia> n_samples = 100 # compute the average transition rate from 100 samples100
julia> Teq = 10*L; # discrete steps between samples
julia> step_type = :glauber;
julia> Ω_1 = let
           # first generate sample with discrete time, then use it to compute the average rate
           s0 = AASequence(L)
           params = SamplingParameters(; sampling_type=:discrete, Teq)
           aln = mcmc_sample(g, M, params; init=s0).sequences
           write("/tmp/tmp.fasta", aln) # save the alignment for use below
           aln = map(AASequence, aln)
           # compute the average transition rate from the sampled alignment
           PottsEvolver.average_transition_rate(g, step_type, aln)
       end193.48655308240185
julia> Ω_2 = let
           # Read the previously sampled alignment, stored in a fasta
           # Ω_1 == Ω_2
           PottsEvolver.average_transition_rate(g, step_type, "/tmp/tmp.fasta")
       end193.48655308240185
julia> Ω_3 = let
           # call the average_transition_rate function directly, which will sample the potts model
           s0 = AASequence(L)
           PottsEvolver.average_transition_rate(g, step_type, s0; n_samples, Teq)
       end190.02859162220207

Now, define SamplingParameters(; sampling_type=:continuous, average_substitution_rate=Ω_1) to have a time-normalized MCMC.

Additional Utilities

Logging

You can control the verbosity of the sampling process using the verbose and logfile parameters:

julia> result = mcmc_sample(g, 1000, params; init=:random_aa, verbose=1);┌ Info: Sampling 1000 sequences using the following settings:
│     - Type of sampling: discrete
│     - Type of sequence = AASequence{Int64}
│     - Step style = gibbs
│     - Step meaning = accepted
│     - Fraction of gap steps (if codon) = 0.9
└     - Initial/last time step: 0/49950
[ Info: Initial sequence: AASequence{Int64}([19, 10, 4, 15, 1, 13, 13, 8, 12, 8, 5, 4, 5, 17, 1, 5, 3, 16, 7, 17])
[ Info: Initializing with 0 burnin iterations... 
[ Info: Sampling...
[ Info: Sampling done in 0.628777712 seconds