Sampling from Potts Models
The mcmc_sample function allows you to sample sequences from a Potts model using Markov Chain Monte Carlo (MCMC) methods. This page provides an overview of the sampling process, including how to use the mcmc_sample function, its inputs, and outputs. The docstring of mcmc_sample provides extra details.
Overview
The mcmc_sample function supports different modes of sampling.
- Chain: sample sequences along a Markov chain for a specified number of steps.
- Tree: sample sequences along the branches of a phylogenetic tree.
- Discrete: the evolutionary time corresponds to a discrete number of mcmc steps.
- Continuous: the evolutionary time corresponds to a continuous time markov chain, as in many phylogenetics methods.
It is of course possible to mix modes (1,2) with (3,4). The sampling process is controlled by the structure SamplingParameters, which allows you to customize the behavior of the MCMC algorithm.
Chain Sampling
Start an evolutionary trajectory, i.e. a montecarlo chain, which is sampled at specified times. The code below samples M=5 sequences, each separated by 50 discrete Gibbs steps.
julia> L, q, M = 20, 21, 5; # sequence of lenth 20, 21 amino acids, 5 samplesjulia> g = PottsGraph(L, q, Float64; init=:rand); # random graphjulia> params = SamplingParameters(; sampling_type=:discrete, Teq=50, burnin=100)SamplingParameters{Int64} sampling_type = :discrete step_type = :gibbs Teq = 50 burnin = 100 step_meaning = :accepted fraction_gap_step = 0.9 branchlength_meaning = BranchLengthMeaning(:step, :exact)julia> initial_sequence = AASequence(L)AASequence{Int64}([4, 6, 1, 8, 19, 5, 12, 1, 13, 14, 16, 14, 18, 13, 21, 20, 18, 5, 1, 1])julia> result = mcmc_sample(g, M, params; init=initial_sequence);
The output is a named tuple with the following fields:
sequences: an alignment or vector of sampled sequences.tvals: a vector with the number of steps at each sample.info: information about the run.params: parameters used for the run as aDict.
julia> result.sequences # a BioSequenceMappings.AlignmentAlignment of M=5 sequences of length L=20 - Shown as `MxL` matrix 5×20 adjoint(::Matrix{Int64}) with eltype Int64: 18 7 7 19 20 6 3 2 12 … 6 2 18 12 14 3 16 12 21 17 14 15 21 8 10 10 19 18 4 3 5 11 13 4 16 9 3 10 15 18 11 11 6 12 1 6 4 1 9 11 8 5 9 10 1 18 1 17 15 11 8 2 8 16 15 10 16 12 9 20 10 17 1 20 6 1 5 21 17 7 8 11 19 6 13 18 14 21 18 19 7julia> write("/tmp/example.fasta", result.sequences) # can be written to fasta formatjulia> run(`cat /tmp/example.fasta`) # headers correspond to the time of sampling>100 TGGVWFCAMRAFATMPCRMY >150 SPQYHKKVTRADCELNDRIC >200 KQTLLFM-FTAD-ILHEIK- >250 T-SQLHAHRTPQKRMIWKS- >300 WF-EYSGHLKCVFNTPYTVG Process(`cat /tmp/example.fasta`, ProcessExited(0))
The burnin parameter means that 100 steps are performed before taking the first sample in the chain. The first sequence of the chain is therefore different from initial_sequence:
julia> AASequence(result.sequences[1]) != initial_sequence
trueIf we want to start the chain from a specific sequence, we just set burnin to 0.
julia> params = SamplingParameters(; sampling_type=:discrete, Teq=50, burnin=0);
julia> result = mcmc_sample(g, M, params; init=initial_sequence);
julia> AASequence(result.sequences[1]) == initial_sequence
trueIt is also possible to sample at non regularly spaced times by providing a vector of time values instead of the number M of sequences.
julia> time_values = round.(Int, logrange(10, 1000, length=M)) # times at which the chain is sampled
5-element Vector{Int64}:
10
32
100
316
1000
julia> result = mcmc_sample(g, time_values, params; init=initial_sequence);
julia> length(result.sequences)
5
julia> result.tvals == time_values
true
julia> result.sequences.names # sequence labels in the alignment correspond to sampling time
5-element Vector{String}:
"10"
"32"
"100"
"316"
"1000"Tree Sampling
Sampling along the branches of the tree is also done using mcmc_sample. The input tree is provided either from a newick file, either as a TreeTools.Tree objet, see TreeTools.
julia> params = SamplingParameters(; sampling_type=:discrete, Teq=50, burnin=0);julia> result = mcmc_sample(g, "../../example/small_tree_integers.nwk", params; init=initial_sequence);julia> result.tree # the tree object, with sequences stored at each node_________________________ 1 ________________________| | |_________________________ 2 ________________________| | | _________________________ 7 | |________________________| | |_________________________ 8 _| | _________________________ 3 | ________________________| | | |_________________________ 4 |________________________| | _________________________ 5 |________________________| |_________________________ 6julia> result.tree.root.data.seq == initial_sequence # the initial sequence is placed at the roottrue
The return value contains alignments of leaf and internal node sequences. Sequences in the alignments are labeled according to the labels of the nodes in the input tree. If nodes are not labeled in the input tree, they are automatically assigned labels which can be read in the output tree.
julia> result.internal_sequencesAlignment of M=7 sequences of length L=20 - Shown as `MxL` matrix 7×20 adjoint(::Matrix{Int64}) with eltype Int64: 15 18 16 6 8 5 18 2 9 … 14 2 1 17 13 19 12 15 19 13 20 1 10 4 13 4 3 15 2 16 20 19 17 2 14 17 20 12 4 3 20 20 21 9 3 1 4 16 11 2 10 19 14 5 17 6 10 1 21 21 5 2 16 17 3 9 18 9 18 16 3 1 9 5 2 8 7 1 11 17 15 21 3 16 4 5 17 2 14 12 15 21 3 2 8 20 5 21 4 14 … 3 12 18 13 10 3 8 1 5 4 6 1 8 19 5 12 1 13 14 18 13 21 20 18 5 1 1julia> result.leaf_sequencesAlignment of M=8 sequences of length L=20 - Shown as `MxL` matrix 8×20 adjoint(::Matrix{Int64}) with eltype Int64: 4 9 7 6 21 12 21 6 2 … 9 10 19 6 13 1 5 16 17 11 5 6 4 18 18 2 5 6 12 2 15 17 13 1 12 11 6 9 10 20 10 19 17 13 12 5 11 2 15 19 20 6 21 17 14 20 15 20 20 21 10 1 11 1 3 2 9 17 17 2 14 4 19 12 10 9 9 20 9 8 15 15 14 16 11 15 4 14 13 16 17 18 19 11 12 7 19 11 3 17 … 2 5 11 4 8 8 4 21 9 6 2 20 9 2 11 10 8 17 18 16 17 20 11 1 17 21 15 21 2 19 10 3 15 21 6 5 2 16 1 7 13 20 18 8 12julia> result.leaf_sequences.names8-element Vector{String}: "1" "2" "7" "8" "3" "4" "5" "6"
Alternatively, sequences can be stored in dictionaries indexed by node labels:
julia> result = mcmc_sample( g, "../../example/small_tree_integers.nwk", params; init=initial_sequence, alignment_output=false );julia> result.leaf_sequencesDict{String, AASequence{Int64}} with 8 entries: "3" => AASequence{Int64}([13, 5, 12, 9, 11, 12, 14, 10, 19, 18, 20, 18, 20, 9… "4" => AASequence{Int64}([20, 1, 14, 13, 16, 10, 7, 15, 16, 15, 11, 8, 2, 9, … "1" => AASequence{Int64}([6, 17, 15, 11, 20, 10, 9, 1, 19, 6, 18, 10, 18, 7, … "5" => AASequence{Int64}([6, 20, 17, 13, 3, 15, 10, 19, 20, 19, 10, 8, 16, 21… "2" => AASequence{Int64}([4, 4, 11, 8, 4, 10, 6, 20, 11, 14, 12, 21, 6, 14, 4… "7" => AASequence{Int64}([2, 3, 5, 19, 18, 11, 7, 9, 16, 6, 7, 20, 4, 5, 4, 1… "6" => AASequence{Int64}([8, 11, 12, 16, 5, 20, 3, 17, 20, 4, 18, 21, 11, 19,… "8" => AASequence{Int64}([20, 5, 3, 6, 9, 11, 1, 20, 12, 20, 20, 21, 16, 9, 2…
It is sometimes useful to perform sampling several times in a row. This is done by calling mcmc_sample(graph, tree, M, params), where M is the number of repetitions. In this case, it is can be practical to have all sequences sampled at a given node grouped in a single alignment, instead of having a list of alignments that each correspond to one tree. This is achieved by calling PottsEvolver.pernode_alignment on the output of mcmc_sample:
julia> M = 5 # five repeats5julia> result = mcmc_sample( g, "../../example/small_tree_integers.nwk", M, params; init=initial_sequence );julia> result = PottsEvolver.pernode_alignment(result);julia> result.sequences["1"] # all sequences sampled at node "1"Alignment of M=5 sequences of length L=20 - Shown as `MxL` matrix 5×20 adjoint(::Matrix{Int64}) with eltype Int64: 1 2 6 18 15 1 6 9 15 … 19 17 4 20 21 11 9 3 7 17 13 2 2 1 2 6 16 13 20 1 2 21 21 17 16 5 5 13 3 15 11 1 4 3 16 18 2 9 17 9 12 12 15 19 19 9 16 12 6 19 12 5 2 3 20 7 4 20 5 3 11 7 19 17 10 17 6 9 6 17 14 11 9 4 7 1 20 19 5 17 15
Interpreting branch lengths
When sampling from a continuous time model, any real valued branch length is valid. Things get more complicated when sampling from a discrete model, where evolutionary time is measured in number of MCMC steps. The BranchLengthMeaning type is used to facilitate this. It is provided as a field of SamplingParameters. For instance
SamplingParameters(; branchlength_meaning=BranchLengthMeaning(; type=:sweep, length=:exact))will first multiply each branch in the tree by a factor L (length of the sequence), and expects the result to be exactly an integer.
Continuous time sampling
Discrete sampling is quite straightforward: the time corresponds to a discrete number of mcmc steps. In continuous time however, it is customary to scale time so that the expected number of mutations per site per unit of time is 1. This is done by setting the average_substitution_rate field of SamplingParameters, see Generative continuous time model reveals epistatic signatures in protein evolution. This average substituion rate can be computed either by sampling the potts model or by using an existing alignment of sequences.
julia> n_samples = 100 # compute the average transition rate from 100 samples100julia> Teq = 10*L; # discrete steps between samplesjulia> step_type = :glauber;julia> Ω_1 = let # first generate sample with discrete time, then use it to compute the average rate s0 = AASequence(L) params = SamplingParameters(; sampling_type=:discrete, Teq) aln = mcmc_sample(g, M, params; init=s0).sequences write("/tmp/tmp.fasta", aln) # save the alignment for use below aln = map(AASequence, aln) # compute the average transition rate from the sampled alignment PottsEvolver.average_transition_rate(g, step_type, aln) end193.48655308240185julia> Ω_2 = let # Read the previously sampled alignment, stored in a fasta # Ω_1 == Ω_2 PottsEvolver.average_transition_rate(g, step_type, "/tmp/tmp.fasta") end193.48655308240185julia> Ω_3 = let # call the average_transition_rate function directly, which will sample the potts model s0 = AASequence(L) PottsEvolver.average_transition_rate(g, step_type, s0; n_samples, Teq) end190.02859162220207
Now, define SamplingParameters(; sampling_type=:continuous, average_substitution_rate=Ω_1) to have a time-normalized MCMC.
Additional Utilities
Logging
You can control the verbosity of the sampling process using the verbose and logfile parameters:
julia> result = mcmc_sample(g, 1000, params; init=:random_aa, verbose=1);┌ Info: Sampling 1000 sequences using the following settings: │ - Type of sampling: discrete │ - Type of sequence = AASequence{Int64} │ - Step style = gibbs │ - Step meaning = accepted │ - Fraction of gap steps (if codon) = 0.9 └ - Initial/last time step: 0/49950 [ Info: Initial sequence: AASequence{Int64}([19, 10, 4, 15, 1, 13, 13, 8, 12, 8, 5, 4, 5, 17, 1, 5, 3, 16, 7, 17]) [ Info: Initializing with 0 burnin iterations... [ Info: Sampling... [ Info: Sampling done in 0.628777712 seconds