Utilities

Hamming distance

hamming(x, y) returns the normalized Hamming distance between two integer vectors.

Alignment statistics

Pairwise Hamming distance

pairwise_hamming(A) returns a vector containing all Hamming distances between pairs of sequences in A. For large alignments, it is often practical to consider only a subset of Hamming distances: the step=n keyword can be used to only consider every nth sequence. The function is also adapted to two alignments: [pairwise_hamming(A,B)] will consider all pairs of sequences with one member in A and the other one in B.

Statistics

All these functions accept a vector of weights as a second argument, and will by default use A.weights if it is not provided.

Weights

In DCA-like algorithms, it is customary to weight sequences by their degree of phylogenetic relations. Typically, the weight of a sequence is inversely proportional to the number of other sequences with a Hamming distance smaller than some threshold. For an alignment X, computing the weights is done using compute_weights:


julia> A = read_fasta("../../example/toy_fasta_dna.fasta")Alignment of M=5 sequences of length L=10 - Shown as `MxL` matrix 5×10 adjoint(::Matrix{Int64}) with eltype Int64: 5 3 4 2 4 1 5 1 5 5 2 4 4 1 4 2 3 2 2 3 4 3 5 3 3 5 5 3 5 1 3 3 5 2 2 1 5 3 1 2 2 3 2 5 3 2 4 3 2 4
julia> compute_weights(A) # compute and return the weight vector, as well as the effective number of sequences([0.2, 0.2, 0.2, 0.2, 0.2], 5.0)
julia> compute_weights!(A); # same, but sets the weights field in A
julia> A.weights5-element Vector{Float64}: 0.2 0.2 0.2 0.2 0.2