Utilities
Hamming distance
hamming(x, y)
returns the normalized Hamming distance between two integer vectors.
Alignment statistics
Pairwise Hamming distance
pairwise_hamming(A)
returns a vector containing all Hamming distances between pairs of sequences in A
. For large alignments, it is often practical to consider only a subset of Hamming distances: the step=n
keyword can be used to only consider every nth sequence. The function is also adapted to two alignments: [pairwise_hamming(A,B)
] will consider all pairs of sequences with one member in A
and the other one in B
.
Statistics
- the profile of the alignment:
site_specific_frequencies(A)
returns aq x L
matrix whereq
is the size of the alphabet andL
the length of sequences, with element(a, i)
being the fraction of sequences in which charactera
was found at positioni
. - the pairwise frequencies:
pairwise_frequencies(A)
- the pairwise correlations:
pairwise_correlations(A)
All these functions accept a vector of weights as a second argument, and will by default use A.weights
if it is not provided.
Weights
In DCA-like algorithms, it is customary to weight sequences by their degree of phylogenetic relations. Typically, the weight of a sequence is inversely proportional to the number of other sequences with a Hamming distance smaller than some threshold. For an alignment X
, computing the weights is done using compute_weights
:
julia> A = read_fasta("../../example/toy_fasta_dna.fasta")
Alignment of M=5 sequences of length L=10 - Shown as `MxL` matrix 5×10 adjoint(::Matrix{Int64}) with eltype Int64: 5 3 4 2 4 1 5 1 5 5 2 4 4 1 4 2 3 2 2 3 4 3 5 3 3 5 5 3 5 1 3 3 5 2 2 1 5 3 1 2 2 3 2 5 3 2 4 3 2 4
julia> compute_weights(A) # compute and return the weight vector, as well as the effective number of sequences
([0.2, 0.2, 0.2, 0.2, 0.2], 5.0)
julia> compute_weights!(A); # same, but sets the weights field in A
julia> A.weights
5-element Vector{Float64}: 0.2 0.2 0.2 0.2 0.2