Documentation for BioSequenceMappings.

BioSequenceMappings.AlignmentType
mutable struct Alignment{A,T} where {A, T<:Integer}
    data::Matrix{T}
    alphabet::Union{Nothing, Alphabet{A,T}}
    weights::Vector{Float64} = ones(size(dat,1))/size(dat,1) # phylogenetic weights of sequences
    names::Vector{String} = fill("", size(dat, 1))

Biological sequences as vectors of type T<:Integer. data stores sequences in columns: size(dat) returns a tuple (L, M) with L the length and M the number of sequences. When displayed, shows data as an MxL matrix to match with traditional alignments.

alphabet{A,T} represents the mapping between integers in data and biological symbols of type A (nucleotides, amino acids...). If nothing, the alignment cannot be mapped to biological sequences.

weights represent phylogenetic weights, and are initialized to 1/M. They must sum to 1. names are the label of sequences, and are expected to be in the same order as the columns of data. They do not have to be unique, and can be ignored

Important: When built from a matrix, assumes that the sequences are stored in columns.

Methods

  • getindex(X::Alignment, i) returns a matrix/vector X.data[:, i].
  • for s in X::Alignment iterates over sequences.
  • eachsequence(X::Alignment) returns an iterator over sequences (Vector{Int}).
  • eachsequence_weighted(X::Alignment) returns an iterator over sequences and weights as tuples
  • subaln(X::Alignment, idx) constructs the subaln defined by index idx.
source
BioSequenceMappings.AlignmentMethod
Alignment(data::AbstractMatrix{T}; alphabet = :auto, kwargs...)

Keyword argument alphabet can be :auto, :none/nothing, or an input to the constructor Alphabet. Other keyword arguments are passed to the default constructor of Alignment.

source
BioSequenceMappings.AlignmentMethod
Alignment(data::AbstractMatrix, alphabet; kwargs...)

data is a matrix of integers, with sequences stored in columns. alphabet can be either

  • an Alphabet
  • nothing: no conversion from integers to biological symbols.
  • something to build an alphabet from (e.g. a symbol like :aa, a string, ...). The constructor Alphabet will be called like so: Alphabet(alphabet).

If the types of alphabet and data mismatch, data is converted.

source
BioSequenceMappings.AlphabetType
struct Alphabet{A,I}
    characters::Vector{A}
    char_to_index::Dict{A, I}
    index_to_char::Dict{I, A}
    default_char = nothing
    default_index
end

Structure allowing the mapping from biological symbols of type A to integers of type I. The typical use case would be Alphabet{Char, Int}. Alphabet can be constructed

  • from a Vector of symbols and an optional type I, e.g. Alphabet(['A','C','G','T'], UInt8)::Alphabet{Char, UInt8}
  • from a String and an optional type, e.g. Alphabet("ACGT")
  • from a mapping Dict{A, I} where I<:Integer: Alphabet(Dict('A'=>1, 'C'=>2))
  • from a Symbol, using default alphabets, e.g. Alphabet(:nt)
  • from an integer, using default alphabets (see ?default_alphabets).
source
BioSequenceMappings.compute_weightsFunction
compute_weights(X::AbstractAlignment, θ = 0.2; normalize = true)

Compute phylogenetic correction weights for sequences of X. The weight sequence S is 1/N, where N is the number of sequences in X at hamming distance less than H from S (including S itself). The threshold H is floor(θ⋅L) where L is the sequence length.

The return value is a tuple (weights, Meff), where Meff is the sum of weights (pre-normalization). If normalize, weights are normalized to sum to one. .

source
BioSequenceMappings.default_alphabetMethod
default_alphabet(q::Int, T::Type)
  • if q==21, amino acids
  • if q==5, nucleotides
  • if q==4, nucleotides without gaps
  • if q==2, binary (0, 1)
  • else, if q<21, return the restriction of amino acids to the first q sites
  • if q>21, fails
source
BioSequenceMappings.eachsequenceMethod
eachsequence(X::AbstractAlignment[, indices]; skip)

Return an iterator over the sequences in X. If indices is specified, consider only sequences at the corresponding indices. Use the integer argument skip to return only one sequence every skip (~ 1:skip:end).

source
BioSequenceMappings.find_sequenceMethod
find_sequence(label::AbstractString, aln::AbstractAlignment)

Find sequence with name label in aln, and return (index, sequence). Scales as the number of sequences.

!!! Return a view of the sequence.

source
BioSequenceMappings.hammingMethod
hamming(x, y; normalize=true, positions=nothing)

Hamming distance between Vectors x and y. Only sites in vector positions will be considered.

source
BioSequenceMappings.match_sequencesMethod
match_sequences(pattern, aln::AbstractAlignment)

Find sequences whose name matches label in aln, and return (indices, sequences). Sequences are returned as columns.

!!! Return a view of the sequences.

source
BioSequenceMappings.pairwise_correlationsFunction
pairwise_correlations(X, w=X.weights; as_mat=false)

Compute connected correlations: the difference between the pairwise frequencies and the product of the single site frequencies. See ?pairwise_frequencies for the shape of the output.

source
BioSequenceMappings.pairwise_frequenciesFunction
pairwise_frequencies(X::AbstractAlignment, w=X.weights; as_mat=false)

Return a q x q x L x L tensor. The (a, b, i, j) element is the fraction of sequences for which we see a at position i and b at position j.

If as_mat=true, will return a qL x qL matrix, with q x q blocks representing correlations between two specific columns.

source
BioSequenceMappings.pairwise_hammingMethod
pairwise_hamming(X, Y; step=1, step_left, step_right, as_vec=true, kwargs...)
pairwise_hamming(X; kwargs...)

Return all hamming distances between sequences of X and Y. In the second form, consider pairs of sequences in X.

Only consider sequences every step. step_left and step_right can be used to skip sequence either in X or in Y. This is useful for large alignment, as the number of computations grows with the product of the size of the alignments

By default, the return value is a vector organized like [H(1,2), H(1,3), ..., H(M-1, M)] with H standing for hamming distance and M for the number of sequences. If a matrix is prefered, use as_vec=false

Extra keyword arguments are passed to hamming.

source
BioSequenceMappings.read_fastaMethod
read_fasta(fastafile::AbstractString; alphabet = :auto, kwargs...)
read_fasta(
    fastafile::AbstractString, alphabet;
    weights = false, theta = 0.2, verbose = false,
)
source
BioSequenceMappings.site_specific_frequenciesFunction
site_specific_frequencies(X::AbstractAlignment[, weights=X.weights]; as_vec=false)

Return the site specific frequencies of X. If as_vec, the result is a vector of length Lxq. Otherwise, it is a matrix of q rows and L columns (default).

source
BioSequenceMappings.subsample_randomMethod
subsample_random(X::AbstractAlignment, m::Int)

Return an Alignment with m sequences taking randomly from X. Sampling is done without replacement, meaning the m sequences are all at different positions in X.

source
BioSequenceMappings.translateMethod
translate(x, original_alphabet::Alphabet, new_alphabet::Alphabet)

Return the translation in new_alphabet of an integer or a vector of integers x that is expressed in original_alphabet.

source