Reference · BioSequenceMappings.jl

Documentation for BioSequenceMappings.

BioSequenceMappings.Alignment
BioSequenceMappings.Alignment
BioSequenceMappings.Alignment
BioSequenceMappings.Alphabet
BioSequenceMappings.compute_weights
BioSequenceMappings.compute_weights!
BioSequenceMappings.default_alphabet
BioSequenceMappings.eachsequence
BioSequenceMappings.find_sequence
BioSequenceMappings.hamming
BioSequenceMappings.match_sequences
BioSequenceMappings.named_sequences
BioSequenceMappings.pairwise_correlations
BioSequenceMappings.pairwise_frequencies
BioSequenceMappings.pairwise_hamming
BioSequenceMappings.read_fasta
BioSequenceMappings.site_specific_frequencies
BioSequenceMappings.subsample
BioSequenceMappings.subsample
BioSequenceMappings.subsample_random
BioSequenceMappings.symbols
BioSequenceMappings.translate

BioSequenceMappings.Alignment — Type

mutable struct Alignment{A,T} where {A, T<:Integer}

    data::Matrix{T}
    alphabet::Union{Nothing, Alphabet{A,T}}
    weights::Vector{Float64} = ones(size(dat,1))/size(dat,1) # phylogenetic weights of sequences
    names::Vector{String} = fill("", size(dat, 1))

Biological sequences as vectors of type T<:Integer. data stores sequences in columns: size(dat) returns a tuple (L, M) with L the length and M the number of sequences. When displayed, shows data as an MxL matrix to match with traditional alignments.

alphabet{A,T} represents the mapping between integers in data and biological symbols of type A (nucleotides, amino acids...). If nothing, the alignment cannot be mapped to biological sequences.

weights represent phylogenetic weights, and are initialized to 1/M. They must sum to 1. names are the label of sequences, and are expected to be in the same order as the columns of data. They do not have to be unique, and can be ignored

Important: When built from a matrix, assumes that the sequences are stored in columns.

Methods

getindex(X::Alignment, i) returns a matrix/vector X.data[:, i].
for s in X::Alignment iterates over sequences.
eachsequence(X::Alignment) returns an iterator over sequences (Vector{Int}).
eachsequence_weighted(X::Alignment) returns an iterator over sequences and weights as tuples
subaln(X::Alignment, idx) constructs the subaln defined by index idx.

BioSequenceMappings.Alignment — Method

Alignment(data::AbstractMatrix{T}; alphabet = :auto, kwargs...)

Keyword argument alphabet can be :auto, :none/nothing, or an input to the constructor Alphabet. Other keyword arguments are passed to the default constructor of Alignment.

BioSequenceMappings.Alignment — Method

Alignment(data::AbstractMatrix, alphabet; kwargs...)

data is a matrix of integers, with sequences stored in columns. alphabet can be either

an Alphabet
nothing: no conversion from integers to biological symbols.
something to build an alphabet from (e.g. a symbol like :aa, a string, ...). The constructor Alphabet will be called like so: Alphabet(alphabet).

If the types of alphabet and data mismatch, data is converted.

data can also have the following shape:

vector of integer vectors, e.g. [[1,2], [3,4]]: each element is considered as a sequence
vector of integers: single sequence alignment

BioSequenceMappings.Alphabet — Type

struct Alphabet{A,I}
    characters::Vector{A}
    char_to_index::Dict{A, I}
    index_to_char::Dict{I, A}
    default_char = nothing
    default_index
end

Structure allowing the mapping from biological symbols of type A to integers of type I. The typical use case would be Alphabet{Char, Int}. Alphabet can be constructed

from a Vector of symbols and an optional type I, e.g. Alphabet(['A','C','G','T'], UInt8)::Alphabet{Char, UInt8}
from a String and an optional type, e.g. Alphabet("ACGT")
from a mapping Dict{A, I} where I<:Integer: Alphabet(Dict('A'=>1, 'C'=>2))
from a Symbol, using default alphabets, e.g. Alphabet(:nt)
from an integer, using default alphabets (see ?default_alphabets).

BioSequenceMappings.compute_weights — Function

compute_weights(X::AbstractAlignment, θ = 0.2; normalize = true)

Compute phylogenetic correction weights for sequences of X. The weight sequence S is 1/N, where N is the number of sequences in X at hamming distance less than H from S (including S itself). The threshold H is floor(θ⋅L) where L is the sequence length.

The return value is a tuple (weights, Meff), where Meff is the sum of weights (pre-normalization). If normalize, weights are normalized to sum to one. .

BioSequenceMappings.compute_weights! — Function

compute_weights!(X, θ; kwargs...)

Compute and set weights for X. See compute_weights.

BioSequenceMappings.default_alphabet — Method

default_alphabet(q::Int, T::Type)

if q==2, binary (0, 1)
if 3 <= q <= 4, nucleotides without gaps
if q==5, nucleotides
if 5 < q <= 21, amino acids
if q>21, fails

BioSequenceMappings.eachsequence — Method

eachsequence(X::AbstractAlignment[, indices]; skip)

Return an iterator over the sequences in X. If indices is specified, consider only sequences at the corresponding indices. Use the integer argument skip to return only one sequence every skip (~ 1:skip:end).

BioSequenceMappings.find_sequence — Method

find_sequence(label::AbstractString, aln::AbstractAlignment)

Find sequence with name label in aln, and return (index, sequence). Scales as the number of sequences. Return the first sequence that matches the label.

!!! Return a view of the sequence.

BioSequenceMappings.hamming — Method

hamming(x, y; normalize=true, positions=nothing)

Hamming distance between Vectors x and y. Only sites in vector positions will be considered.

BioSequenceMappings.match_sequences — Method

match_sequences(pattern, aln::AbstractAlignment)

Find sequences whose name matches label in aln, and return (indices, sequences). Sequences are returned as columns.

!!! Return a view of the sequences.

BioSequenceMappings.named_sequences — Method

named_sequences(X::AbstractAlignment; skip)

Return an iterator of the form (name, sequence) over X.

BioSequenceMappings.pairwise_correlations — Function

pairwise_correlations(X, w=X.weights; as_mat=false)

Compute connected correlations: the difference between the pairwise frequencies and the product of the single site frequencies. See ?pairwise_frequencies for the shape of the output.

BioSequenceMappings.pairwise_frequencies — Function

pairwise_frequencies(X::AbstractAlignment, w=X.weights; as_mat=false)

Return a q x q x L x L tensor. The (a, b, i, j) element is the fraction of sequences for which we see a at position i and b at position j.

If as_mat=true, will return a qL x qL matrix, with q x q blocks representing correlations between two specific columns.

BioSequenceMappings.pairwise_hamming — Method

pairwise_hamming(X, Y; step=1, step_left, step_right, as_vec=true, kwargs...)
pairwise_hamming(X; as_vec, step, kwargs...)

Return all hamming distances between sequences of X and Y. In the second form, consider pairs of sequences in X.

Only consider sequences every step. step_left and step_right can be used to skip sequence either in X or in Y. This is useful for large alignment, as the number of computations grows with the product of the size of the alignments

By default, the return value is a vector organized like [H(1,2), H(1,3), ..., H(M-1, M)] with H standing for hamming distance and M for the number of sequences. If a matrix is prefered, use as_vec=false

Extra keyword arguments are passed to hamming.

BioSequenceMappings.read_fasta — Method

read_fasta(fastafile::AbstractString; alphabet = :auto, kwargs...)
read_fasta(
    fastafile::AbstractString, alphabet;
    weights = false, theta = 0.2, verbose = false,
)

BioSequenceMappings.site_specific_frequencies — Function

site_specific_frequencies(X::AbstractAlignment[, weights=X.weights]; as_vec=false)

Return the site specific frequencies of X. If as_vec, the result is a vector of length Lxq. Otherwise, it is a matrix of q rows and L columns (default).

BioSequenceMappings.subsample — Method

subsample(X::AbstractAlignment, labels::AbstractVector{<:AbstractString})

Return an Alignment containing only sequences whose name is in labels.

BioSequenceMappings.subsample — Method

subsample(X::AbstractAlignment, indices)

Return an Alignment containing only the sequences of X at indices.

BioSequenceMappings.subsample_random — Method

subsample_random(X::AbstractAlignment, m::Int)

Return an Alignment with m sequences taking randomly from X. Sampling is done without replacement, meaning the m sequences are all at different positions in X.

BioSequenceMappings.symbols — Method

symbols(alphabet)

Return the vector of symbols/characters used by alphabet.

BioSequenceMappings.translate — Method

translate(x, original_alphabet::Alphabet, new_alphabet::Alphabet)

Return the translation in new_alphabet of an integer or a vector of integers x that is expressed in original_alphabet.