Documentation for BioSequenceMappings.
BioSequenceMappings.AlignmentBioSequenceMappings.AlignmentBioSequenceMappings.AlignmentBioSequenceMappings.AlphabetBioSequenceMappings.compute_weightsBioSequenceMappings.compute_weights!BioSequenceMappings.default_alphabetBioSequenceMappings.eachsequenceBioSequenceMappings.find_sequenceBioSequenceMappings.hammingBioSequenceMappings.match_sequencesBioSequenceMappings.named_sequencesBioSequenceMappings.pairwise_correlationsBioSequenceMappings.pairwise_frequenciesBioSequenceMappings.pairwise_hammingBioSequenceMappings.read_fastaBioSequenceMappings.site_specific_frequenciesBioSequenceMappings.subsampleBioSequenceMappings.subsampleBioSequenceMappings.subsample_randomBioSequenceMappings.symbolsBioSequenceMappings.translate
BioSequenceMappings.Alignment — Type
mutable struct Alignment{A,T} where {A, T<:Integer} data::Matrix{T}
alphabet::Union{Nothing, Alphabet{A,T}}
weights::Vector{Float64} = ones(size(dat,1))/size(dat,1) # phylogenetic weights of sequences
names::Vector{String} = fill("", size(dat, 1))Biological sequences as vectors of type T<:Integer. data stores sequences in columns: size(dat) returns a tuple (L, M) with L the length and M the number of sequences. When displayed, shows data as an MxL matrix to match with traditional alignments.
alphabet{A,T} represents the mapping between integers in data and biological symbols of type A (nucleotides, amino acids...). If nothing, the alignment cannot be mapped to biological sequences.
weights represent phylogenetic weights, and are initialized to 1/M. They must sum to 1. names are the label of sequences, and are expected to be in the same order as the columns of data. They do not have to be unique, and can be ignored
Important: When built from a matrix, assumes that the sequences are stored in columns.
Methods
getindex(X::Alignment, i)returns a matrix/vectorX.data[:, i].for s in X::Alignmentiterates over sequences.eachsequence(X::Alignment)returns an iterator over sequences (Vector{Int}).eachsequence_weighted(X::Alignment)returns an iterator over sequences and weights as tuplessubaln(X::Alignment, idx)constructs the subaln defined by indexidx.
BioSequenceMappings.Alignment — Method
Alignment(data::AbstractMatrix{T}; alphabet = :auto, kwargs...)Keyword argument alphabet can be :auto, :none/nothing, or an input to the constructor Alphabet. Other keyword arguments are passed to the default constructor of Alignment.
BioSequenceMappings.Alignment — Method
Alignment(data::AbstractMatrix, alphabet; kwargs...)data is a matrix of integers, with sequences stored in columns. alphabet can be either
- an
Alphabet nothing: no conversion from integers to biological symbols.- something to build an alphabet from (e.g. a symbol like
:aa, a string, ...). The constructorAlphabetwill be called like so:Alphabet(alphabet).
If the types of alphabet and data mismatch, data is converted.
data can also have the following shape:
- vector of integer vectors, e.g. [[1,2], [3,4]]: each element is considered as a sequence
- vector of integers: single sequence alignment
BioSequenceMappings.Alphabet — Type
struct Alphabet{A,I}
characters::Vector{A}
char_to_index::Dict{A, I}
index_to_char::Dict{I, A}
default_char = nothing
default_index
endStructure allowing the mapping from biological symbols of type A to integers of type I. The typical use case would be Alphabet{Char, Int}. Alphabet can be constructed
- from a
Vectorof symbols and an optional typeI, e.g.Alphabet(['A','C','G','T'], UInt8)::Alphabet{Char, UInt8} - from a
Stringand an optional type, e.g.Alphabet("ACGT") - from a mapping
Dict{A, I}whereI<:Integer:Alphabet(Dict('A'=>1, 'C'=>2)) - from a
Symbol, using default alphabets, e.g.Alphabet(:nt) - from an integer, using default alphabets (see
?default_alphabets).
BioSequenceMappings.compute_weights — Function
compute_weights(X::AbstractAlignment, θ = 0.2; normalize = true)Compute phylogenetic correction weights for sequences of X. The weight sequence S is 1/N, where N is the number of sequences in X at hamming distance less than H from S (including S itself). The threshold H is floor(θ⋅L) where L is the sequence length.
The return value is a tuple (weights, Meff), where Meff is the sum of weights (pre-normalization). If normalize, weights are normalized to sum to one. .
BioSequenceMappings.compute_weights! — Function
BioSequenceMappings.default_alphabet — Method
default_alphabet(q::Int, T::Type)- if
q==2, binary (0, 1) - if
3 <= q <= 4, nucleotides without gaps - if
q==5, nucleotides - if
5 < q <= 21, amino acids - if
q>21, fails
BioSequenceMappings.eachsequence — Method
eachsequence(X::AbstractAlignment[, indices]; skip)Return an iterator over the sequences in X. If indices is specified, consider only sequences at the corresponding indices. Use the integer argument skip to return only one sequence every skip (~ 1:skip:end).
BioSequenceMappings.find_sequence — Method
find_sequence(label::AbstractString, aln::AbstractAlignment)Find sequence with name label in aln, and return (index, sequence). Scales as the number of sequences. Return the first sequence that matches the label.
!!! Return a view of the sequence.
sourceBioSequenceMappings.hamming — Method
hamming(x, y; normalize=true, positions=nothing)Hamming distance between Vectors x and y. Only sites in vector positions will be considered.
BioSequenceMappings.match_sequences — Method
match_sequences(pattern, aln::AbstractAlignment)Find sequences whose name matches label in aln, and return (indices, sequences). Sequences are returned as columns.
!!! Return a view of the sequences.
sourceBioSequenceMappings.named_sequences — Method
named_sequences(X::AbstractAlignment; skip)Return an iterator of the form (name, sequence) over X.
BioSequenceMappings.pairwise_correlations — Function
pairwise_correlations(X, w=X.weights; as_mat=false)Compute connected correlations: the difference between the pairwise frequencies and the product of the single site frequencies. See ?pairwise_frequencies for the shape of the output.
BioSequenceMappings.pairwise_frequencies — Function
pairwise_frequencies(X::AbstractAlignment, w=X.weights; as_mat=false)Return a q x q x L x L tensor. The (a, b, i, j) element is the fraction of sequences for which we see a at position i and b at position j.
If as_mat=true, will return a qL x qL matrix, with q x q blocks representing correlations between two specific columns.
BioSequenceMappings.pairwise_hamming — Method
pairwise_hamming(X, Y; step=1, step_left, step_right, as_vec=true, kwargs...)
pairwise_hamming(X; as_vec, step, kwargs...)Return all hamming distances between sequences of X and Y. In the second form, consider pairs of sequences in X.
Only consider sequences every step. step_left and step_right can be used to skip sequence either in X or in Y. This is useful for large alignment, as the number of computations grows with the product of the size of the alignments
By default, the return value is a vector organized like [H(1,2), H(1,3), ..., H(M-1, M)] with H standing for hamming distance and M for the number of sequences. If a matrix is prefered, use as_vec=false
Extra keyword arguments are passed to hamming.
BioSequenceMappings.read_fasta — Method
read_fasta(fastafile::AbstractString; alphabet = :auto, kwargs...)
read_fasta(
fastafile::AbstractString, alphabet;
weights = false, theta = 0.2, verbose = false,
)sourceBioSequenceMappings.site_specific_frequencies — Function
site_specific_frequencies(X::AbstractAlignment[, weights=X.weights]; as_vec=false)Return the site specific frequencies of X. If as_vec, the result is a vector of length Lxq. Otherwise, it is a matrix of q rows and L columns (default).
BioSequenceMappings.subsample — Method
subsample(X::AbstractAlignment, labels::AbstractVector{<:AbstractString})Return an Alignment containing only sequences whose name is in labels.
BioSequenceMappings.subsample — Method
subsample(X::AbstractAlignment, indices)Return an Alignment containing only the sequences of X at indices.
BioSequenceMappings.subsample_random — Method
subsample_random(X::AbstractAlignment, m::Int)Return an Alignment with m sequences taking randomly from X. Sampling is done without replacement, meaning the m sequences are all at different positions in X.
BioSequenceMappings.symbols — Method
BioSequenceMappings.translate — Method
translate(x, original_alphabet::Alphabet, new_alphabet::Alphabet)Return the translation in new_alphabet of an integer or a vector of integers x that is expressed in original_alphabet.