Documentation for BioSequenceMappings.
BioSequenceMappings.AlignmentBioSequenceMappings.AlignmentBioSequenceMappings.AlignmentBioSequenceMappings.AlphabetBioSequenceMappings.compute_weightsBioSequenceMappings.compute_weights!BioSequenceMappings.default_alphabetBioSequenceMappings.eachsequenceBioSequenceMappings.find_sequenceBioSequenceMappings.hammingBioSequenceMappings.match_sequencesBioSequenceMappings.named_sequencesBioSequenceMappings.pairwise_correlationsBioSequenceMappings.pairwise_frequenciesBioSequenceMappings.pairwise_hammingBioSequenceMappings.read_fastaBioSequenceMappings.site_specific_frequenciesBioSequenceMappings.subsampleBioSequenceMappings.subsampleBioSequenceMappings.subsample_randomBioSequenceMappings.symbolsBioSequenceMappings.translate
BioSequenceMappings.Alignment — Typemutable struct Alignment{A,T} where {A, T<:Integer} data::Matrix{T}
alphabet::Union{Nothing, Alphabet{A,T}}
weights::Vector{Float64} = ones(size(dat,1))/size(dat,1) # phylogenetic weights of sequences
names::Vector{String} = fill("", size(dat, 1))Biological sequences as vectors of type T<:Integer. data stores sequences in columns: size(dat) returns a tuple (L, M) with L the length and M the number of sequences. When displayed, shows data as an MxL matrix to match with traditional alignments.
alphabet{A,T} represents the mapping between integers in data and biological symbols of type A (nucleotides, amino acids...). If nothing, the alignment cannot be mapped to biological sequences.
weights represent phylogenetic weights, and are initialized to 1/M. They must sum to 1. names are the label of sequences, and are expected to be in the same order as the columns of data. They do not have to be unique, and can be ignored
Important: When built from a matrix, assumes that the sequences are stored in columns.
Methods
getindex(X::Alignment, i)returns a matrix/vectorX.data[:, i].for s in X::Alignmentiterates over sequences.eachsequence(X::Alignment)returns an iterator over sequences (Vector{Int}).eachsequence_weighted(X::Alignment)returns an iterator over sequences and weights as tuplessubaln(X::Alignment, idx)constructs the subaln defined by indexidx.
BioSequenceMappings.Alignment — MethodAlignment(data::AbstractMatrix{T}; alphabet = :auto, kwargs...)Keyword argument alphabet can be :auto, :none/nothing, or an input to the constructor Alphabet. Other keyword arguments are passed to the default constructor of Alignment.
BioSequenceMappings.Alignment — MethodAlignment(data::AbstractMatrix, alphabet; kwargs...)data is a matrix of integers, with sequences stored in columns. alphabet can be either
- an
Alphabet nothing: no conversion from integers to biological symbols.- something to build an alphabet from (e.g. a symbol like
:aa, a string, ...). The constructorAlphabetwill be called like so:Alphabet(alphabet).
If the types of alphabet and data mismatch, data is converted.
data can also have the following shape:
- vector of integer vectors, e.g. [[1,2], [3,4]]: each element is considered as a sequence
- vector of integers: single sequence alignment
BioSequenceMappings.Alphabet — Typestruct Alphabet{A,I}
characters::Vector{A}
char_to_index::Dict{A, I}
index_to_char::Dict{I, A}
default_char = nothing
default_index
endStructure allowing the mapping from biological symbols of type A to integers of type I. The typical use case would be Alphabet{Char, Int}. Alphabet can be constructed
- from a
Vectorof symbols and an optional typeI, e.g.Alphabet(['A','C','G','T'], UInt8)::Alphabet{Char, UInt8} - from a
Stringand an optional type, e.g.Alphabet("ACGT") - from a mapping
Dict{A, I}whereI<:Integer:Alphabet(Dict('A'=>1, 'C'=>2)) - from a
Symbol, using default alphabets, e.g.Alphabet(:nt) - from an integer, using default alphabets (see
?default_alphabets).
BioSequenceMappings.compute_weights — Functioncompute_weights(X::AbstractAlignment, θ = 0.2; normalize = true)Compute phylogenetic correction weights for sequences of X. The weight sequence S is 1/N, where N is the number of sequences in X at hamming distance less than H from S (including S itself). The threshold H is floor(θ⋅L) where L is the sequence length.
The return value is a tuple (weights, Meff), where Meff is the sum of weights (pre-normalization). If normalize, weights are normalized to sum to one. .
BioSequenceMappings.compute_weights! — Functioncompute_weights!(X, θ; kwargs...)Compute and set weights for X. See compute_weights.
BioSequenceMappings.default_alphabet — Methoddefault_alphabet(q::Int, T::Type)- if
q==2, binary (0, 1) - if
3 <= q <= 4, nucleotides without gaps - if
q==5, nucleotides - if
5 < q <= 21, amino acids - if
q>21, fails
BioSequenceMappings.eachsequence — Methodeachsequence(X::AbstractAlignment[, indices]; skip)Return an iterator over the sequences in X. If indices is specified, consider only sequences at the corresponding indices. Use the integer argument skip to return only one sequence every skip (~ 1:skip:end).
BioSequenceMappings.find_sequence — Methodfind_sequence(label::AbstractString, aln::AbstractAlignment)Find sequence with name label in aln, and return (index, sequence). Scales as the number of sequences. Return the first sequence that matches the label.
!!! Return a view of the sequence.
BioSequenceMappings.hamming — Methodhamming(x, y; normalize=true, positions=nothing)Hamming distance between Vectors x and y. Only sites in vector positions will be considered.
BioSequenceMappings.match_sequences — Methodmatch_sequences(pattern, aln::AbstractAlignment)Find sequences whose name matches label in aln, and return (indices, sequences). Sequences are returned as columns.
!!! Return a view of the sequences.
BioSequenceMappings.named_sequences — Methodnamed_sequences(X::AbstractAlignment; skip)Return an iterator of the form (name, sequence) over X.
BioSequenceMappings.pairwise_correlations — Functionpairwise_correlations(X, w=X.weights; as_mat=false)Compute connected correlations: the difference between the pairwise frequencies and the product of the single site frequencies. See ?pairwise_frequencies for the shape of the output.
BioSequenceMappings.pairwise_frequencies — Functionpairwise_frequencies(X::AbstractAlignment, w=X.weights; as_mat=false)Return a q x q x L x L tensor. The (a, b, i, j) element is the fraction of sequences for which we see a at position i and b at position j.
If as_mat=true, will return a qL x qL matrix, with q x q blocks representing correlations between two specific columns.
BioSequenceMappings.pairwise_hamming — Methodpairwise_hamming(X, Y; step=1, step_left, step_right, as_vec=true, kwargs...)
pairwise_hamming(X; as_vec, step, kwargs...)Return all hamming distances between sequences of X and Y. In the second form, consider pairs of sequences in X.
Only consider sequences every step. step_left and step_right can be used to skip sequence either in X or in Y. This is useful for large alignment, as the number of computations grows with the product of the size of the alignments
By default, the return value is a vector organized like [H(1,2), H(1,3), ..., H(M-1, M)] with H standing for hamming distance and M for the number of sequences. If a matrix is prefered, use as_vec=false
Extra keyword arguments are passed to hamming.
BioSequenceMappings.read_fasta — Methodread_fasta(fastafile::AbstractString; alphabet = :auto, kwargs...)
read_fasta(
fastafile::AbstractString, alphabet;
weights = false, theta = 0.2, verbose = false,
)BioSequenceMappings.site_specific_frequencies — Functionsite_specific_frequencies(X::AbstractAlignment[, weights=X.weights]; as_vec=false)Return the site specific frequencies of X. If as_vec, the result is a vector of length Lxq. Otherwise, it is a matrix of q rows and L columns (default).
BioSequenceMappings.subsample — Methodsubsample(X::AbstractAlignment, labels::AbstractVector{<:AbstractString})Return an Alignment containing only sequences whose name is in labels.
BioSequenceMappings.subsample — Methodsubsample(X::AbstractAlignment, indices)Return an Alignment containing only the sequences of X at indices.
BioSequenceMappings.subsample_random — Methodsubsample_random(X::AbstractAlignment, m::Int)Return an Alignment with m sequences taking randomly from X. Sampling is done without replacement, meaning the m sequences are all at different positions in X.
BioSequenceMappings.symbols — Methodsymbols(alphabet)Return the vector of symbols/characters used by alphabet.
BioSequenceMappings.translate — Methodtranslate(x, original_alphabet::Alphabet, new_alphabet::Alphabet)Return the translation in new_alphabet of an integer or a vector of integers x that is expressed in original_alphabet.