Documentation for BioSequenceMappings.
BioSequenceMappings.Alignment
BioSequenceMappings.Alignment
BioSequenceMappings.Alignment
BioSequenceMappings.Alphabet
BioSequenceMappings.compute_weights
BioSequenceMappings.compute_weights!
BioSequenceMappings.default_alphabet
BioSequenceMappings.eachsequence
BioSequenceMappings.find_sequence
BioSequenceMappings.hamming
BioSequenceMappings.match_sequences
BioSequenceMappings.named_sequences
BioSequenceMappings.pairwise_correlations
BioSequenceMappings.pairwise_frequencies
BioSequenceMappings.pairwise_hamming
BioSequenceMappings.read_fasta
BioSequenceMappings.site_specific_frequencies
BioSequenceMappings.subsample
BioSequenceMappings.subsample
BioSequenceMappings.subsample_random
BioSequenceMappings.symbols
BioSequenceMappings.translate
BioSequenceMappings.Alignment
— Typemutable struct Alignment{A,T} where {A, T<:Integer}
data::Matrix{T}
alphabet::Union{Nothing, Alphabet{A,T}}
weights::Vector{Float64} = ones(size(dat,1))/size(dat,1) # phylogenetic weights of sequences
names::Vector{String} = fill("", size(dat, 1))
Biological sequences as vectors of type T<:Integer
. data
stores sequences in columns: size(dat)
returns a tuple (L, M)
with L
the length and M
the number of sequences. When displayed, shows data
as an MxL
matrix to match with traditional alignments.
alphabet{A,T}
represents the mapping between integers in data
and biological symbols of type A
(nucleotides, amino acids...). If nothing
, the alignment cannot be mapped to biological sequences.
weights
represent phylogenetic weights, and are initialized to 1/M
. They must sum to 1. names
are the label of sequences, and are expected to be in the same order as the columns of data
. They do not have to be unique, and can be ignored
Important: When built from a matrix, assumes that the sequences are stored in columns.
Methods
getindex(X::Alignment, i)
returns a matrix/vectorX.data[:, i]
.for s in X::Alignment
iterates over sequences.eachsequence(X::Alignment)
returns an iterator over sequences (Vector{Int}
).eachsequence_weighted(X::Alignment)
returns an iterator over sequences and weights as tuplessubaln(X::Alignment, idx)
constructs the subaln defined by indexidx
.
BioSequenceMappings.Alignment
— MethodAlignment(data::AbstractMatrix{T}; alphabet = :auto, kwargs...)
Keyword argument alphabet
can be :auto
, :none
/nothing
, or an input to the constructor Alphabet
. Other keyword arguments are passed to the default constructor of Alignment
.
BioSequenceMappings.Alignment
— MethodAlignment(data::AbstractMatrix, alphabet; kwargs...)
data
is a matrix of integers, with sequences stored in columns. alphabet
can be either
- an
Alphabet
nothing
: no conversion from integers to biological symbols.- something to build an alphabet from (e.g. a symbol like
:aa
, a string, ...). The constructorAlphabet
will be called like so:Alphabet(alphabet)
.
If the types of alphabet
and data
mismatch, data
is converted.
data
can also have the following shape:
- vector of integer vectors, e.g. [[1,2], [3,4]]: each element is considered as a sequence
- vector of integers: single sequence alignment
BioSequenceMappings.Alphabet
— Typestruct Alphabet{A,I}
characters::Vector{A}
char_to_index::Dict{A, I}
index_to_char::Dict{I, A}
default_char = nothing
default_index
end
Structure allowing the mapping from biological symbols of type A
to integers of type I
. The typical use case would be Alphabet{Char, Int}
. Alphabet
can be constructed
- from a
Vector
of symbols and an optional typeI
, e.g.Alphabet(['A','C','G','T'], UInt8)::Alphabet{Char, UInt8}
- from a
String
and an optional type, e.g.Alphabet("ACGT")
- from a mapping
Dict{A, I}
whereI<:Integer
:Alphabet(Dict('A'=>1, 'C'=>2))
- from a
Symbol
, using default alphabets, e.g.Alphabet(:nt)
- from an integer, using default alphabets (see
?default_alphabets
).
BioSequenceMappings.compute_weights
— Functioncompute_weights(X::AbstractAlignment, θ = 0.2; normalize = true)
Compute phylogenetic correction weights for sequences of X
. The weight sequence S
is 1/N
, where N
is the number of sequences in X
at hamming distance less than H
from S
(including S
itself). The threshold H
is floor(θ⋅L)
where L
is the sequence length.
The return value is a tuple (weights, Meff)
, where Meff
is the sum of weights (pre-normalization). If normalize
, weights are normalized to sum to one. .
BioSequenceMappings.compute_weights!
— Functioncompute_weights!(X, θ; kwargs...)
Compute and set weights for X
. See compute_weights
.
BioSequenceMappings.default_alphabet
— Methoddefault_alphabet(q::Int, T::Type)
- if
q==2
, binary (0, 1) - if
3 <= q <= 4
, nucleotides without gaps - if
q==5
, nucleotides - if
5 < q <= 21
, amino acids - if
q>21
, fails
BioSequenceMappings.eachsequence
— Methodeachsequence(X::AbstractAlignment[, indices]; skip)
Return an iterator over the sequences in X
. If indices
is specified, consider only sequences at the corresponding indices. Use the integer argument skip
to return only one sequence every skip
(~ 1:skip:end
).
BioSequenceMappings.find_sequence
— Methodfind_sequence(label::AbstractString, aln::AbstractAlignment)
Find sequence with name label
in aln
, and return (index, sequence)
. Scales as the number of sequences. Return the first sequence that matches the label.
!!! Return a view of the sequence.
BioSequenceMappings.hamming
— Methodhamming(x, y; normalize=true, positions=nothing)
Hamming distance between Vectors
x
and y
. Only sites in vector positions
will be considered.
BioSequenceMappings.match_sequences
— Methodmatch_sequences(pattern, aln::AbstractAlignment)
Find sequences whose name matches label
in aln
, and return (indices, sequences)
. Sequences are returned as columns.
!!! Return a view of the sequences.
BioSequenceMappings.named_sequences
— Methodnamed_sequences(X::AbstractAlignment; skip)
Return an iterator of the form (name, sequence)
over X
.
BioSequenceMappings.pairwise_correlations
— Functionpairwise_correlations(X, w=X.weights; as_mat=false)
Compute connected correlations: the difference between the pairwise frequencies and the product of the single site frequencies. See ?pairwise_frequencies
for the shape of the output.
BioSequenceMappings.pairwise_frequencies
— Functionpairwise_frequencies(X::AbstractAlignment, w=X.weights; as_mat=false)
Return a q x q x L x L
tensor. The (a, b, i, j)
element is the fraction of sequences for which we see a
at position i
and b
at position j
.
If as_mat=true
, will return a qL x qL
matrix, with q x q
blocks representing correlations between two specific columns.
BioSequenceMappings.pairwise_hamming
— Methodpairwise_hamming(X, Y; step=1, step_left, step_right, as_vec=true, kwargs...)
pairwise_hamming(X; as_vec, step, kwargs...)
Return all hamming distances between sequences of X
and Y
. In the second form, consider pairs of sequences in X
.
Only consider sequences every step
. step_left
and step_right
can be used to skip sequence either in X
or in Y
. This is useful for large alignment, as the number of computations grows with the product of the size of the alignments
By default, the return value is a vector organized like [H(1,2), H(1,3), ..., H(M-1, M)]
with H
standing for hamming distance and M
for the number of sequences. If a matrix is prefered, use as_vec=false
Extra keyword arguments are passed to hamming
.
BioSequenceMappings.read_fasta
— Methodread_fasta(fastafile::AbstractString; alphabet = :auto, kwargs...)
read_fasta(
fastafile::AbstractString, alphabet;
weights = false, theta = 0.2, verbose = false,
)
BioSequenceMappings.site_specific_frequencies
— Functionsite_specific_frequencies(X::AbstractAlignment[, weights=X.weights]; as_vec=false)
Return the site specific frequencies of X
. If as_vec
, the result is a vector of length Lxq
. Otherwise, it is a matrix of q
rows and L
columns (default).
BioSequenceMappings.subsample
— Methodsubsample(X::AbstractAlignment, labels::AbstractVector{<:AbstractString})
Return an Alignment
containing only sequences whose name is in labels
.
BioSequenceMappings.subsample
— Methodsubsample(X::AbstractAlignment, indices)
Return an Alignment
containing only the sequences of X
at indices
.
BioSequenceMappings.subsample_random
— Methodsubsample_random(X::AbstractAlignment, m::Int)
Return an Alignment
with m
sequences taking randomly from X
. Sampling is done without replacement, meaning the m
sequences are all at different positions in X
.
BioSequenceMappings.symbols
— Methodsymbols(alphabet)
Return the vector of symbols/characters used by alphabet
.
BioSequenceMappings.translate
— Methodtranslate(x, original_alphabet::Alphabet, new_alphabet::Alphabet)
Return the translation in new_alphabet
of an integer or a vector of integers x
that is expressed in original_alphabet
.