The Alignment type

An Alignment essentially contains a set of aligned biological sequences mapped to integers. It is a subtype of the more general AbstractAlignment. The precise type is of the form Alignment{A,T} where A is the type of biological symbols (see The Alphabet type) and T the type of integers. Its fields are simple and self-explanatory:

  • data::Matrix{T}: the mapped sequences. Each column of the matrix is one sequence.
  • alphabet::Alphabet{A,T}: the alphabet defining the mapping from symbols to integers.
  • weights::Vector{Float64}: weights associated with each sequence. DCA-like methods assign phylogenetic weights to sequences when infering models.
  • names::Vector{String}: names of the sequences, corresponding to the description string if the alignment was built from a FASTA file.

The dimensions of an alignment can be obtained using size: length $\cross$ number of sequences

Reading & writing

For now, the only file format that this package interacts with is FASTA. Reading an alignment is simple:


julia> A = read_fasta("../../example/PF00014.fasta")Alignment of M=100 sequences of length L=53 - Shown as `MxL` matrix 100×53 adjoint(::Matrix{Int64}) with eltype Int64: 1 3 10 10 14 14 17 5 7 19 … 15 15 4 3 9 17 18 3 1 1 3 15 15 14 11 8 8 7 13 6 5 4 1 1 1 1 1 1 19 3 8 11 14 16 5 5 7 5 16 10 15 3 16 10 9 3 2 18 3 17 11 17 14 17 14 7 18 16 2 5 3 5 13 21 3 1 19 3 2 6 14 12 5 10 7 14 10 5 10 3 5 10 6 3 10 11 3 11 15 17 10 4 9 7 18 … 5 5 2 3 5 16 2 3 1 2 3 13 11 14 9 19 16 7 14 5 10 5 3 10 5 21 3 1 19 3 17 11 14 2 19 15 7 14 16 5 5 3 5 5 17 3 1 6 3 2 11 10 2 4 4 7 14 11 5 5 3 10 5 10 3 1 1 3 11 2 5 19 4 11 16 19 21 5 5 3 10 2 17 3 15 ⋮ ⋮ ⋱ ⋮ ⋮ 9 3 12 5 10 14 5 10 7 14 17 18 5 3 17 10 18 3 1 2 3 7 11 14 10 5 18 7 14 5 17 5 3 10 5 18 3 15 1 3 13 11 14 17 19 19 7 9 10 5 17 3 5 16 15 3 1 9 3 17 6 14 14 4 20 7 17 6 5 9 3 15 10 2 3 5 2 3 5 11 14 19 13 8 7 16 … 21 5 5 3 9 7 12 3 1 1 3 10 11 16 11 4 15 7 17 5 5 5 3 10 10 18 3 1 14 3 5 15 14 12 4 2 7 14 5 2 17 3 7 21 8 3 10 1 3 21 19 7 16 13 5 7 21 15 5 17 3 5 2 16 3 1 9 3 12 11 14 10 4 18 7 18 15 5 2 3 5 2 19 3 2
julia> size(A) # length and number of sequences(53, 100)

When reading from FASTA, the choice of the alphabet is made by reading the first five sequences, and comparing the observed characters with the list of default alphabets (see The Alphabet type). If they fit one of the defaults, it will be used. Otherwise, an alphabet will be created ad hoc:


julia> A = read_fasta("../../example/strange_characters.fasta"); # warning produced because no default alphabet was found┌ Warning: Could not find a default alphabet for characters ['!', '-', '/', '0', '9', '@', 'A', 'C', 'G', 'T'] Using Alphabet{Char,Int64}: ['!', '-', '/', '0', '9', '@', 'A', 'C', 'G', 'T'] @ BioSequenceMappings ~/work/BioSequenceMappings.jl/BioSequenceMappings.jl/src/IO.jl:62
julia> A.alphabet |> symbols |> prod"!-/09@ACGT"

Writing to a FASTA file is just as easy:

julia> write("new_fasta_file.fasta", A) # or...
julia> open("new_fasta_file.fasta", "w") do io write(io, A) end

Accessing & iterating

Sequences can be accessed by indexing. Indexing using a range will return a view in the underlying data matrix.

julia> A[1] # the first sequence of the alignment53-element view(::Matrix{Int64}, :, 1) with eltype Int64:
  1
  3
 10
 10
 14
 14
 17
  5
  7
 19
  ⋮
 15
 15
  4
  3
  9
 17
 18
  3
  1
julia> size(A[1:5]) # 5 sequences of length L(53, 5)
Tip

When indexing or when iterating, the return value is a reference to the sequence and not a copy. For this reason, if you modify the sequence, the alignment itself will be modified


julia> s = A[1]'1×53 adjoint(view(::Matrix{Int64}, :, 1)) with eltype Int64: 1 3 10 10 14 14 17 5 7 19 3 … 17 15 15 4 3 9 17 18 3 1
julia> s[1] = 2121
julia> A[1]' # the first element has been modified1×53 adjoint(view(::Matrix{Int64}, :, 1)) with eltype Int64: 21 3 10 10 14 14 17 5 7 19 3 … 17 15 15 4 3 9 17 18 3 1

The fact that calls like A[1:5] return a matrix-like object can be inconvenient. To iterate over sequences as vectors of integers, one can use the eachsequence function. It takes the alignment as a first argument and optionally indices, and return an iterator.

julia> for s in eachsequence(A)
       	# s: vector of integers
       end
julia> map(length, eachsequence(A, 1:5:16)) # length of sequences 1, 6, 11. 164-element Vector{Int64}: 53 53 53 53
julia> collect(eachsequence(A; skip=25)) # collect as list of vectors, taking one every 25 sequences4-element Vector{SubArray{Int64, 1, Matrix{Int64}, Tuple{Base.Slice{Base.OneTo{Int64}}, Int64}, true}}: [1, 3, 10, 10, 14, 14, 17, 5, 7, 19 … 17, 15, 15, 4, 3, 9, 17, 18, 3, 1] [1, 3, 17, 5, 16, 6, 4, 2, 2, 14 … 18, 5, 18, 4, 3, 11, 13, 18, 3, 1] [19, 3, 11, 11, 15, 10, 4, 18, 7, 14 … 17, 10, 15, 5, 3, 5, 18, 4, 3, 1] [1, 3, 19, 7, 19, 17, 11, 14, 7, 14 … 17, 16, 5, 2, 3, 5, 4, 16, 3, 2]

If the name of the sequence is needed, the iterator named_sequences can be used instead, taking the same arguments and iterating over tuples of the form (name, sequence).

Finding sequences

The package provides find_sequence to find sequences by name in an alignment, and match_sequences to match all sequences with a particular name pattern.

julia> A = read_fasta(joinpath(example_dir, "toy_fasta_dna.fasta"));

julia> n = A.names[1] # name of the first sequence in the alignment
"sequence_1"

julia> find_sequence(n, A) # index and sequence
(1, [5, 3, 4, 2, 4, 1, 5, 1, 5, 5])

julia> indices, sequences = match_sequences(r"sequence_[0-9]", A); # using a regex 

julia> indices
5-element Vector{Int64}:
 1
 2
 3
 4
 5
Note

find_sequence searches for an exact match. It is based on findfirst, and will thus return the first match it finds. If nothing is found, then the result is nothing. On the other hand, match_sequences is based on occursin and findall. The returned sequences are references to original objects, as when indexing.

Creating subalignments

The call subsample(A, indices) will create an AbstractAlignment of the same type as A by taking only the sequences at indices. It uses the same alphabet as A and copies over the names of the sequences. Note that subsample copies the underlying data, creating a completely independent object.


julia> A = read_fasta("../../example/toy_fasta_dna.fasta")Alignment of M=5 sequences of length L=10 - Shown as `MxL` matrix 5×10 adjoint(::Matrix{Int64}) with eltype Int64: 5 3 4 2 4 1 5 1 5 5 2 4 4 1 4 2 3 2 2 3 4 3 5 3 3 5 5 3 5 1 3 3 5 2 2 1 5 3 1 2 2 3 2 5 3 2 4 3 2 4
julia> A[1][1] # first character of the first sequence5
julia> B = subsample(A, 1:2:5)Alignment of M=3 sequences of length L=10 - Shown as `MxL` matrix 3×10 adjoint(::Matrix{Int64}) with eltype Int64: 5 3 4 2 4 1 5 1 5 5 4 3 5 3 3 5 5 3 5 1 2 3 2 5 3 2 4 3 2 4
julia> B[1][1] = 55
julia> A[1][1] # A remains unchanged5

With subsample_random, it is also possible to create a random subalignment by picking sequences from the original one. For now, this is only possible without replacement, i.e. the same sequence cannot be picked twice. To just pick one sequence at random without creating a new alignment object, just call rand.

julia> subsample_random(A, 3) # new alignment using three random sequences from AAlignment of M=3 sequences of length L=10 - Shown as `MxL` matrix
3×10 adjoint(::Matrix{Int64}) with eltype Int64:
 3  3  5  2  2  1  5  3  1  2
 2  4  4  1  4  2  3  2  2  3
 5  3  4  2  4  1  5  1  5  5
julia> subsample_random(A, 12) # sampling without replacement: this will error since size(A, 1) < 12ERROR: AssertionError: Cannot take 12 different sequences from alignment of size 5
julia> rand(A) # one random sequence from A (returns a view)10-element view(::Matrix{Int64}, :, 4) with eltype Int64: 3 3 5 2 2 1 5 3 1 2

OneHotAlignment

TBA