BioSequenceMappings

The aim of the package is to facilitate the task of converting biological sequences (nucleotides, amino acids) to integers or to onehot representation. It also provides simple function to manipulate alignments or to compute simple statistics.

Note: I do not frequently use the onehot format, so for now it is not documented / not well tested. I'll develop it more when I get the time.

Installation

From the Julia REPL:

julia> using Pkg
julia> Pkg.add("BioSequenceMappings") Updating registry at `~/.julia/registries/General.toml` Resolving package versions... Installed BioSequenceMappings ─ v0.1.5 Updating `~/work/BioSequenceMappings.jl/BioSequenceMappings.jl/docs/Project.toml` [a84bc454] ~ BioSequenceMappings v0.1.6 `~/work/BioSequenceMappings.jl/BioSequenceMappings.jl` ⇒ v0.1.5 Updating `~/work/BioSequenceMappings.jl/BioSequenceMappings.jl/docs/Manifest.toml` [a84bc454] ~ BioSequenceMappings v0.1.6 `~/work/BioSequenceMappings.jl/BioSequenceMappings.jl` ⇒ v0.1.5 Precompiling project... 1980.8 ms ✓ BioSequenceMappings 1 dependency successfully precompiled in 3 seconds. 79 already precompiled. 1 dependency precompiled but a different version is currently loaded. Restart julia to access the new version. Otherwise, loading dependents of this package may trigger further precompilation to work with the unexpected version.
julia> using BioSequenceMappings

Usage

Filter sequences by Hamming distance

Load an alignment, find all sequences with a Hamming distance smaller than 66% to the first one, create a new alignment object from them and save it to a file.

julia> using BioSequenceMappings
julia> A = read_fasta("../../example/PF00014.fasta");
julia> size(A) # 100 sequences of length 53(53, 100)
julia> s0 = A[1];
julia> s0' # transpose for legibility1×53 adjoint(view(::Matrix{Int64}, :, 1)) with eltype Int64: 1 3 10 10 14 14 17 5 7 19 3 … 17 15 15 4 3 9 17 18 3 1
julia> indices = findall(s -> hamming(s, s0; normalize=true) < 0.66, A)5-element Vector{Int64}: 1 33 64 83 86
julia> B = subsample(A, indices)Alignment of M=5 sequences of length L=53 - Shown as `MxL` matrix 5×53 adjoint(::Matrix{Int64}) with eltype Int64: 1 3 10 10 14 14 17 5 7 19 … 17 15 15 4 3 9 17 18 3 1 1 3 6 11 14 11 17 10 7 6 17 11 13 5 3 16 11 2 3 1 6 3 11 5 14 14 21 18 7 16 17 5 4 4 3 12 16 18 3 1 1 3 11 11 15 19 4 5 7 14 17 21 15 5 3 15 10 18 3 1 1 3 16 5 17 10 13 5 7 14 17 10 17 5 3 11 17 18 3 1
julia> write("PF00014_subsample.fasta", B)

Remove columns with too many gaps

Load an alignment, find columns with more than 5% gaps and remove them, then create a new alignment object.


julia> A = read_fasta("../../example/PF00014.fasta"); # uses the default alphabet `Alphabet(:aa)`
julia> gap_digit = A.alphabet('-')1
julia> f1 = site_specific_frequencies(A)21×53 Matrix{Float64}: 0.41 0.05 0.04 0.03 0.02 0.01 … 0.01 0.01 0.01 0.02 0.02 0.62 0.1 0.0 0.06 0.07 0.06 0.13 0.0 0.0 0.16 0.14 0.0 0.06 0.0 0.95 0.0 0.0 0.0 0.0 0.99 0.0 0.0 0.0 0.98 0.0 0.0 0.0 0.05 0.03 0.02 0.0 0.0 0.02 0.06 0.01 0.0 0.0 0.0 0.0 0.07 0.09 0.05 0.02 0.0 0.28 0.15 0.01 0.0 0.07 0.08 0.0 0.03 0.02 0.0 0.03 … 0.0 0.0 0.0 0.02 0.0 0.0 0.0 0.0 0.01 0.02 0.01 0.0 0.0 0.01 0.01 0.0 0.0 0.0 0.0 0.0 0.03 0.02 0.0 0.01 0.0 0.0 0.01 0.03 0.0 0.0 0.13 0.0 0.02 0.01 0.0 0.06 0.0 0.05 0.02 0.06 0.0 0.0 0.0 0.0 0.05 0.02 0.03 0.15 0.0 0.12 0.11 0.08 0.0 0.11 ⋮ ⋮ ⋱ ⋮ 0.0 0.0 0.07 0.0 0.02 0.01 0.0 0.02 0.08 0.02 0.0 0.0 0.03 0.0 0.0 0.0 0.62 0.22 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.03 0.09 0.03 0.02 0.0 0.13 0.05 0.01 0.0 0.05 0.04 0.0 0.07 0.0 0.03 0.07 … 0.0 0.06 0.13 0.05 0.0 0.08 0.0 0.0 0.16 0.02 0.06 0.05 0.0 0.03 0.12 0.12 0.0 0.0 0.02 0.0 0.04 0.01 0.0 0.01 0.0 0.02 0.01 0.23 0.0 0.0 0.16 0.0 0.04 0.04 0.01 0.03 0.0 0.01 0.02 0.06 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.01 0.0 0.0 0.0 0.0 0.0 0.0 0.02 0.02 0.0 0.01 … 0.0 0.0 0.02 0.09 0.0 0.0
julia> non_gapped_colums = findall(j -> f1[gap_digit, j] <= 0.05, 1:size(f1, 2))' # transpose for legibility1×51 adjoint(::Vector{Int64}) with eltype Int64: 2 3 4 5 6 7 8 9 10 11 12 … 44 45 46 47 48 49 50 51 52
julia> B = Alignment(A.data[non_gapped_colums', :], A.alphabet; A.names) # sequences are stored as columnsAlignment of M=100 sequences of length L=51 - Shown as `MxL` matrix 100×51 adjoint(::Matrix{Int64}) with eltype Int64: 3 10 10 14 14 17 5 7 19 3 … 17 15 15 4 3 9 17 18 3 3 15 15 14 11 8 8 7 13 3 18 6 5 4 1 1 1 1 1 3 8 11 14 16 5 5 7 5 3 18 16 10 15 3 16 10 9 3 3 17 11 17 14 17 14 7 18 3 17 16 2 5 3 5 13 21 3 3 2 6 14 12 5 10 7 14 3 16 10 5 10 3 5 10 6 3 3 11 15 17 10 4 9 7 18 3 … 17 5 5 2 3 5 16 2 3 3 13 11 14 9 19 16 7 14 3 17 5 10 5 3 10 5 21 3 3 17 11 14 2 19 15 7 14 3 17 16 5 5 3 5 5 17 3 3 2 11 10 2 4 4 7 14 3 17 11 5 5 3 10 5 10 3 3 11 2 5 19 4 11 16 19 3 18 21 5 5 3 10 2 17 3 ⋮ ⋮ ⋱ ⋮ ⋮ 3 12 5 10 14 5 10 7 14 3 18 17 18 5 3 17 10 18 3 3 7 11 14 10 5 18 7 14 3 17 5 17 5 3 10 5 18 3 3 13 11 14 17 19 19 7 9 3 18 10 5 17 3 5 16 15 3 3 17 6 14 14 4 20 7 17 3 18 6 5 9 3 15 10 2 3 3 5 11 14 19 13 8 7 16 3 … 17 21 5 5 3 9 7 12 3 3 10 11 16 11 4 15 7 17 3 18 5 5 5 3 10 10 18 3 3 5 15 14 12 4 2 7 14 3 18 5 2 17 3 7 21 8 3 3 21 19 7 16 13 5 7 21 3 18 15 5 17 3 5 2 16 3 3 12 11 14 10 4 18 7 18 3 17 15 5 2 3 5 2 19 3