BioSequenceMappings
The aim of the package is to facilitate the task of converting biological sequences (nucleotides, amino acids) to integers or to onehot representation. It also provides simple function to manipulate alignments or to compute simple statistics.
Note: I do not frequently use the onehot format, so for now it is not documented / not well tested. I'll develop it more when I get the time.
Installation
From the Julia REPL:
julia> using Pkg
julia> Pkg.add("BioSequenceMappings")
Updating registry at `~/.julia/registries/General.toml` Resolving package versions... Installed BioSequenceMappings ─ v0.1.5 Updating `~/work/BioSequenceMappings.jl/BioSequenceMappings.jl/docs/Project.toml` [a84bc454] ~ BioSequenceMappings v0.1.6 `~/work/BioSequenceMappings.jl/BioSequenceMappings.jl` ⇒ v0.1.5 Updating `~/work/BioSequenceMappings.jl/BioSequenceMappings.jl/docs/Manifest.toml` [a84bc454] ~ BioSequenceMappings v0.1.6 `~/work/BioSequenceMappings.jl/BioSequenceMappings.jl` ⇒ v0.1.5 Precompiling project... 1980.8 ms ✓ BioSequenceMappings 1 dependency successfully precompiled in 3 seconds. 79 already precompiled. 1 dependency precompiled but a different version is currently loaded. Restart julia to access the new version. Otherwise, loading dependents of this package may trigger further precompilation to work with the unexpected version.
julia> using BioSequenceMappings
Usage
Filter sequences by Hamming distance
Load an alignment, find all sequences with a Hamming distance smaller than 66% to the first one, create a new alignment object from them and save it to a file.
julia> using BioSequenceMappings
julia> A = read_fasta("../../example/PF00014.fasta");
julia> size(A) # 100 sequences of length 53
(53, 100)
julia> s0 = A[1];
julia> s0' # transpose for legibility
1×53 adjoint(view(::Matrix{Int64}, :, 1)) with eltype Int64: 1 3 10 10 14 14 17 5 7 19 3 … 17 15 15 4 3 9 17 18 3 1
julia> indices = findall(s -> hamming(s, s0; normalize=true) < 0.66, A)
5-element Vector{Int64}: 1 33 64 83 86
julia> B = subsample(A, indices)
Alignment of M=5 sequences of length L=53 - Shown as `MxL` matrix 5×53 adjoint(::Matrix{Int64}) with eltype Int64: 1 3 10 10 14 14 17 5 7 19 … 17 15 15 4 3 9 17 18 3 1 1 3 6 11 14 11 17 10 7 6 17 11 13 5 3 16 11 2 3 1 6 3 11 5 14 14 21 18 7 16 17 5 4 4 3 12 16 18 3 1 1 3 11 11 15 19 4 5 7 14 17 21 15 5 3 15 10 18 3 1 1 3 16 5 17 10 13 5 7 14 17 10 17 5 3 11 17 18 3 1
julia> write("PF00014_subsample.fasta", B)
Remove columns with too many gaps
Load an alignment, find columns with more than 5% gaps and remove them, then create a new alignment object.
julia> A = read_fasta("../../example/PF00014.fasta"); # uses the default alphabet `Alphabet(:aa)`
julia> gap_digit = A.alphabet('-')
1
julia> f1 = site_specific_frequencies(A)
21×53 Matrix{Float64}: 0.41 0.05 0.04 0.03 0.02 0.01 … 0.01 0.01 0.01 0.02 0.02 0.62 0.1 0.0 0.06 0.07 0.06 0.13 0.0 0.0 0.16 0.14 0.0 0.06 0.0 0.95 0.0 0.0 0.0 0.0 0.99 0.0 0.0 0.0 0.98 0.0 0.0 0.0 0.05 0.03 0.02 0.0 0.0 0.02 0.06 0.01 0.0 0.0 0.0 0.0 0.07 0.09 0.05 0.02 0.0 0.28 0.15 0.01 0.0 0.07 0.08 0.0 0.03 0.02 0.0 0.03 … 0.0 0.0 0.0 0.02 0.0 0.0 0.0 0.0 0.01 0.02 0.01 0.0 0.0 0.01 0.01 0.0 0.0 0.0 0.0 0.0 0.03 0.02 0.0 0.01 0.0 0.0 0.01 0.03 0.0 0.0 0.13 0.0 0.02 0.01 0.0 0.06 0.0 0.05 0.02 0.06 0.0 0.0 0.0 0.0 0.05 0.02 0.03 0.15 0.0 0.12 0.11 0.08 0.0 0.11 ⋮ ⋮ ⋱ ⋮ 0.0 0.0 0.07 0.0 0.02 0.01 0.0 0.02 0.08 0.02 0.0 0.0 0.03 0.0 0.0 0.0 0.62 0.22 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.03 0.09 0.03 0.02 0.0 0.13 0.05 0.01 0.0 0.05 0.04 0.0 0.07 0.0 0.03 0.07 … 0.0 0.06 0.13 0.05 0.0 0.08 0.0 0.0 0.16 0.02 0.06 0.05 0.0 0.03 0.12 0.12 0.0 0.0 0.02 0.0 0.04 0.01 0.0 0.01 0.0 0.02 0.01 0.23 0.0 0.0 0.16 0.0 0.04 0.04 0.01 0.03 0.0 0.01 0.02 0.06 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.01 0.0 0.0 0.0 0.0 0.0 0.0 0.02 0.02 0.0 0.01 … 0.0 0.0 0.02 0.09 0.0 0.0
julia> non_gapped_colums = findall(j -> f1[gap_digit, j] <= 0.05, 1:size(f1, 2))' # transpose for legibility
1×51 adjoint(::Vector{Int64}) with eltype Int64: 2 3 4 5 6 7 8 9 10 11 12 … 44 45 46 47 48 49 50 51 52
julia> B = Alignment(A.data[non_gapped_colums', :], A.alphabet; A.names) # sequences are stored as columns
Alignment of M=100 sequences of length L=51 - Shown as `MxL` matrix 100×51 adjoint(::Matrix{Int64}) with eltype Int64: 3 10 10 14 14 17 5 7 19 3 … 17 15 15 4 3 9 17 18 3 3 15 15 14 11 8 8 7 13 3 18 6 5 4 1 1 1 1 1 3 8 11 14 16 5 5 7 5 3 18 16 10 15 3 16 10 9 3 3 17 11 17 14 17 14 7 18 3 17 16 2 5 3 5 13 21 3 3 2 6 14 12 5 10 7 14 3 16 10 5 10 3 5 10 6 3 3 11 15 17 10 4 9 7 18 3 … 17 5 5 2 3 5 16 2 3 3 13 11 14 9 19 16 7 14 3 17 5 10 5 3 10 5 21 3 3 17 11 14 2 19 15 7 14 3 17 16 5 5 3 5 5 17 3 3 2 11 10 2 4 4 7 14 3 17 11 5 5 3 10 5 10 3 3 11 2 5 19 4 11 16 19 3 18 21 5 5 3 10 2 17 3 ⋮ ⋮ ⋱ ⋮ ⋮ 3 12 5 10 14 5 10 7 14 3 18 17 18 5 3 17 10 18 3 3 7 11 14 10 5 18 7 14 3 17 5 17 5 3 10 5 18 3 3 13 11 14 17 19 19 7 9 3 18 10 5 17 3 5 16 15 3 3 17 6 14 14 4 20 7 17 3 18 6 5 9 3 15 10 2 3 3 5 11 14 19 13 8 7 16 3 … 17 21 5 5 3 9 7 12 3 3 10 11 16 11 4 15 7 17 3 18 5 5 5 3 10 10 18 3 3 5 15 14 12 4 2 7 14 3 18 5 2 17 3 7 21 8 3 3 21 19 7 16 13 5 7 21 3 18 15 5 17 3 5 2 16 3 3 12 11 14 10 4 18 7 18 3 17 15 5 2 3 5 2 19 3