BioSequenceMappings
The aim of the package is to facilitate the task of converting biological sequences (nucleotides, amino acids) to integers or to onehot representation. It also provides simple function to manipulate alignments or to compute simple statistics.
Note: I do not frequently use the onehot format, so for now it is not documented / not well tested. I'll develop it more when I get the time.
Installation
From the Julia REPL:
julia> using Pkg
julia> Pkg.add("https://github.com/PierreBarrat/BioSequenceMappings.jl")
ERROR: `https://github.com/PierreBarrat/BioSequenceMappings.jl` is not a valid package name. Perhaps you meant `https://github.com/PierreBarrat/BioSequenceMappings` The argument appears to be a URL or path, perhaps you meant `Pkg.add(url="...")` or `Pkg.add(path="...")`.
julia> using BioSequenceMappings
Usage
Filter sequences by Hamming distance
Load an alignment, find all sequences with a Hamming distance smaller than 66% to the first one, create a new alignment object from them and save it to a file.
julia> using BioSequenceMappings
julia> A = read_fasta("../../example/PF00014.fasta");
julia> size(A) # 100 sequences of length 53
(53, 100)
julia> s0 = A[1];
julia> s0' # transpose for legibility
1×53 adjoint(view(::Matrix{Int64}, :, 1)) with eltype Int64: 1 3 10 10 14 14 17 5 7 19 3 … 17 15 15 4 3 9 17 18 3 1
julia> indices = findall(s -> hamming(s, s0; normalize=true) < 0.66, A)
5-element Vector{Int64}: 1 33 64 83 86
julia> B = subsample(A, indices)
Alignment of M=5 sequences of length L=53 - Shown as `MxL` matrix 5×53 adjoint(::Matrix{Int64}) with eltype Int64: 1 3 10 10 14 14 17 5 7 19 … 17 15 15 4 3 9 17 18 3 1 1 3 6 11 14 11 17 10 7 6 17 11 13 5 3 16 11 2 3 1 6 3 11 5 14 14 21 18 7 16 17 5 4 4 3 12 16 18 3 1 1 3 11 11 15 19 4 5 7 14 17 21 15 5 3 15 10 18 3 1 1 3 16 5 17 10 13 5 7 14 17 10 17 5 3 11 17 18 3 1
julia> write("PF00014_subsample.fasta", B)
Remove columns with too many gaps
Load an alignment, find columns with more than 5% gaps and remove them, then create a new alignment object.
julia> A = read_fasta("../../example/PF00014.fasta"); # uses the default alphabet `aa_alphabet`
julia> gap_digit = A.alphabet('-')
1
julia> f1 = site_specific_frequencies(A)
21×53 Matrix{Float64}: 0.41 0.05 0.04 0.03 0.02 0.01 … 0.01 0.01 0.01 0.02 0.02 0.62 0.1 0.0 0.06 0.07 0.06 0.13 0.0 0.0 0.16 0.14 0.0 0.06 0.0 0.95 0.0 0.0 0.0 0.0 0.99 0.0 0.0 0.0 0.98 0.0 0.0 0.0 0.05 0.03 0.02 0.0 0.0 0.02 0.06 0.01 0.0 0.0 0.0 0.0 0.07 0.09 0.05 0.02 0.0 0.28 0.15 0.01 0.0 0.07 0.08 0.0 0.03 0.02 0.0 0.03 … 0.0 0.0 0.0 0.02 0.0 0.0 0.0 0.0 0.01 0.02 0.01 0.0 0.0 0.01 0.01 0.0 0.0 0.0 0.0 0.0 0.03 0.02 0.0 0.01 0.0 0.0 0.01 0.03 0.0 0.0 0.13 0.0 0.02 0.01 0.0 0.06 0.0 0.05 0.02 0.06 0.0 0.0 0.0 0.0 0.05 0.02 0.03 0.15 0.0 0.12 0.11 0.08 0.0 0.11 ⋮ ⋮ ⋱ ⋮ 0.0 0.0 0.07 0.0 0.02 0.01 0.0 0.02 0.08 0.02 0.0 0.0 0.03 0.0 0.0 0.0 0.62 0.22 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.03 0.09 0.03 0.02 0.0 0.13 0.05 0.01 0.0 0.05 0.04 0.0 0.07 0.0 0.03 0.07 … 0.0 0.06 0.13 0.05 0.0 0.08 0.0 0.0 0.16 0.02 0.06 0.05 0.0 0.03 0.12 0.12 0.0 0.0 0.02 0.0 0.04 0.01 0.0 0.01 0.0 0.02 0.01 0.23 0.0 0.0 0.16 0.0 0.04 0.04 0.01 0.03 0.0 0.01 0.02 0.06 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.01 0.0 0.0 0.0 0.0 0.0 0.0 0.02 0.02 0.0 0.01 … 0.0 0.0 0.02 0.09 0.0 0.0
julia> non_gapped_colums = findall(j -> f1[gap_digit, j] <= 0.05, 1:size(f1, 2))' # transpose for legibility
1×51 adjoint(::Vector{Int64}) with eltype Int64: 2 3 4 5 6 7 8 9 10 11 12 … 44 45 46 47 48 49 50 51 52
julia> B = Alignment(A.data[non_gapped_colums', :], A.alphabet; A.names) # sequences are stored as columns
Alignment of M=100 sequences of length L=51 - Shown as `MxL` matrix 100×51 adjoint(::Matrix{Int64}) with eltype Int64: 3 10 10 14 14 17 5 7 19 3 … 17 15 15 4 3 9 17 18 3 3 15 15 14 11 8 8 7 13 3 18 6 5 4 1 1 1 1 1 3 8 11 14 16 5 5 7 5 3 18 16 10 15 3 16 10 9 3 3 17 11 17 14 17 14 7 18 3 17 16 2 5 3 5 13 21 3 3 2 6 14 12 5 10 7 14 3 16 10 5 10 3 5 10 6 3 3 11 15 17 10 4 9 7 18 3 … 17 5 5 2 3 5 16 2 3 3 13 11 14 9 19 16 7 14 3 17 5 10 5 3 10 5 21 3 3 17 11 14 2 19 15 7 14 3 17 16 5 5 3 5 5 17 3 3 2 11 10 2 4 4 7 14 3 17 11 5 5 3 10 5 10 3 3 11 2 5 19 4 11 16 19 3 18 21 5 5 3 10 2 17 3 ⋮ ⋮ ⋱ ⋮ ⋮ 3 12 5 10 14 5 10 7 14 3 18 17 18 5 3 17 10 18 3 3 7 11 14 10 5 18 7 14 3 17 5 17 5 3 10 5 18 3 3 13 11 14 17 19 19 7 9 3 18 10 5 17 3 5 16 15 3 3 17 6 14 14 4 20 7 17 3 18 6 5 9 3 15 10 2 3 3 5 11 14 19 13 8 7 16 3 … 17 21 5 5 3 9 7 12 3 3 10 11 16 11 4 15 7 17 3 18 5 5 5 3 10 10 18 3 3 5 15 14 12 4 2 7 14 3 18 5 2 17 3 7 21 8 3 3 21 19 7 16 13 5 7 21 3 18 15 5 17 3 5 2 16 3 3 12 11 14 10 4 18 7 18 3 17 15 5 2 3 5 2 19 3