BioSequenceMappings

The aim of the package is to facilitate the task of converting biological sequences (nucleotides, amino acids) to integers or to onehot representation. It also provides simple function to manipulate alignments or to compute simple statistics.

Note: I do not frequently use the onehot format, so for now it is not documented / not well tested. I'll develop it more when I get the time.

Installation

From the Julia REPL:

julia> using Pkg
julia> Pkg.add("https://github.com/PierreBarrat/BioSequenceMappings.jl")ERROR: `https://github.com/PierreBarrat/BioSequenceMappings.jl` is not a valid package name. Perhaps you meant `https://github.com/PierreBarrat/BioSequenceMappings` The argument appears to be a URL or path, perhaps you meant `Pkg.add(url="...")` or `Pkg.add(path="...")`.
julia> using BioSequenceMappings

Usage

Filter sequences by Hamming distance

Load an alignment, find all sequences with a Hamming distance smaller than 66% to the first one, create a new alignment object from them and save it to a file.

julia> using BioSequenceMappings
julia> A = read_fasta("../../example/PF00014.fasta");
julia> size(A) # 100 sequences of length 53(53, 100)
julia> s0 = A[1];
julia> s0' # transpose for legibility1×53 adjoint(view(::Matrix{Int64}, :, 1)) with eltype Int64: 1 3 10 10 14 14 17 5 7 19 3 … 17 15 15 4 3 9 17 18 3 1
julia> indices = findall(s -> hamming(s, s0; normalize=true) < 0.66, A)5-element Vector{Int64}: 1 33 64 83 86
julia> B = subsample(A, indices)Alignment of M=5 sequences of length L=53 - Shown as `MxL` matrix 5×53 adjoint(::Matrix{Int64}) with eltype Int64: 1 3 10 10 14 14 17 5 7 19 … 17 15 15 4 3 9 17 18 3 1 1 3 6 11 14 11 17 10 7 6 17 11 13 5 3 16 11 2 3 1 6 3 11 5 14 14 21 18 7 16 17 5 4 4 3 12 16 18 3 1 1 3 11 11 15 19 4 5 7 14 17 21 15 5 3 15 10 18 3 1 1 3 16 5 17 10 13 5 7 14 17 10 17 5 3 11 17 18 3 1
julia> write("PF00014_subsample.fasta", B)

Remove columns with too many gaps

Load an alignment, find columns with more than 5% gaps and remove them, then create a new alignment object.


julia> A = read_fasta("../../example/PF00014.fasta"); # uses the default alphabet `aa_alphabet`
julia> gap_digit = A.alphabet('-')1
julia> f1 = site_specific_frequencies(A)21×53 Matrix{Float64}: 0.41 0.05 0.04 0.03 0.02 0.01 … 0.01 0.01 0.01 0.02 0.02 0.62 0.1 0.0 0.06 0.07 0.06 0.13 0.0 0.0 0.16 0.14 0.0 0.06 0.0 0.95 0.0 0.0 0.0 0.0 0.99 0.0 0.0 0.0 0.98 0.0 0.0 0.0 0.05 0.03 0.02 0.0 0.0 0.02 0.06 0.01 0.0 0.0 0.0 0.0 0.07 0.09 0.05 0.02 0.0 0.28 0.15 0.01 0.0 0.07 0.08 0.0 0.03 0.02 0.0 0.03 … 0.0 0.0 0.0 0.02 0.0 0.0 0.0 0.0 0.01 0.02 0.01 0.0 0.0 0.01 0.01 0.0 0.0 0.0 0.0 0.0 0.03 0.02 0.0 0.01 0.0 0.0 0.01 0.03 0.0 0.0 0.13 0.0 0.02 0.01 0.0 0.06 0.0 0.05 0.02 0.06 0.0 0.0 0.0 0.0 0.05 0.02 0.03 0.15 0.0 0.12 0.11 0.08 0.0 0.11 ⋮ ⋮ ⋱ ⋮ 0.0 0.0 0.07 0.0 0.02 0.01 0.0 0.02 0.08 0.02 0.0 0.0 0.03 0.0 0.0 0.0 0.62 0.22 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.03 0.09 0.03 0.02 0.0 0.13 0.05 0.01 0.0 0.05 0.04 0.0 0.07 0.0 0.03 0.07 … 0.0 0.06 0.13 0.05 0.0 0.08 0.0 0.0 0.16 0.02 0.06 0.05 0.0 0.03 0.12 0.12 0.0 0.0 0.02 0.0 0.04 0.01 0.0 0.01 0.0 0.02 0.01 0.23 0.0 0.0 0.16 0.0 0.04 0.04 0.01 0.03 0.0 0.01 0.02 0.06 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.01 0.0 0.0 0.0 0.0 0.0 0.0 0.02 0.02 0.0 0.01 … 0.0 0.0 0.02 0.09 0.0 0.0
julia> non_gapped_colums = findall(j -> f1[gap_digit, j] <= 0.05, 1:size(f1, 2))' # transpose for legibility1×51 adjoint(::Vector{Int64}) with eltype Int64: 2 3 4 5 6 7 8 9 10 11 12 … 44 45 46 47 48 49 50 51 52
julia> B = Alignment(A.data[non_gapped_colums', :], A.alphabet; A.names) # sequences are stored as columnsAlignment of M=100 sequences of length L=51 - Shown as `MxL` matrix 100×51 adjoint(::Matrix{Int64}) with eltype Int64: 3 10 10 14 14 17 5 7 19 3 … 17 15 15 4 3 9 17 18 3 3 15 15 14 11 8 8 7 13 3 18 6 5 4 1 1 1 1 1 3 8 11 14 16 5 5 7 5 3 18 16 10 15 3 16 10 9 3 3 17 11 17 14 17 14 7 18 3 17 16 2 5 3 5 13 21 3 3 2 6 14 12 5 10 7 14 3 16 10 5 10 3 5 10 6 3 3 11 15 17 10 4 9 7 18 3 … 17 5 5 2 3 5 16 2 3 3 13 11 14 9 19 16 7 14 3 17 5 10 5 3 10 5 21 3 3 17 11 14 2 19 15 7 14 3 17 16 5 5 3 5 5 17 3 3 2 11 10 2 4 4 7 14 3 17 11 5 5 3 10 5 10 3 3 11 2 5 19 4 11 16 19 3 18 21 5 5 3 10 2 17 3 ⋮ ⋮ ⋱ ⋮ ⋮ 3 12 5 10 14 5 10 7 14 3 18 17 18 5 3 17 10 18 3 3 7 11 14 10 5 18 7 14 3 17 5 17 5 3 10 5 18 3 3 13 11 14 17 19 19 7 9 3 18 10 5 17 3 5 16 15 3 3 17 6 14 14 4 20 7 17 3 18 6 5 9 3 15 10 2 3 3 5 11 14 19 13 8 7 16 3 … 17 21 5 5 3 9 7 12 3 3 10 11 16 11 4 15 7 17 3 18 5 5 5 3 10 10 18 3 3 5 15 14 12 4 2 7 14 3 18 5 2 17 3 7 21 8 3 3 21 19 7 16 13 5 7 21 3 18 15 5 17 3 5 2 16 3 3 12 11 14 10 4 18 7 18 3 17 15 5 2 3 5 2 19 3