The Alphabet
type
Basics
An Alphabet
contains information necessary to map biological symbols to integers and inversely. The full type is Alphabet{A,I}
, where A
is the type of biological symbols (typically Char
) and I
is a subtype of Integer
.
The simplest way to create an alphabet is from a list of symbols:
julia> A = Alphabet(['A', 'C', 'G', 'T', '-'])
custom Alphabet{Char,Int64} with mapping ['A', 'C', 'G', 'T', '-']
The created alphabet A
associates Char
correspondings to nucleotides to Int
according to the index at which they appear in the input vector: 'A' => 1
, 'C' => 2
, etc... Note that we could have created the same alphabet from a string (since it's based on Char
s) or from a dictionary:
julia> B = Alphabet("ACGT-");
julia> C = Alphabet(Dict('A'=>1, 'C'=>2, 'G'=>3, 'T'=>4, '-'=>5));
julia> A == B
true
julia> A == C
true
The alphabet is used to map symbols to integers and inversely. This is done by calling the object directly, as a function:
julia> A('A') # mapping Char to Int
1
julia> A(1) # mapping Int to Char
'A': ASCII/Unicode U+0041 (category Lu: Letter, uppercase)
julia> A("A-GT") # mapping a string to a vector
4-element Vector{Int64}:
1
5
3
4
julia> A([1,2,3]) # mapping a vector to a string
"ACG"
If needed, the mapping is accessible using the symbols
function:
julia> symbols(A)
5-element Vector{Char}:
'A': ASCII/Unicode U+0041 (category Lu: Letter, uppercase)
'C': ASCII/Unicode U+0043 (category Lu: Letter, uppercase)
'G': ASCII/Unicode U+0047 (category Lu: Letter, uppercase)
'T': ASCII/Unicode U+0054 (category Lu: Letter, uppercase)
'-': ASCII/Unicode U+002D (category Pd: Punctuation, dash)
julia> A |> symbols |> prod # as a string
"ACGT-"
Default alphabets
The package comes with three default alphabets:
- an amino-acid alphabet
Alphabet(:aa)
using the mapping"-ACDEFGHIKLMNPQRSTVWY"
; - a nucleotide alphabet
Alphabet(:dna)
using the mapping"-ACGT"
; - a "binary" alphabet
Alphabet(:binary)
, which I found useful for simulations, with the mapping:"01"
.
They can be accessed by calling Alphabet(name)
where name
is a symbol corresponding to any of the default alphabets. The symbolic names can be easily be found:
julia> BioSequenceMappings.aa_alphabet_names # also works with nt and binary alphabets
(:aa, :AA, :aminoacids, :amino_acids)
julia> Alphabet(:aa) == Alphabet("-ACDEFGHIKLMNPQRSTVWY")
true
julia> Alphabet(:amino_acids)([1,2,3])
"-AC"
Each default alphabet is also associated to a specific cardinality of biological symbols through the function default_alphabet
. This means that an integer vector with elements ranging from 1 to q
will be associated to the following alphabets:
julia> default_alphabet(2) == Alphabet(:binary) # q == 2
true
julia> default_alphabet(5) == Alphabet(:nt) # q == 5
true
julia> default_alphabet(21) == Alphabet(:aa) # 5 < q <= 21
true
julia> default_alphabet(15) == Alphabet(:aa) # 5 < q <= 21
This association is useful to create Alignment
objects from a matrix of integers without having to specify the alphabet manually.
Default characters
When reading biological sequences, it can be convenient to associate all unexpected characters to a default symbol, for instance the gap. This can be achieved by providing the default_char
keyword argument when constructing the alphabet:
julia> A_default = Alphabet("ACGT-"; default_char = '-')
custom Alphabet{Char,Int64} with mapping ['A', 'C', 'G', 'T', '-']
julia> A_default("ABCDEF") # 'unknown' chars are mapped to '-', in turn mapped to 5
6-element Vector{Int64}: 1 5 2 5 5 5
julia> A("ABCDEF") # if no defaults are provided, fails
ERROR: Symbol B not in alphabet, and no defaults set.
This also works the other way around: integers that are not in the range of the alphabet are mapped to the default symbol:
julia> A_default(1:10) # indices larger than 5 are mapped to the gap
"ACGT------"
julia> A(1:10) # if no defaults are provided, fails
ERROR: 6 is not in alphabet range, and no defaults set.
Using specific integer types
When created as above, the alphabet will default to using Int
as the integer type. If dealing with large amounts of data, it can be beneficial to use a more sober type. This is done by providing an extra argument of the desired type when constructing the alphabet:
julia> B = Alphabet("ACGT-", UInt8)
custom Alphabet{Char,UInt8} with mapping ['A', 'C', 'G', 'T', '-']
julia> B == A
false
julia> B("A-")
2-element Vector{UInt8}: 0x01 0x05
Translating between alphabets
It often happens to me that I have an integer vector X
representing a sequence, but with a mapping different from the one I am used to. The translate
function lets me convert it to another integer vector with the right mapping.
julia> strange_alphabet = Alphabet("TCGA-"); # the default is "-ACGT"
julia> nt_alphabet = Alphabet(:dna); # the default for nucleotides
julia> X = Int[2, 2, 5, 4, 5]; # representing the sequence "CC-A-" according to the above
julia> strange_alphabet(X)
"CC-A-"
julia> nt_alphabet(X) # this is obviously wrong - nt_alphabet uses "-ACGT"
"AATGT"
julia> Y = translate(X, strange_alphabet, nt_alphabet)
5-element Vector{Int64}:
3
3
1
2
1
julia> nt_alphabet(Y) # now this works
"CC-A-"