Iteration: going through a tree

TreeTools offers two different ways to iterate through nodes: post/pre-order traversals, and arbitrary iteration.

Traversals

The call traversal(tree, style) returns an iterator over the nodes of tree. style can take two values: :preorder or :postorder. The exact definition of the two traversals can be found here. In short, the traversal guarantees that:

for :postorder (resp. :preorder), children nodes are always visited before (resp. after) parent nodes;
the order in which children of a node are visited is the same as that given in children(node);
the order is "depth-first": iteration over a subtree finishes before the next subtree starts.

julia> tree = parse_newick_string("((A:1,B:1)AB:2,C:3)R;");

julia> map(label, traversal(tree, :postorder))
5-element Vector{String}:
 "A"
 "B"
 "AB"
 "C"
 "R"

julia> map(label, traversal(tree, :preorder))
5-element Vector{String}:
 "R"
 "AB"
 "A"
 "B"
 "C"

julia> map(label, traversal(tree, :postorder)) == map(label, traversal(tree)) # postorder is the default
true

Of course, this can also be used in a loop:

julia> using TreeTools # hide
julia> for node in traversal(tree) # :postorder is the default
       	# do something with `node`
       end

Note that traversal can also be called on TreeNode objects:

julia> map(label, traversal(tree, :preorder)) == map(label, traversal(root(tree), :preorder))
true

The traversal function accepts boolean keyword arguments leaves, root and internals. If any of those is set to false, the corresponding nodes are skipped:

julia> map(label, traversal(tree, :postorder, leaves=false))
2-element Vector{String}:
 "AB"
 "R"

julia> map(label, traversal(tree, :preorder, internals=false))
3-element Vector{String}:
 "A"
 "B"
 "C"

One can also skip nodes by passing a function f as the first argument. The traversal will only return nodes n such that f(n) is true:

julia> map(label, traversal(n -> label(n)[1] == 'A', tree, :postorder))
2-element Vector{String}:
 "A"
 "AB"

`map`, `count`, etc...

In some cases, one wants to do something like count the number of nodes that have a given property, or apply some function f to each node and collect the result. To facilitate this, TreeTools extends the Base functions map, map! and count to Tree and TreeNode objects. Using these functions will traverse the tree in post-order. If called on a TreeNode object, they will only iterate through the clade defined by this node.


julia> map(branch_length, tree) # Branch length of all nodes, in postorder5-element Vector{Union{Missing, Float64}}:
 1.0
 1.0
 2.0
 3.0
  missing
julia> map!(tree) do node # Double the length of all branches - map! returns `nothing`
       	x = branch_length(node)
       	if !ismissing(x) branch_length!(node, 2*x) end
       end
julia> map(branch_length, tree) # Doubled branch length, except for root (`missing`)5-element Vector{Union{Missing, Float64}}:
 2.0
 2.0
 4.0
 6.0
  missing
julia> count(n -> label(n)[1] == 'A', tree) # count the nodes with a label starting with 'A'2

Note that there is a difference between the TreeTools extension of map! with respect Base: in TreeTools, map! returns nothing instead of an array.

Arbitrary order

As explained in Basic concepts, a Tree object is mainly a dictionary mapping labels to TreeNode objects. We can thus iterate through nodes in the tree using this dictionary. For this, TreeTools provides the nodes, leaves and internals methods. This will traverse the tree in an arbitrary order but is faster than traversal.

julia> for n in leaves(tree)
       	println("$(label(n)) is a leaf")
       endB is a leaf
A is a leaf
C is a leaf
julia> for n in internals(tree)
       	println("$(label(n)) is an internal node")
       endR is an internal node
AB is an internal node
julia> map(label, nodes(tree)) == union(
       	map(label, leaves(tree)),
       	map(label, internals(tree))
       )true

A note on speed

Iterating through tree using nodes(tree) will be faster than using traversal(tree, ...). This is mainly because of my inability to write an efficient iterator (any help appreciated). Below is a simple example where we define functions that count the number of nodes in a tree:

count_nodes_traversal(tree) = sum(x -> 1, traversal(tree, :postorder)) # Traversing the tree in post-order while summing 1
count_nodes_arbitrary(tree) = sum(x -> 1, nodes(tree)) # Arbitrary order

These two functions obviously give the same result, but not with the same run time. Here, we try it on the example/tree_10.nwk file:

julia> tree = read_tree("../../examples/tree_10.nwk")
                                                          _________________ 10
  _______________________________________________________|
 |                                                       |               __ 8
 |                                                       |______________|
 |                                                                      |,_ 9
 |                                                                      ||
_|                                                                       |, 4
 |                                                                       ||
 |                                                                        | 6
 |
 |                                           ______________________________ 2
 |                                          |
 |__________________________________________|                             , 1
                                            |          ___________________|
                                            |         |                   | 5
                                            |_________|
                                                      |                  ,_ 3
                                                      |__________________|
                                                                         |_ 7
julia> using BenchmarkTools
julia> @btime count_nodes_traversal(tree)  1.758 μs (5 allocations: 288 bytes)
19
julia> @btime count_nodes_arbitrary(tree)  58.042 ns (0 allocations: 0 bytes)
19

Benchmarks show that this time difference tends to reduce when trees get larger (hundreds of leaves). In any case, if a fast post/pre-order is needed, the only solution in the current state of TreeTools is to "manually" program it using a recursive function. The code below defines a more efficient to count nodes, traversing the tree in post-order.

function count_nodes_recursive(n::TreeNode) # this counts the number of nodes below `n`
	cnt = 0
	for c in children(n)
		cnt += count_nodes_recursive(c) #
	end
	return cnt + 1
end

count_nodes_recursive(tree::Tree) = count_nodes_recursive(tree.root)

This will run faster than count_nodes_traversal, and does not allocate. For counting nodes, this is really overkill, and one could just call length(nodes(tree)). In particular, traversing the tree in a precise order does not matter at all. But for more complex use case, writing short recursive code as above does not add a lot of complexity.

julia> @btime count_nodes_recursive(tree)  79.777 ns (0 allocations: 0 bytes)
19

Iteration: going through a tree

Traversals

map, count, etc...

Arbitrary order

A note on speed

`map`, `count`, etc...