UNTANGLING GENOME ASSEMBLIES
ORGANIZATION
BC Cancer
DATE
2009
THE PROJECT
Sequencing a genome can give rich information about an how an organism functions, but obtaining a genome sequence made up of nucleotides (abbreviated A, C, G, and T) is not as simple as reading from one end to the other. Instead, sequencing involves first breaking up the genome into small fragments, sequencing those pieces, and then putting them all back together in a process known as genome assembly.
​
Researchers at the Michael Smith Genome Sciences Centre needed a way to efficiently analyze and debug their genome assembly algorithm, Assembly By Short Sequences (ABySS). Off-the-shelf tools did not support their use cases and struggled to handle large genomes. I designed and developed ABySS-Explorer that uses a novel graph visualization to address their needs and won an IEEE VIS Best Paper award in 2009.
​
THE DESIGN
Starting Point
This project began with a hand annotated printout of an ABySS assembly graph. Shaun Jackman, one of the ABySS co-authors and developers, was using this image to interpret the algorithm's output and asked if I could help.
Transform THE Data
Genomes are full of repeated sequences, so it's common for assembled sequences to overlap. The ABySS algorithm results in a graph with sequences as vertices and points of overlap as edges.
Reasoning about data encoded this way is actually quite tricky and is not consistent with how most scientists visualize DNA sequences as long strings or lines.
To tackle this, I flipped the representation to instead visualize assembled sequences a edges and points of overlap as vertices. In the example below, it's now easier to see that sequence 1 shares an overlap with sequences 2, 3, and 4 (left versus right graph).
Sequence Length
The next key challenge was how to depict the length of each sequence. In an email exchange with Martin Wattenberg, he suggested "would it make sense to used curved or even squiggly edges to represent longer segments".
Inspired by this idea, I encoded each sequence as a wave with one oscillation for each fixed number of nucleotides. Long sequences condense into nearly solid shapes and I rendered them as asymmetric leaf-like shapes to capture their directionality.
Colour
Colour is a powerful channel and I saved it for last. In this example, I used colour to annotate parts of an assembled genome from a lymphoma. Orange and blue colours map to distant regions in a healthy reference genome highlighting where genomic rearrangements have likely occurred in this cancer.
MY CONTRIBUTIONS
DEFINE
the problem
I worked closely with computational biologists to understand their evolving analytical needs and defined the scope of the original design.
WRANGLE
the data
I transformed the raw output of the ABySS algorithm into a form that served the visual tasks and supported performant interaction.
DESIGN
the prototypes
I iteratively designed ABySS-Explorer through both sketches and prototypes in code that were routinely critiqued and tested by end users.
DEPLOY
the solution
I built the initial ABySS-Explorer application in Java and later supervised junior developers to extend its functionality.
COMMUNICATE
the methods
I wrote the IEEE VIS paper describing the design, methods and applications, which won a best paper award.