Genomics: the end of the beginning

The cells of every living organism contain a set of fundamental instructions – the genome. The genome consists of long DNA molecules that form chromosomes, which provide the information the cell needs in order to operate. Understanding our genome means understanding our genetic identity which sets us apart from other individuals and other species.

Enter the genomic era. In 2001, the draft sequence of the human genome was completed after more than a decade of intentional effort and a cost of 2.7 billion dollars. This landmark achievement of sequencing and assembling nearly 3 billion basepairs was not the beginning of the end, but the end of the beginning. Since then, incredible advances in DNA sequencing technologies have started to revolutionize biology. Today, generating a draft human genome sequence can cost around $1000 and can be completed in days. In parallel, the remarkable adoption of the CRISPR/Cas bacterial immune system as an efficient DNA-editing tool has made it easy to create new genomic sequences. Now, the stage is set to take on the monumental challenge of deciphering the secrets of the genome.

Genome structure and function

The cells of our body share the same genome and thus the same set of genes. Yet a muscle cell, a blood cell and a brain cell are all very different from each other in shape and function. How is this possible? Rather than constantly activating all their genes, our cells operate somewhat like a computer (or a mathematical function): based on its current condition and on input from its surroundings, a cell calculates how to act, resulting in activation of a specific set of appropriate genes. This complex regulatory system which performs the calculation gives the cells their unique functional identity, and understanding how it works is one of the central goals of modern biology.

Interestingly, gene regulation is closely linked to the spatial organization of the genome. For example, special regulatory sequences can physically interact and control the activation of genes that are far from them in the genomic sequence. It is thus not surprising that the 3D organization of the genome is not random, but hierarchical and ordered. In fact, the spatial organization of the genome is associated with most nuclear processes, both natural and in disease, in a wide range of biological species.

What we study

The two main fundamental questions we study are:

  1. How is the 3D organization encoded by the genome, given that sequence information is only 1D? What sequences and molecules are involved?
  2. How does the 3D organization of the genome mediate its function? How does this actually work mechanistically? Can we go beyond correlations and figure out the cause from the consequence?

How we study it

We use high-throughput experiments, mainly based on next-generation DNA sequencing technology. One of the central techniques we use is Hi-C, an experiment that allows to measure pairs of loci in the genome that contact each other in 3D, at a single point in time, across the entire genome. Measuring hundreds of millions of interacting pairs in a single experiment, we construct interaction maps – matrices that show how frequently every pair of loci in the genome interacts.

Hi-C interaction map
Hi-C interaction map

Some of the major challenges in this field include using these interaction maps to computationally identify structural patterns, analyzing these patterns, and associating them with other functional genomic data. We use a variety of computational approaches, with a focus on machine learning and probabilistic models. However, we try to go beyond simple descriptive and correlative analysis, to produce quantitative predictive models that incorporate explicit assumptions about the underlying biological mechanisms. By studying a diverse set of biological systems, we aim to uncover general principles underlying genome structure and function.

Using Hi-C to solve outstanding challenges in genome assembly

Despite the huge advances in DNA sequencing technology, these technologies are not sufficient to produce high-quality genome sequences de novo. The reliance of these methods on the measurement of short sequences requires using computational methods to assemble them into larger sequences. However, the success of these methods is limited, resulting in fragmented genomes which pose a huge problem for genomic research.

3D bridges between genome fragments

We have recently developed a way of repurposing Hi-C data to solve this set of problems. The method is based on the notion that some of the strongest interaction patterns found in Hi-C are canonical and appear in all genomes. This means that even for a previously unobserved genome, we can expect these interaction patterns to be present. So, given a fragmented genome, we can perform Hi-C, map the data to the genome fragments, and then computationally rearrange the fragments such that they produce the expected canonical interaction patterns.

This idea (published back-to-back with similar work from Jay Shendure’s lab) has since been further developing in both academia and industry, and was recently used to assemble the frog, quinoa, goat, mosquito and barley genomes.

xenopus       quinoa       barley
Nature covers featuring some of the genomes that were assembled using our Hi-C scaffolding idea

In addition, this notion can also be used to solve other related problems including long-range haplotype phasing, metagenome deconvolution and structural variation identification.

We are currently developing exciting new ways of using Hi-C to solve genome assembly problems.