Bioinformatics Seminar

The Bioinformatics Seminar is co-sponsored by the Department of Mathematics at the Massachusetts Institute of Technology and the Theory of Computation group at MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL). The seminar series focuses on highlighting areas of research in the field of computational biology. This year, we are hoping to highlight three topics: (1) evolution and computational approaches to modeling and understanding it, (2) generative AI for biology/biomedicine, and (3) algorithms for computational biology/genomics.

Fall 2024

Lectures are on Wednesdays, 11:30am - 1:00pm ET
Location: 32G-575 (Stata Center at MIT; Gates Tower; 5th Floor)
Zoom link for virtual attendants: https://harvard.zoom.us/j/99103715484

Date Speaker Title/Abstract
Sept. 11 Roshan Rao
(EvolutionaryScale)

Multimodal Protein Foundation Models

How can multimodality improve representations of proteins? Foundation models have shown promise in building powerful representations for many domains. Language models are able to access a vast quantity of human knowledge and are able to perform limited reasoning over this body of knowledge. Protein models learn the evolutionary patterns in proteins, enabling prediction of protein structure and function. This talk will cover the development of protein foundation models, understanding the representations they build, and how they scale. Finally, it will cover incorporating modalities beyond protein sequences, and how additional data could be added to produce better representations in the future.

Sept. 18 Ben Langmead
(JHU)

Pan-genomic advances for fighting reference bias

Sequencing data analysis often begins with aligning reads to a reference genome, where the reference takes the form of a linear string of bases. But linearity leads to reference bias, a tendency to miss or misreport alignments containing non-reference alleles, which can confound downstream statistical and biological results. This is a major concern in human genomics; we don't want to live in a world where diagnostics and therapeutics are differentially effective depending whether and where our genetic variants happen to match the reference.

Fortunately, computer science and bioinformatics are meeting the moment. We can now index and align sequencing reads to references that include many population variants. I will present some of the major and insights that have shaped this journey from the early days of efficient genome indexing -- especially the Burrows-Wheeler Transform -- continuing through recent methods for indexing graph-shaped references and references that include many genomes. I will emphasize recent results that show how to optimize simple and complex pan-genome representations for effective avoidance of reference bias. Finally, I will outline promising methods for the bias, including new ideas for how to measure bias, new proposals in compressed indexing, and new workflows that integrate genotype imputation to improve reference bias.

Sept. 25 Ard Louis
(Oxford)*

Does evolution have an inbuilt bias towards highly compressible phenotypes? 

Darwinian evolution proceeds by natural selection acting on random variation. I will argue that, although mutations are random, the novel phenotypes they produce can be highly biased towards simple or more compressible forms. This bias is so strong that it can dramatically shape the spectrum of adaptive outcomes. The basic intuition follows from an algorithmic twist on the infinite monkey theorem inspired by the fact that natural selection doesn’t act directly on mutations, but rather on the phenotypes that are generated by developmental programmes. If monkeys type at random in a computer language, they are much more likely to generate outputs derived from shorter algorithms. This intuition can be formalised with the coding theorem of algorithmic information theory, predicting that random mutations are exponentially more likely to result in simpler, more compressible phenotypes with low descriptional (Kolmogorov) complexity. Evidence for this evolutionary Occam’s razor can be found in the symmetry in protein complexes [1], and in the simplicity of RNA secondary structures [2], gene regulatory networks, leaf shape, and Richard Dawkins’ biomorphs model of development [3]. This principle may also extend to machine learning, offering insights into why neural networks generalize well on typical datasets [4].

[1] Symmetry and simplicity spontaneously emerge from the algorithmic nature of evolution, IG Johnston, et al, PNAS 119 (11), e2113883119 (2022);
[2] Phenotype bias determines how RNA structures occupy the morphospace of all possible shapes, Kamaludin Dingle, Fatme Ghaddar, Petr Sulc, and Ard A. Louis. Molecular Biology and Evolution, 39, msab280 (2021)
[3] Bias in the arrival of variation can dominate over natural selection in Richard Dawkins’s biomorphs, NS Martin, CQ Camargo, AA Louis PLOS Comp. Bio. 20 (3), e1011893 (2024)
[4] Do deep neural networks have an inbuilt Occam's razor? C Mingard, H Rees, G Valle-Pérez, AA Louis arXiv preprint arXiv:2304.06670

Oct. 2 Smita Krishnaswamy
(Yale)*

Inferring and Characterizing Cellular and Neural Dynamics with Geometric and Topological Deep Learning

In the last decade there has been a data revolution in biology with the advent of high-throughput high dimensional data modalities such as single-cell RNA-sequencing, fMRI data, molecular structure data and other modalities. A key issue in these data types is that they provide static snapshots of highly dynamic biological entities. In this talk I will cover our work inferring and characterizing cellular and neural dynamics during various processes. First, I will cover how to infer cell state dynamics during differentiation and disease with a neural ODE framework called MIOflow that is regularized with data geometric and manifold priors. Then I will discuss RITINI, our recent graph ODE network which allows us to learn gene regulation that underlies cellular dynamics, and potentially find new targets for treatments of disease. I will showcase applications of these in triple negative breast cancer and human embryonic stem cell differentiation. Once these dynamics are available, I will showcase tools to quantify and classify these dynamics based on graph signal processing and topological data analysis. This will involve our learnable geometric scattering transform to capture spatial signal patterns, as well as persistence homology and other tools to quantify time-varying patterns. Applications to characterization of brain activity data will be presented.

Oct. 9 Sriram Sankararaman
(UCLA)

Understanding the genetic basis of complex traits from Biobank-scale data: Statistical and Computational challenges

The quest to understand the interplay between evolution, genes and traits has been revolutionized by the collection of rich phenotypic and genetic data across millions of individuals in diverse populations. However analyses of these Biobank-scale datasets present substantial statistical and computational challenges.

I will describe how we bring together statistical and computational insights to design accurate and highly scalable algorithms for a suite of problems that arise in the analysis of Biobank data: highly scalable randomized inference algorithms to dissect the genetic architecture of complex traits and deep-learning based phenotype imputation to deal with complex patterns of missingness. By applying these methods to about half a million individuals from the UK Biobank, we obtain novel insights how genetic effects are distributed across the genome, the relative contributions of additive, dominance and gene-environment interaction effects to trait variation, and new genes that confer risk for hard-to-measure diseases.

Oct. 16 Kevin K. Yang
(Microsoft Research)

Deep generative models for protein engineering

Deep generative models are increasingly powerful tools for the in silico design of novel proteins. Recently, a family of generative models called diffusion models has demonstrated the ability to generate biologically plausible proteins that are dissimilar to any actual proteins seen in nature, enabling unprecedented capability and control in de novo protein design. However, current state-of-the-art models generate protein structures, which limits the scope of their training data and restricts generations to a small and biased subset of protein design space. Here, we introduce a general-purpose diffusion framework, EvoDiff, that combines evolutionary-scale data with the distinct conditioning capabilities of diffusion models for controllable protein generation in sequence space. EvoDiff generates high-fidelity, diverse, and structurally-plausible proteins that cover natural sequence and functional space. Critically, EvoDiff can generate proteins inaccessible to structure-based models, such as those with disordered regions, while maintaining the ability to design scaffolds for functional structural motifs, demonstrating the universality of our sequence-based formulation. We envision that EvoDiff will expand capabilities in protein engineering beyond the structure-function paradigm toward programmable, sequence-first design.

Oct. 23 Bin Yu
(UC Berkeley)*

Veridical Data Science and PCS Uncertainty Quantification

Data Science is central to AI and has driven most of the recent advances in biomedicine and beyond. Human judgment calls are ubiquitous at every step of the data science life cycle (DSLC). We will introduce Veridical (truthful) Data Science (VDS) based on three core principles of data science: Predictability, Computability and Stability (PCS) to formally take into account the human judgment calls as sources of uncertainty. PCS will be showcased through collaborative research in prostate cancer detection and in seeking genetic drivers of a heart disease. We will end with on-going research on PCS uncertainty quantification (UQ) that addresses two unconventional prominent sources of uncertainty in the DSLC from data cleaning and algorithm choices.

Oct 30 Adam Phillippy
(NIH)

Telomere-to-telomere genome assembly and alignment

In 2022, roughly 20 years after the conclusion of the Human Genome Project, we were finally able to complete the last 8% of the human genome that had been missing from all prior assemblies of the human genome. Our complete, gapless, “telomere-to-telomere” assembly revealed over 200 Mbp of novel sequence, comprising some of the most repetitive and structurally variable regions of the genome. In addition to new methods for genome sequencing and assembly, these regions have also required new methods for sequence alignment, annotation, and analysis that account for their unique evolutionary properties. I will cover some of the key algorithmic details that have now enabled the routine assembly and analysis of complete human genomes, and the new biology we are uncovering.

Nov. 6 Ava Amini
(Microsoft Research)

Learning the functional consequences of cell state across human cancers.

Assessing how alterations in DNA control disease progression and overall cellular function is a core component of cancer biology and has largely driven how we search for and assign therapies. The advent of single-cell RNA-sequencing (scRNA-seq) has reshaped our understanding of human cancers by revealing that tumors are complex systems of interacting cells and exhibit substantial variation in transcriptional states in addition to mutational heterogeneity. Despite the generation of many high-resolution and multimodal single-cell atlases of cancer, we still have a limited understanding of the relative importance and functional consequences of cell state diversity in human malignancy. Addressing this problem is equal parts biology and computer science. In Project Ex Vivo, a joint cancer research collaboration between Microsoft Research and the Broad Institute, we are leveraging the knowledge within a diverse group of computer scientists, experimentalists, clinicians, and computational biologists to better understand the complexity of cell state phenotypes in cancer. I will discuss our efforts to build AI models to better define, model, and therapeutically target cell states in cancer.

Nov. 13 Michael Desai
(Harvard)

TBA

Nov. 20 Jesse Bloom
(Fred Hutchinson Cancer Center)*

Interpreting the evolution of SARS-CoV-2 and other viruses

Some human viruses including SARS-CoV-2 and seasonal influenza evolve rapidly to erode antibody immunity. I will discuss how new high-throughput experimental techniques including deep mutational scanning and sequencing-based neutralization assays can be used to understand and to some extent forecast this evolution.

Nov. 27 Aleksandra Walczak
(Ecole Normale Supérieure)*

How personalised is your immune repertoire?

Immune repertoires provide a unique fingerprint reflecting the immune history of individuals, with potential applications in precision medicine. Can this information be used to identify a person uniquely? If it really is a personalised medical record, can it inform us about the outcomes of a COVID-19 infection? I will show how statistical analysis of immune repertoires sequencing experiments can answer these questions.

Dec. 4 Tristan Bepler
(New York Structural Biology Center)

TBA

Dec. 11 David Van Valen
(Caltech)*

TBA

*Indicates the speaker will be presenting over Zoom. Otherwise, they will be presenting in person.

Past Terms

A listing of the Bioinformatics Seminar series home pages from prior terms.

Organizers and Information

The Bioinformatics Seminar is hosted by MIT Simons Professor of Mathematics and head of the Computation and Biology group at CSAIL Bonnie Berger. Professor Berger is also Faculty of Harvard-MIT Health Sciences & Technology, Associate Member of the Broad Institute of MIT and Harvard, Faculty of MIT CSB, and Affiliated Faculty of Harvard Medical School.

The seminar is announced weekly via email to members of the seminar's mailing list and to those on CSAIL's event calendar list. It is also posted in the BioWeek calendar.

Bonnie Berger: bab@mit.edu

Anna Sappington (TA): asapp@mit.edu

To be added to the seminar's email announcement list or for any questions you have about the seminar, please mail bioinfo@csail.mit.edu and cc TA Anna Sappington (asapp@mit.edu).

If you plan to enroll in the associated course, 18.418/HST.504: Topics in Computational Molecular Biology, please contact Professor Berger (bab@mit.edu) and cc TA Anna Sappington (asapp@mit.edu) for more information.