Bioinformatics Seminar

The Bioinformatics Seminar is co-sponsored by the Department of Mathematics at the Massachusetts Institute of Technology and the Theory of Computation group at MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL). The seminar series focuses on highlighting areas of research in the field of computational biology. This year, we are hoping to highlight three topics: (1) evolution and computational approaches to modeling and understanding it, (2) generative AI for biology/biomedicine, and (3) algorithms for computational biology/genomics.

Fall 2024

Lectures are on Wednesdays, 11:30am - 1:00pm ET
Location: 32G-575 (Stata Center at MIT; Gates Tower; 5th Floor)
Zoom link for virtual attendants: https://harvard.zoom.us/j/99103715484

Date Speaker Title/Abstract
Sept. 11 Roshan Rao
(EvolutionaryScale)

Multimodal Protein Foundation Models

How can multimodality improve representations of proteins? Foundation models have shown promise in building powerful representations for many domains. Language models are able to access a vast quantity of human knowledge and are able to perform limited reasoning over this body of knowledge. Protein models learn the evolutionary patterns in proteins, enabling prediction of protein structure and function. This talk will cover the development of protein foundation models, understanding the representations they build, and how they scale. Finally, it will cover incorporating modalities beyond protein sequences, and how additional data could be added to produce better representations in the future.

Sept. 18 Ben Langmead
(JHU)

Pan-genomic advances for fighting reference bias

Sequencing data analysis often begins with aligning reads to a reference genome, where the reference takes the form of a linear string of bases. But linearity leads to reference bias, a tendency to miss or misreport alignments containing non-reference alleles, which can confound downstream statistical and biological results. This is a major concern in human genomics; we don't want to live in a world where diagnostics and therapeutics are differentially effective depending whether and where our genetic variants happen to match the reference.

Fortunately, computer science and bioinformatics are meeting the moment. We can now index and align sequencing reads to references that include many population variants. I will present some of the major and insights that have shaped this journey from the early days of efficient genome indexing -- especially the Burrows-Wheeler Transform -- continuing through recent methods for indexing graph-shaped references and references that include many genomes. I will emphasize recent results that show how to optimize simple and complex pan-genome representations for effective avoidance of reference bias. Finally, I will outline promising methods for the bias, including new ideas for how to measure bias, new proposals in compressed indexing, and new workflows that integrate genotype imputation to improve reference bias.

Sept. 25 Ard Louis
(Oxford)*

Does evolution have an inbuilt bias towards highly compressible phenotypes? 

Darwinian evolution proceeds by natural selection acting on random variation. I will argue that, although mutations are random, the novel phenotypes they produce can be highly biased towards simple or more compressible forms. This bias is so strong that it can dramatically shape the spectrum of adaptive outcomes. The basic intuition follows from an algorithmic twist on the infinite monkey theorem inspired by the fact that natural selection doesn’t act directly on mutations, but rather on the phenotypes that are generated by developmental programmes. If monkeys type at random in a computer language, they are much more likely to generate outputs derived from shorter algorithms. This intuition can be formalised with the coding theorem of algorithmic information theory, predicting that random mutations are exponentially more likely to result in simpler, more compressible phenotypes with low descriptional (Kolmogorov) complexity. Evidence for this evolutionary Occam’s razor can be found in the symmetry in protein complexes [1], and in the simplicity of RNA secondary structures [2], gene regulatory networks, leaf shape, and Richard Dawkins’ biomorphs model of development [3]. This principle may also extend to machine learning, offering insights into why neural networks generalize well on typical datasets [4].

[1] Symmetry and simplicity spontaneously emerge from the algorithmic nature of evolution, IG Johnston, et al, PNAS 119 (11), e2113883119 (2022);
[2] Phenotype bias determines how RNA structures occupy the morphospace of all possible shapes, Kamaludin Dingle, Fatme Ghaddar, Petr Sulc, and Ard A. Louis. Molecular Biology and Evolution, 39, msab280 (2021)
[3] Bias in the arrival of variation can dominate over natural selection in Richard Dawkins’s biomorphs, NS Martin, CQ Camargo, AA Louis PLOS Comp. Bio. 20 (3), e1011893 (2024)
[4] Do deep neural networks have an inbuilt Occam's razor? C Mingard, H Rees, G Valle-Pérez, AA Louis arXiv preprint arXiv:2304.06670

Oct. 2 Smita Krishnaswamy
(Yale)*

TBA

Oct. 9 Sriram Sankararaman
(UCLA)

Understanding the genetic basis of complex traits from Biobank-scale data: Statistical and Computational challenges

The quest to understand the interplay between evolution, genes and traits has been revolutionized by the collection of rich phenotypic and genetic data across millions of individuals in diverse populations. However analyses of these Biobank-scale datasets present substantial statistical and computational challenges.

I will describe how we bring together statistical and computational insights to design accurate and highly scalable algorithms for a suite of problems that arise in the analysis of Biobank data: highly scalable randomized inference algorithms to dissect the genetic architecture of complex traits and deep-learning based phenotype imputation to deal with complex patterns of missingness. By applying these methods to about half a million individuals from the UK Biobank, we obtain novel insights how genetic effects are distributed across the genome, the relative contributions of additive, dominance and gene-environment interaction effects to trait variation, and new genes that confer risk for hard-to-measure diseases.

Oct. 16 Kevin K. Yang
(Microsoft Research)

TBA

Oct. 23 Bin Yu
(UC Berkeley)*

TBA

Oct 30 Adam Phillippy
(NIH)

TBA

Nov. 6 Ava Amini
(Microsoft Research)

TBA

Nov. 13 Michael Desai
(Harvard)

TBA

Nov. 20 Jesse Bloom
(Fred Hutchinson Cancer Center)*

TBA

Nov. 27 Aleksandra Walczak
(Ecole Normale Supérieure)*

TBA

Dec. 4 Tristan Bepler
(New York Structural Biology Center)

TBA

Dec. 11 David Van Valen
(Caltech)*

TBA

*Indicates the speaker will be presenting over Zoom. Otherwise, they will be presenting in person.

Past Terms

A listing of the Bioinformatics Seminar series home pages from prior terms.

Organizers and Information

The Bioinformatics Seminar is hosted by MIT Simons Professor of Mathematics and head of the Computation and Biology group at CSAIL Bonnie Berger. Professor Berger is also Faculty of Harvard-MIT Health Sciences & Technology, Associate Member of the Broad Institute of MIT and Harvard, Faculty of MIT CSB, and Affiliated Faculty of Harvard Medical School.

The seminar is announced weekly via email to members of the seminar's mailing list and to those on CSAIL's event calendar list. It is also posted in the BioWeek calendar.

Bonnie Berger: bab@mit.edu

Anna Sappington (TA): asapp@mit.edu

To be added to the seminar's email announcement list or for any questions you have about the seminar, please mail bioinfo@csail.mit.edu and cc TA Anna Sappington (asapp@mit.edu).

If you plan to enroll in the associated course, 18.418/HST.504: Topics in Computational Molecular Biology, please contact Professor Berger (bab@mit.edu) and cc TA Anna Sappington (asapp@mit.edu) for more information.