In this interview, John Mattick shares with us his view on “RNA The Epicenter of Genetic Information”, which is the title of a book, which has just been published, and the past, present and future of RNA Biology and its role for life.
For whom and why did you write the book "RNA The Epicenter of Genetic information" with the subtitle "A New Understanding of Molecular Biology"?
This book is not for the public but scientific peers. We wrote the book to explain how genetic information has been misunderstood. This cannot be done in a review article nor a two-hour lecture. To understand how we got to this point, one must go back to the beginning of molecular biology and follow its fascinating history with a focus on how the roles of RNA were defined.
The book was prompted by the background work done my former student and co-author Paulo Amaral. He included an appendix documenting the early history of RNA research in his PhD thesis. I thought this was a good starting point for a book; it took three years to put it together. I am pleased with the outcome and endorsements it received, for example, from Tom Cech and Joan Steitz.
The book contains a lot of quotes. Throughout the history of Molecular Biology, progress was clumsy because, over and over again, the great and the good were skeptical of unexpected findings and opposed ideas that turned out to be correct. Prominent scientists said, "We discovered rRNA, tRNA and mRNA; the hard work has been done," genes = proteins, and the rest is now just detail.
It was the same story for epigenetics and histone modifications. David Allis faced tremendous difficulties until their importance was established. And there are many stories like this. Barbara McClintock’s discovery of transposable elements, which she correctly insisted are “controlling elements”, is another example. In the book, I put the quote from the American musical Porgy and Bess by George and Ira Gershwin: "It ain't necessarily so - The things that you're liable to read in the Bible - ain't necessarily so." My adaptation is, "What you are liable to read in the textbooks ain't necessarily so." I am not talking about factual knowledge like the detailed structure of the ribosome, but the conceptual framework itself.
Regarding concepts, what causes the doubts regarding the functionality of long noncoding RNAs?
In the early 2000s, the transcriptome projects surprisingly revealed tens of thousands of long transcripts with little or no protein-coding potential. The underlying problem was to accept that the textbooks would need to be rewritten if they were functional. These RNAs emerged out of the mist, and no conventional explanations for gene regulation could accommodate such a large army of molecules that had not been accounted for previously. A common refrain was: "They might be noise." Two arguments were used to strengthen this notion:
One argument was that these RNAs are lowly expressed and less conserved than protein-coding sequences. There are several problems with that argument, one of them being how conservation was assessed. In 2002, the mouse and human genome papers used ancient transposons common to both species to assess the rate of “neutral” evolution and found a similar degree of divergence in the rest of the genome, which they concluded is also evolving neutrally and the RNAs expressed from it must also be non-functional. This is an entirely circular argument. Nowadays, it is evident that transposable elements are major features of genome biology.
The second argument was that conservation imputes function and thus the rapidly evolving long noncoding RNAs (lncRNAs) are less likely to be functional. However, conservation is a relative measure, and low conservation imputes nothing. There must be lineage specificity: regulatory sequences, including promoters, evolve much more quickly than highly constrained protein-coding sequences. Protein sequences must maintain their structure for their function. On the other hand, regulatory sequences, including RNAs, have much more plastic structure-function relationships than proteins. Evolutionary developmental researchers will tell you that it is evident that phenotypic variation comes largely from regulatory sequence variation and not protein sequence variation. Thus, there is positive selection for variation in regulatory architecture, which underpins phenotypic radiation.
Where does this fundamental resistance come from, e.g., the number of protein-coding genes between C. elegans and humans is similar? Should this not hint at the presence of other significant differences between the genomes of these organisms?
The resistance to accepting the functionality of lncRNAs is fundamentally a victim of the orthodox conceptual framework of gene regulation compounded by reductionism. They did not fit and most were focussed on their gene or protein of interest, not how the system works. When the C. elegans genome was published and it became clear that the number of protein-coding genes was similar and that many are orthologous to those in humans, it was assumed that the combinatorics of transcription factor regulation provides more than sufficient power to enable the developmental programming of a worm or a human . However, the assertion that transcription factor combinatorics could explain everything about gene regulation and diversity was vague. It was never justified theoretically, mathematically or mechanistically.
There are two interesting features of transcription factors: Nearly all of them contain intrinsically disordered regions (IDRs), and most can bind RNA. How does it mechanistically work that a transcription factor binds to different promoters in different cells at various stages of development? There is no answer to that in conventional space. However, the data show that zinc finger transcription factors have a higher affinity for RNA:DNA hybrids than they do for double-stranded DNA. RNA:DNA hybrids and triplexes occur over the genome, so a plausible explanation is be that RNA molecules select the exact binding sequence of transcription factors in a given cell at a given time. RNA regulatory networks direct where transcription factors bind to the genome for controlling transcription.
The so called 95% "junk" has crucial functions?
The junk idea has a long history dating back to the 1930s when theoretical biologists considered the size of genomes. They argued that the mutational load would be too high if there were the same density of protein encoding sequences in humans as you have in bacteria. A nucleotide variation in a protein that changes a codon or introduces a stop codon is often catastrophic. However, if a nucleotide in a regulatory RNA changes the regulatory architecture, which is the basis of quantitative trait variation. Back then, there was a long argument between so-called "Mendelians" and those working in agriculture and animal breeding who understood that quantitative trait variation is not usually a function of protein-coding mutations. This has since been confirmed by the genome-wide association studies (GWAS).
The second argument was based on the C-value enigma. Some organisms like certain amoebae, arthropods and amphibians have much more DNA per cell than humans. The assumption was that they have variable amounts of junk, which justifies that the assumption that the human genome can also contain a lot of junk. However, the increase in noncoding sequences compared to organismal complexity suggests a massive expansion in the regulatory architecture. The only way to invalidate this proposition would be to identify downward exceptions: complex organisms with little noncoding DNA. None have been found to date.
The fundamental mistake was to think that proteins transact most genetic information in complex organisms. Most of the information is, in fact, transacted by RNAs, and most genes produce RNAs, which then organize cell fate decisions from fertilization to the adult.
Did RNA in terms of molecular evolution passed on the heredity part to DNA and the catalytic part to proteins?
That is a fair summary. RNA was likely the ancestral molecule because it combines the two critical functions of information storage and catalysis. Information storage was then outsourced to the more stable and easily replicable DNA, which was an intelligent move of evolution. Catalytic activities were largely outsourced to proteins because they possess more chemical versatility. The proof that RNA preceded proteins is simple: peptide bond formation in the ribosome is an RNA-catalyzed reaction.
Moving to another RNA-catalyzed process, Splicing: You consider the discovery of introns as the biggest shock in molecular biology. Why?
When the discovery was made that genes are not collinear with their protein products in complex organisms and that protein-coding sequences are split into bits located over vast territories, the reaction was, "Wow, what is going on here?" My big criticism is that nobody at this moment took the chance to reconsider what was really known and not known about genetic information, especially in complex organisms. Somebody once said, "The best science is done at the point of greatest surprise." If that is true, then molecular biology was found wanting because quickly and almost universally introns were condemned as another manifestation of and evidence for junk sequences in the genome.
Walter Gilbert wrote an article to rationalize the presence of introns. He suggested that they were remnants of the primordial assembly of genes where you had fragments of protein-coding sequences "exons" interspersed with other RNA sequences "introns." Evolution then built proteins by removing the intervening sequences by splicing after transcription. He proposed that bacteria lost introns under the pressure of rapid replication but were retained in the slower-growing eukaryotes.
There is a significant fault with this argument. Gilbert used a colorful expression in that article, writing that animals retained “the full stigmata of their birth”, which is an evolutionary non sequitur. The eukaryotes were unicellular for at least 2 billion years under the same selective pressures of rapid replication prior to the emergence of multicellularity. So, his argument tacitly suggests that there was a clade of unicellular eukaryotes sitting under a proverbial evolutionary rock, waiting for the sunny day that they would emerge as complex organisms. That does not make sense.
The more reasonable argument is that self-splicing group II introns, which exist in out of the way places in bacteria, recolonized genes after the separation of transcription and translation. The early eukaryotes were scavengers. As a scavenger, if you start engulfing things, you must protect your genome, and so transcription and translation were separated. This provided a window for group II introns to invade genes and splice themselves out before translation and led to the formation of the spliceosome. These internal segments then became the substrate for positive selection for RNA regulatory functions that were produced in parallel with the protein-coding sequences.
On the topic of cellular compartmentalization, how does phase separation relate to the origin of life?
Most people would agree that phase separation has been one of the most exciting developments in molecular cell biology in the last decade and has been staring us in the face since nucleoli were first observed microscopically. Phase separation also gives another dimension to the prebiotic assembly of life. RNA-protein interactions drive phase separation. Interestingly, the most ancient codons are the ones that specify the amino acids in the IDRs.
So, the plausible scenario is that RNA has a function beyond information storage and catalysis: the ability to cooperate with primitive peptides to form phase-separated domains. These domains can then become reaction centers for biochemical and genetic evolution, producing a protocell.
Phase separation is the hidden and overlooked dimension of the organization of the cell, and of the chromatin, during development. The proportion of IDR-containing proteins has increased enormously with organismal complexity and scales with it. Nearly all proteins that control mammalian development contain IDRs. Genetic loci called enhancers – of which there are ~400,000 in the human genome - control the spatiotemporal patterns of gene expression during development by inducing chromatin rearrangements, and enhancers express lncRNAs in the cells in which they are active, likely the mechanistic basis of their function.
Staying with the spatial organization, cellular RNA localization is also far from functionally and mechanistically being understood.
Over a decade ago, we showed specific expression of localization of particular long noncoding RNAs to unknown subcellular locations. Seeing these images was a "Wow" moment. Somewhere around 30% of lncRNAs go to the cytoplasm, while the others are retained in the nucleus. Specific cellular localization provides further evidence against the opinion that lncRNAs are just transcriptional noise. LncRNAs comprise the major information complement of the genome, they are highly alternatively spliced, and their structure appears to be modular. If we can work out which structures in lncRNAs perform which functions, we can elucidate their mechanism and pathways.
There are thousands of publications on lncRNAs, but most are descriptive. Many labs look at long noncoding RNAs in cancer, differentiation, or something else. And then they see one changing, perturb it and report that something happens. However, that is not getting to the heart of how they work, although it is valuable because it adds to the weight of evidence of the functionality of long noncoding RNAs.
We are trying to decipher the mechanisms of lncRNA action, but I am not a big fan of getting deep down in the trench because I think you can lose your way. But we do need to determine the structure-function relationships for lncRNAs. So, I am dreaming of a new Rfam. This Rfam, like Pfam for proteins, would tell you based on the sequence that a lncRNA contains, e.g., a Polycomb binding domain. Then we can start putting some structure into understanding what lncRNAs are doing and where they are going.
Do we have enough data to take this type of approach?
The lack of training data is the problem. I think the field of RNA structural biology will grow and that the data will come. However, RNA structures are complex and sensitive to base changes and modifications. The way we have started to tackle this problem is to look at high confidence predictions of two-dimensional RNA structures. When we did that, we could show that almost 20% of the human genome was conserved at the level of predicted RNA structure.
We then looked in an evolutionary series at how nucleotide variations affected the predicted structure, that is, if we could find changes in predicted stems accompanied by a complementary change that would maintain the stem, in other words, co-evolution. The more depth you have in your evolutionary series, the more statistical confidence you have that your two-dimensional structure projections are correct. By our estimation, there are 10 million of conserved RNA structures in the mammalian genome, with 2 million classified as high confidence.
You call RNA "The computational engine of the cell". How do you see the future of RNA research and which dogmas will be overturned?
The big dogma to be overturned are the idea that genes mostly encode proteins, and that the human genome is full of junk, which is the complete reverse of the truth. To the contrary, the human genome is a highly efficient information suite. Measured in bits, it contains only 825 megabytes, less than the size of Microsoft Word, yet contains the information that puts 30 plus trillion cells in the right places with all their specialized architecture and functions.
The genome contains a sophisticated program that takes a single cell and directs the development of the entire organism, e.g., to form all your bones and muscles in the right shapes and places. For me, after the brain, bones are the most fascinating organ of the body because of the variation in their architecture. Plus, bones have different densities, and all that architectural variation is derived from the same or highly similar cell types. So, most of the programming for human development has not to do with cell differentiation. It is easy to specify cell types. Instead, it is organizing the architecture, and almost nobody in molecular biology thinks about this, another example of a casualty of reductionism.
Another dogma to be overturned is that transcription factors control development. They execute functions, but RNAs exercise the actual control, a huge conceptual change that may take another decade for people to accept.
At the practical level, once we understand the structure-function relationships and pathways, we can manipulate them, for example, in the case of genetic variations underpinning complex human traits and disorders. These variations primarily lie in intergenic regions, not protein-coding ones. The intergenic haplotype blocks identified by the GWAS studies are replete with lncRNAs, which are the candidates for the underlying mechanistic basis.
Can it be that simple that it is only one or a few long noncoding RNAs, or would it not be a complex interplay of many genetic loci?
I think it will be possible to identify the best treatment for people with particular subsets of complex disorders. The GWAS data indicate that there are 50 or 100 loci that contribute, but it does not necessarily mean that one needs to reconfigure the whole network, but rather address just the part that is damaged. Once we understand the mechanistic basis of the variations underpinning complex disorders, it is likely that at least some of these damaging changes can be corrected in some way.
What are new technological approaches most urgently needed?
For me, this is high-resolution single-cell sequencing. That might sound odd because everyone is doing single-cell sequencing nowadays, but it is not yet high resolution, with few exceptions. Most single-cell sequencing polls the 3' ends of the abundant protein-coding transcripts. There are two problems with this. One is unknown to most people. The 3' UTRs of many genes are not necessarily co-expressed with their associated protein-coding sequences but can be expressed independently. The evidence for this comes not only from in situ hybridization and genetics. So, the fact that you get a 3' end in your sequencing data set does not mean that the actual protein is being produced.
The more significant problem is that most sequencing does not poll the splice variants. LncRNAs are almost universally alternatively spliced, potentially varying their protein cargoes and genomic targets at every one of the ~60 trillion cell fate decisions that be made during development. Therefore we need a high-resolution analysis of the transcriptional output of cells at every stage. That will be impossible in humans, but it should be possible in mice. You need to be able to sequence the entire length of all transcripts – all mRNAs, lncRNAs and small RNAs- in a cell.
Where do you see problems with today's molecular biology?
One is the lack of reproducibility of many findings in molecular cell biology. There are two big causes: One is based on self-interest, and the other is cell culture. When I was director of a medical research institute, a senior colleague asked my permission to spend five million dollars to buy a large batch of fetal calf serum. I asked him why he would want to buy such a massive amount all at once. He answered that it would provide five years stable supply of serum, which would guarantee reproducibility of experiments. I asked if he realized what he just said, namely that the experiments may not be reproducible by others; using fetal calf serum or anything undefined in experiments can give you batch effects. The only way around it is to repeat the experiments with three or five batches and see if you get the same result independent of the batch used.
The other reason is psychological. If you work on a specific gene or pathway, then it is in your interest to ensure that that gene or pathway is important because your career and grant applications depend on it. So, the design and interpretation of every experiment is subconsciously dedicated to this proposition.
A background problem is founder fallacies and validation creep. There is an excellent article by Marc Halfon on this topic with reference to the understanding of enhancers. Early on, the idea was put forward that enhancers bind transcription factors and then loop around to bring those transcription factors into contact with the promoter of the target genes. That is the standard model, but there is no evidence whatsoever for transcription factor crosstalk, beyond the fact that enhancers cause local topological rearrangements in chromatin. It was just a conceptual proposal, but such generalizations often become founder fallacies. The proposition may have been a reasonable hypothesis, but became the conventional explanation and an article of faith, which has biased the interpretation of experimental data ever since.
How did your research journey lead to RNA?
I did my PhD in the seventies during the period when the cloning revolution was underway. I studied DNA replication in yeast mitochondria. For my postdoc, I went to Houston to work on fatty acid synthase. Regarding RNA, a couple of months into my postdoc, I talked on Friday night over a beer to a friend who told me about introns.
So, from there on, I have been intrigued by the RNA that is transcribed but not translated. My immediate response to hearing about introns from my friend was that there might be some other form of information being transacted by these sequences. However, the data and technologies were limited in those days so exploring the idea was difficult. Nevertheless, it remained my intellectual hobby and in the early nineties I started actively working on the idea that other information was being transacted by RNA, especially in complex organisms. I see biology primarily in terms of information rather than chemistry. Sure, the information is transacted by chemistry, and it is essential to understand it, but my interest is the type of information in genomes.
What advice would you give to young researchers besides being good in bioinformatics
Do what suits your soul and do your best. There is no inferiority or superiority in different career paths, and everybody is different.
Find time to read and think. One of the problems with modern science, which should not be understood as a criticism of scientists, is that we do not have enough time to read and think. Thinking is crucial for planning experiments, and there is no point in executing improperly designed experiments.
Look for the things that do not make sense. The unexpected results are the most exciting, and you should follow up on them rather than just put them aside because they do not fit the current way of thinking. Unusual observations usually lead to new insights. As well, you should always be thinking to generalizability of what you do, keeping in mind that your interpretation may not be correct.
Above all, be curious.
The origin story and emergence of molecular biology is muddled. The early triumphs in bacterial genetics and the complexity of animal and plant genomes complicate an intricate history. This book documents the many advances, as well as the prejudices and founder fallacies. It highlights the premature relegation of RNA to simply an intermediate between gene and protein, the underestimation of the amount of information required to program the development of multicellular organisms, and the dawning realization that RNA is the cornerstone of cell biology, development, brain function and probably evolution itself. Key personalities, their hubris as well as prescient predictions are richly illustrated with quotes, archival material, photographs, diagrams and references to bring the people, ideas and discoveries to life, from the conceptual cradles of molecular biology to the current revolution in the understanding of genetic information.
Cover image and book description from the book by John Mattick & Paulo Amaral “RNA, the Epicenter of Genetic Information” published under a CC BY-NC-ND 4.0 license.
In 1978, John Mattick obtained his PhD from Monash University, Melbourne, Australia researching mitochondrial DNA replication and mutation. For his postdoc, he studied the organization and function of the fatty acid synthase complex at the Baylor College of Medicine, Houston, USA. As an independent group leader he worked first at the Commonwealth Scientific and Industrial Research Organisation in Sydney, Australia developing a DNA vaccine and studying bacterial type IV pilus assembly. In 1988, he moved to the University of Queensland in Brisbane, Australia where his research increasingly became focused on non-coding RNA. From 2012 to 2018, he served as the Executive Director of the Garvan Institute of Medical Research in Sydney. Parallel to studying the role of non-coding RNAs, John Mattick starting at the Garvan Institute developed several clinical sequencing initiatives including during his time in the UK as Chief Executive of Genomics England before returning to the UNSW in Sydney in 2020 as Professor of RNA Biology.
Interview conducted by Dominik Theler on May 6, 2022.