From laborer …
The use of mechanical computation in scientific research is not new, and neither is the involvement of leading scientists in creating such machines to help them with their work. The Pascaline mechanical calculator, for example, was invented by mathematician and physicist Blaise Pascal when he was just 19 years old, in 1642. The invention came about as Pascal was assisting his father, who worked as a tax commissioner, and was increasingly bored with adding long columns of numbers. The Pascaline could only add and subtract, but the mathematician and philosopher Gottfried von Leibniz extended Pascal’s ideas with his invention of the Stepped Reckoner, believing that “it is unworthy of excellent men to lose hours like slaves in the labor of calculation, which could be safely relegated to anyone else if machines were used.” Johannes Kepler preceded both Pascal and von Leibniz in using a calculating machine in his revolutionary astronomical research. The machine used by Kepler, nicknamed Speeding Clock, was developed by Wilhelm Schickard in 1623.
Since the 17th century and until not long ago, computers served as no more than research tools. As Pascal did when he was aiding his father’s tax calculations, they performed essential but repetitive and uninspired work. Unlike Pascal’s later work, they did not contribute towards reviewing existing scientific knowledge, forming hypotheses, planning experiments or interpreting the scientific meaning of experimental results.
… to apprentice …
The history of computers as research assistants started at least thirty years ago. In 1978, at Stanford University, Mark Stefik and Peter Friedland started developing the MOLGEN program, designed for use in molecular biology research (not to be confused with the later Molgen Program developed at the University of Bayeruth in Germany for computing molecular structures). MOLGEN included knowledge of various tools and experimental methods for manipulation of DNA, such as cutting and slicing, and techniques for testing and measuring the results of the manipulation, chemical processes for accelerating or inhibiting chemical reactions, etc.
As inspiration for the development of MOLGEN, the team studied the process of experimentation and discovery in a 13-year-long research program led by Charles Yanofsky, a study which focused on studying regulatory mechanisms in E. Coli bacteria. Through interviews with Yanofsy and others on his team, they identified concrete evidence for four types of reasoning. Briefly, these were: Data-Driven Reasoning, where empirical data drives alterations to the proposed theory; Theory-Driven Reasoning, where the growing theory itself suggests further extensions (for example, when Yanofski’s team discovered that an area of DNA had a role in regulating the activation of a nearby gene, the theory suggested that there exists a protein which binds to that region); Analogy to Other Biological Systems, where the work of other researchers serves to validate the theories being developed and suggests improvements; and Analogy to Distantly Related Systems, where a seemingly irrelevant domain provides inspiration (for example, some DNA researchers compared the DNA transcription mechanisms to control techniques in railroads). All of these types of reasoning were already well-known, but the MOLGEN team managed to document them in a concrete domain and follow up by partially codifying them into an artificial reasoning system.
MOLGEN had two main parts: one part had abstract schemes of experiments, which may be considered as “master templates” for experimental designs. When this part was presented with specifications of experimental needs, it selected an appropriate template and filled in the empty “slots” in the template to match the experimenter’s requirements, following the rules in its knowledge base. The other part of MOLGEN was able to generate a highly detailed experimental design based on such an experimental plan, specifying the sub-steps, the tools, and the necessary materials, the correct sequence, and so on. The process of generating a detailed plan from a “skeletal plan”, described in a later paper, was again based on human strategies. Humans often attempt to build a new plan based on known existing plans, and strategies for finding a solution when there is no known plan satisfying the experiment’s needs; In the paper, the process was explained in this way: “For example, suppose the experimenter cannot find a workable instantiation for a denaturation (strand-separation) step. He uses his dictionary to find that denaturation really means breaking the hydrogen bonds between the two strands of DNA. He then attempts to find a new skeletal plan for the subgoal of hydrogen bond breaking and repeats the experiment design process for that skeletal plan. The idea is that he is now treating denaturation as the goal of an experiment design problem, that is, that subgoals are generated when instantiation fails.”. Besides planning an experiment, MOLGEN could also “read” an existing experimental plan and report on its completeness, correctness, and feasibility.
MOLGEN served as inspiration for further research in the formation of scientific theories (including its direct descendant, the MOLGEN-II Project), as well as other for research projects into deductive reasoning and into plan formation.
… to scientist?
In order to present one of the most fascinating projects involving this meaning of “computer science”, we first need a detour into how scientists study cellular biology. To date, we have sequenced the genomes of many organisms. In other words, we have read their genetic code. But have we fully decoded it? Not by a long shot. Even for the simpler organisms, it remains a tough challenge to discover what the function of each gene actually is and how it participates in the incredibly complex network of interactions between genes, intracellular processes, intercellular processes, and the overall environment. This is a highly important challenge for early-21st century science, and will probably continue to gain influence over every aspect of our lives.
While we might be more attracted to decoding the human genetic code, it is also highly useful to study much simpler organisms. For instance, extensive research has been conducted on baker’s yeast (Saccharomyces cerevisiae), which is the main type of yeast used in baking and brewing. This unicellular organism has become an important model for understanding the function and organization of eukaryotic cells. While it is not simple by any means, it is simpler than some other forms of eukaryotic life (such as multicellular organisms), and its rapid lifecycle enables researchers to try out many experimental procedures efficiently and quickly. It has about 6,000 genes, and it is estimated that about 23% of them are shared with humans.
In order to understand the yeast cell, it is not enough to view the list of 13 million base pairs comprising its genome. What does each gene do? When is it activated and deactivated? What is its effect on other genes, other cellular molecules, and the functioning of the cell? How do all these moderate the activity of this gene? Professor Steve Oliver of the University of Manchester explains: “we aim to determine how the 6,000 or so genes in the yeast genome interact to allow this simple eukaryotic cell to grow, divide, develop, and respond to environmental changes. If this fully integrative view of the yeast cell can be obtained, it should provide an important ‘navigational aid’ to guide our studies of more complex genomes, such as those of humans, crop plants, and farm animals. We are carrying out functional genomic experiments at four levels of analysis: the genome, transcriptome, proteome, and metabolome. These represent, respectively, the complete complement of genes, mRNA molecules, proteins, and metabolites present in the yeast cell.”
To reach such holistic understanding, many experiments are required. To take a simplified example, consider a gene G1 (part of the genome), which codes for a protein P1 (part of the proteome). P1 participates in the metabolization of nutrient molecule M1. To discover the fact that G1 takes part in “digesting” M1 (part of the metabolome), we need to observe it being activated and transcribed into the mRNA (part of the transcriptome) which causes the production of P1. We also need to know or discover P1’s reaction with M1, or with another molecule which is produced from M1 during its metabolization. All of these processes and observations might not take place unless M1 is present in the yeast cell’s environment. A simple mechanism for M1 to trigger the activation of P1, and thereby produce the molecular apparatus required for digesting M1, is for M1 to bond to the activation site of yet another protein, P2. The structure of P2 in the absence of M1 is such that it can bond to a location on the chromosome adjacent to G1 and inhibit G1’s activation. The presence of M1 changes the structure of P2 so that it no longer inhibits G1. P2, of course, is produced from the information coded in another gene – call it G2.
Sounds complicated? Not at all, by the standards of molecular biology, where far more complex webs of interaction are commonplace. Yet, how can scientists unravel the above process? After all, the two genes and proteins mentioned in this example are part of a far larger roster of players in the complex and ongoing cell activities, where molecules are constantly being produced or taken apart, molecules bond and change their structures and activity, and genes are turned either ‘on’ or ‘off’. The way to tease out smaller “plot lines” in this multi-player drama is through extensive experiments, where scientists vary the genome, e.g. by removing (or controlling the activation of) genes which have been implicated in the biochemical process being studied. The scientists also vary the environment, e.g. by controlling the availability of metabolites (such as nutrient molecules). Observing the changes in the metabolic activity as well as in the production of various proteins and mRNA molecules in each such experiment can supply the required information to discover these webs of interaction.
This detour has hopefully demonstrated how important this kind of research is, and why many experiments are required in order to achieve the kind of goals Professor Oliver has set – holistic, fully integrative understanding of cellular function. Returning to the subject of “computer as scientist”, we can pose the question: is it possible to mechanize significant portions of this type of research?
In 2004, a large group of British biologists and computer scientists reported the development of a “robot scientist” dedicated to just this purpose.
This “robot scientist” may not be impressive visually (see picture below). It is composed of several computers, linked into machinery controlling the flow of various nutrients into a yeast culture and measuring the growth and biochemistry of that culture. However, exterior impressions can be misleading- while the robot continues the tradition started by MOLGEN three decades earlier, it incorporates at least two major advances. Firstly, it can generate hypotheses from observed facts, and then plan experiments to support or disprove those hypotheses. Secondly, it fully closes the loop by independently performing the experiments that it has planned, deriving new facts from the results of these experiments, and using the new findings to form yet more hypotheses leading to new experiments.
The part of the software which builds the hypotheses has a modus operandi similar to that of a police detective trying to solve a crime mystery. The machine seeks the assumptions that best explain the known facts. In logic, such a thinking process is called “abduction” – inferring the explanation from the evidence. This method can be considered as the opposite of deductive thinking. Deduction infers effects given known causes, whereas abduction infers causes given known effects. The results of abduction may, of course, be wrong, since there may be many potential explanations for any combination of items of evidence. This is why the scientific method calls for supporting or disproving such reasoning through the deduction of additional effects under the assumption that the inferred explanation is correct, and then designing an experiment to look for these effects.
Strengths and limitations
The robot built by the Robot Scientist project can form hypotheses, design experiments to test the hypotheses, and repeat this process to generate new scientific knowledge. In my opinion, this justifies the label of “scientist”. Still, the robot can fulfill the role of scientist only within strictly circumscribed contexts.
In order to enable the robot to form new hypotheses, known information must be categorized and possible inferences need to be represented. The research team created a rich mathematical model of the interactions between the four levels being studied (genome, transcriptome, proteome, and metabolome), and of the experimental techniques available for designing new experiments. This model is the key ingredient of the robot’s ability to follow the scientific method. It is an impressive achievement, and it may well be the first occasion where Artificial Intelligence has the potential to create “something new” in science, independently of any human being.
Still, the boundaries of this rich model are the boundaries of the robot’s science. For example, if the model does not include the fact that the same protein might appear in two different three-dimensional structures (as has been found in research on prions), there is no way for the robot to acquire this knowledge, and thus the robot may fail to make progress in a potentially promising direction. It is also unclear whether such modeling can be extended to a wider research domain in the same field, such as studying multicellular life, where the experimental conditions may affect the development and growth of a creature with many different types of cells. Anyway, any such extension of the model must be performed by the robot’s human designers. This robot is unable to modify and improve its own internal models, though it can and does enrich the facts and relationships represented using its models. Lastly, other fields of research may be less amenable to formal mathematical modeling – for example, studies and experiments in cognitive psychology.
These limitations imply that even if we grant the robot the rank of “scientist”, its scientific potential can not (yet?) compete with the average human scientist. Still, its real contribution to science means that it can accelerate the work of human researchers by going beyond the automation of running experiments and analyzing its results, and delivering some real – though repetitive – scientist-grade achievements. These achievements may contribute towards better and cheaper food production, new medical treatments, and many other benefits of scientific progress.
How does the robot’s performance compare with human performance? To answer this question, the robot’s software was adjusted so that graduate computer scientists and biologists could choose the next experiment, instead of leaving this task to the robot’s mechanisms for hypotheses generation and experiment planning. This gave a measure of human performance, which was then compared to the robot’s performance. Results: the robot performed as well as the best humans. Another test of the robot’s “scientific understanding” was done by comparing its sophisticated experimental planning to two simple methods of planning: choosing the cheapest experiment in each step; or choosing experiments randomly. In all cases, experimental cost included the time required for performing the tests and the price of materials used in these tests. It was found that for achieving the same level of accuracy for predictions generated by inferences from experimental results, the robot required only 1% of the costs of random experimental plans and 33% of the costs of naïve selection of cheapest experiments. This eliminates the possibility that the task set to the robot was too simple.
By another set of measures, the robot scientist is far superior to any human: every day, it performs about a thousand experiments and collects over 200,000 observations. Such accuracy and speed are essential in such fields as drug design, where an immense number of experiments is often required in order to create and run initial tests (“screening”) on potential molecules and potential ways of producing such molecules.
In the past few years, this Robot Scientist has undergone further enhancements and upgrades. In its current incarnation, as described in a recent paper, it “is capable of autonomously designing and initiating >1,000 new strain/nutrient experiments each day, with each experiment lasting up to 4 days, using over 50 different yeast strains”. It continues to perform the double duty of being a researcher in molecular biology as well as a being a vehicle of research in artificial intelligence.
Principal Investigator of the Robot Scientist Project, Professor Ross King, of the Department of Computer Science in the University of Wales, Aberystwyth describes an example of automated discovery where the Robot Scientist independently hypothesized that a certain gene, of previously unknown function, codes for a specific protein. Given the biological activity of that protein, the robot then inferred that if this is the gene’s function, then a mutant strain of yeast where this gene is “knocked out” will not be able to grow on some types of media where normal strains of yeast can grow. It further inferred that the mutant strain will recover its ability to grow if one of three metabolites were added (because adding these molecules would sidestep the lack of the protein). Performing the growth experiments on this strain, the robot found that the predictions were mostly verified, except for one metabolite whose addition did not recover growth. King postulates an additional enzymatic step to explain this discrepancy (apparently the robot did not postulate it independently). The bottom line: the robot contributed just about all the work required to discover the function of one of the nearly 2,000 yeast genes whose role is unknown, though it needed assistance from a human team member in resolving one difficulty – not unlike human researchers… This illustrates not just the strength of this approach but also the critical need for mechanizing some of this research – there are simply too many genes, proteins, and metabolites even in such simple cells.
Employing robots to perform these kinds of scientific tasks will free human scientists to do what they do best: step back, look at the data, and make the small or large intuitive leaps which keep other scientists – human and robotic – productive.
This column has focused mostly on autonomic robotic research. A next column in this series will look at the implications of automated representation of scientific knowledge, and how these have the potential to contribute to, and collaborate with, human scientific research.
Acknowledgement: Some of this text has previously appeared, in different form, in articles by this author which were published in Galileo, the Israeli Magazine of Science and Thought, and appears here with the kind permission of Galileo Magazine.