Engineered proteins have numerous applications in synthetic biology, impacting fields such as therapeutics, materials, fuels, and agriculture. However, the protein sequence space is so vast that mapping sequence to function or its numerical quantification, “fitness”, is extremely challenging. Currently, directed evolution (DE) is the primary experimental method for improving protein fitness. It involves iterative rounds of mutagenesis and screening to accumulate beneficial low-order mutations. However, DE campaigns are often time-consuming and resource-intensive.
Machine learning (ML) holds significant promise to accelerate this process by mapping sequence to fitness in silico, providing platforms for synthetic biology applications at speed and scale. Thus, our goal is to develop easy-to-implement and generalizable ML methods to inform and simplify the wet-lab engineering of diverse proteins.
To achieve this goal, we propose an approach that synergizes information from three modalities: sequence, structure, and biophysical simulations. We have been using two lab-generated enzyme and antibody datasets in addition to a well-studied public dataset on an immunoglobulin-binding protein domain (GB1) to test our approach. One of the two includes thousands of sequence-fitness pairs for the β-subunit of tryptophan synthase (TrpB), a model enzyme system. The other comprises coronavirus-antibody binding data.
Our preliminary results on GB1 method development show that combining structural information with protein sequence information outperforms using either modality on its own and prior work baselines. We are still investigating the usefulness of simulated mutation structures as well as biophysical features such as energy and charge distributions as an additional modality.