**Review Article**

A summary of some of our group's recent work is published in the Accounts of Materials Research (Feb 2022):

https://pubs.acs.org/doi/10.1021/accountsmr.1c00238

**Summary Poster**

(Click on poster for high-res version)

**Combining Physics-Based Priors with Machine Learning **

Machine learning approaches, particularly deep neural networks continue to prove extraordinary interpolators: given a large enough dataset it is almost certain that a neural network trained on the appropriate representation will do a good job but is likely to transfer poorly to new situations. However, in materials and chemistry we have strong physics-based priors for functional forms (harmonics, exponentials, etc), symmetries and invariants (rotation, permutation, point groups, etc.) and statistical behavior (statistic mechanics). We are interested in combining powerful back-propagation and reinforcement-learning based approaches with these strong priors to create accurate and fast simulators.

**Modeling complex reactive potential energy surfaces with active learning **

Modeling dynamical effects in chemical reactions typically requires *ab initio* molecular dynamics (MD) simulations but these simulations are computationally expensive, and hence restricted to lower-accuracy electronic structure methods and scarce sampling. Combining high-throughput *ab initio* calculations, graph-convolution interatomic potentials, active learning and transfer learning we have been able to accelerate these simulations by a factor of 1000 over the traditional *ab initio *MD approach.

**Ang, S.J., Wang, W., Schwalbe-Koda, D., S. Axelrod & Gómez-Bombarelli, R. Active Learning Accelerates Ab Initio Molecular Dynamics on Pericyclic Reactive Energy Surfaces. (2020). **

**Differentiable Molecular Simulations**

Molecular dynamics simulations use statistical mechanics at the atomistic scale to enable both the elucidation of fundamental mechanisms and the engineering of matter for desired tasks. The behavior of molecular systems at the microscale is typically simulated with differential equations parameterized by a Hamiltonian, or energy function. In order to derive predictive microscopic models, one wishes to infer a molecular Hamiltonian that agrees with observed macroscopic quantities. For engineering, one wishes to control the Hamiltonian to achieve desired simulation outcomes. In both cases, the goal is to modify the Hamiltonian such that emergent properties of the simulated system match a given target. We are interesting exploring differentiable molecular dynamics simulations. In these, bulk target observables and simulation outcomes can be analytically differentiated with respect to Hamiltonians, opening up new routes for parameterizing Hamiltonians to infer macroscopic models and to develop control protocols for external parameters.

** Controlling a polymer fold with Graph Neural Networks **

**Wang, W., Axelrod, S. & Gómez-Bombarelli, R. Differentiable Molecular Simulations for Control and Learning. (2020). arXiv: https://arxiv.org/abs/2003.00868**

**Auto-encoders for coarse-grained dynamics**

Molecular dynamics simulations provide theoretical insight into the microscopic behavior of materials in condensed phase and, as a predictive tool, enable computational design of new compounds. However, because of the large temporal and spatial scales involved in thermodynamic and kinetic phenomena in materials, atomistic simulations are often computationally unfeasible. Coarse-graining methods allow simulating larger systems, by reducing the dimensionality of the simulation, and propagating longer timesteps, by averaging out fast motions. Coarse-graining involves two coupled learning problems; defining the mapping from an all-atom to a reduced representation, and the parametrization of a Hamiltonian over coarse-grained coordinates. Multiple statistical mechanics approaches have addressed the latter, but the former is generally a hand-tuned process based on chemical intuition. We developed Autograin, an optimization framework based on auto-encoders to learn both tasks simultaneously [1]. Our Autograin is trained to learn the optimal mapping between all-atom and reduced representation, using the reconstruction loss to facilitate the learning of coarse-grained variables. In addition, a force-matching method is applied to variationally determine the coarse-grained potential energy function. This procedure is tested on a number of model systems including single-molecule and bulk-phase periodic simulations.

** Training a Coarse-Grained auto-encoder to parametrize discrete mapping and coordinates reconstruction **

**Wang, W. & Gómez-Bombarelli, R. Coarse-graining auto-encoders for molecular dynamics. ****(2019) ***npj Comput. Mater.*

**Message-passing neural networks for classical force field parameterization**

Classical force fields use simple energy terms to decompose the total energy of molecular systems. Because of their speed, and reasonably accuracy, they are the workhorse of atomistic simulations at large lenght and time scales. However, because they are typically parametrized based on hand-defined atomic types, they are hard to transfer to unseen chemistries, and rely on careful, mostly manual parametrization. We are working to utilize deep neural networks to map chemical environments to classical force field terms in a continuous fashion. Message-Passing Neural Networks use the chemical graph to predict what are the best classical energy terms for any given moiety, based on the available training data.

**Representation Learning **

**Using conformer ensembles to improve property prediction **

Machine learning has shown great promise for predicting molecular properties. Typically one uses a 2D molecular graph or a single 3D structure as input for the model. In reality, however, molecules are neither a 2D structure nor a single 3D structure, but rather a set of continuously interconverting conformers. Moreover, one or a few conformers with low statistical weight may be the main determinant of a property. For example, a single conformer may be responsible for a drug successfully binding a target protein. We are developing molecular representations that utilize the conformer ensemble and learn the importance of any one conformer to the property.

** Molecular representations of the latanoprost molecule. Top: SMILES string. Left: stereochemical formula. Right: overlay of conformers.**

**Axelrod, S. & Gómez-Bombarelli, R. GEOM: Energy-annotated molecular conformations for property prediction and molecular generation. (2020). **

**Materials Design**

**Inverse design **

Most chemists understand molecules as undirected graphs with atoms in the nodes and bonds in the edges.* Carrying out global optimization over such a discrete space is very difficult: the existing arsenal of optimization algorithms works much better if gradients are available. The domain of molecules has hard constraints about which graphs are valid (think pentavalent carbon) and also soft constraints regarding which molecules are desirable and makeable. In computer-driven molecular discovery projects, all these constraints are typically captured using hand-made rules, both to define the search domain and to filter poor candidates.

We use deep generative models to create a continuous embedding of molecules on which we can apply gradient-based optimization [2]. Generative models have the ability to capture the statistics that define the training data, which ideally would result in an embedding that inherently capture the chemistry that chemists can make.

*Theoretical chemists might think of 3D arrangments of nuclei that are a local minima in the potential energy hypersurface, or perhaps about an ensemble of thermally-accessible minima. But most chemists are not theoretical chemists.

**R. Gómez-Bombarelli et al. Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules. (2018) ACS Central Science **

**High-throughput virtual screening**

High-throughput virtual screening combines computational investigation with experiments to optimize a given property within vast libraries of compounds. By rationally deciding which materials will be simulated or synthesized, best candidates can be found among combinatorially large options.

**Open-Source Code**

The open-source repositories for our work are collected in our group GitHub organization :

https://github.com/learningmatter-mit

**Funding Sources**