Research Areas

Group poster for our research summary

Check out the group poster

Combining physics-based priors with machine learning 

Machine learning approaches, particularly deep neural network continue to prove extraordinary interpolators: given a large enough dataset it is almost certain that a neural network trained on the appropriate representation will do a good job but is likely to transfer poorly to new situations. However, in materials and chemistry we have strong physics-based priors for functional forms (harmonics, exponentials, etc), symmetries and invariants (rotation, permutation, point groups, etc.) and statistical behavior (statistic mechanics). We are interested in combining powerful back-propagation and reinforcement-learning based approaches with these strong priors to create accurate and fast simulators.

Auto-encoders for coarse-grained dynamics

Molecular dynamics simulations provide theoretical insight into the microscopic behavior of materials in condensed phase and, as a predictive tool, enable computational design of new compounds. However, because of the large temporal and spatial scales involved in thermodynamic and kinetic phenomena in materials, atomistic simulations are often computationally unfeasible. Coarse-graining methods allow simulating larger systems, by reducing the dimensionality of the simulation, and propagating longer timesteps, by averaging out fast motions. Coarse-graining involves two coupled learning problems; defining the mapping from an all-atom to a reduced representation, and the parametrization of a Hamiltonian over coarse-grained coordinates. Multiple statistical mechanics approaches have addressed the latter, but the former is generally a hand-tuned process based on chemical intuition. We developed Autograin, an optimization framework based on auto-encoders to learn both tasks simultaneously [1]. Our Autograin is trained to learn the optimal mapping between all-atom and reduced representation, using the reconstruction loss to facilitate the learning of coarse-grained variables. In addition, a force-matching method is applied to variationally determine the coarse-grained potential energy function. This procedure is tested on a number of model systems including single-molecule and bulk-phase periodic simulations.

[1] - W. Wang and R. Gómez-Bombarelli. Coarse-Graining Auto-Encoders for Molecular Dynamics. arXiv:1812.02706. Under review.

Message-passing neural networks for classical force field parameterization

Molecular dynamics and Monte Carlo simulations are standard tools used to predict thermodynamic and kinetic properties of materials. However, their accuracy depends on the force fields used to model the system's potential energy surface (PES). Quantum chemistry methods like DFT can determine the PES to high accuracy, but their computational complexity limits applications to small systems over short timescales. In practice, the need for a more computatioanlly efficient and scalable PES has led to the wirespread use of classical force fields (CFFs) in atomistic simulations. CFFs write the total energy of a system as a sum over simple functions of the atomic positions. In the simplest cases, bonds can be treated as springs while electrostatic energies can be modeled using point charge Coulomb interactions. CFFs have relatively few parameters, are fit to optimize agreement with calculated bulk properties, and do not suffer significantly from overfitting. However, they are tailored to specific classes of molecules and suffer from low generalizability.

Recently, machine learning approaches have used highly parameterized neural network potentials, such as the SchNet architecture, to produce energy predictions to within chemical accuracy (< 1 kcal/mol). However, their large representational capacity means that they are frequently unable to interpolate across chemistries or to molecular configurations not seen in the training data. Moreover, these models are generally difficult to interpret, thus limiting their ability to probe underlying physical processes. We use machine learning approaches to find a balance between these two extremes. We explore the trade-offs between functional complexity and generalizability in fitting CFFs; focus on simple and scalable functional forms for interatomic interactions; and take advantage of quantum chemistry big-data to densely sample the chemical configuration domain.

One tool, called AuTopology, uses message-passing neural networks (MPNNs) to generate continuous atom types by convolving over molecular graphs. These atomic embeddings encode information about an atom's bonded chemical environment, but unlike more general neural network potentials, do not contain distances or other geometric information. These embeddings then serve as inputs to a series of TopologyNet networks which produce parameterizations for bonded (bond, angle, proper and improper dihedral) and non-bonded (coulomb, dispersion) interactions. While the functional forms of these energy terms are arbitrary, we focus on simple, computationally fast functional forms that enforce physics-based regularization and transferability and that are commonly implemented in atomistic simulation packages, such as harmonic or Morse potentials for bonds and Lennard-Jones potentials for non-bonded pair interactions. For example, BondNet predicts equilibrium bond lengths and spring constants given the MPNN embeddings of pairs of bonded neighbors, while PairNet uses atomic embeddings to predict partial charges and polarizabilities. The MPNN and TopologyNet models are jointly trained by fitting to DFT-calculated atomic forces, with potentially one or more additional loss terms for predicted atomic charges, total charges, and molecular dipoles. In addition, we implement an induced point dipole model to capture polarization effects. By training with thousands of geometries spanning diverse chemistries, we drive the MPNN layers to learn information-rich atomic embeddings that can produce optimally parameterized classical force fields through the TopologyNet layers. This tool can be used to generate force fields which are as specific as required for a given application and that outperform the off-the-shelf force field alternatives without being susceptible to the potentially catastrophic overfitting of general neural network potentials. A universal, fast CFF would open the door to coupling atomistic simulation and high-throughput virtual screening for the discovery and design of new materials.

Inverse chemical design

Most chemists understand molecules as undirected graphs with atoms in the nodes and bonds in the edges.* Carrying out global optimization over such a discrete space is very difficult: the existing arsenal of optimization algorithms works much better if gradients are available. The domain of molecules has hard constraints about which graphs are valid (think pentavalent carbon) and also soft constraints regarding which molecules are desirable and makeable. In computer-driven molecular discovery projects, all these constraints are typically captured using hand-made rules, both to define the search domain and to filter poor candidates.

We use deep generative models to create a continuous embedding of molecules on which we can apply gradient-based optimization [2]. Generative models have the ability to capture the statistics that define the training data, which ideally would result in an embedding that inherently capture the chemistry that chemists can make. 

 

*Theoretical chemists might think of 3D arrangments of nuclei that are a local minima in the potential energy hypersurface, or perhaps about an ensemble of thermally-accessible minima. But most chemists are not theoretical chemists.

[2] - R. Gómez-Bombarelli et al. Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules. ACS Central Science 4 (2), p. 268-276.

High-throughput virtual screening

High-throughput virtual screening combines computational investigation with experiments to optimize a given property within vast libraries of compounds. By rationally deciding which materials will be simulated or synthesized, best candidates can be found among combinatorially large options.

Funding sources