Sunghwan Kim - PubChem3D: Shape compatibility filtering using molecular shape quadrupoles

Version 3

      Publication Details (including relevant citation   information):

      S. Kim, E.E.   Bolton, and S.H. Bryant;

      Journal of Cheminformatics, 2011, 3,   25.




      PubChem provides a 3-D neighboring relationship, which involves   finding the maximal shape overlap between two static compound 3-D   conformations, a computationally intensive step. It is highly   desirable to avoid this overlap computation, especially if it can   be determined with certainty that a conformer pair cannot meet   the criteria to be a 3-D neighbor. As such, PubChem employs a   series of pre-filters, based on the concept of volume, to remove   approximately 65% of all conformer neighbor pairs prior to shape   overlap optimization. Given that molecular volume, a somewhat   vague concept, is rather effective, it leads one to wonder: can   the existing PubChem 3-D neighboring relationship, which consists   of billions of shape similar conformer pairs from tens of   millions of unique small molecules, be used to identify   additional shape descriptor relationships? Or, put more   specifically, can one place an upper bound on shape similarity   using other "fuzzy" shape-like concepts like length, width, and   height?


      Using a basis set of 4.18 billion 3-D neighbor pairs identified   from single conformer per compound neighboring of 17.1 million   molecules, shape descriptors were computed for all conformers.   These steric shape descriptors included several forms of   molecular volume and shape quadrupoles, which essentially embody   the length, width, and height of a conformer. For a given 3-D   neighbor conformer pair, the volume and each quadrupole component   (Qx, Qy, and Qz) were binned and   their frequency of occurrence was examined. Per molecular volume   type, this effectively produced three different maps, one per   quadrupole component (Qx, Qy, and   Qz), of allowed values for the similarity metric,   shape Tanimoto (ST) ≥ 0.8. The efficiency of these relationships   (in terms of true positive, true negative, false positive and   false negative) as a function of ST threshold was determined in a   test run of 13.2 billion conformer pairs not previously   considered by the 3-D neighbor set. At an ST ≥ 0.8, a filtering   efficiency of 40.4% of true negatives was achieved with only 32   false negatives out of 24 million true positives, when applying   the separate Qx, Qy, and Qz maps   in a series (Qxyz). This efficiency increased linearly   as a function of ST threshold in the range 0.8-0.99. The   Qx filter was consistently the most efficient followed   by Qy and then by Qz. Use of a monopole   volume showed the best overall performance, followed by the   self-overlap volume and then by the analytic volume. Application   of the monopole-based Qxyz filter in a "real world"   test of 3-D neighboring of 4,218 chemicals of biomedical interest   against 26.1 million molecules in PubChem reduced the total CPU   cost of neighboring by between 24-38% and, if used as the initial   filter, removed from consideration 48.3% of all conformer pairs   at almost negligible computational overhead.


      Basic shape descriptors, such as those embodied by size, length,   width, and height, can be highly effective in identifying shape   incompatible compound conformer pairs. When performing a 3-D   search using a shape similarity cut-off, computation can be   avoided by identifying conformer pairs that cannot meet the   result criteria. Applying this methodology as a filter for   PubChem 3-D neighboring computation, an improvement of 31% was   realized, increasing the average conformer pair throughput from   154,000 to 202,000 per second per CPU core.


      Address (URL): 0a11b6c7c2ee63681cf69f144149641d