Prediction of Synthetic Accessibility Based on Commercially Available Compound Databases
Yoshifumi Fukunishi,*,†,‡ Takashi Kurosawa,‡,§ Yoshiaki Mikami,‡,§ and Haruki Nakamura∥
†Molecular Proﬁling Research Center for Drug Discovery (molprof), National Institute of Advanced Industrial Science and Technology (AIST), 2-3-26, Aomi, Koto-ku, Tokyo 135-0064, Japan
‡Technology Research Association for Next-Generation Natural Products Chemistry, 2-3-26, Aomi, Koto-ku, Tokyo 135-0064, Japan
§Hitachi Solutions East Japan, 12-1 Ekimae-honcho, Kawasaki-ku, Kawasaki, Kanagawa 210-0007, Japan
∥Institute for Protein Research, Osaka University, 3-2 Yamadaoka, Suita, Osaka 565-0871, Japan
*S Supporting Information
A compound’s synthetic accessibility (SA) is an important aspect of drug design, since in some cases computer-designed compounds cannot be synthesized.1,2 And while the concept of SA is well-known among medicinal and synthetic chemists, it is less familiar among computational chemists, who generally lack a background in synthetic chemistry. Therefore, although computational chemists often engage in computer-guided drug design, they may ﬁnd SA estimation of newly designed compounds diﬃcult. An automatic SA evaluation program would be helpful in such cases and should be required when
new compounds are generated by a de novo design program. There is thus a great need for an automatic SA prediction program. Accordingly, several computational methods have been developed, such as CAESA,3 RECAP,4 WODCA,5 LHASA,6 RASA,7 RSsvm,8 AIPHOS,9 and SYLVIA,10 to
perform retrosynthesis11,12 and/or SA prediction for the compounds in question. In these methods, the synthetic reaction path is predicted by software, and both the diﬃculty of reactions and the availability of reagents (starting materials) are
were not taken into consideration in these approaches. Also, these SA prediction techniques depend on the availability of reagents, the catalogs of which have changed every year.
The study most closely related to our present study was reported by Takaoka et al.13 In this study, several chemists evaluated the SA (in the original paper, “synthetic easiness”) of each compound in a teaching data set by expert manual assessment. The compounds in the teaching data set were described by molecular descriptors, and the weights of the descriptors were determined to reproduce the SA. This approach worked well, and the report suggested that SA is a useful but not well-deﬁned concept.14 In the present work, we present a new method for predicting SA based on a commercially available compound database and molecular descriptors. The idea behind our method was very similar to that in a previous study.15 In both the present and the above- mentioned methods,13,15 reaction databases and retrosynthesis analyses were not necessary. SA was estimated from the
probability of the existence of substructures of the compound calculated based on a compound library, the number of
examined using a database. These methods work well and
should be useful in drug design. However, the steric eﬀects (such as atom collisions and other interatomic interactions)
Received: July 17, 2014
© XXXX American Chemical Society A dx.doi.org/10.1021/ci500568d | J. Chem. Inf. Model. XXXX, XXX, XXX−XXX
symmetry atoms, and the number of chiral centers in the compound. A steric eﬀect, such as that of atomic collision, could be partially taken into consideration by the probability of the existence of substructures of the compound. It has recently become easy to access free compound libraries, such as PUBCHEM,16 ChEMBL,17 ZINC,18,19 and LigandBOX.20,21
Since our method uses a compound library but not a reaction database, it should be easy for compound vendors to customize their predictions using our approach. Then, the vendors could establish their own libraries with the resulting compounds.
In addition, we examined the relationship between the sales price and SA of approved drugs. Recently, the sales prices of newly approved drugs have increased. The recently developed kinase inhibitors are especially expensive. Of course, the price of a drug is critical to its practical adoption by patients. For this reason, the ability to predict the price of a new drug in the drug-design process would be highly useful. Our hypothesis is that drugs that are diﬃcult to synthesize will be expensive, based on the fact that, in the past, some approved drugs were not actually produced because their sales prices were not high enough to yield proﬁts. We checked this hypothesis in the present study by examining 100 new drugs and 100 generic drugs.
In the current study, SA was calculated as follows.
SA = c1Sprob + c2Ssym + c3Nchiral + c4Sgraph‐complexity + c0
Figure 1. ECFP-like atom-based and bond-based fragmentations.
In bond-based fragmentation, the ith fragment of M order fragmentation consists of the atoms connected to one of the two atoms of the ith bond by contiguous M bonds.
The atom-based and bond-based fragments include the atom-type information and bond-type information, respectively, in addition to the topology, and the deﬁnitions of atom and bond types follow those of the Sybyl mol2 ﬁle format.
All compounds in the library were decomposed into M order fragments. Let N and N(i) be the total number of N order fragments found in the library and the total number of N order fragments found in the library that were exactly the same as the ith fragment of the compound in question, respectively. The probability of existence of the ith fragment in library (P(i)) is given by
P(i) = N(i)/N
The total probability of existence of the compound in question is then given as
where c , c , c , c , and c
are ﬁtting parameters. Parameters c −
Ptotal = ∏ P(i)
0 1 2 3 4 0 i
c4 are optimized to reproduce the SA determined by expert manual assessment. In the present study, the SA determined by expert manual assessment is named “human SA” and the SA calculated by eq 1 is named “calculated SA”. SA is estimated from the probability of existence (Sprob) calculated based on a
compound library, the number of symmetry atoms (Nsym), the total number of atoms (Ntot) of the compound, the number of chiral centers of the compound (Nchiral), and the graph complexity (Sgraph‑complexity).22 Nsym is the number of chemically (topologically) equivalent atoms. The importance of symmetry, chirality, and graph complexity was suggested previously.15
If a compound consists of fragments frequently found in an available-compound database, it should be easy to synthesize. On the other hand, if a compound consists of fragments rarely or never found in an available-compound database, it would be diﬃcult to synthesize. The probability of existence (Sprob) was calculated on the basis of the decomposition of the compound into fragments, and the probability of existence of each fragment was estimated according to the compound library. This idea was introduced previously.15 We used two fragmentation methods to obtain the molecular descriptors: atom-based fragmentation and bond-based fragmentation (see Figure 1). Our descriptors were very similar to the extended connectivity ﬁngerprint (ECFP) descriptors developed by SciTegic, which are substructural ﬁngerprints.15
In atom-based fragmentation, the ith fragment of M order
fragmentation consists of the atoms connected to the ith atom by contiguous M bonds.
Sprob = log10(Ptotal) = ∑ log10(P(i))
The compound library should consist of already-synthesized available compounds.
Chemical reactions occur at chemically equivalent atoms. Thus, it should be easier to synthesize symmetrical substructures than nonsymmetrical substructures. The synthesis of a chiral structure is known to involve a diﬃcult reaction. A previous report showed that the graph complexity was important for evaluating SA. Thus, we examined the Nsym, Nchiral, and Sgraph‑complexity.
We also calculated the formation heat of the compound in
question by using MOPAC PM3 (Quantum Chemistry Program Exchange, (QCPE), Indiana University, Bloomington, Indiana, USA, http://openmopac.net/index.html), since the stability of a compound depends on its formation heat. In general, distorted unstable compounds (such as cubane) should be diﬃcult to synthesize, even if the compound could be synthesized.
In this study, the correlations between two sets of values were calculated (i.e., the correlation between the predicted value and the manual assessment, the correlation between the predicted values, and so on). The all-correlation coeﬃcients
(R) were Pearson product-moment correlation coeﬃcients (Pearson correlation coeﬃcients).
B dx.doi.org/10.1021/ci500568d | J. Chem. Inf. Model. XXXX, XXX, XXX−XXX
⦁ DATA PREPARATION
Our compound data set consisted of 256 compounds, including 12 from the SYLVIA data,12 100 from the RASA data,9 and 144 randomly selected from the KEGG DRUG database (http:// www.genome.jp/kegg/drug/).23 KEGG DRUG is a free database of approved drugs. For all 256 compounds, 2 chemists from the Nard Institute and 2 chemists from Fujimoto Chemicals evaluated the SA. For the RASA data set, 5 chemists evaluated the SA, and for SYLVIA, 11 chemists evaluated the previously published SA. For the KEGG DRUG data set, the SA was averaged across the results by the four chemists. For the RASA and SYLVIA data sets, we calculated the average SA for all evaluated results. The lists of the human SA values are summarized in the Supporting Information (Table S1).
Table 1 shows the correlation between each descriptor and the SA evaluated by the chemists (human SA). The actual
Table 1. Pearson Correlation Coeﬃcients between Human SA and Descriptors
To calculate S
, a library of known existing compounds is
R L1 L2 L3
Ssym 0.267 0.267 0.267
Schiral 0.428 0.428 0.428
heata 0.060 0.060 0.060
Sgraph 0.345 0.345 0.345
A1b 0.118 0.111 0.111
A2b 0.332 0.320 0.310
A3b 0.500 0.507 0.499
A4b 0.494 0.503 0.505
A5b 0.446 0.457 0.465
A6b 0.404 0.419 0.428
B1b 0.402 0.404 0.400
B2b 0.474 0.479 0.478
necessary. We prepared nine libraries to evaluate the library- dependence and size-dependence of our prediction method. Libraries L1, L2, and L3 were nonredundant, and their compounds were randomly selected from the LigandBOX database.20,21 Each library (L1, L2, and L3) consisted of 100 000 compounds. Libraries M1, M2, and M3 were 50 000- compound libraries randomly selected from L1, L2, and L3, respectively. Libraries S1, S2, and S3 were 10 000-compound libraries randomly selected from M1, M2, and M3, respectively. The compound data of the libraries (L1−L3, M1−M3, S1− S3) extracted from the LigandBOX database were prepared as three-dimensional (3D) coordinates in the Sybyl mol2 ﬁle format; however, only 2D structure data were used in the present study. The atom and bond types also followed the Sybyl mol2 format. The 3D structure of each compound was optimized by energy minimization using the General AMBER force ﬁeld (GAFF)24 and myPresto/cosgene25 with MOPAC
AM1 atomic charge. The details of the database were described in our previous papers.20,21 The 3D structures of the 256- compound set were prepared in the same manner as in the LigandBOX procedure. The 3D structures were prepared to calculate the heat of formation, but in the prediction models (FA1−FA6, FB1, and FB2), only the 2D structures were used. For the price examination, new drugs and generic drugs were extracted from DrugBank (http://www.drugbank.ca/).26 About 100 new drugs and 100 generic drugs were arbitrarily selected. When multiple brands were available, the average price was used. The lists of the drugs are summarized in the Supporting
Information (Table S2).
⦁ RESULTS AND DISCUSSION
⦁ Correlation between Human SA and Each Term of the Prediction Model. We prepared a compound database of human SA that consisted of 256 compounds and had 4 chemists from 2 chemical vendors (2 from Nard Institute and 2 from Fujimoto Chemicals) evaluate each compound. The SA of each compound was ranked from 1 (easy to synthesize) to 10 (diﬃcult to synthesize), as in previous works.9,12,15 The SA of each compound was compared with the calculated compound descriptors. The data sets are described in detail in the Data Preparation section. This data set was used in all validation tests in the present study.
The atom-based Sprob values were calculated from 1- to 6- order fragments (A1−A6) and the bond-based Sprob values were calculated from 1- to 2-order fragments (B1 and B2). We did not examine further bond-order fragments, since the results were similar to those obtained for the atom-based fragments.
aHeat of formation calculated by MOPAC PM3. bSprob descriptor.
Table 2. Average Sprob Values and Their Standard Deviation over a 256-Compound Set Calculated Using the L1 Library
model log10(Sprob) σ of log10(Sprob)
A1 −16.7 −60.7
A2 −85.5 −63.3
A3 −134.1 −64.0
A4 −153.7 −64.9
A5 −162.5 −65.0
A6 −168.4 −65.1
B1 −153.3 −71.6
B2 −223.9 −48.8
values of Sprob and the deviations are summarized in Table 2. The three compound libraries (L1, L2, and L3) were used as the database for the atom-based (A1−A6) and bond-based (B1 and B2) descriptors. Most of the descriptors were correlated with the human SA; the exceptions were the atom-based ﬁrst- order fragment (A1) and the heat of formation. This means that the P(i) value of the ith atom (eq 3) could show the SA around the ith atom. This could tell us which part of the compound in question should be diﬃcult to synthesize. The results obtained by the atomic-based fragmentation were almost the same as those obtained by the bond-based fragmentation. As was the case in the previous reports, the Ssym, Nchiral, and Sgraph complexity showed good correlations with the human SA. We examined the correlation between Sprob and the human SA using the L1, L2, L3, M1, M2, M3, S1, S2, and S3 libraries, since the atom-based (A1−A6) and bond-based (B1 and B2) descriptors were all correlated highly with human SA. The calculated SA was obtained by using eq 1 with c2 = c3 = c4 = 0, and the other two parameters (c0 and c1) were optimized. Table 3 shows the computational root-mean-square deviation error (RMSD), the average error, and the correlation coeﬃcient between the human SA values and the SA values calculated by
these eight models. The models used were SA1−SA6, SB1, and SB2, and these models employed A1−A6, B1, and B2 as Sprob,
respectively. The results were obtained by leave-one-out
(LOO) cross-validation tests and multiple-linear regression
Table 3. SA Results Calculated by Using Only the Sprob Term (Two-Parameter (c0 and c1) Model)a
MLR library M2
level RMSD average error
R RMSD average error
R level RMSD average error
R RMSD average error
A1 1.980 1.555 0.115 1.973 1.549 0.118 A6 1.509 1.171 0.399 1.502 1.166 0.401
A2 4.222 3.910 0.333 4.202 3.893 0.332 B1 1.585 1.177 0.402 1.577 1.171 0.403
A3 1.810 1.411 0.498 1.800 1.405 0.500 B2 1.463 1.108 0.472 1.455 1.103 0.473
A4 1.501 1.165 0.492 1.494 1.160 0.494 library
A5 1.501 1.177 0.444 1.493 1.171 0.446 M3 cross -validation test MLR
A6 1.519 1.200 0.402 1.512 1.194 0.404 average average
B1 1.569 1.167 0.401 1.561 1.162 0.402 level RMSD error R RMSD error R
B2 1.464 1.108 0.473 1.456 1.103 0.474 A1 2.080 1.640 0.081 2.073 1.634 0.084
average average A4 1.510 1.185 0.397 1.503 1.180 0.399
level RMSD error R RMSD error R A5 1.540 1.209 0.360 1.533 1.203 0.362
A1 1.991 1.570 0.108 1.984 1.564 0.111 A6 1.559 1.224 0.336 1.552 1.219 0.339
A2 4.188 3.787 0.311 4.169 3.772 0.310 B1 1.559 1.158 0.390 1.552 1.153 0.392
A3 1.938 1.532 0.497 1.928 1.526 0.499 B2 1.469 1.129 0.444 1.462 1.124 0.446
A4 1.517 1.160 0.503 1.509 1.154 0.505 library
A5 1.489 1.148 0.463 1.482 1.143 0.465 S2 cross-validation test MLR
A6 1.494 1.154 0.425 1.486 1.149 0.428 average average
B1 1.607 1.191 0.399 1.598 1.185 0.400 level RMSD error R RMSD error R
B2 1.469 1.105 0.476 1.461 1.100 0.478 A1 2.539 1.990 0.016 2.532 1.983 0.013
average average A4 1.465 1.133 0.446 1.458 1.128 0.448
level RMSD error R RMSD error R A5 1.509 1.173 0.393 1.502 1.168 0.395
A1 2.080 1.639 0.083 2.073 1.633 0.086 A6 1.535 1.196 0.361 1.528 1.191 0.363
A2 4.012 3.672 0.355 3.998 3.659 0.352 B1 1.564 1.155 0.414 1.555 1.150 0.415
A3 1.647 1.284 0.511 1.638 1.278 0.513 B2 1.454 1.101 0.471 1.447 1.095 0.472
A4 1.461 1.128 0.490 1.454 1.123 0.492 aMLR: multilinear regression. R: Pearson correlation coeﬃcient.
A5 1.483 1.151 0.439 1.476 1.145 0.441
average average A4 1.454 1.116 0.506 1.447 1.111 0.508
level RMSD error R RMSD error R A5 1.474 1.134 0.453 1.467 1.129 0.456
A1 1.987 1.564 0.108 1.980 1.558 0.111 A6 1.496 1.153 0.411 1.489 1.148 0.414
A2 4.206 3.843 0.322 4.187 3.828 0.320 B1 1.601 1.185 0.400 1.592 1.179 0.401
A3 1.886 1.493 0.505 1.877 1.486 0.507 B2 1.462 1.100 0.477 1.454 1.095 0.478
A4 1.503 1.151 0.501 1.495 1.146 0.503 library
A5 1.491 1.153 0.455 1.483 1.148 0.457 S1 cross-validation test MLR
A6 1.500 1.160 0.417 1.493 1.155 0.419 average average
B1 1.589 1.181 0.403 1.581 1.175 0.404 level RMSD error R RMSD error R
B2 1.466 1.107 0.478 1.458 1.102 0.479 A1 2.729 2.129 0.014 2.721 2.122 0.013
average average A4 1.477 1.145 0.434 1.470 1.140 0.436
level RMSD error R RMSD error R A5 1.511 1.179 0.391 1.505 1.174 0.393
A1 2.086 1.641 0.093 2.079 1.635 0.095 A6 1.534 1.193 0.361 1.527 1.188 0.363
A2 3.746 3.303 0.364 3.736 3.293 0.363 B1 1.568 1.161 0.407 1.560 1.155 0.408
A3 1.609 1.237 0.487 1.600 1.232 0.489 B2 1.461 1.108 0.463 1.454 1.103 0.464
A4 1.498 1.170 0.454 1.491 1.164 0.456 library
A5 1.511 1.186 0.412 1.504 1.181 0.415 S3 cross-validation test MLR
A6 1.538 1.211 0.374 1.531 1.205 0.376 average average
B1 1.573 1.168 0.390 1.565 1.163 0.392 level RMSD error R RMSD error R
B2 1.469 1.116 0.459 1.462 1.110 0.461 A1 2.544 1.986 0.022 2.537 1.980 0.019
(MLR) over the data of all 256 compounds using the L1, L2, L3, M1, M2, M3, S1, S2, and S3 libraries. In the leave-one-out cross-validation test, 1 datum was selected as the test datum to be predicted, and the other data were used as teaching data to
generate the prediction model equation. The test data were selected one after another in a given data set until all data had been selected as test data. The results obtained by the atom- based fourth-order fragmentation (SA4) were better than those
Table 4. SA Results Calculated by Equation 1 (Five-Parameter (c0−c4) Model)a
MLR library M2
level RMSD average error
R RMSD average error
R level RMSD average error
R RMSD average error
A1 1.357 1.049 0.516 1.326 1.026 0.516 A6 1.336 1.034 0.535 1.307 1.013 0.536
A2 1.338 1.023 0.534 1.308 1.002 0.535 B1 1.353 1.044 0.519 1.323 1.021 0.519
A3 1.316 1.011 0.555 1.287 0.990 0.556 B2 1.333 1.021 0.539 1.304 1.000 0.539
A4 1.309 1.001 0.561 1.280 0.980 0.562 library
A5 1.324 1.018 0.546 1.296 0.997 0.547 M3 cross -validation test MLR
A6 1.336 1.034 0.536 1.307 1.013 0.536 average average
B1 1.353 1.043 0.519 1.323 1.020 0.520 level RMSD error R RMSD error R
B2 1.331 1.015 0.540 1.302 0.994 0.541 A1 1.357 1.050 0.515 1.327 1.027 0.516
average average A4 1.336 1.033 0.536 1.305 1.011 0.538
level RMSD error R RMSD error R A5 1.351 1.050 0.522 1.320 1.028 0.522
A1 1.358 1.050 0.515 1.327 1.028 0.515 A6 1.359 1.059 0.514 1.327 1.036 0.515
A2 1.339 1.027 0.533 1.310 1.005 0.533 B1 1.355 1.044 0.517 1.325 1.021 0.518
A3 1.312 1.007 0.558 1.283 0.986 0.559 B2 1.334 1.015 0.537 1.304 0.994 0.539
A4 1.307 0.996 0.563 1.278 0.975 0.564 library
A5 1.319 1.005 0.552 1.290 0.985 0.553 S2 cross-validation test MLR
A6 1.329 1.019 0.542 1.300 0.997 0.543 average average
B1 1.354 1.045 0.518 1.324 1.023 0.519 level RMSD error R RMSD error R
B2 1.333 1.020 0.539 1.304 0.998 0.539 A1 1.356 1.048 0.516 1.326 1.026 0.517
average average A4 1.312 1.001 0.559 1.282 0.980 0.561
level RMSD error R RMSD error R A5 1.336 1.029 0.536 1.306 1.008 0.537
A1 1.357 1.049 0.516 1.326 1.026 0.516 A6 1.349 1.048 0.524 1.318 1.026 0.525
A2 1.339 1.028 0.533 1.310 1.007 0.533 B1 1.353 1.044 0.520 1.322 1.021 0.520
A3 1.313 1.011 0.558 1.284 0.990 0.559 B2 1.327 1.014 0.545 1.298 0.993 0.545
A4 1.310 0.999 0.561 1.281 0.978 0.562 aMLR: multilinear regression. R: Pearson correlation coeﬃcient.
A5 1.324 1.015 0.547 1.295 0.995 0.548
average average A4 1.299 0.988 0.571 1.270 0.968 0.572
level RMSD error R RMSD error R A5 1.317 1.005 0.554 1.288 0.984 0.555
A1 1.357 1.050 0.516 1.326 1.027 0.516 A6 1.331 1.025 0.541 1.301 1.003 0.542
A2 1.338 1.027 0.534 1.309 1.006 0.534 B1 1.354 1.045 0.518 1.324 1.023 0.519
A3 1.312 1.008 0.558 1.283 0.987 0.559 B2 1.331 1.018 0.540 1.302 0.997 0.541
A4 1.311 0.997 0.560 1.281 0.977 0.561 library
A5 1.323 1.013 0.548 1.294 0.992 0.549 S1 cross-validation test MLR
A6 1.333 1.028 0.539 1.303 1.007 0.540 average average
B1 1.354 1.044 0.519 1.323 1.022 0.519 level RMSD error R RMSD error R
B2 1.332 1.019 0.539 1.303 0.998 0.540 A1 1.358 1.048 0.515 1.326 1.025 0.516
average average A4 1.326 1.017 0.545 1.296 0.995 0.547
level RMSD error R RMSD error R A5 1.342 1.039 0.530 1.312 1.017 0.531
A1 1.358 1.050 0.515 1.327 1.027 0.515 A6 1.352 1.051 0.521 1.321 1.029 0.522
A2 1.342 1.026 0.530 1.312 1.005 0.531 B1 1.354 1.044 0.519 1.323 1.022 0.519
A3 1.325 1.013 0.546 1.296 0.992 0.547 B2 1.334 1.021 0.538 1.305 1.000 0.538
A4 1.327 1.012 0.543 1.298 0.991 0.545 library
A5 1.336 1.026 0.534 1.307 1.005 0.536 S3 cross-validation test MLR
A6 1.347 1.044 0.524 1.317 1.022 0.525 average average
B1 1.355 1.044 0.517 1.324 1.021 0.518 level RMSD error R RMSD error R
B2 1.337 1.019 0.535 1.307 0.998 0.536 A1 1.356 1.048 0.517 1.325 1.026 0.517
obtained by the other fragmentations and were similar to those obtained by the bond-based second-order fragmentation. The correlation coeﬃcients between the predicted values and the human SAs were about 0.5 for models SA4 and SB2.
⦁ SA Prediction by the Theoretical Model and Its Accuracy. We examined the prediction accuracy of eq 1 with all c0−c4 parameters free and with the use of L1, L2, L3, M1, M2, M3, S1, S2, and S3 libraries. Eight prediction models
(FA1−FA6, FB1, and FB2) were examined. In each model, Sprob was calculated by each of the A1−A6, B1, and B2 descriptors. Table 4 shows the computational root-mean-square deviation error (RMSD), the average error, and the correlation
coeﬃcient between the human SAs and the values calculated by these eight models (FA1−FA6, FB1, and FB2). The results were obtained by LOO cross-validation tests and the MLR across the data of all 256 compounds using the L1, L2, L3, M1, M2, M3, S1, S2, and S3 libraries, just as in the above examinations of models SA1−SA6, SB1, and SB2.
The results obtained by the atom-based fourth-order
fragmentation were better than those obtained with the other fragmentations and similar to those obtained by the bond-based second-order fragmentation. The correlation coeﬃcients between the predicted SA values and the human SAs were about 0.56 for each of models FA3 and FA4. The computa- tional results did not depend on the size of the libraries used for the calculation of the descriptors, and a library containing 10 000−100 000 compounds was large enough for prediction. Since our method required a compound library but not a reaction database, it should be easy to customize predictions for compound vendors by using this method. Compound vendors could make their own libraries of the compounds they have made. Figure 2 shows the correlations between the predicted
Figure 2. Correlation between the human SA and calculated SA obtained by the cross-validation test by model FA4 based on library L1. The solid line shows a ﬁtted curve.
values obtained by model FA4 with library L1 and the synthetic accessibilities evaluated by the chemists. The correlation graphs obtained using libraries L2 and L3 were similar to those in Figure 1, and they were omitted. There were clear correlations between the calculated and the human SAs.
Table 5 shows the correlations between the SAs calculated by
prediction). The SA values were calculated for the 256- compounds set as well as for the above procedures. Since the larger fragments include the smaller fragments, the predicted values were correlated with each other. In particular, the SA4 and FB2 models showed that all correlation coeﬃcients were larger than 0.35, while some models did not correlate at all for example, there were no correlations between SA1 and SA3, between SA1 and SB1, or between SA2 and SA6. Tables 3 and 4 show that model FA4 was suitable for SA prediction, because model FA4 was able to yield results similar to those of the other models and maintained a high prediction accuracy.
In Figure 2, the gradient of the ﬁtted line was 0.32 and this gradient was lower than the ideal gradient, 1. The ﬁtted gradients did not change depending on the choice of the compound libraries (S1, S2, S3, M1, M2, M3, L1, L2, and L3). The reason for the low gradient could be the lack of descriptors to evaluate the SA of the test set compounds. In other words, the diversities of the compound libraries were relatively low comparing to the test set. This idea is supported by the weak dependence of the calculated SA on the library size in Table 4. One of the reasons of the weak dependence of result (R and error) on the library size could be low diversity of the compound libraries, and the P(i) values in eq 3 did not change by the choice of the library. The prediction could be improved by using a compound library of rich of diversity.
We also applied a 2-fold cross-validation test. In this test, the 256-compounds set was arbitrarily dived into two groups. One group was used as the teaching set and the other was used as the test set. The test was repeated by exchanging the teaching and test sets, and an average of the two predictions was used as the result. The 2-fold cross-validation tests were performed three times, using model FA4 with the L1, L2, and L3 libraries, respectively. For these three tests, the correlation coeﬃcient between the predicted SA values and the human SA values was 0.40, and the average error of each cross-validation test was
1.20. When the number of compounds in the teaching set was reduced by half, the current prediction method still worked, although the prediction accuracy decreased.
The LigandBOX database (the current version as of June 2013) consists of 4 586 865 compounds supplied by 44 compound vendors. The biggest three libraries are Enamine (1 883 713 compounds), Vitas (1 163 246 compounds), and TimTec (1 066 286 compounds). The average number of entries per library across all vendors is 179 748 compounds. However, if the biggest three vendors are excluded, the average number of entries per library drops to 92 576 compounds. Indeed, 31 out of 44 vendors (70%) supply fewer than 92 000
models SA1−SA6, B1, and B2 (only S
was used for the
compounds. Because our prediction method can work with a compound database of smaller than 100 000 compounds, these
vendors could use our method to conduct an easy and
Table 5. Pearson Correlation Coeﬃcient between SAs Calculated Using Only the Sprob Term (Two-Parameter (c0 and c1) model)
A1 A2 A3 A4 A5 A6 B1 B2
automatic SA evaluation based on their own compound libraries before their manual assessments.
4.3. Correlation among Human SAs. Table 6 shows the correlations among the human SAs. As shown in previous reports, the human SAs were strongly dependent on the skill
A1 1.00 0.58 0.04 0.38 0.54 0.66 0.07 0.35 and experience of the individual chemist. Thus, the simple
A2 1.00 0.75 0.47 0.30 0.15 0.61 0.52 average of the correlation coeﬃcients (R) was 0.586. The
A3 1.00 0.91 0.79 0.69 0.78 0.86 results obtained by the two chemists who belonged to the same
A4 1.00 0.97 0.91 0.80 0.94 company were similar to each other. The results obtained by
A5 1.00 0.98 0.77 0.92 the chemists from diﬀerent companies were diﬀerent from each
A6 1.00 0.72 0.89 other. SA would thus be expected to depend on the equipment
B1 1.00 0.91 available at the individual company, as well as on the training
B2 1.00 and experience of the company chemists. This would explain
Table 6. Pearson Correlation Coeﬃcient between Human SAs Evaluated by Two Chemistsa
FMP1 FMP2 NARD1 NARD2 RASA SYLVIA
FMP1 1.000 0.931 0.572 0.557 0.355 0.369
FMP2 0.931 1.000 0.552 0.530 0.285 0.413
NARD1 0.572 0.552 1.000 0.862 0.474 0.944
NARD2 0.557 0.530 0.862 1.000 0.438 0.923
RASA 0.355 0.285 0.474 0.438 1.000 N.D.
SYLVIA 0.369 0.413 0.944 0.923 N.D. 1.000
aFMP: Fujimoto Chemicals. NARD: Nard Institute.
why the R of our human SA values was much lower than those reported in the previous studies (R = 0.7−0.8).9,12 If, in fact,
showed a weak correlation to the calculated SA values. Their correlation coeﬃcients were 0.32 and 0.29, respectively. The prices of generic drugs showed the same trends as the new drugs. The reason for this should be that the price of a generic drug follows the price of the original drug. The average SA of a new drug was 5.6, slightly higher than that of a generic drug (average SA = 5.5). We also examined the list of the correlation coeﬃcients between the sales prices and the SA values (Table 8), broken down by drug. The R values of antiacute and
Table 8. Pearson Correlation Coeﬃcient (R) between the Sales Prices of Newly Approved Drugs and Their SA Values
our method was not eﬀective at SA prediction, then the
those between the human SAs (R = 0.59 with a standard injection drug 13 −0.09 20 0.03
deviation of 0.22 and average error = 1.1). Thus, our SA (/mg)
prediction method was considered to function eﬀectively. injection drug 13 −0.08 20 0.19
4.4. Relationship between the Sales Price of (/mmol)
correlation coeﬃcient between the calculated SA and the human SA would be expected to be much lower than that between human SAs. However, the R value (0.56) and the average error (1.2) obtained by the FA4 model were similar to
dosage form topical drug (/mg)
topical drug (/mmol)
of drugs R
of drugs R
oral drug (/mg) 48 0.08 66 0.03
oral drug (/mmol) 48 0.14 66 0.24
pathophysiology antichronic disease drug (/mg) 73 0.08 52 0.02
antichronic disease drug (/mmol) 73 0.15 52 0.04
antiacute disease drug (/mg) 7 −0.02 7 0.38
antiacute disease drug (/mmol) 7 −0.02 7 0.43
anticancer drug 11 −0.07 11 0.31
Approved Drugs and SA. Figure 3 shows the correlation
(/mg) anticancer drug
11 −0.02 11 0.46
Figure 3. Correlation between the sales price of approved drugs in the
US and SAs. The sales price of an approved drug is the price per mole
of a new drug in the US. The solid line shows a ﬁtted curve.
between the logarithm of the sales price of a newly approved drug in the United States (US) and the calculated SA value obtained by model FA4. The drug price is the price per mole. The price showed a weak correlation to the calculated SA, with a correlation coeﬃcient R of 0.32. The correlation coeﬃcients between the sales prices of approved drugs/generics and the SA values are summarized in Table 7. The prices per weight, mole, and tablet did not correlate with the SA values. The logarithms of the prices showed a weak correlation to the calculated SA values. The prices of new drugs in the United States and Japan
Table 7. Pearson Correlation Coeﬃcient between the Sales Prices of Approved Drugs/Generics and the SA Values
US new approved drug generic drug
price per weight (mg) 0.03 0.11
price per mole 0.02 0.11
Log(price per mol) 0.32 0.37
Japan new approved drug generic drug
price per weight (mg) 0.05 0.27
price per mole 0.12 0.43
Log(price per mole) 0.29 0.27
anticancer drugs in Japan showed weak positive correlations, while the others did not show correlation. In the United States, some R values became negative.
The approved drug sets included some natural compounds, such as steroids, oligosaccharides, macrolides, and other large cyclic compounds. These natural compounds are extremely diﬃcult to synthesize, but they are still relatively inexpensive. However, the current teaching sets for human SA evaluations consist of only non-natural compounds. The diﬀerence in the data sets could be one of the reasons why the correlation is poor.
The correlation coeﬃcients between the sales prices of drugs in the United States and Japan are summarized in Table 9. The
Table 9. Pearson Correlation Coeﬃcient (R) between Sales Prices of Drugs in the United States and Japan
newly approved drugs generic drugs
price/mg 0.14 0.81
price/mmol 0.14 0.84
Log(price/mg)a 0.77 0.63
Log(price/mmol)a 0.79 0.66
aCorrelation coeﬃcient between the logarithms of the drug prices in the United States and the logarithms of the prices in Japan.
correlation between the logarithm of the price of drugs in the US and the price in Japan was very high (R = about 0.8), even though drug prices in the United States and Japan are controlled by their respective governments, as shown in Table 8.
The price of a drug may depend on many kinds of costs, including the costs of development, phase trials, and patents. This means that a drug that is diﬃcult to synthesize is not always expensive. The average SA of a drug is about 6, meaning that the average diﬃculty of drug synthesis is not extremely high.
4.5. Miscellaneous Discussion on the Prediction
Model. The predicted SA strongly depended on Sprob. If the compound library consisted of approved drugs, Sprob reﬂected the speciﬁc features of those drugs. For example, the probability of ﬁnding a toxic substructure (like −COH, −N−N−, or −O−
O−) should be low. Thus, Sprob should be a measure of drug-
likeness. Since drug-likeness is not a well-deﬁned idea, we did not examine in detail the behavior of Sprob values in the present study.
The calculated SA results obtained by model FA4 were not substantially better than those obtained by model SA4. This means that the Sgraph‑complexity, Schi, and Ssym terms do not improve the results dramatically. The substructure information includes some information about chirality and symmetry. For example, the symmetry of −CH2 or −CH3 is represented by
the A1−A6, B1, and B2 descriptors. If four atoms connected to
a center C atom are all diﬀerent from each other, the center C atom must be a chiral center. This information is also included in the A1−A6, B1, and B2 descriptors. In practical use, the calculated SA results obtained by model FA4 were less dependent on the size of the compound library compared to the calculated SA results obtained by model SA4. Thus, model FA4 should be more useful than model SA4.
The prediction accuracy of our model (R = 0.56 by model FA4) was not superior to the prediction results obtained by the previous studies. One of the possible explanations for these homogeneous results is that the concept of SA remains unclear, and thus there is a limit to the prediction accuracy. Users will tend to choose the computer software most suitable for their purpose among the various types of SA prediction methods.
Our method expends about 1 min in generating a histogram of the substructures present in a library of 10 000 compounds (S1, S2, and S3), and the calculation of eq 1 for each compound costs about 0.01 s on a Xeon 5400 processor (3 GHz). The computational time required for histogram generation of the substructures of a library was almost proportional to the number of compounds in the library.
We proposed a new method for predicting SA. The input compound was decomposed into ECFP-like substructures, and the probabilities of the substructures were estimated using a database search on a compound library of 10 000−100 000 available compounds. In addition, the symmetry, the number of chiral centers, and the graph complexity of the compounds were considered to predict the SA. The predicted SA showed good agreement with the SA estimated by chemists, with a correlation coeﬃcient of 0.56. Considering that the correlation coeﬃcient between SAs estimated by the chemists was around 0.58, the computational prediction worked well. Model FA4 showed the best prediction accuracy among the eight tested models, and the most important descriptor was the probability
of ﬁnding the substructure in a given compound library. The number of compounds in a teaching library could be small; namely, 10 000−100 000 compounds were suﬃcient to calculate the SA. Thus, it should be easy to customize predictions for compound vendors using this method. These vendors could establish their own compound libraries made up of the compounds they have produced. The source code of the program will be freely available from our website (http:// presto.protein.osaka-u.ac.jp/myPresto4/index.php?lang=en).
In our approach, the major contribution to the calculation of the SA value was made by Sprob, which was calculated by using a group contribution method. This means that the probability of the existence of a fragment structure around each atom could reveal the partial SA of that structure. Thus, it would be possible to determine which part of a given compound would be diﬃcult to synthesize.
We also examined the correlation between drug prices and SA. For both new and generic drugs, the price did not depend on SA and thus presumably depended on the total development cost and other factors. Therefore, the theoretical prediction of drug prices is likely to be very diﬃcult.
⦁ ASSOCIATED CONTENT
*S Supporting Information
Table S1: the human SA values with the compound structures. Table S2: the sales prices of approved drugs with the compound sctructures. This material is available free of charge via the Internet at http://pubs.acs.org.
*Tel.: +81-3-3599-8290. Fax: +81-3-3599-8099. E-mail: y-
The authors declare no competing ﬁnancial interest.
This work was supported by grants from the National Institute
of Advanced Industrial Science and Technology (AIST), the New Energy and Industrial Technology Development Organ- ization of Japan (NEDO), and the Ministry of Economy, Trade, and Industry (METI), of Japan. Y.F. thanks Dr. Isao Yasumatsu for helpful suggestions.
⦁ Schneider, G.; Fechner, U. Computer-based de novo Design of Drug like Molecules. Nat. Rev. Drug Discovery 2005, 4, 649−663.
⦁ Baber, J. C.; Feher, M. Predicting Synthetic Accessibility:
Application in Drug Discovery and Development. Mini-Rev. Med. Chem. 2004, 4, 681−692.
⦁ Gillet, V.; Myatt, G.; Zsoldos, Z.; Johnson, A. SPROUT, HIPPO
and CAESA: Tool for de novo Structure Generation and Estimation of Synthetic Accessibility. Perspect. Drug Discovery Des. 1995, 3, 34−50.
⦁ Lewell, X. Q.; Judd, D. B.; Watson, S. P.; Hann, M. RECAP-
retrosynthetic Combinatorial Analysis Procedure: a Powerful New Technique for Identifying Privileged Molecular Fragments with Useful Applications in Combinatorial Chemistry. J. Chem. Inf. Comput. Sci. 1998, 38, 511−522.
⦁ Pförtner, M.; Sitzmann, M. Handbook of Cheminformatics; Gasteiger, J., Ed., Wiley-VCH: Weinheim, 2003; pp 1457−1507.
⦁ Johnson, A.; Marshall, C.; Judson, P. Starting Material Oriented
Retrosynthetic Analysis in the LHASA Program. 1. General description. J. Chem. Inf. Comput. Sci. 1992, 32, 411−417.
⦁ Huang, Q.; Li, L.-L.; Yang, S.-Y. RASA: a Rapid Retrosynthesis- based Scoring Method for the Assessment of Synthetic Accessibility of Drug-like Molecules. J. Chem. Inf. Model. 2011, 51, 2768−2777.
⦁ Podolyan, Y.; Walters, M. A.; Karypis, G. Assessing Synthetic
Accessibility of Chemical Compounds Using Machine Learning Methods. J. Chem. Inf. Model. 2010, 50, 979−991.
⦁ Funatsu, K.; Sasaki, S. Computer-assisted Organic Synthesis Design and Reaction Prediction System, “AIPHOS. Tetrahedron Comput. Methodol. 1988, 1, 27−38.
⦁ Boda, K.; Seidel, T.; Gasteiger, J. Structure and Reaction Based
Evaluation of Synthetic Accessibility. J. Comput.-Aided Mol. Des. 2007,
⦁ Corey, E. J.; Wipke, W. T. Computer-assisted Design of Complex Organic Syntheses. Science 1969, 166, 178−192.
⦁ Timothy, D.; Salatin, T. D.; Jorgensen, W. L. Computer-assisted
Mechanistic Evaluation of Organic Reactions. 1. overview. J. Org. Chem. 1980, 45, 2043−2057.
⦁ Takaoka, Y.; Endo, Y.; Yamanobe, S.; Kakinuma, H.; Okubo, T.;
Shimazaki, Y.; Ota, T.; Sumiya, S.; Yoshikawa, K. Development of a Method for Evaluating Drug-likeness and Ease of Synthesis Using a Data Set in which Compounds are Assigned Scores Based on Chemists’ Intuition. J. Chem. Inf. Comput. Sci. 2003, 43, 1269−1275.
⦁ Lajiness, M. S.; Maggiora, G. M.; Shanmugasundaram, V.
Assessment of the Consistency of Medicinal Chemists in Reviewing Sets of Compounds. J. Med. Chem. 2004, 47, 4891−4896.
⦁ Ertl, P.; Schuffenhauer, A. Estimation of Synthetic Accessibility
Score of Drug-like Molecules Based on Molecular Complexity and Fragment Contributions. J. Cheminform. 2009, 1, 8.
⦁ Wang, Y.; Xiao, J.; Suzek, T. O.; Zhang, J.; Bryant, S. H. PubChem: a Public Information System for Analyzing Bioactivities of Small Molecules. Nucleic Acids Res. 2009, 37, W623−W633.
⦁ Gaulton, A.; Bellis, L. J.; Bentro, A. P.; Chambers, J.; Davies, M.;
Hersey, A.; Light, Y.; McGlinchey, S.; Michalovich, D.; Al-Lazikani, B.; Overington, P. ChEMBL: a Large-scale Bioactivity Database for Drug Discovery. Nucleic Acids Res. 2011, 40, D1100−D1107.
⦁ Irwin, J. I.; Shoichet, B. K. ZINC – a Free Database of
Commercially Available Compounds for Virtual Screening. J. Chem. Info. Model. 2005, 45, 177−182.
⦁ Irwin, J. I.; Sterling, T.; Mysinger, M. M.; Bolstad, E. S.; Coleman, R. G. ZINC − a Free Tool to Discover Chemistry for Biology. J. Chem. Info. Model. 2012, 52, 1757−1768.
⦁ Fukunishi, Y.; Sugihara, Y.; Mikami, Y.; Sakai, K.; Kusudo, H.;
Nakamura, H. Advanced in-silico Drug Screening to Achieve High Hit Ratio – Development of 3D-compound Database. Synthesiology 2008, 2, 64−72.
⦁ Kawabata, T.; Sugihara, Y.; Fukunishi, Y.; Nakamura, H.
LigandBox – a Database for 3D Structures of Chemical Compounds.
BIOPHYSICS. 2013, 9, 113−121.
⦁ Bertz, S. H. The First General Index of Molecular Complexity. J. Am. Chem. Sci. 1981, 103, 3599−3601.
⦁ Kanehisa, M.; Goto, S.; Sato, Y.; Furumichi, M.; Tanabe, M.
KEGG for Integration and Interpretation of Large-scale Molecular Data Sets. Nucleic Acids Res. 2012, 34, D109−D114.
⦁ Wang, J.; Wolf, R. M.; Caldwell, J. W.; Kollman, P. A.; Case, D.
A. Development and Testing of a General Amber Force Field. J. Comput. Chem. 2004, 25, 1157−1174.
⦁ Fukunishi, Y.; Mikami, Y.; Nakamura, H. The Filling Potential
Method: a Method for Estimating the Free Energy Surface for Protein- ligand Docking. J. Phys. Chem. B 2003, 107, 13201−13210.
⦁ Law, V.; Knox, C.; Djoumbou, Y.; Jewison, T.; Guo, A. C.; Liu,
Y.; Maciejewski, A.; Arndt, D.; Wilson, M.; Neveu, V.; Tang, A.;
Gabriel, G.; Ly, C.; Adamjee, S.; Dame, Z. T.; Han, B.; Zhou, Y.; Wishart, D. S. DrugBank 4.0: Shedding New Light on Drug Metabolism. Nucleic Acids Res. 2014, 42, D1091−D1097.FEN1-IN-4