Statistical Aspects of the Interrelation Between the Biological Activity of Chemical Compounds and Their Molecular Structure
Mukhomorov V. K.
Universita Degli Studi di Napoli "Federico II", Via Cintia, Napoli, Italy
Email address:
To cite this article:
Mukhomorov V. K. Statistical Aspects of the Interrelation Between the Biological Activity of Chemical Compounds and Their Molecular Structure. Biomedical Statistics and Informatics. Vol. 1, No. 1, 2016, pp. 24-34. doi: 10.11648/j.bsi.20160101.14
Received: December 7, 2016; Accepted: December 27, 2016; Published: January 20, 2017
Abstract: An attempt was made to construct an adequate model of interrelation of radioprotective properties of biologically active chemical compounds with their electronic and information factors. Biological activity (radiation protective effects) of chemical compounds has been analyzed in relation to their electronic sign and the information function. Statistical comparison of qualitative indices has revealed that electronic and information signs the most informative characteristics of the molecules responsible for radiation protective action. Correlation equations are given for electronic and information dependent change in the antiradiation properties of the molecule. Quantitative estimates were made associating the protective efficiency of the chemical compounds under study with variations in the electronic parameters and dose of chemicals.
Keywords: Bioactivity, Statistics, Molecular Structure, Electronic Sign, Information Function, Radioprotector, Statistical Criterion, Contingency, Correlation
1. Introduction
Knowledge of quantitative stochastic interrelation between the chemical structure of a molecules and its biological activity has important theoretical and practical significance. It is necessary both to clarify the mechanism of biochemical action of molecules, and to search for promising new drugs. It is known that the classical apparatus of probability theory and mathematical statistics is the basis of the stochastic simulation of natural phenomena. The main party of such research is to estimation of the closeness of causal relationships between explanatory parameters and response of the biological system.
Causal relationship implies that their recurrence lead to the same consequences. However, a causal relationship can be subject to fluctuations due to random deviations. These fluctuations are due to the uncontrolled and unaccounted factors and are identified by statistical laws.
One of the most relevant issues of modern chemistry of biologically active substances is the problem of creating new effective radioprotectors. The main demands on these drugs are low effective dose, low toxicity and lack of side effects. The existence of side effects significantly limits the practical applicability of radioprotectors. Statistical methods are the most rational in solving problems that are associated with the study of action of a combination of factors on the biosystem. Since the effect of the interaction of drugs with biosystem depends on many conditions, then it has a probabilistic nature. Therefore it is preferable to use a probabilistic model.
It is not always possible to construct an adequate model which describes the relationship of the chemical structure of the compound with its biological activity. If the model is overloaded with a large number of non-essential characteristics use such model becomes almost impossible. At the same time, nothing can compensate for the shortcomings of the model, if the main link has been lost. Therefore, an adequate model should be as close as possible to simulate the basic properties of chemical compounds. Figuring out of the connection between molecular structure and biological activity will allow to carry out a targeted search for new chemicals, and also can contribute to deciphering the mechanisms of their bioactivity.
2. Method and Discussion
For a description of the interrelation of bioactivity with molecular structure, we use the descriptors (attributes), the calculation of which requires knowledge of only the structural formula of chemical compounds. We take into account the remark of Alexander P. and Bacq Z. [1] on the importance of the primary chemical structure of the drug in the mechanism of protection against ionizing radiation.
We use the average number of electrons in the outer shell of the atoms as a sign of the molecule [2]:
(1)
where is the number of atoms of i-th kind; is a number of electrons in the outer electron shell. The summation is performed over all the atoms in a molecule N is the total number of atoms. In [3] it was shown that the empirical pseudopotential can be represented in the following analytical form
(2)
where and are amendments to the Coulomb potential. Amendments depend on the distance r between the molecule and the electron.
Two groups of chemical compounds are given in Table 1 [4, 5]. The first group contains chemical compounds with an effective radioprotective effect (dose ≤ 1mM / kg; the survival of more than 50%, chemical compounds are marked with "+" sign). The second group contains chemical compounds, which have no anti-radiation activity at high doses: Dose> 2mM / kg (these chemicals are marked with "-" sign). This choice of the chemical compounds imposes restriction on the size of the sample.
Our goal is to find a classification rule that statistically reliable divides the active and non-active chemical compounds. To do this, we use the association method (statistical methods for rates and proportions) for signs which have an alternative variation ("yes" or "no"). Observations and sign (Z) of molecules can be represented as 2 × 2 table or tetrachoric table (Table 2). We will carry out the analysis of the interrelation of chemical compounds bioactivity and the magnitude of sign of Z.
N | Chemical compounds | I. P. Doze, mM/kg, [4, 5] | A. R. P. [4, 5] | Z*) | H, bit |
1 | H_{2}N (CH_{2})_{4}CH (NH)_{2}CH_{2}SH | 0.34 | + | 2.346 | 1.460 |
2 | H_{2}NC (=NH)CH_{2}CH_{2}SH | 0.61 | + | 2.571 | 1.611 |
3 | H_{2}NCH_{2}CH_{2}SCN | 0.49 | + | 2.833 | 1.730 |
4 | H_{2}NCH_{2}CH_{2}СH_{2}NHCH_{2}CH_{2}SH | 0.56 | + | 2.273 | 1.418 |
5 | (CH_{3})_{2}NC(=NH)CH_{2}SH | 0.85 | + | 2.471 | 1.545 |
6 | (CH_{3})_{3}CNHCSNHCH_{2}CH_{2}OH | 0.71 | + | 2.444 | 1.583 |
7 | CH_{2}=CHCH_{2}NHCH_{2}CH_{2}SH | 0.85 | + | 2.333 | 1.411 |
8 | CH_{3}CH_{2}CH (NH_{2})CH_{2}SH | 0.95 | + | 2.235 | 1.378 |
9 | (CH_{3})_{2}CH (CH_{2})_{5}NH (CH_{2})S_{2}O_{3}H | 0.07 | + | 2.556 | 1.628 |
10 | CH_{3} (CH_{2})_{6}NH (CH_{2})_{2}S_{2}O_{3}H | 0.29 | + | 2.556 | 1.628 |
11 | H_{2}C=C (CH_{3})CH_{2}SC (=NH)NH_{2} | 0.31 | + | 2.556 | 1.568 |
12 | CH_{3}NH (CH_{2})_{3}NHCH_{2}CH_{2}CH_{2}SPO_{3}H_{2} | 0.31 | + | 2.606 | 1.798 |
13 | H_{2}N (CH_{2})_{5}NHCH_{2}CH_{2}SPO_{3}H_{2} | 0.62 | + | 2.606 | 1.798 |
14 | H_{2}NCH_{2}C (CH_{3})_{2}CH_{2}NHCH_{2}CH_{2}SPO_{3}H_{2} | 0.62 | + | 2.606 | 1.798 |
15 | CH_{2}=C (NH_{2})CH_{2}CH_{2}SH | 0.15 | + | 2.400 | 1.472 |
16 | H_{2}N (CH_{2})_{5}CH (NH_{2})CH_{2}SPO_{3}H_{2} | 0.21 | + | 2.606 | 1.798 |
17 | Cyclo-C_{6}H_{11}NHP (O) (OH)SH | 0.19 | + | 2.741 | 1.818 |
18 | H_{2}N (CH_{2})_{3}NHCH_{2}CH_{2}CH_{2}SPO_{3}H_{2} | 0.32 | + | 2.667 | 1.849 |
19 | H_{2}NCH_{2}CH (CH_{3})CH_{2}NHCH_{2}CH_{2}SPO_{3}H_{2} | 0.44 | + | 2.667 | 1.849 |
20 | H_{2}NCH_{2}CH_{2}CH (CH_{3})NHCH_{2}CH_{2}SPO_{3}H_{2} | 0.66 | + | 2.667 | 1.849 |
21 | L (+)=H_{2}N (CH_{2})_{4}CH (NH_{2})CH_{2}SPO_{3}H_{2} | 0.14 | + | 2.667 | 1.849 |
22 | H_{2}N (=NH)CH_{2}SSCH_{2}CH_{2} (=NH)NH_{2} | 0.07 | + | 2.667 | 1.641 |
23 | H_{2}NCH_{2}CH_{2}CH_{2}NHCH_{2}CH_{2}SPO_{3}H_{2} | 0.07 | + | 2.741 | 1.904 |
24 | H_{2}NC (=NH)NHCH_{2}CH (CH_{3}) CH_{2}NH (CH_{2})SPO_{3}H_{2} | 0.07 | + | 2.813 | 1.945 |
25 | H_{2}NC (=NH)NH (CH_{2})_{3}NH (CH_{2})_{3}SPO_{3}H_{2} | 0.08 | + | 2.743 | 1.897 |
26 | H_{2}NC (=NH)CH_{2}SH | 0.13 | + | 2.727 | 1.686 |
27 | H_{2}N (CH_{2})_{3}NHCH_{2}CH (OH)CH_{2}SPO_{3}H_{2} | 0.82 | + | 2.774 | 1.890 |
28 | CH_{3}CH_{2}CH_{2}CH_{2}NHP (O) (OH)SH | 0.15 | + | 2.667 | 1.868 |
29 | H_{2}C=CHCH_{2}NHCH_{2}CH_{2}SH | 0.85 | + | 2.333 | 1.411 |
30 | H_{2}NCH_{2}CH (OH)CH_{2}NHCH_{2}CH_{2}SPO_{3}H_{2} | 0.33 | + | 2.857 | 1.943 |
31 | H_{2}NC (=NH)NHCH_{2}CH_{2}NHCH_{2}CH_{2}SPO_{3}H_{2} | 0.10 | + | 2.897 | 1.997 |
32 |
| 0.10 | + | 2.813 | 1.649 |
33 |
| 0.05 | + | 2.813 | 1.649 |
34 | H_{2}N CH_{2}CH_{2}SCN | 0.49 | + | 2.833 | 1.730 |
35 | H_{2}NCH_{2}CH_{2}SSCH_{2}CH_{2}NH_{2} | 0.99 | + | 2.500 | 1.571 |
36 | H_{2}N (NH)CNHCH_{2}CH_{2}S_{2}O_{3}H | 0.50 | + | 3.300 | 2.082 |
37 | H_{2}N (CH_{2})_{3}NH (CH_{2})_{2}SPO_{5}H_{2} | 0.31 | + | 2.875 | 1.919 |
38 | H_{2}N- (CH_{2})_{4}NH (CH_{2})_{2}SPO_{5}H_{2} | 0.77 | + | 2.875 | 1.919 |
39 | CH_{3}CONHCH_{2}CH_{2}SS (CH_{2})_{4}SO_{2}H | 0.17 | + | 2.813 | 1.781 |
40 | H_{2}NCH_{2}CH_{2}SS CH_{2}CONH_{2} | 0.60 | + | 2.727 | 1.794 |
41 | L (-)-H_{2}NCH_{2}CH_{2}CH (NH_{2})CH_{2}SPO_{3}H_{2} | 0.63 | + | 2.833 | 1.966 |
42 | H_{2}NC (=NH)NH (CH_{2})_{3}NHCH_{2}CH_{2}SPO_{3}H_{2} | 0.10 | + | 2.813 | 1.945 |
43 | HO_{2}S (CH_{2})_{4}-SSS- (CH_{2})_{4}SO_{2}H | 0.06 | + | 2.971 | 1.739 |
44 | H_{2}O_{3}PSCH_{2}CH_{2}NH (CH_{2})_{3}NH CH_{2}CH_{2} SPO_{3}H_{2} | 0.35 | + | 2.974 | 2.014 |
45 | H_{2}NCH_{2}CH_{2}SSCH_{2}COOH | 0.30 | + | 3.000 | 1.918 |
46 | CH_{3}S (CH_{2})_{3}NHC (=NH)CH_{2}S_{2}O_{3}H | 0.19 | + | 3.000 | 1.939 |
47 | H_{2}NCH_{2}CH (NH_{2})CH_{2}SPO_{3}H_{2} | 1.00 | + | 2.952 | 2.032 |
48 | H_{2}NCH_{2}CH_{2}SPO_{3}H_{2} | 0.64 | + | 3.125 | 2.078 |
49 | H_{2}NC (=NH)NHCH_{2}CH_{2}SPO_{3}H_{2} | 0.25 | + | 3.143 | 2.131 |
50 | HSCH_{2}CONHNHCOCH_{2}SH | 0.83 | + | 3.222 | 2.059 |
51 | Histamine (H- imidazole-4- ethanamine) | 0.90 | + | 2.588 | 1.447 |
52 | Mexaminum | 0.05 | + | 2.643 | 1.473 |
53 | Serotonin (5-hydroxytryptamine) | 0.06 | + | 2.720 | 1.514 |
54 | Thiazolidin | 0.85 | + | 2.333 | 1.411 |
55 | H_{2}NCH_{2}CH_{2}CH_{2}CH_{2}CH_{2}SSH | 0.24 | + | 2.381 | 1.454 |
56 | H_{2}NCH_{2}CH_{2}CH_{2}NHCH_{2}CH_{2}CH_{2}SPO_{3}H | 0.33 | + | 2.724 | 1.883 |
57 | (CH_{3})_{2}NC_{6}H_{4}CH (OH)S (CH_{2})NH_{2} | 0.78 | + | 2.600 | 1.600 |
58 | CH_{2}=CHCH_{2}NHCSNH_{2} | 6.89 | - | 2.667 | 1.640 |
59 | CH_{3}CH (NH_{2})COSH | 11.4 | - | 2.769 | 1.823 |
60 | H_{2}NCH_{2}CH_{2}SO_{2}NH_{2} | 4.83 | - | 2.933 | 1.907 |
61 | H_{2}NSSO_{3}H | 4.65 | - | 4.222 | 1.891 |
62 | H_{2}NCH_{2}COSH | 11.0 | - | 3.000 | 1.961 |
63 | CH_{3}CH_{2}CH_{2}NHCSNH_{2} | 4.23 | - | 2.471 | 1.545 |
64 | HCONHCH_{2}CH (CH3)SH | 3.36 | - | 2.625 | 1.717 |
65 | H_{2}NCH_{2}CH_{2}COSH | 9.51 | - | 2.769 | 1.823 |
66 | (CH_{3})C (SH)CH (NH_{2})COOH | 13.4 | - | 2.938 | 1.875 |
67 | (CH_{3})_{2}NCSSH | 4.12 | - | 2.769 | 1.669 |
68 | CH_{3} CH (NH_{2})COSH | 11.4 | - | 2.769 | 1.823 |
69 | H_{2}NCOCH (NH_{2})CH_{2}SH | 9.99 | - | 2.800 | 1.857 |
70 | H_{2}NC (=NH)SCH_{2}CH_{2}CH_{2}SO_{3}H | 10.1 | - | 3.143 | 2.012 |
71 | (CH_{3})_{2} NNHCH_{2}CH_{2}SH | 4.16 | - | 2.316 | 1.457 |
72 | CH_{3}CH_{2}OCOCH_{2}NHCSSCH_{2}CH_{3} | 5.07 | - | 2.522 | 1.491 |
73 | H_{2}C=CHCH_{2}NHC (O)SCH_{2}COOCH_{2}CH_{3} | 4.93 | - | 2.846 | 1.174 |
74 | HO (CH_{2})_{2}CH_{2}NHCH_{2}CH_{2}S_{2}O_{3}H | 3.72 | - | 2.960 | 1.855 |
75 | 4- (2- Mercaptooxazolyl)-erythrite | 8.97 | - | 3.000 | 1.807 |
76 | H_{2}NCH_{2}CH_{2}SC (O)CH_{2} | 3.91 | - | 2.733 | 1.774 |
77 | BrC_{6}H_{4}O (CH_{2})_{4}NHCH_{2}CH_{2}S_{2}O_{3}H | 2.13 | - | 3.000 | 1.878 |
78 | CH_{3}CH_{2}CH_{2}NHCSNH_{2} | 4.23 | - | 2.471 | 1.545 |
79 | CH_{3}CH_{2}SC (S)NHCH_{2}COOH | 5.59 | - | 3.053 | 1.925 |
80 | HO_{2}CCH_{2}NHCONHCH_{2}CH_{2}SH | 5.62 | - | 3.048 | 1.936 |
81 | Tionicotinamide | 4.71 | - | 3.067 | 1.706 |
82 | CH_{3}SC (O)CH_{2}CH_{2}NHCONHCH_{2}CH_{2} SC (O)SCH_{3} | 12.3 | - | 3.031 | 1.918 |
83 | HOCH_{2} (CHOH)_{2}CH_{2}NHCH_{2}CH_{2}S_{2}O_{3}H | 3.07 | - | 3.067 | 1.853 |
84 | HOCH_{2}CHOHCH_{2}NHCH_{2}CH_{2}S_{2}O_{3}H | 7.60 | - | 3.077 | 1.880 |
85 | 2-Carboxypyrrolidine-1- dithiocarboxylic acid | 5.24 | - | 3.211 | 1.958 |
86 | CH_{3}OCOCH_{2}CH_{2}SO_{2}CH_{2}CH (NH_{2})COOH | 3.18 | - | 3.143 | 1.901 |
87 | H_{2}NC (=NH)SCH_{2}CH_{2}CH_{2}SO_{3}H | 10.1 | - | 3.143 | 2.013 |
88 | [H_{2}NC (=NH)NHCH (COOH)CH_{2}S]_{2}- | 3.09 | - | 3.167 | 2.017 |
89 | N-Oxide 4-mercaptodihydropyridine | 7.87 | - | 2.970 | 1.892 |
90 | H_{2}NCH_{2}CHOHCH_{2}S_{2}O_{3}H | 5.35 | - | 3.263 | 1.970 |
91 | H_{2}NCH_{2}CH (CH_{2}OH)S_{2}O_{3}H | 4.81 | - | 3.263 | 1.970 |
92 | CH_{3}C (=NH)SCH_{2}CH_{2}CH_{2}S_{2}O_{3}H | 5.08 | - | 3.130 | 1.951 |
93 | 2-Furyl-CH_{2}NHC (=NH)CH_{2}S_{2}O_{3}H | 4.00 | - | 3.360 | 2.049 |
94 | H_{2}NCONHCH_{2}CH_{2}S_{2}O_{3}H | 5.00 | - | 3.474 | 2.103 |
95 | γ- (S-Purinyl) Thiopropylsulphonic acid | 4.42 | - | 3.407 | 2.089 |
96 | HCONHCH_{2} CH (CH_{3})SH | 3.36 | - | 2.625 | 1.717 |
97 | CF_{3}CF_{2}CH_{2}OCOCH_{2}CH_{2}NHCH_{2}CH_{2}S_{2}O_{3}H | 3.00 | - | 3.818 | 2.249 |
98 | (NC)_{2}C=C (SH)_{2} | 3.94 | - | 4.000 | 1.922 |
99 | 1, 2, 5-Thiadiazole-3-carboxylic acid | 7.69 | - | 4.146 | 1.842 |
100 | 1, 2, 5-Thiadiazole -3, 4-dicarboxylic acid | 4.60 | - | 4.462 | 2.162 |
^{*)} The number of electrons in the outer shell of an atom: Z (H) = 1, Z (C) = 4, Z (N) = 5, Z (S) = 6, Z (P) = 5, Z (O) = 6, Z (Pb) = 4^{,} Z (Br, F) = 7.
First of all, we need to set the threshold value of the sign Z^{(th)} which statistically significant separates effective radioprotectors from ineffective radioprotectors. We first determine the mean value of the sign of Z for the sample chemical compounds (Table 1). We obtained the following statistics for average value Z:
N = 100, = 2.87 ± 0.04, = 2.235, = 4.462, S_{z} = 0.40.(3)
Here and are the minimum and maximum values of the sign Z; S_{z} is the standard deviation of the sample. The average value of should be compatible with other elements of the sample. Typically, the maximum and minimum sample elements are questionable. The element of set is out-of-order of the set, if the following inequality holds:
(4)
where f is the number of degrees of freedom. is the table value of fractile of τ-distribution of the maximum deviation [6]. Let's verify the compatibility of sample points:
(5)
Here f is equal to 100. From inequality (5) it follows that the chemical compound number N = 100 (Z = 4.462) is not compatible with other the sample elements. Consequently, the chemical compound is to be excluded from the sample and calculating the average value must be repeated. After recurrence the calculations, we have found that the chemical compounds numbered 96, 97, 98, and 99 also must be excluded from the sample. Now the average value has the following statistics:
N = 95, = 2.80 ± 0.03, = 2.235, = 3.474, S_{z} = 0.27. (6)
Here and are the minimum and maximum values of Z in sample that contains N = 95 elements; f = 95. Sample satisfies the following inequality:
= 3.094 < = 14.1, p = 0.88, N = 95. (7)
Thus, the sample is uniform and fits the normal distribution. Here p value determines the significance level of criterion which determines the probability of error (~ 10%); f is the number of degrees of freedom. Wilk-Shapiro criterion is also satisfied: W = 0.989 > = 0.950.
Now we can determine the average value of Z for the effective and ineffective radioprotectors (N = 95). As a result, we obtained the following statistics:
N_{1} = 57, = 2.71 ± 0.03, = 2.235, = 3.300, S_{z1} = 0.24,
N_{2} = 38, = 2.95 ± 0.04, = 2.316, = 3.474, S_{z2} = 0.27.(8)
Values of Z are located around and for the effective and ineffective chemical compounds, respectively. Using tabulated values of t - distribution, we can verify whether the distinction in the average values of sign Z (
> ) statistically significant. First, we compare the variances of samples: = 1.34 < . That is, the distinction of the dispersions is not statistically significant. Then we use the following inequality [7]:
(9)
Inequality (9) shows that at the 5% significant level the null hypothesis of equality of average values can be rejected. Consequently, the difference between the average values and are statistically significant.
In the first approximation, we can assume that the average value = 2.80 is a threshold that separates chemical compounds with different radioprotective efficiency. However, it is better to choose the threshold value through repeated testing various Z values close to (for example, within the mean error). You can then use the value of Z which results to a more convincing statistical inference. This approach is demonstrated in the search of the classification rules by statistical methods for rates and proportions.
According to the analysis, it is preferable to choose a threshold is equal to = 2.87. Importantly, the chemical compounds (NN = 97 - 100) have the sign of Z noticeably larger than the average value and therefore does not violate the inequality: > .
We need to verify to see whether the separation of chemical compounds into two conditional groups is the result of random factors. Description of classifications, it is convenient to start with the construction of the table of mutual contingency (or association) [8, 9] (cross-selection method). Figure 1 shows the distribution of the chemical compounds by quadrants of the rectangular 2 × 2 table (table of "four fields"). In each cell of the table is indicated the number (frequency) of q_{ij} objects. Obviously, the classification model better describes the phenomenon, the closer the contingency table to diagonal form. In which connection for the objects in each quadrant, we do not assume the existence of a functional mathematical relationship between the dependent variable and the explanatory variable.
A B
Contingency (association) method is applicable, if the sample size satisfies the following inequality: . It is generally believed that the frequencies q_{ij} meet the inequality of subject to i ≠ j [8].
To determine the Pearson contingency coefficient Φ [9] between the radioprotective efficacy and value of sign Z we use the following equation:
(10)
Here number of degrees of freedom is equal to f = N – 2; = 45 is number of effective chemical compounds having the sign value subject to D 1 mM/kg; = 12 is number of effective chemical compounds having the sign value subject to D^{(th)} = 1 mM/kg; = 29 is number of effective chemical compounds having the sign value subject to D > 2mM/kg; = 14 is number of effective chemical compounds having the sign value subject to D > 2mM/kg; (Table 2). For tetrachoric contingency tables can also be used the Yule coefficient association [8]:
(11)
The coefficient Q = 0.77 point to the existence of the interrelation between the signs. Obviously, this coefficient is in the following range of values: .
Signs RE (the radioprotective efficiency) and Z are independent if the product of the marginal or unconditional proportions is equal to the joint proportion (see Table 2). For example, we obtained the following result: p_{12} = 0.12. These proportions differ considerably. The greater the distinction, the interdependence of signs RE and Z is greater.
The application of the threshold value leads to more convincing statistical results than using the average value of In brackets (see Table 2), we reported the statistical results that have been obtained for the average value of. Using the average value also suggests a correlation signs at significance level α = 0.05. In this case, the strength of the interrelation too weak: = 0.19. Therefore, it is preferable to use the threshold value 2.87. The adequacy of the model, we can verify using the value of the empirical error. The error is determined by the fraction of misclassified objects: . Using the data in Table 2, we found the following value of the empirical error of the model:. Application of the threshold value reduces the empirical error of model (approximately 21%).
The sign of Z | Radioprotective efficacy (RE) | The total sum | |
Effective chemical compound, D 1mM/kg | Inefficient chemical compounds, D > 2mM/kg | ||
| q_{11} = 45 (36) | q_{21} = 14 (12) | 59 (48) |
p_{11} = 0.45 (0.36) | p_{21} = 0.14 (0.12) | P_{1} = 0.59 (0.48) | |
| q_{12} = 12 (21) | q_{22} = 29 (31) | 41 (52) |
p_{12} = 0.12 (0.21) | p_{22} = 0.29 (0.31) | P_{2} = 0.41 (0.52) | |
The total sum | 57(57) | 43 (43) | N =100 |
_{1}P = 0.57 (0.57) | _{2}P = 0.43 (0.43) | 1.00 | |
Q = 0.77 (0.63), = 0.39 (0.19), ^{*)}, SE =0.09 (0.09), K = 0.43 (0.32), |r_{tet}| = 0.68 (0.53), Δ = 0.26 (0.33). |
^{*)}^{ }chi-square we calculated using the equation (17).
Let's see the representativeness of the sample (Table 1). Using a table of random numbers [10], we will make a partial sample of data Table 1. The method of random numbers avoids involuntary and systematic mistakes in the preparation of the sample. As a result, we obtained the following sequence of random numbers:
03, 47, 43, 73, 86, 36, 96, 46, 63, 71, 62, 33, 26, 16, 80, 45, 60, 11, 14, 10, 74, 24, 67,42, 81, 57, 20, 53, 32, 37, 27, 07, 51, 79, 89, 76, 66, 56, 50, 90. (12)
A series of random numbers, we can obtain, starting from any point of the table of random numbers. We wrote all the random numbers that do not exceed number of 96 [6]. Comparing these numbers with random numbers of chemical compounds Table 1, the partial sample of 40 items was obtained. In a partial sample the sequence of chemical compounds represented by "with an open mind" [10]. Statistics of the partial sample is as follows:
N = 40, = 2.82 ± 0.04, = 2.316, = 3.300, S_{z} = 0.23.
N_{1} = 24, = 2.78 ± 0.04, = 2.333, = 3.300, S_{z1} = 0.21,
N_{2} = 16, = 2.88 ± 0.07, = 2.316, = 3.263, S_{z2} = 0.25.(13)
This result is similar to the statistics (6), at while the sign of Z is represented in the same proportion as in the original sample.
The standard error of contingency coefficient we can be assessed using the following equation:
(14)
Testing of the significance is carried out by using chi - test [9]:
(15)
i.e., at the α = 0.05 significance level of the null hypothesis can be rejected. For normally distributed data, you can additionally use the tetrachoric coefficient (-1 ≤ r_{tet} ≤ 1) association:
(16)
However, if the distribution of frequencies on borders of two-by-two table is non-uniformly, then coefficient becomes unreliable. Therefore, commonly used [8, 9], Pearson goodness of fit (adjusted for continuity of Yates):
(10.8)
(17)
Here N = q_{11} + q_{12} + q_{22} + q_{21} is the sum of all frequencies. The inequality (17) shows that there is a statistically significant interrelation of signs. However, the criterion (17) does not give an idea of the strength of the signs interrelation. The assessment of closeness of the linkage between the signs can be obtained by using the coefficient of mutual contingency Pearson:
(18)
The indicator of mean-square of mutual conjugation is equal to:
(19)
Using equation (18) we determine the coefficient of mutual contingency K = 0.43(0.32), which confirms the interrelation of dichotomous signs.
Study of the interrelationship of the molecules structure - activity showed that the electronic sign of Z is associated with the Shannon informational function [11]:
(20)
where , and the following ratios are met for : , ; , is number of varieties of atoms in the molecule, N is the total number of atoms. Ratio determines the relative share of i-th kind of the atom in the molecule [12]. Shannon function is an integral characteristic of the molecule that determines the measure of uncertainty (or diversity) of the structure of chemical compound. The smaller value of the function H, the more diverse (on the relative content of atoms) a multicomponent system.
Using the data of Table 1 we define the average value of the information function:
N = 100, = 1.80 ± 0.02, = 1.174, = 2.249, S_{H} = 0.21.(21)
We verify the compatibility of the elements of the sample on the basis of H:
(22)
Here f is equal to 100. Consequently, the sample does not contain incompatible elements. Statistics of average values of information functions for effective radioprotectors will be as follows:
N_{1} = 57, = 1.76 ± 0.03, = 1.378, = 2.131, S_{H1} = 0.21.(23)
This subset is close to a normal distribution: , and the following inequality satisfies to the criterion of Wilk-Shapiro: W = 0.951 > = 0.947. Let's see the compatibility of the elements of this subset:
(24)
Here f is equal to 57. These inequalities are point to the lack of incompatible elements.
For the inefficient radioprotectors statistics of the average value will be as follows:
N_{2} = 43, = 1.85 ± 0.03, = 1.174, = 2.249, S_{H2} = 0.20.(25)
Checking of elements of the second subset leads to inequalities:
(26)
Here f is equal to 43. From the second inequality it follows that the chemical compound number 16 (H = 1.174 bit) is incompatible with the other elements of the subset. After excluding this element, we obtained the following statistics for the information function:(1.174)
N_{2} = 42, = 1.87 ± 0.03, = 1.457, = 2.249, S_{H2} = 0.17.(27)
This subset is close to a normal distribution: . Wilk-Shapiro criterion exceeds the critical value: W = 0.964 > = 0.942. The examination of uniformity of the subset leads to the following inequalities:
(28)
Here f is equal to 42. Thus, the subset comprises only compatible elements. Let's see whether the distinction between the average values of and statistically significant. We predefine the distinction between the variances of and : = 1.52 < . That is, the distinction in variance is not statistically significant. Therefore, we must use the following inequality:
N = 99, N_{1} =57, N_{2} = 42, S_{H1} = 0.21, S_{H2} = 0.17. (29)
Inequality (29) rejects the null hypothesis on equality of the average values of the information functions.
Again, we will use the association method of qualitative signs. We choose as the boundary value the following value of the information function (21): = 1.80bit. The numerical data are contained in Table 3.
The sign of H, bit | Radioprotective efficacy (RE) | The total sum | |
Effective chemical compounds, D 1mM/kg | Inefficient chemical compounds, D > 2mM/kg | ||
| q_{11} = 31 | q_{21} = 11 | 42 |
p_{11} = 0.31 | p_{21} = 0.11 | P_{1} = 0.42 | |
| q_{12} = 26 | q_{22} = 32 | 58 |
p_{12} = 0.26 | p_{22} = 0.32 | P_{2} = 0.58 | |
The total sum | 57 | 43 | N =100 |
_{1}P = 0.57 | _{2}P = 0.43 |
| |
Q = 0.55, = 0.07, ^{*)}, SE = 0.10, K = 0.25, |r_{tet}| = 0.46, Δ = 0.37. |
^{*)} Chi-square we calculated using the equation (17).
Thus, the sign of H serves as the boundary between effective radioprotectors and ineffective chemicals. Variation of the threshold ≡ H^{(av)} = 1.80bit does not improve the statistical results.
Let's examine these classification rules for chemical compounds that have anti-radiation activity. These chemical compounds were not included in the original sample: 1) NH_{2}CH_{2}CH_{2}CH_{2}SH (Dose: 3.79mM/kg; Z = 2.73, H = 1.43bit), 2) (CH_{3})_{2}S=O (Dose: 6.4-12.8mM/kg; Z = 2.60, H = 1.57bit), 3) NH_{2}CH_{2}CH_{2}NHCOCH_{2}SH (Dose; Z = 2.63, H = 1.77bit), 4) cysteine (Dose: 1.56-1.94mM/kg; Z = 2.36, H = 1.49bit), 5) disulfide β – mercaptoethylamine (Dose: 0.99-1.18mM/kg; Z = 2.50, H = 1.57bit), 6) S – β aminoethylisothiuronium (AET) (Dose: 1.68-2.10mM/kg; Z = 2.63, H = 1.63bit), 7) (CH_{3})_{2}N-C_{6}H_{5}-CH (OH)-S-CH_{2}CH_{2}NH_{2} (Dose: 0.88-1.77mM/kg; Z = 2.55, H = 1.56bit). Obviously, signs of these chemical compounds satisfy the inequalities:, .
The analysis has shown the molecular signs of Z and H are interconnected. For the effective radioprotectors the interrelation can be described by the following linear regression (Fig. 2):
, R = 0.87 > = 0.22, N_{1} = 57, S_{1} = 0.122.(30)
The absolute term A and the regression coefficient B are equal to:
A = – 0.332 ± 0.338, S_{A} = 0.169, B = 0.772 ± 0.124, S_{B} = 0.062,
RMSE = 0.109, ,
F =153.3 >> = 7.12, t = 9.5 >
(31)
Here statistics estimates the variance from the regression line; S_{A} and S_{B} are standard errors of the regression parameters; R is the sample correlation coefficient. Number of connections is equal to m = 1; number of degrees of freedom is equal to f = N_{1} - m - 1 [8]. The confidence limits for the free term A and the regression coefficient B at a significance level α = 0.05 were determined according to the formula: .
For chemical agents which do not possess effective radiation protective action, this interrelation is nonlinear (Figure 2) and can be approximated by the following analytical form:
, , , ,
N_{2} = 43, R = 0.89, RMSE = 0.074,
(32)
We are able to obtain additional information about the nonlinear dependence of H(Z) (Figure 2) if we will make a variation series of the grouped chemical compounds. It is usually used six to eight groups for the sample size N ≈ 40-60. Previously we have to rank the variation series (for example, in ascending order of Z). Then the data are grouped on the factor of Z. It is convenient the groupment to make at regular intervals. We chose number of groups equal to n = 6. We can determine the width of the interval using the following equation: . For each group, we calculate the mean values and , here i = 1, 2,..., 6. After that the ratios of the average values are compared. We give some relationships:
,
,
,
(33)
The parameter should be close to a constant value for the linear approximation. The frequency of sample units in groups (3_{(1)}, 9_{(2)}, 10_{(3)}, 13_{(4)}, 5_{(5)}, 3_{(6)}) close to a normal distribution: W = 0.902 > = 0.788.
The separation into different groups allows us to calculate the empirical correlation ratio = 0.84; here is the variance between groups, is the total variance of the original sample of 43 elements. Obviously, the empirical ratio varies from zero to unity. The ratio allows us to quantify the impact of Z on the variation of the resultant variable H.
Hereafter we may calculate the following theoretical correlation ratio: . Here = 0.02 is the variance of balanced values of information function. The variance of empirical (actual) values of resultant variable is equal to S^{2} = 0.025. The theoretical correlation ratio is equal to = 0.89 (coefficient of the determination is equal to = 0.79). It is known, the non-linear relationship between the signs is a strong (on a scale Cheddoka) if there is the following inequality: 0.7 < < 0.9.
Figure 3. (1) Scattering pattern of the electronic and information signs for radioprotector series: CH_{3}(CH_{2})_{m}NHCH_{2}CH_{2}SSO_{3}H (m = 0, 1,…, 17) [4]. The regression equation: ,, , , RMSE = 0.0006. (2) Series of chemical compounds: CH_{3}(CH_{2})_{m}NH(CH2)_{n}SPO_{3}H_{2 }(m = 2, 3, 4, n = 2, 3) [13]. The regression equation: , , , , RMSE = 0.0009. (3) Series of chemical compounds: NH_{2}(CH_{2})_{m}SH (m = 2, 3, 4, 5) [5]. The regression equation: , , , , RMSE = 0.0011.
As analysis has shown the information function relates to the value of π. The value of π = 0.52 [14] defines an additional contribution of the group atoms CH_{2} in hydrophobicity of molecules. Figure 4 shows this relationship for radioprotectors: CH_{3}(CH_{2})_{m}NHCH_{2}CH_{2}SSO_{3}H (m = 0, 1,…, 17), CH_{3}(CH_{2})_{m}NH(CH_{2})_{n}SPO_{3}H_{2} (m = 2, 3, 4, n = 2, 3),_{ }NH_{2}(CH_{2})_{m}SH (m = 2, 3, 4, 5).
The positive interrelation between the signs of Z and H is not random. Information function determines the diversity of the molecular structure, which in turn is determined by the number of different atoms, forming a bound complex of atoms, i.e., molecules. At the same time, the structure of the molecule is not an arbitrary set of various atoms, but is determined by the valence electrons in the outer electron shell. Apparently, this quantum-chemical property establishes the interrelation of two signs of Z and H for molecular structures. RMSE is so low that regression equations (Figures 3 and 4) to converge towards functional dependence.
Some distinctions between effective and inefficient radioprotectors we can get if we will analyze the frequency of the atoms appearance in the molecule. Figure 5 shows the frequency of occurrence of atoms (C, H, N, O, S, P) in the molecule.
Using the data of Table 1, we can approximately indicate the frequency of occurrence of atoms in a molecule of hypothetical effective agent (for a homogeneous sample): P ~ 1, S ~ 1, N ~ 2, O ~ 3, C ~ 5-6, H ~ 17 (Figure 5)^{1}. At the same time the most probable distribution of atoms in the inefficient agents (hypothetical molecule) will be as follows: P ~ 1, N ~ 1, O ~ 1, S ~ 2, C ~ 4, H ~ 8-10.
3. Conclusion
The proposed classification rules allow to identify the similarities between the molecular structures. These rules can be practically useful in a preliminary forecast of bioactivity of new chemical compounds. It should be noted that for the calculation of signs of Z and H is only required the knowledge of the chemical structural formula. This makes much easier the work in a preliminary searching for new bioactive chemicals. Classification rules allow you to set whether you can expect from a chemical compound effective biological action. The ability to separate the biologically active chemical compounds from the inactive chemical compounds on the basis of the sign of Z, apparently is due to the existence of the real molecular electrostatic potential. The magnitude of this potential varies from molecule to molecule. Moreover, there is a threshold of the electrostatic potential for effective chemical compounds which is lower of some value (in absolute value). The method described in this article, has yielded positive results when researching antifungal activity and toxicity of chemical compounds [15]. This method was also used in the analysis of the activity of carcinogenic chemicals [16].
However, it should be noted that these rules are not sensitive to iso-electronic molecular systems, as well as for the isomer molecules. This approach gives the most reliable results when analyzing the homologous series of chemical compounds. Homologous series are generally characterized by the signs that satisfy the compatibility condition.
Abbreviation
I. P. - intraperitonel, A. R. P. - antiradiation protection, RE - radioprotective efficiency, RMSE – Root Mean Square Error.
References
Footnotes
^{1}This sequence of numbers is close to the Fibonacci series: 1, 1, 2, 5, 8, 13.