Data source:

In the present study, a total of nine high-throughput screening (HTS) assays data were downloaded from National Centre for Advancing Translational Sciences (NCATS) website and used for the machine learning (ML)-based models training and evaluation (Table 1). Initially, nine tab-separated files were downloaded from the NCATS website. Of the nine files, each contained information such as “Sample ID”, “Sample Name”, “PubChem SID”, “Primary MOA”, “Assay Name”, “CAS ID”, “AC50 value”, “Efficacy”, “Maximum Response”, “Drug Name”, “SMILES”, etc. Of the given columns, two columns, namely “SMILES” and “MAX_RESPONSE”, were extracted from each tab-separated file for further processing and machine learning-based models training and validation. According to the description given on the NCATS data browser (https://opendata.ncats.nih.gov/covid19/databrowser), a molecule with a MAX_RESPONSE value approaching -100 cab be considered as highly active while the one with MAX_RESPONSE value approaching zero or towards positive values can be considered as least active or inactive.

Table 1. List of assays used in the present study with the data source.

Assay ID

Assay name

Target category

Download link

1

Spike-ACE2 protein-protein interaction

(AlphaLISA)

Viral entry

https://opendata.ncats.nih.gov/covid19/assay?aid=1

2

Spike-ACE2 protein-protein interaction

(TruHit Counterscreen)

Counterscreen

https://opendata.ncats.nih.gov/covid19/assay?aid=2

6

ACE2 enzymatic activity

Viral entry

https://opendata.ncats.nih.gov/covid19/assay?aid=6

8

TMPRSS2 enzymatic activity

Viral entry

https://opendata.ncats.nih.gov/covid19/assay?aid=8

9

3CL enzymatic activity

Viral replication

https://opendata.ncats.nih.gov/covid19/assay?aid=9

14

SARS-CoV-2 cytopathic effect (CPE)

Live virus infectivity

https://opendata.ncats.nih.gov/covid19/assay?aid=14

15

SARS-CoV-2 cytopathic effect

(host tox counterscreen)

Counterscreen

https://opendata.ncats.nih.gov/covid19/assay?aid=15

20

HEK293 cell line toxicity

Counterscreen

https://opendata.ncats.nih.gov/covid19/assay?aid=20

21

Human fibroblast toxicity

Counterscreen

https://opendata.ncats.nih.gov/covid19/assay?aid=21


Brief description of HTS assays:

The nine HTS assays (Table 1) used to test compounds’ bio-activities by NCATS can be broadly categorized into four different types viz. assay used to determine the molecules that (i) prevent viral entry into host cells (ii) prevent viral replication into host cells (iii) reverse the cytopathic effect of host cells (caused by SARS-CoV-2 virion) and (iv) show toxic effects against normal human/host cells.

AlphaLISA assay (Table 1, Assay ID-1) contains the therapeutic molecules that can potentially disrupt the interaction between SARS-CoV-2 Spike protein and the human host ACE2 receptor. The interaction between SARS-CoV-2 Spike protein and host ACE2 receptor plays a vital role during the viral entry into the human host. Thus, this assay aims to identify the molecules with the potential to prevent SARS-CoV-2 entry into the human host. The second assay, i.e., TruHit counterscreen (Table 1, Assay ID-2), helps identify false-positive compounds that interfere with the AlphaLISA readout in a non-specific manner.

ACE2 enzymatic activity assay (Table 1, Assay ID-6) measures the ACE2 inhibitory potential of compounds to prevent the disruption of endogenous enzyme function. The TMPRSS2 enzymatic activity assay (Table 1, Assay ID-8) is specific to measure the TMPRSS2 inhibitory potential of compounds. Recent studies have highlighted the importance of human TMPRSS2 as an antiviral drug target and made attempts to screen the molecules with inhibitory potential against TMPRSS2.

The 3CL enzymatic activity (Table 1, Assay ID-9) measures the inhibitory potential of molecules against viral 3-chymotrypsin like protease (3CLpro), the main protease of SARS-CoV-2. The enzyme is pivotal for the viral replication within human host cells. Thus, it is an important anti-SARS-CoV-2 drug target in COVID-19 drug discovery.

CPE assay (Table 1, Assay ID-14) determines the ability of a compound to reverse the cytopathic effect caused by SARS-CoV-2 in Vero E6 host cells. This assay can be used for the high-throughput screening of compounds with antiviral activity. In contrast, there are chances that a compound itself may show cytotoxicity; therefore, a counterscreen (Table 1, Assay ID-15) to measure the toxicity of compounds against host (Vero E6) cells is used for the detection of such compounds.

The general toxicity of compounds against HEK293 and Hh-Wt cell lines is measured with the help of HEK293 cell line toxicity (Table 1, Assay ID-20) and Human fibroblast toxicity assay (Table 1, Assay ID-21), respectively. More details about the assays used in the present study can be retrieved from the NCATS website (Table 1).

Descriptors and fingerprints (FPs) calculation:

The SMILES structures of the molecules are used for the calculation of a total of 1444 1D, 2D descriptors and 12 different types of fingerprints (FPs) calculation using open-source software PaDEL (v2.21).

Data pre-processing:

The pre-processing of data is an important task for any machine learning study. Therefore, a systematic approach is applied to process the molecular structure data while molecular descriptors and fingerprints calculation (Figure 3). The parameters opted on PaDEL software (before actually starting the descriptors / FPs calculation) are “Remove salt”, “Detect aromaticity”, “Standardize nitro groups”, “Max. threads -1”, “Max. waiting jobs -1”, “Max. Running time per molecule: 12,00,000 milliseconds”, and “Retain molecules order”. Furthermore, only those molecules for which all the descriptor/fingerprint values are calculated have been used for ML-based models training, validation and further analysis.

Figure 3: A systematic computational approach used for building ML-based QSAR prediction models and their usage by users. (A). Flow diagram depicting the overall strategy used to train, evaluate and build the ML-based QSAR prediction models. (B). The best prediction models can be used by users to predict the anti-SARS-CoV-2 activity and human cell toxicity of compounds.

Preparation of datasets for models training and validation:

Data pre-processing and filtering are followed by redundancy removal to retrieve the dataset of unique molecules. Therefore, the molecules possessing identical descriptor or FPs values and maximum response values are included only once. The unique dataset of molecules was further split into a training dataset (80% molecules) and a external validation dataset (20% molecules). The training datasets are used for training and internal validation (through five-fold cross-validation technique) of the ML-based models, while external validation datasets are kept separate for the final or external validation of the developed models. The number of compounds used for the trainning (through five-fold cross-validation technique) and evaluation of best ML-based QSAR models are given in Table 2.

Table 2. The number of molecules used for the training and evaluation of best prediction models.

Assay ID

Number of unique molecules

Number of training dataset molecules

Number of external validation dataset molecules

1

3370

2696

674

2

3235

2588

647

6

3376

2701

675

8

5144

4115

1029

9

11007

8806

2201

14

9909

7927

1982

15

9080

7264

1816

20

9694

7755

1939

21

4491

3593

898


Descriptors or feature selection:

In the past, it has been observed that all the features or descriptors are not equally important to predict the biological activity of the molecules. Therefore, a feature selection technique in WEKA v3.8.2 is applied to determine the most relevant descriptors and fingerprints associated with the biological activity of the molecules. Thus, “CfsSubsetEval” (with default parameter values) as “Attribute Evaluator” with “BestFirst” as “Search Method” (with default parameter values) is used as feature selection techniques for the present study. This technique has been extensively used for features selection in various chemical QSAR and classification-related studies.

Tools used for model building:

An open-source data mining and ML tool, WEKA (v3.8.2), has been used in the present study to train and validate the prediction models.

Cross-validation technique used:

The prioritization of best models (with the highest efficacy) is important in any ML-based study. In the present study, thousands of models (the results for best models from each assay are given in Table 3) were trained with all regression techniques available in WEKA, however, selection of the best models is made through the five-fold cross-validation technique. This technique has been widely used in previous studies to prioritize the best prediction/classification models. In the five-fold cross-validation technique, the whole training dataset is divided into five almost equal-sized parts and each part is used in the training and testing of the prediction models.

Formulae used to evaluate the models' performance:

The in-built functions available with WEKA (v3.8.2), such as Pearson Correlation Coefficient (R), mean absolute error (MAE) and root mean squared error (RMSE), have been used to evaluate the models' performance through five-fold cross-validation technique. In both internal and external validation, the models with the highest R-value and lowest MAE and RMSE values are selected as the best prediction models. The latter are hosted on the ASCovPred webserver for public use. The equations for the used formulae are as given below:





For ith compound, Yi and Xi represent predicted and actual maximum response value, respectively. N is total number of compounds. The value of R is used to measure the quality of model. The value of R varies from −1 to +1. The negative value of R shows the negative correlation with a particular property or feature. Thus, higher the value of R, better will be the quality of model in terms of the predicted maximum response value of the compounds.

Table 3. Performance of best ML-based QSAR models on training and external validation dataset of compounds.

Performance evaluation of best models with training dataset compounds 

(five-fold cross-validation)

Performance evaluation of best models with external validation dataset compounds

Assay ID

Deployed on ASCoVPred
web server and standalone

Descriptors or fingerprints type

Number of input features

R

R2

MAE

RMSE

WEKA technique used for training the model with training dataset of compounds

R

R2

MAE

RMSE

1
[Help]

Yes

SubFPC*

307

0.67 

0.44 

14.96

21.22

weka.classifiers.meta.AdditiveRegression (RandomForest)

0.66 

0.42 

17.41

24.95

2
[Help]

Yes

1D & 2D

50

0.72

0.51

13.51

18.83

weka.classifiers.meta.AdditiveRegression (RandomForest)

0.74

0.55

14.44

20.50

6
[Help]

No

ExtendedFingerprinter

28

0.32

0.10

14.31

37.09

weka.classifiers.trees.RandomForest

0.57 

0.32 

16.68 

44.88 

8
[Help]

Yes

SubFPC*

307

0.52

0.24

11.12

36.47

weka.classifiers.trees.RandomForest

0.73

0.36

14.99

44.64

9
[Help]

No

SubFPC*

307

0.40

0.16

5.90

11.31

weka.classifiers.meta.RandomCommittee (RandomForest)

0.45 

0.19 

5.11 

9.14 

14
[Help]

No

1D & 2D 

1444 

0.50 

0.25 

7.43 

14.59 

weka.classifiers.meta.RandomSubSpace (RandomForest)

0.43 

0.18 

8.23 

14.90 

15
[Help]

Yes

1D & 2D  

1444 

0.66 

0.43 

13.12 

21.13 

weka.classifiers.meta.AdditiveRegression (RandomForest)

0.65 

0.42 

14.11 

21.93 

20
[Help]

Yes

SubFPC*

307

0.66

0.44

26.68

34.16

weka.classifiers.meta.RandomCommittee (RandomForest)

0.68

0.46

26.41

33.87

21
[Help]

No

1D & 2D 

45 

0.43 

0.18 

12.61 

19.69 

weka.classifiers.meta.RandomCommittee (RandomForest)

0.51 

0.26 

11.99 

18.63 

*SubstructureFingerprintCount

Identification of descriptors of FPs associated with the maximum response:

To understand the relative contribution of different descriptors and fingerprints (FPs) in determining the molecules’ maximum response values, assay-wise, R values were calculated between these. The molecules with top 10 positively and negatively correlated descriptors or FPs values (Supplementary_files 10-19, Table 1) were shortlisted from the training dataset molecules along with their maximum response values and sorted in the ascending order of maximum response values. Further, top 100 records (highly active compounds) (Supplementary_files 10-19, Table 2) and bottom 100 records (least active compounds) (Supplementary_files 10-19, Table 3) were selected for both positively and negatively correlated descriptors and FPs. Finally, the average values were calculated for each descriptor and FP (Supplementary_files 10-19, Table 4). Grouped bar charts were plotted to compare the average values of descriptors / FPs amongst highly active and least active compounds (Supplementary_files 10-19, Figure 1). Thus, the grouped bar charts depict the differences of descriptors / FPs average values amongst highly active and least active compounds.

Note: Please refer to Table 4 (given below) to see all the Supplementary files.

Table 4. Supplementary excel files containing results for performance of all ML-based QSAR models and data analysis results.

Assay ID

Deployed on ASCoVPred web server and standalone

Assay Name

Target Category

ML-based prediction results

Data analysis results

1

Yes

Spike-ACE2 protein-protein interaction (AlphaLISA)

Viral Entry

Supplementary_file1.xlsx 

Supplementary_file10.xlsx  Supplementary_file11.xlsx

2

Yes

Spike-ACE2 protein-protein interaction (TruHit Counterscreen)

Counterscreen

Supplementary_file2.xlsx 

Supplementary_file12.xlsx   Supplementary_file13.xlsx 

6

No

ACE2 enzymatic activity

Viral Entry

Supplementary_file3.xlsx 

N/A 

8

Yes

TMPRSS2 enzymatic activity

Viral Entry

Supplementary_file4.xlsx 

Supplementary_file14.xlsx   Supplementary_file15.xlsx 

9

No

3CL enzymatic activity

Viral replication

Supplementary_file5.xlsx 

N/A 

14

No

SARS-CoV-2 cytopathic effect (CPE)

Live virus infectivity

Supplementary_file6.xlsx 

N/A 

15

Yes

SARS-CoV-2 cytopathic effect

(host tox counterscreen)

Counterscreen

Supplementary_file7.xlsx 

Supplementary_file16.xlsx   Supplementary_file17.xlsx

20

Yes

HEK293 cell line toxicity

Counterscreen

Supplementary_file8.xlsx 

Supplementary_file18.xlsx   Supplementary_file19.xlsx 

21

No

Human fibroblast toxicity

Counterscreen

Supplementary_file9.xlsx 

N/A