Collect as many as possible, chemicals with known activity/toxicity/property of your interest.
Always collect the data from reliable sources like published articles, well-known online databases, etc.
The collection should ideally include all range of chemicals with high, moderate, and low or no activity (in terms of pIC50, ideally the range [i.e., maxActivity-minActivity] should be >3 )
Remove structural and response outliers, if any
Make sure there are no duplicates or activity cliffs in the modeling set
Try to calculate a wide range of descriptors, though you may always start with simple meaningful descriptors (constitutional, electrotopological state atom, etc.), and if you are not satisfied with the model, then you should calculate all kinds of descriptors.
Try different division methods: random as well as rational methods (especially for small to moderate data sets)
For small data sets (<30-50 data points), you may avoid data set division step. Use whole data sets as training set and perform cross validation techniques (leave-one-out, leave-many-out, etc.) to validate the models.
You can employ double cross-validation technique to find more diverse QSAR models, especially, for small data sets
Perform Best subset selection using top 20-30 descriptors identified from Genetic algorithms, Stepwise-MLR, etc. to identify best possible models from those descriptors.
If you are not getting a good regression model, try developing a classification-based model.
Along with the missing value error, the tool will also show the exact location of the missing value in the input file, i.e., Row number and Column number. Thus, if you have a missing value at that location, please rectify the issue by removing the column (descriptor) or perform imputation to fill the missing value.
However, if there is no missing value and/or if the detected column number is not present in the input file. Then, this is an issue most probably caused due to a hidden endline character that is present in the input file. Fortunately, this can be easily solved by freshly preparing the input file:
HOW TO Freshly prepare the input file
Select and copy the entire data (only data) in your current input file (.csv or .xlsx) and paste it in new .csv or .xlsx file and save the new file. Now use the newly saved file as your input file. You should not face the issue anymore.