Our research introduces the farthest point sampling (FPS) strategy within targeted chemical feature spaces to generate well-distributed training datasets. This approach enhances model performance by increasing the diversity within the training data's chemical feature space. We rigorously evaluated this strategy across various ML models – including artificial neural networks (ANN), support vector machines (SVM), random forests (RF) etc. – using datasets encapsulating key physicochemical properties. Our findings demonstrate that FPS-based models markedly outperform those trained via random sampling in terms of predictive accuracy, robustness, and a notable reduction in overfitting, especially in smaller training datasets.
A graphic illustration of the farthest point sampling in chemical space
MSE compared between FPS and RS
MSE compared by sampling in different chemical space
Heatmap of MSE for different machine learning model
MSE for different physicochemical datasets
t-SNE distributions for FPS and RS