Empirical Analysis of Dataset Size Impact on Classification Performance in Precision Agriculture Using Machine Learning Models

Khadija Lechqar, Mohammed Errais

Abstract


This study empirically investigates the relationship between dataset size and classification performance in precision agriculture applications. Seven machine learning models (Decision Tree, Random Forest, Logistic Regression, SVM, Gaussian Naïve Bayes, KNN, and AdaBoost) were evaluated on seven agricultural datasets ranging from 100 to 4,000 samples. Performance was assessed using five metrics: accuracy, precision, recall, F1-score, and ROC-AUC. The methodology involved two phases: initial evaluation using complete datasets, followed by systematic analysis of subdivided datasets to examine performance variation with data volume. Statistical analysis using Pearson correlation coefficients revealed no significant correlation between dataset size and model performance (r = 0.12, p > 0.05). Results indicate that Random Forest and Decision Tree models achieved the highest average performance across datasets (88.48% and 85.37% accuracy, respectively). The findings suggest that dataset quality and problem characteristics have greater influence on classification performance than dataset size alone in precision agriculture applications.


Full Text:

PDF

References


T. Ayoub Shaikh, T. Rasool, and F. Rasheed Lone, “Towards leveraging the role of machine learning and artificial intelligence in precision agriculture and smart farming,” Comput. Electron. Agric., vol. 198, no. June 2021, p. 107119, 2022, doi: 10.1016/j.compag.2022.107119.

S. Condran, M. Bewong, M. Z. Islam, L. Maphosa, and L. Zheng, “Machine Learning in Precision Agriculture: A Survey on Trends, Applications and Evaluations over Two Decades,” IEEE Access, vol. 10, no. June, pp. 73786–73803, 2022, doi: 10.1109/ACCESS.2022.3188649.

A. Sen, R. Roy, and S. R. Dash, “Smart Farming Using Machine Learning and IoT,” Agric. Informatics Autom. Using IoT Mach. Learn., vol. 3, no. March, pp. 13–34, 2021, doi: 10.1002/9781119769231.ch2.

E. M. B. M. Karunathilake, A. T. Le, S. Heo, Y. S. Chung, and S. Mansoor, “The Path to Smart Farming: Innovations and Opportunities in Precision Agriculture,” Agric., vol. 13, no. 8, pp. 1–26, 2023, doi: 10.3390/agriculture13081593.

D. Radočaj, M. Jurišić, and M. Gašparović, “The Role of Remote Sensing Data and Methods in a Modern Approach to Fertilization in Precision Agriculture,” Remote Sens., vol. 14, no. 3, 2022, doi: 10.3390/rs14030778.

P. K. Singh and A. Sharma, “An intelligent WSN-UAV-based IoT framework for precision agriculture application,” Comput. Electr. Eng., vol. 100, no. July 2021, p. 107912, 2022, doi: 10.1016/j.compeleceng.2022.107912.

H. Bagha, A. Yavari, and D. Georgakopoulos, “Hybrid Sensing Platform for IoT-Based Precision Agriculture,” Futur. Internet, vol. 14, no. 8, 2022, doi: 10.3390/fi14080233.

C. R. Kagan, D. P. Arnold, D. J. Cappelleri, C. M. Keske, and K. T. Turner, “Special report: The Internet of Things for Precision Agriculture (IoT4Ag),” Comput. Electron. Agric., vol. 196, no. January, 2022, doi: 10.1016/j.compag.2022.106742.

A. Paullada, I. D. Raji, E. M. Bender, E. Denton, and A. Hanna, “Data and its (dis)contents: A survey of dataset development and use in machine learning research,” Patterns, vol. 2, no. 11, p. 100336, 2021, doi: 10.1016/j.patter.2021.100336.

M. Zheng, F. Wang, X. Hu, Y. Miao, H. Cao, and M. Tang, “A Method for Analyzing the Performance Impact of Imbalanced Binary Data on Machine Learning Models,” Axioms, vol. 11, no. 11, 2022, doi: 10.3390/axioms11110607.

P. Wibowo and C. Fatichah, “An in-depth performance analysis of the oversampling techniques for high-class imbalanced dataset,” Regist. J. Ilm. Teknol. Sist. Inf., vol. 7, no. 1, pp. 63–71, 2021, doi: 10.26594/register.v7i1.2206.

A. Althnian et al., “Impact of dataset size on classification performance: An empirical evaluation in the medical domain,” Appl. Sci., vol. 11, no. 2, pp. 1–18, 2021, doi: 10.3390/app11020796.

M. A. Alshammari and M. Alshayeb, “The effect of the dataset size on the accuracy of software defect prediction models: An empirical study,” Intel. Artif., vol. 24, no. 68, pp. 72–88, 2021, doi: 10.4114/intartif.vol24iss68pp72-88.

J. Imlawi and M. Alsharo, “Evaluating classification accuracy: The impact of resampling and dataset size,” Int. J. Bus. Inf. Syst., vol. 24, no. 1, pp. 91–101, 2017, doi: 10.1504/IJBIS.2017.080947.

A. Bailly et al., “Effects of dataset size and interactions on the prediction performance of logistic regression and deep learning models,” Comput. Methods Programs Biomed., vol. 213, p. 106504, 2022, doi:[1] T. Ayoub Shaikh, T. Rasool, and F. Rasheed Lone, “Towards leveraging the role of machine learning and artificial intelligence in precision agriculture and smart farming,” Comput. Electron. Agric., vol. 198, no. June 2021, p. 107119, 2022, doi: 10.1016/j.compag.2022.107119.

S. Condran, M. Bewong, M. Z. Islam, L. Maphosa, and L. Zheng, “Machine Learning in Precision Agriculture: A Survey on Trends, Applications and Evaluations over Two Decades,” IEEE Access, vol. 10, no. June, pp. 73786–73803, 2022, doi: 10.1109/ACCESS.2022.3188649.

A. Sen, R. Roy, and S. R. Dash, “Smart Farming Using Machine Learning and IoT,” Agric. Informatics Autom. Using IoT Mach. Learn., vol. 3, no. March, pp. 13–34, 2021, doi: 10.1002/9781119769231.ch2.

E. M. B. M. Karunathilake, A. T. Le, S. Heo, Y. S. Chung, and S. Mansoor, “The Path to Smart Farming: Innovations and Opportunities in Precision Agriculture,” Agric., vol. 13, no. 8, pp. 1–26, 2023, doi: 10.3390/agriculture13081593.

D. Radočaj, M. Jurišić, and M. Gašparović, “The Role of Remote Sensing Data and Methods in a Modern Approach to Fertilization in Precision Agriculture,” Remote Sens., vol. 14, no. 3, 2022, doi: 10.3390/rs14030778.

P. K. Singh and A. Sharma, “An intelligent WSN-UAV-based IoT framework for precision agriculture application,” Comput. Electr. Eng., vol. 100, no. July 2021, p. 107912, 2022, doi: 10.1016/j.compeleceng.2022.107912.

H. Bagha, A. Yavari, and D. Georgakopoulos, “Hybrid Sensing Platform for IoT-Based Precision Agriculture,” Futur. Internet, vol. 14, no. 8, 2022, doi: 10.3390/fi14080233.

C. R. Kagan, D. P. Arnold, D. J. Cappelleri, C. M. Keske, and K. T. Turner, “Special report: The Internet of Things for Precision Agriculture (IoT4Ag),” Comput. Electron. Agric., vol. 196, no. January, 2022, doi: 10.1016/j.compag.2022.106742.

A. Paullada, I. D. Raji, E. M. Bender, E. Denton, and A. Hanna, “Data and its (dis)contents: A survey of dataset development and use in machine learning research,” Patterns, vol. 2, no. 11, p. 100336, 2021, doi: 10.1016/j.patter.2021.100336.

M. Zheng, F. Wang, X. Hu, Y. Miao, H. Cao, and M. Tang, “A Method for Analyzing the Performance Impact of Imbalanced Binary Data on Machine Learning Models,” Axioms, vol. 11, no. 11, 2022, doi: 10.3390/axioms11110607.

P. Wibowo and C. Fatichah, “An in-depth performance analysis of the oversampling techniques for high-class imbalanced dataset,” Regist. J. Ilm. Teknol. Sist. Inf., vol. 7, no. 1, pp. 63–71, 2021, doi: 10.26594/register.v7i1.2206.

A. Althnian et al., “Impact of dataset size on classification performance: An empirical evaluation in the medical domain,” Appl. Sci., vol. 11, no. 2, pp. 1–18, 2021, doi: 10.3390/app11020796.

M. A. Alshammari and M. Alshayeb, “The effect of the dataset size on the accuracy of software defect prediction models: An empirical study,” Intel. Artif., vol. 24, no. 68, pp. 72–88, 2021, doi: 10.4114/intartif.vol24iss68pp72-88.

J. Imlawi and M. Alsharo, “Evaluating classification accuracy: The impact of resampling and dataset size,” Int. J. Bus. Inf. Syst., vol. 24, no. 1, pp. 91–101, 2017, doi: 10.1504/IJBIS.2017.080947.

A. Bailly et al., “Effects of dataset size and interactions on the prediction performance of logistic regression and deep learning models,” Comput. Methods Programs Biomed., vol. 213, p. 106504, 2022, doi: 10.1016/j.cmpb.2021.106504.

L. S. Lin, Y. S. Lin, D. C. Li, and Y. H. Liu, “Improved learning performance for small datasets in high dimensions by new dual-net model for non-linear interpolation virtual sample generation,” Decis. Support Syst., vol. 172, no. April, p. 113996, 2023, doi: 10.1016/j.dss.2023.113996.

K. Lechqar and M. Errais, “Crop Recommendation in the Context of Precision Agriculture,” in Advances on Intelligent Computing and Data Science, 2023, pp. 523--532, doi: https://doi.org/10.1007/978-3-031-36258-3_46.

S. García, A. Fernández, J. Luengo, and F. Herrera, “A study of statistical techniques and performance measures for genetics-based machine learning: Accuracy and interpretability,” Soft Comput., vol. 13, no. 10, pp. 959–977, 2009, doi: 10.1007/s00500-008-0392-y.




DOI: https://doi.org/10.31449/inf.v49i23.8137

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.