A data science approach to integrating hybrid type, environmental, weather, and soil data for predicting maize yield

No Thumbnail Available

Meeting name

Sponsors

Date

Journal Title

Format

Thesis

Subject

Research Projects

Organizational Units

Journal Issue

Abstract

[EMBARGOED UNTIL 05/01/2026] Accurate yield prediction is essential for guiding agronomic practices, supporting crop breeding, and informing policy decisions. This thesis presents a data-driven modeling framework for predicting maize yield by integrating hybrid type, environmental, soil, and weather datasets from the Genomes to Fields (G2F) initiative (2014--2023). The dataset comprises over 5,000 maize hybrids evaluated across more than 270 environments. After cleaning, harmonizing, and engineering features across datasets, several machine learning models were tested using different categorical encoding techniques. CatBoost combined with Pandas Categorical encoding emerged as the best-performing model, achieving a test RMSE of 2.591 Mg/ha and a Pearson correlation of 0.687. Model tuning through Grid Search, Random Search, Bayesian Optimization, and Hyperopt further enhanced performance stability. Model interpretability was achieved using SHAP and LIME, identifying the most influential variables affecting the maize yield prediction, including hybrid type, soil texture, planting date, and weather conditions during flowering and grain filling. The model's effectiveness was demonstrated in two key use cases: (1) Recommending top hybrids for farmers in specific environments and suggesting hybrids for simulated wet and dry conditions and (2) Assessing hybrid performance across environments to support breeders and researchers. This study presents a strong, modular, and interpretable approach to predicting crop yields that can be extended to other crops and tailored for various stakeholders in agriculture. The work lays the groundwork for data-informed agricultural planning, hybrid evaluation, and climate-resilient crop management strategies.

Table of Contents

PubMed ID

Degree

M.S.

Rights

License