Date of Award
2025
Document Type
Thesis
Degree Name
Master of Science in Software Engineering (MSSE)
Department
Electrical and Computer Engineering
Committee Chair
David Coe
Committee Member
Aleksandar Milenkovic
Committee Member
Rahul Bhadani
Research Advisor
David Coe
Subject(s)
Code generators--Evaluation, Artificial intelligence, Analysis of variance
Abstract
As generative Artificial Intelligence(AI) tools become increasingly common in software development, there is a growing need to understand how well these tools perform beyond just producing code that runs. This thesis examines the performance of four popular generative AI models, ChatGPT (GPT-4 mini), GitHub Copilot, Code LLaMA 3.3, and DeepSeek Web, in generating code that is not only functionally correct but also efficient and maintainable. To do this, we tested each model on six real-world-style coding problems sourced from LeetCode, covering a range of algorithmic challenges like dynamic programming, graph traversal, and array manipulation. Using a consistent prompting strategy, we collected Python code samples from each model and evaluated them using established software engineering metrics: Lines of Code, Cyclomatic Complexity, Halstead Complexity, and the Maintainability Index. We then applied a detailed statistical analysis, including ANOVA, post hoc testing, and nonparametric methods, to see which models consistently performed best. Our results show that the type of problem has the biggest impact on the complexity and length of the code, but when it comes to how maintainable the code is, the Artificial Intelligence(AI) model itself matters a lot. LLaMA produced the most maintainable code across the board, while GitHub Copilot often generated more complex, harder-to-maintain solutions. ChatGPT and DeepSeek showed similar and generally solid performance, landing somewhere in the middle. This research goes beyond simple pass/fail benchmarks and provides a clearer and more nuanced understanding of how generative AI tools behave in practical programming tasks. Developers, educators, and tool makers can use these findings to choose the right AI assistant for their needs and better understand where these models shine and where they still fall short.
Recommended Citation
Gontiya, Rajavi, "Analyzing code generation by AI models: an ANOVA-based study of quality, consistency, and composite ranking" (2025). Theses. 770.
https://louis.uah.edu/uah-theses/770