Theses

Analyzing code generation by AI models: an ANOVA-based study of quality, consistency, and composite ranking

Rajavi Gontiya

Date of Award

2025

Document Type

Thesis

Degree Name

Master of Science in Software Engineering (MSSE)

Department

Electrical and Computer Engineering

Committee Chair

David Coe

Committee Member

Aleksandar Milenkovic

Committee Member

Rahul Bhadani

Research Advisor

David Coe

Subject(s)

Code generators--Evaluation, Artificial intelligence, Analysis of variance

Abstract

As generative Artificial Intelligence(AI) tools become increasingly common in software development, there is a growing need to understand how well these tools perform beyond just producing code that runs. This thesis examines the performance of four popular generative AI models, ChatGPT (GPT-4 mini), GitHub Copilot, Code LLaMA 3.3, and DeepSeek Web, in generating code that is not only functionally correct but also efficient and maintainable. To do this, we tested each model on six real-world-style coding problems sourced from LeetCode, covering a range of algorithmic challenges like dynamic programming, graph traversal, and array manipulation. Using a consistent prompting strategy, we collected Python code samples from each model and evaluated them using established software engineering metrics: Lines of Code, Cyclomatic Complexity, Halstead Complexity, and the Maintainability Index. We then applied a detailed statistical analysis, including ANOVA, post hoc testing, and nonparametric methods, to see which models consistently performed best. Our results show that the type of problem has the biggest impact on the complexity and length of the code, but when it comes to how maintainable the code is, the Artificial Intelligence(AI) model itself matters a lot. LLaMA produced the most maintainable code across the board, while GitHub Copilot often generated more complex, harder-to-maintain solutions. ChatGPT and DeepSeek showed similar and generally solid performance, landing somewhere in the middle. This research goes beyond simple pass/fail benchmarks and provides a clearer and more nuanced understanding of how generative AI tools behave in practical programming tasks. Developers, educators, and tool makers can use these findings to choose the right AI assistant for their needs and better understand where these models shine and where they still fall short.

Recommended Citation

Gontiya, Rajavi, "Analyzing code generation by AI models: an ANOVA-based study of quality, consistency, and composite ranking" (2025). Theses. 770.
https://louis.uah.edu/uah-theses/770

Anova_Analysis 11161.ipynb (748 kB)
Anova Analysis
DataSheet_Thesis 11161.csv (8 kB)
Data Sheet

Download

COinS

Theses

Analyzing code generation by AI models: an ANOVA-based study of quality, consistency, and composite ranking

Date of Award

Document Type

Degree Name

Department

Committee Chair

Committee Member

Committee Member

Research Advisor

Subject(s)

Abstract

Recommended Citation

Search

Browse

Author Corner

M. Louis Salmon Library

Theses

Analyzing code generation by AI models: an ANOVA-based study of quality, consistency, and composite ranking

Author

Date of Award

Document Type

Degree Name

Department

Committee Chair

Committee Member

Committee Member

Research Advisor

Subject(s)

Abstract

Recommended Citation

Share

Search

Browse

Author Corner

M. Louis Salmon Library