Date of Award

2025

Document Type

Thesis

Degree Name

Master of Science in Software Engineering (MSSE)

Department

Electrical and Computer Engineering

Committee Chair

David Coe

Committee Member

Aleksandar Milenkovic

Committee Member

Rahul Bhadani

Research Advisor

David Coe

Subject(s)

Code generators--Evaluation, Artificial intelligence, Analysis of variance

Abstract

As generative Artificial Intelligence(AI) tools become increasingly common in software development, there is a growing need to understand how well these tools perform beyond just producing code that runs. This thesis examines the performance of four popular generative AI models, ChatGPT (GPT-4 mini), GitHub Copilot, Code LLaMA 3.3, and DeepSeek Web, in generating code that is not only functionally correct but also efficient and maintainable. To do this, we tested each model on six real-world-style coding problems sourced from LeetCode, covering a range of algorithmic challenges like dynamic programming, graph traversal, and array manipulation. Using a consistent prompting strategy, we collected Python code samples from each model and evaluated them using established software engineering metrics: Lines of Code, Cyclomatic Complexity, Halstead Complexity, and the Maintainability Index. We then applied a detailed statistical analysis, including ANOVA, post hoc testing, and nonparametric methods, to see which models consistently performed best. Our results show that the type of problem has the biggest impact on the complexity and length of the code, but when it comes to how maintainable the code is, the Artificial Intelligence(AI) model itself matters a lot. LLaMA produced the most maintainable code across the board, while GitHub Copilot often generated more complex, harder-to-maintain solutions. ChatGPT and DeepSeek showed similar and generally solid performance, landing somewhere in the middle. This research goes beyond simple pass/fail benchmarks and provides a clearer and more nuanced understanding of how generative AI tools behave in practical programming tasks. Developers, educators, and tool makers can use these findings to choose the right AI assistant for their needs and better understand where these models shine and where they still fall short.

Anova_Analysis 11161.ipynb (748 kB)
Anova Analysis

DataSheet_Thesis 11161.csv (8 kB)
Data Sheet

Share

COinS
 
 

To view the content in your browser, please download Adobe Reader or, alternately,
you may Download the file to your hard drive.

NOTE: The latest versions of Adobe Reader do not support viewing PDF files within Firefox on Mac OS and if you are using a modern (Intel) Mac, there is no official plugin for viewing PDF files within the browser window.