Date of Award
2024
Document Type
Thesis
Degree Name
Master of Science (MS)
Department
Computer Science
Committee Chair
Tathagata Mukherjee
Committee Member
Letha Etzkorn
Committee Member
Chaity Banerjee Mukherjee
Research Advisor
Tathagata Mukherjee
Subject(s)
Source code (Computer science), Python (Computer program language), Comparative semantics
Abstract
In this thesis, we did comparative study of various methods for generating Python source code embeddings and evaluated their effectiveness using semantic labels. We used both word embedding models, such as Word2Vec and GloVe, and document embedding models to capture the semantic meaning of Python source code. In terms of word embedding evaluation, Word2Vec, combined with cosine distance, achieved the highest nearest neighbor precision of 0.5790. For evaluation of Python source code (or document) embeddings, our analysis across two datasets showed that Doc2Vec, paired with cosine distance, outperformed other methods in semantic code similarity detection, achieving an AUROC between 0.80 and 0.81 and an AUPR between 0.82 and 0.83. Notably, transformer-based methods like CodeBERT and GPT-2 underperformed when used solely for inference, likely because these large language models are more effective in tasks like code completion and code recommendation rather than generating robust source code embeddings.
Recommended Citation
Gyawali, Binita, "A comparative study of methods for modeling Python source code semantic similarity" (2024). Theses. 724.
https://louis.uah.edu/uah-theses/724