Date of Award

2024

Document Type

Thesis

Degree Name

Master of Science (MS)

Department

Computer Science

Committee Chair

Tathagata Mukherjee

Committee Member

Letha Etzkorn

Committee Member

Chaity Banerjee Mukherjee

Research Advisor

Tathagata Mukherjee

Subject(s)

Source code (Computer science), Python (Computer program language), Comparative semantics

Abstract

In this thesis, we did comparative study of various methods for generating Python source code embeddings and evaluated their effectiveness using semantic labels. We used both word embedding models, such as Word2Vec and GloVe, and document embedding models to capture the semantic meaning of Python source code. In terms of word embedding evaluation, Word2Vec, combined with cosine distance, achieved the highest nearest neighbor precision of 0.5790. For evaluation of Python source code (or document) embeddings, our analysis across two datasets showed that Doc2Vec, paired with cosine distance, outperformed other methods in semantic code similarity detection, achieving an AUROC between 0.80 and 0.81 and an AUPR between 0.82 and 0.83. Notably, transformer-based methods like CodeBERT and GPT-2 underperformed when used solely for inference, likely because these large language models are more effective in tasks like code completion and code recommendation rather than generating robust source code embeddings.

Share

COinS
 
 

To view the content in your browser, please download Adobe Reader or, alternately,
you may Download the file to your hard drive.

NOTE: The latest versions of Adobe Reader do not support viewing PDF files within Firefox on Mac OS and if you are using a modern (Intel) Mac, there is no official plugin for viewing PDF files within the browser window.