Date of Award
2017
Document Type
Thesis
Degree Name
Master of Science (MS)
Department
Computer Science
Committee Chair
Ramazan S. Aygun
Committee Member
Daniel Rochowiak
Committee Member
Marc Pusey
Subject(s)
Proteins--Analysis, Crystallization, Data integration (Computer science)
Abstract
Many companies have developed commercial screen kits of different combinations of chemicals for protein crystallization trials. Typically, scientists may use screen kits from various companies for crystallizing a single protein. The data representation as well as naming conventions used by these different companies make the automated analysis of crystallization experiments difficult and time-consuming. Matching headers among the input and output screens need to be identified and then the data has to be copied under corresponding headers in the output file. In order to reduce the human effort required to deal with this problem, we present an algorithm based on linguistic schema matching and data integration to automatically find the matching elements between the two schemas of screen kits using three syntactic similarity measures and then transform the input screen file to the required output screen format. This approach is tested on several commercial screens from different companies and evaluated using two metrics. The results of the experiments showed an overall accuracy of 97\% and an F-measure value of 0.99 which were significantly better than the two other matchers we compared with. The protein screen kits also have inconsistent naming of chemicals as there is no standard format for the names used in the screens which makes the analysis task difficult. In this thesis, our proposed method also produces an output file with consistent names for the chemicals.
Recommended Citation
Shrestha, Midusha, "Schema matching and data integration with consistent naming for protein crystallization screens" (2017). Theses. 212.
https://louis.uah.edu/uah-theses/212