METHOD OF PARALLEL HYBRID SEARCH FOR LARGE-SCALE CODE REPOSITORIES
DOI:
https://doi.org/10.15588/1607-3274-2025-3-6Keywords:
hybrid code search, vector search, semantic embeddings, code summarization, LLM-generated metadata, cosine similarity, textual relevance, class and method retrieval, class-based indexing, software engineeringAbstract
Context. Modern software systems contain extensive and growing codebases, making code retrieval a critical task for software engineers. Traditional code search methods rely on keyword-based matching or structural analysis but often fail to capture the semantic intent of user queries or struggle with unstructured and inconsistently documented code. Recently, semantic vector search and large language models (LLMs) have shown promise in enhancing code understanding. The problem – is designing a scalable, accurate, and hybrid code search method capable of retrieving relevant code snippets based on both textual queries and semantic context, while supporting parallel processing and metadata enrichment.
Objective. The goal of the study is to develop a hybrid method for semantic code search by combining keyword-based filtering and embedding-based retrieval enhanced with LLM-generated summaries and semantic tags. The aim is to improve accuracy and efficiency in locating relevant code elements across large code repositories.
Method. A two-path search method with post-processing is proposed, where textual keyword search and embedding-based semantic search are executed in parallel. Code blocks are preprocessed using GPT-4o model to generate natural-language summaries and semantic tags.
Results. The method has been implemented and validated on a .NET codebase, demonstrating improved precision in retrieving semantically relevant methods. The combination of parallel search paths and LLM generated metadata enhanced both result quality and responsiveness. Additionally, LLM-post-processing was applied to the top-most relevant results, enabling more precise identification of code lines matching the query within retrieved snippets. Other results can be further refined on-demand.
Conclusions. Experimental findings confirm the operability and practical applicability of the proposed hybrid code search framework. The system’s modular architecture supports real-time developer workflows, and its extensibility enables future improvements through active learning and user feedback. Further research may focus on optimizing embedding selection strategies, integrating automatic query rewriting, and scaling across polyglot code environments
References
Kumar Vivek, Chinmay Bhatt, Varsha Namdeo A framework for document plagiarism detection using Rabin Karp method, International Journal of Innovative Research in Technology and Management, 2021, Vol. 5, pp. 17–30.
Zhang Ashley Ge, Chen Yan, Oney Steve RunEx: Augmenting Regular-Expression Code Search with Runtime Values, Proceedings of the 2023 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC), 2023, pp. 145–155. DOI: 10.1109/VL-HCC57772.2023.00024
Karnalim Oscar, Simon Syntax Trees and Information Retrieval to Improve Code Similarity Detection, Proceedings of the Twenty-Second Australasian Computing Education Conference (ACE 2020), 2020, pp. 48–55. DOI: 10.1145/3373165.3373171
Liu Chao, Xia Xin, Lo David, Liu Zhiwei, Hassan Ahmed E., Li Shanping CodeMatcher: Searching Code Based on Sequential Semantics of Important Query Words. Ithaca, arXiv, 2020, 36 p. (Preprint / arXiv; 2005.14373). DOI: 10.1145/3465403
Kong Xianglong, Chen Hongyu, Yu Ming, Zhang Lixiang Boosting Code Search with Structural Code Annotation, Electronics, 2022, Vol. 11, No. 19, P. 3053. DOI: 10.3390/electronics11193053
Gotmare Khilesh Deepak, Li Junnan, Joty Shafiq, Hoi Steven C. H. Cascaded Fast and Slow Models for Efficient Semantic Code Search. Ithaca: arXiv, 2021, 12 p. (Preprint /arXiv; 2110.07811). DOI: 10.48550/arXiv.2110.07811
Jain Sarthak, Dora Aditya, Sam Ka Seng, Singh Prabhat LLM Agents Improve Semantic Code Search. Ithaca, arXiv, 2024, 12 p. (Preprint / arXiv; 2408.11058). DOI: 10.48550/arXiv.2408.11058
Khan M. A. M. Development of a code search engine using natural language processing techniquе: Graduate thesis. IUT, Department of Computer Science and Engineering, 2023,65 p.
Deng Zhongyang, Xu Ling, Liu Chao, Huangfu Luwen, Yan Meng Code semantic enrichment for deep code search, Journal of Systems and Software, 2024, Vol. 207, P. 111856. DOI: 10.1016/j.jss.2023.111856
Chen Junkai, Hu Xing, Li Zhenhao, Gao Cuiyun, Xia Xin, Lo David Code Search Is All You Need? Improving Code Suggestions with Code Search, Proceedings of the IEEE/ACM 46th International Conference on Software Engineering (ICSE 2024), Lisbon, Portugal, April 14–20,
, 2024, Article No. 73, pp. 1–13. DOI: 10.1145/3597503.3639085
Nate Suraj, Patil Om, Medar Shreenidhi, Deshmukh Jyoti A Survey on Transformer-based Models in Code Summarization, International Research Journal on Advanced Engineering Hub (IRJAEH), 2025, Vol. 3, pp. 740–745. DOI: 10.47392/IRJAEH.2025.0103
Parmar Mihir, Deilamsalehy Hanieh, Dernoncourt Franck, Yoon Seunghyun, Rossi Ryan A., Bui Trung Towards Enhancing Coherence in Extractive Summarization: Dataset and Experiments with LLMs, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024, pp. 19810–19820. DOI: 10.18653/v1/2024.emnlp-main.1106
Korade Nilesh Bhikaji, Salunke Mahendra B., Bhosle Amol, Kumbharkar Prashant Babarao, Asalkar Gayatri, Khedkar Rutuja G. Strengthening Sentence Similarity Identification Through OpenAI Embeddings and Deep Learning, International Journal of Advanced Computer Science and Applications, 2024, Vol. 15, No. 4, pp. 821–829. DOI: 10.14569/IJACSA.2024.0150485
OpenAI. New and improved embedding model [Electronic resource], OpenAI, Mode of access: https://openai.com/index/new-and-improved-embeddingmodel (date of access: 09.04.2025). – Title from screen.
Patil Rajvardhan, Boit Sorio, Gudivada Venkat N., Nandigam Jagadeesh A Survey of Text Representation and Embedding Techniques in NLP, IEEE Access, 2023, Vol. 11, pp. 36120–36146. DOI: 10.1109/ACCESS.2023.3266377
Jiang Xue, Wang Weiren, Tian Shaohan, Wang Hao, Lookman Turab, Su Yanjing Applications of natural language processing and large language models in materials discovery, npj Computational Materials, 2025, Vol. 11. DOI: 10.1038/s41524-025-01554-0
Qdrant. Qdrant Vector Database: High-performance vector similarity search [Electronic resource], Qdrant Documentation. Mode of access: https://qdrant.tech/qdrant-vectordatabase (date of access: 09.04.2025). – Title from screen.
Elastic. Elasticsearch: The Official Distributed Search & Analytics Engine [Electronic resource], Elastic. Mode of access: https://www.elastic.co/elasticsearch (date of access: 09.04.2025). Title from screen.
Hoyt Charles Tapley, Berrendorf Max, Galkin Mikhail, Tresp Volker, Gyori Benjamin M. A Unified Framework for Rank-based Evaluation Metrics for Link Prediction in Knowledge Graphs. Ithaca: arXiv, 2022, 18 p. (Preprint / arXiv; 2203.07544). DOI:10.48550/arXiv.2203.07544
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 V. O. Boiko

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Creative Commons Licensing Notifications in the Copyright Notices
The journal allows the authors to hold the copyright without restrictions and to retain publishing rights without restrictions.
The journal allows readers to read, download, copy, distribute, print, search, or link to the full texts of its articles.
The journal allows to reuse and remixing of its content, in accordance with a Creative Commons license СС BY -SA.
Authors who publish with this journal agree to the following terms:
-
Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License CC BY-SA that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
-
Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
-
Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work.