METHOD OF PARALLEL HYBRID SEARCH FOR LARGE-SCALE CODE REPOSITORIES

V. O.  Boiko

doi:10.15588/1607-3274-2025-3-6

Authors

V. O. Boiko Khmelnytskyi National University, Khmelnytskyi, Ukraine, Ukraine

DOI:

https://doi.org/10.15588/1607-3274-2025-3-6

Keywords:

hybrid code search, vector search, semantic embeddings, code summarization, LLM-generated metadata, cosine similarity, textual relevance, class and method retrieval, class-based indexing, software engineering

Abstract

Context. Modern software systems contain extensive and growing codebases, making code retrieval a critical task for software engineers. Traditional code search methods rely on keyword-based matching or structural analysis but often fail to capture the semantic intent of user queries or struggle with unstructured and inconsistently documented code. Recently, semantic vector search and large language models (LLMs) have shown promise in enhancing code understanding. The problem – is designing a scalable, accurate, and hybrid code search method capable of retrieving relevant code snippets based on both textual queries and semantic context, while supporting parallel processing and metadata enrichment.
Objective. The goal of the study is to develop a hybrid method for semantic code search by combining keyword-based filtering and embedding-based retrieval enhanced with LLM-generated summaries and semantic tags. The aim is to improve accuracy and efficiency in locating relevant code elements across large code repositories.
Method. A two-path search method with post-processing is proposed, where textual keyword search and embedding-based semantic search are executed in parallel. Code blocks are preprocessed using GPT-4o model to generate natural-language summaries and semantic tags.
Results. The method has been implemented and validated on a .NET codebase, demonstrating improved precision in retrieving semantically relevant methods. The combination of parallel search paths and LLM generated metadata enhanced both result quality and responsiveness. Additionally, LLM-post-processing was applied to the top-most relevant results, enabling more precise identification of code lines matching the query within retrieved snippets. Other results can be further refined on-demand.
Conclusions. Experimental findings confirm the operability and practical applicability of the proposed hybrid code search framework. The system’s modular architecture supports real-time developer workflows, and its extensibility enables future improvements through active learning and user feedback. Further research may focus on optimizing embedding selection strategies, integrating automatic query rewriting, and scaling across polyglot code environments

Author Biography

V. O. Boiko, Khmelnytskyi National University, Khmelnytskyi, Ukraine

Assistant of the Department of Software Engineering

References

Kumar Vivek, Chinmay Bhatt, Varsha Namdeo A framework for document plagiarism detection using Rabin Karp method, International Journal of Innovative Research in Technology and Management, 2021, Vol. 5, pp. 17–30.

Zhang Ashley Ge, Chen Yan, Oney Steve RunEx: Augmenting Regular-Expression Code Search with Runtime Values, Proceedings of the 2023 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC), 2023, pp. 145–155. DOI: 10.1109/VL-HCC57772.2023.00024

Karnalim Oscar, Simon Syntax Trees and Information Retrieval to Improve Code Similarity Detection, Proceedings of the Twenty-Second Australasian Computing Education Conference (ACE 2020), 2020, pp. 48–55. DOI: 10.1145/3373165.3373171

Liu Chao, Xia Xin, Lo David, Liu Zhiwei, Hassan Ahmed E., Li Shanping CodeMatcher: Searching Code Based on Sequential Semantics of Important Query Words. Ithaca, arXiv, 2020, 36 p. (Preprint / arXiv; 2005.14373). DOI: 10.1145/3465403

Kong Xianglong, Chen Hongyu, Yu Ming, Zhang Lixiang Boosting Code Search with Structural Code Annotation, Electronics, 2022, Vol. 11, No. 19, P. 3053. DOI: 10.3390/electronics11193053

Gotmare Khilesh Deepak, Li Junnan, Joty Shafiq, Hoi Steven C. H. Cascaded Fast and Slow Models for Efficient Semantic Code Search. Ithaca: arXiv, 2021, 12 p. (Preprint /arXiv; 2110.07811). DOI: 10.48550/arXiv.2110.07811

Jain Sarthak, Dora Aditya, Sam Ka Seng, Singh Prabhat LLM Agents Improve Semantic Code Search. Ithaca, arXiv, 2024, 12 p. (Preprint / arXiv; 2408.11058). DOI: 10.48550/arXiv.2408.11058

Khan M. A. M. Development of a code search engine using natural language processing techniquе: Graduate thesis. IUT, Department of Computer Science and Engineering, 2023,65 p.

Deng Zhongyang, Xu Ling, Liu Chao, Huangfu Luwen, Yan Meng Code semantic enrichment for deep code search, Journal of Systems and Software, 2024, Vol. 207, P. 111856. DOI: 10.1016/j.jss.2023.111856

Chen Junkai, Hu Xing, Li Zhenhao, Gao Cuiyun, Xia Xin, Lo David Code Search Is All You Need? Improving Code Suggestions with Code Search, Proceedings of the IEEE/ACM 46th International Conference on Software Engineering (ICSE 2024), Lisbon, Portugal, April 14–20,

, 2024, Article No. 73, pp. 1–13. DOI: 10.1145/3597503.3639085

Nate Suraj, Patil Om, Medar Shreenidhi, Deshmukh Jyoti A Survey on Transformer-based Models in Code Summarization, International Research Journal on Advanced Engineering Hub (IRJAEH), 2025, Vol. 3, pp. 740–745. DOI: 10.47392/IRJAEH.2025.0103

Parmar Mihir, Deilamsalehy Hanieh, Dernoncourt Franck, Yoon Seunghyun, Rossi Ryan A., Bui Trung Towards Enhancing Coherence in Extractive Summarization: Dataset and Experiments with LLMs, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024, pp. 19810–19820. DOI: 10.18653/v1/2024.emnlp-main.1106

Korade Nilesh Bhikaji, Salunke Mahendra B., Bhosle Amol, Kumbharkar Prashant Babarao, Asalkar Gayatri, Khedkar Rutuja G. Strengthening Sentence Similarity Identification Through OpenAI Embeddings and Deep Learning, International Journal of Advanced Computer Science and Applications, 2024, Vol. 15, No. 4, pp. 821–829. DOI: 10.14569/IJACSA.2024.0150485

OpenAI. New and improved embedding model [Electronic resource], OpenAI, Mode of access: https://openai.com/index/new-and-improved-embeddingmodel (date of access: 09.04.2025). – Title from screen.

Patil Rajvardhan, Boit Sorio, Gudivada Venkat N., Nandigam Jagadeesh A Survey of Text Representation and Embedding Techniques in NLP, IEEE Access, 2023, Vol. 11, pp. 36120–36146. DOI: 10.1109/ACCESS.2023.3266377

Jiang Xue, Wang Weiren, Tian Shaohan, Wang Hao, Lookman Turab, Su Yanjing Applications of natural language processing and large language models in materials discovery, npj Computational Materials, 2025, Vol. 11. DOI: 10.1038/s41524-025-01554-0

Qdrant. Qdrant Vector Database: High-performance vector similarity search [Electronic resource], Qdrant Documentation. Mode of access: https://qdrant.tech/qdrant-vectordatabase (date of access: 09.04.2025). – Title from screen.

Elastic. Elasticsearch: The Official Distributed Search & Analytics Engine [Electronic resource], Elastic. Mode of access: https://www.elastic.co/elasticsearch (date of access: 09.04.2025). Title from screen.

Hoyt Charles Tapley, Berrendorf Max, Galkin Mikhail, Tresp Volker, Gyori Benjamin M. A Unified Framework for Rank-based Evaluation Metrics for Link Prediction in Knowledge Graphs. Ithaca: arXiv, 2022, 18 p. (Preprint / arXiv; 2203.07544). DOI:10.48550/arXiv.2203.07544

METHOD OF PARALLEL HYBRID SEARCH FOR LARGE-SCALE CODE REPOSITORIES

Authors

DOI:

Keywords:

Abstract

Author Biography

V. O. Boiko, Khmelnytskyi National University, Khmelnytskyi, Ukraine

References

Downloads

Published

How to Cite

Issue

Section

License

Creative Commons Licensing Notifications in the Copyright Notices

Information

Current Issue