METHOD OF IMPERATIVE VARIABLES FOR SEARCH AUTOMATION OF TEXTUAL CONTENT IN UNSTRUCTURED DOCUMENTS
DOI:
https://doi.org/10.15588/1607-3274-2024-2-12Keywords:
textual search, unstructured text documents, natural language processing, rule-based search, generative artificial intelligence, imperative variablesAbstract
Context. Currently, there are a lot of approaches that are used for textual search. Nowadays, methods such as pattern-matching and optical character recognition are highly used for retrieving preferred information from documents with proven effectiveness. However, they work with a common or predictive document structure, while unstructured documents are neglected. The problem – is automating the textual search in documents with unstructured content. The object of the study was to develop a method and implement it into an efficient model for searching the content in unstructured textual information.
Objective. The goal of the work is the implementation of a rule-based textual search method and a model for seeking and retrieving information from documents with unstructured text content.
Method. To achieve the purpose of the research, the method of rule-based textual search in heterogenous content was developed and applied in the appropriately designed model. It is based on natural language processing that has been improved in recent years along with a new generative artificial intelligence becoming more available.
Results. The method has been implemented in a designed model that represents a pattern or a framework of unstructured textual search for software engineers. The application programming interface has been implemented.
Conclusions. The conducted experiments have confirmed the proposed software’s operability and allow recommendations for use in practice for solving the problems of textual search in unstructured documents. The prospects for further research may include the improvement of the performance using multithreading or parallelization for large textual documents along with the optimization approaches to minimize the impact of OpenAI application programming interface content processing limitations. Furthermore, additional investigation might incorporate extending the area of imperative variables usage in programming and software development.
References
Dutta H., Gupta A. PNRank: Unsupervised ranking of person name entities from noisy OCR text, Decision support systems, 2021, P. 113662.
Kumar V., Chinmay B., Varsha N. A framework for document plagiarism detection using Rabin Karp method, International Journal of Innovative Research in Technology and Managemen, 2021, Vol. 5, pp. 18–19.
Onyenwe I. et al. Developing Smart Web-Search using Regex, International Journal on Natural Language Computing, 2022,Vol. 11, No. 3, pp. 25–30.
OCR – optical character recognition – azure AI services [Electronic resource], Microsoft Learn: Build skills that open doors in your career. Mode of access: https://learn.microsoft.com/en-us/azure/aiservices/computer-vision/overview-ocr (date of access: 23.03.2024). Title from screen.
Drobac S., Lindén K. Optical character recognition with neural networks and post-correction with finite state methods, International journal on document analysis and recognition (IJDAR), 2020, Vol. 23, No. 4, pp. 279–295.
Deshmukh M., Maheshwari S. Free form document based extraction using ML, International journal of science and research (IJSR), 2019, Vol. 8, P. 1.
Kwabena A. E. et al. An automated method for developing search strategies for systematic review using natural language processing (NLP), MethodsX, 2022, P. 101935.
Just J. Natural language processing for innovation search – Reviewing an emerging non-human innovation intermediary, Technovation, 2024, Vol. 129, P. 102883.
Allen K. S. et al. Natural language processing-driven state machines to extract social factors from unstructured clinical documentation, JAMIA open, 2023, Vol. 6, No. 2.
Li I. et al. Neural natural language processing for unstructured data in electronic health records: A review, Computer science review, 2022, Vol. 46, P. 100511.
Qiu Q. et al. Automatic spatiotemporal and semantic information extraction from unstructured geoscience reports using text mining techniques, Earth science informatics, 2020, Vol. 13, No. 4, pp. 1393–1410.
Research [Electronic resource], OpenAI. Mode of access: https://openai.com/research/overview (date of access: 24.03.2024). Title from screen.
Koubaa A. et al. Exploring ChatGPT capabilities and limitations: A critical review of the NLP game changer. Riyadh. Preprints, 2023, 29 p. (Preprint / Prince Sultan University; 2023030438).
Ekin S. Prompt Engineering For ChatGPT: A Quick Guide To Techniques, Tips, And Best Practices. Texas City: TechRxiv, 2023, 12 p. (Preprint / Texas A&M University; 22683919).
Chat completions API [Electronic resource]. Mode of access: https://platform.openai.com/docs/guides/textgeneration/chat-completions-api (date of access: 26.03.2024). – Title from screen.
Lee M. A mathematical investigation of hallucination and creativity in GPT models, Mathematics, 2023, Vol. 11, No. 10, P. 2320.
Kingma D. P., Ba J. Adam: A Method for Stochastic Optimization, 3rd International Conference for Learning Representations, San Diego, 7–9 May 2015.
Usage tiers [Electronic resource]. Mode of access: https://platform.openai.com/docs/guides/rate-limits/usagetiers?context=tier-one (date of access: 26.03.2024). Title from screen.
GPT-3.5 Turbo [Electronic resource]. Mode of access: https://platform.openai.com/docs/models/gpt-3-5-turbo (date of access: 26.03.2024). – Title from screen.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2024 В. О. Бойко
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Creative Commons Licensing Notifications in the Copyright Notices
The journal allows the authors to hold the copyright without restrictions and to retain publishing rights without restrictions.
The journal allows readers to read, download, copy, distribute, print, search, or link to the full texts of its articles.
The journal allows to reuse and remixing of its content, in accordance with a Creative Commons license СС BY -SA.
Authors who publish with this journal agree to the following terms:
-
Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License CC BY-SA that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
-
Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
-
Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work.