Mestrado em Ciência da Computação (INF)

URI Permanente para esta coleçãohttp://200.137.215.59/tede/handle/tde/225

Navegar

Submissões Recentes

Agora exibindo 1 - 20 de 318
  • Item type: Item ,
    Enhanced-ProBlock: adaptações em uma abordagem descentralizada, baseada em checagem humana e apoiada por blockchain para consenso ponderado na detecção de desinformação
    (Universidade Federal de Goiás, 2025-05-26) Damacena, Pedro Henrique Campos; Borges, Vinicius da Cunha Martins; http://lattes.cnpq.br/6904676677900593; Lima; Borges, Vinicius da Cunha Martins; http://lattes.cnpq.br/6904676677900593; Lima, Eliomar Araújo de; http://lattes.cnpq.br/1362170231777201; Graciano Neto, Valdemar Vicente; http://lattes.cnpq.br/9864803557706493
    Embargada.
  • Item type: Item ,
    Verificação semi-automática de fatos em português: enriquecimento de corpus via busca e extração de alegação
    (Universidade Federal de Goiás, 2025-06-10) Gomes, Juliana Resplande Sant'Anna; Galvão Filho, Arlindo Rodrigues; http://lattes.cnpq.br/7744765287200890; Galvão Filho, Arlindo Rodrigues; http://lattes.cnpq.br/7744765287200890; Lima, Eliomar Araújo de; http://lattes.cnpq.br/1362170231777201; Soares, Telma de Woerle de Lima; http://lattes.cnpq.br/6296363436468330
    The accelerated dissemination of disinformation often outpaces the capacity for manual fact-checking, highlighting the urgent need for Semi-Automated Fact-Checking (SAFC) systems. Within the Portuguese language context, there is a noted scarcity of publicly available datasets (corpora) that integrate external evidence, an essential component for developing robust AFC systems, as many existing resources focus solely on classification based on intrinsic text features. This dissertation addresses this gap by developing, applying, and analyzing a methodology to enrich Portuguese news corpora (Fake.Br, COVID19.BR, MuMiN-PT) with external evidence. The approach simulates a user’s verification process, employing Large Language Models (LLMs, specifically Gemini 1.5 Flash) to extract the main claim from texts and search engine APIs (Google Search API, Google FactCheck Claims Search API) to retrieve relevant external documents (evidence). Additionally, a data validation and preprocessing framework, including near-duplicate detection, is introduced to enhance the quality of the base corpora. The main results demonstrate the methodology’s viability, providing enriched corpora and analyses that confirm the utility of claim extraction, the influence of original data characteristics on the process, and the positive impact of enrichment on the performance of classification models (Bertimbau and Gemini 1.5 Flash), especially with fine-tuning. This work contributes valuable resources and insights for advancing SAFC in Portuguese.
  • Item type: Item ,
    Proposal and Evaluation of Efficient Pruning Approaches for Multi-Vector Representation in Passage Retrieval
    (Universidade Federal de Goiás, 2025-06-13) Chihururu, Alex Michael; Rosa, Thierson Couto; http://lattes.cnpq.br/4414718560764818; Rosa, Thierson Couto; http://lattes.cnpq.br/4414718560764818; Brandão, Wladmir Cardoso; http://lattes.cnpq.br/4935788335854516; Martins, Wellington Santos; http://lattes.cnpq.br/3041686206689904
    Multi-vector retrieval models employ bi-encoders to generate contextualized embeddings for queries and passages, and have proven highly effective in capturing fine-grained token-level interactions. Models such as ColBERT, ColBERTv2, and PLAID leverage all token-level output vectors from the encoder to accurately model query-passage relationships. However, storing dense vectors for every token in each passage results in substantial mem-ory overhead. Additionally, query latency is significantly affected by the computational cost of computing inner products between each query token and all passage tokens to obtain similarity scores. In this work, we explore pruning techniques applied to passage vectors produced by PLAID, aiming to remove less important token vectors to improve memory efficiency and reduce query processing time, with minimal impact on retrieval effectiveness. We propose two novel pruning methods: MLM Max with Token Reordering (MMTR) and TF-IDF pruning. We conducted extensive experiments on both in-domain and zero-shot (out-of-domain) datasets, following best-practice evaluation protocols. Our results show that MMTR consistently yields the smallest effectiveness drop compared to the original, unpruned PLAID model. We observe that retaining 50% of the passage to-ken embeddings provides the best trade-off between effectiveness, index size, and latency across most datasets. Interestingly, on certain out-of-domain datasets, pruning acts as a form of noise reduction—where retaining only 25% of the token embeddings leads to improved retrieval performance over the full, unpruned index.
  • Item type: Item ,
    Sliding puzzles gym: a scalable benchmark for state representation in visual reinforcement learning
    (Universidade Federal de Goiás, 2025-01-09) Oliveira, Bryan Lincoln Marques de; Soares, Telma Woerle de Lima; http://lattes.cnpq.br/6296363436468330; Soares, Telma Woerle de Lima; http://lattes.cnpq.br/6296363436468330; Maximo, Marcos Ricardo Omena de Albuquerque; http://lattes.cnpq.br/1610878342077626; Vieira (, Flávio Henrique Teles; http://lattes.cnpq.br/0920629723928382
    Embargada.
  • Item type: Item ,
    Channel-aware inter-slice scheduling for SLA assurance: theoretical and simulation-based approaches
    (Universidade Federal de Goiás, 2025-06-20) Silva, Daniel Campos da; Cardoso, Kleber Vieira; http://lattes.cnpq.br/0268732896111424; Cardoso, Kleber Vieira; http://lattes.cnpq.br/0268732896111424; Rocha, Flávio Geraldo Coelho; http://lattes.cnpq.br/5583470206347446; Kibilda, Jacek
    As the diversity of applications with heterogeneous Quality of Service (QoS) requirements grows in mobile networks, network slicing emerges as a key technology to meet Service Level Agreements (SLAs) by isolating resources among different types of services grouped into independent slices. Efficient inter-slice radio resource scheduling (RRS) is crucial in this context, directly governing the achievable throughput - and thus SLA assurance - while also enabling energy efficiency gains in low-demand scenarios through reduced resource usage and power consumption in the base station. This thesis investigates high-performance inter-slice RRS characteristics, including channelawareness, intra-slice RRS prediction, SLA-drift-oriented allocation, dynamic slice resource proportions, and fairness among users within the same slice. We formulate the RRS problem mathematically, facilitating the design of RRS heuristics approximating optimal solutions. Through simulations, we demonstrate how our proposed algorithms outperform state-of-the-art schedulers in both SLA assurance and resource efficiency.
  • Item type: Item ,
    Orquestração de recursos para a oferta de serviços, em infraestruturas híbridas de computação de borda e nuvem, com foco em aplicações de realidade mista
    (Universidade Federal de Goiás, 2025-05-30) Fraga, Luciano de Souza; Pinto, Leizer de Lima; http://lattes.cnpq.br/0611031507120144; Cardoso, Kleber Vieira; http://lattes.cnpq.br/0268732896111424; Cardoso, Kleber Vieira; http://lattes.cnpq.br/0268732896111424; Pinto, Leizer de Lima; http://lattes.cnpq.br/0611031507120144; Rezende, José Ferreira de; http://lattes.cnpq.br/8588117212005149; Bueno, Elivelton Ferreira; http://lattes.cnpq.br/2764240045623948
    Efficient resource allocation in hybrid edge-cloud computing environments is becoming increasingly important due to the growing adoption of mixed reality applications and the widespread use of devices with limited energy, processing, and memory resources. A welldesigned allocation strategy not only ensures compliance with the quality of service (QoS) requirements of these applications but also promotes optimized use of computational and network infrastructure, resulting in lower operational costs. In this work, we propose a model based on Integer Linear Programming (ILP) aimed at maximizing the fulfillment of demand generated by user devices, while minimizing the cost associated with the use of virtual machines responsible for processing. We evaluate the complexity of the model and propose structural simplifications, in addition to developing a heuristic designed to reduce solution generation time. Finally, we introduce a proactive approach based on a predictive model that anticipates resource usage patterns, contributing to more accurate decisions compared to reactive strategies. Experimental results demonstrate significant improvements in the volume of demand served when compared to other approaches in the literature, as well as highlight the benefits of adopting proactive strategies for resource allocation.
  • Item type: Item ,
    Segmentação dinâmica de objetos aplicada à odometria visual
    (Universidade Federal de Goiás, 2024-10-02) Oliveira, Thiago Henrique de; Laureano, Gustavo Teodoro Laureano; http://lattes.cnpq.br/4418446095942420; Laureano, Gustavo Teodoro; http://lattes.cnpq.br/4418446095942420; Osório, Fernando Santos; http://lattes.cnpq.br/7396818382676736; Soares, Anderson da Silva; http://lattes.cnpq.br/1096941114079527
    The presence of dynamic objects in a scene can significantly impair the performance of visual odometry methods. Even with the use of robust methods, it is not always possible to avoid outliers and interferences in the estimation of the camera’s movement. This type of object introduces characteristic points whose movement does not align with the actual movement performed by the camera. To filter these objects, this work presents a neural network architecture that combines RGB images and optical flow to segment regions that exhibit moving objects, even while the camera itself moves. To enable the training of the network, a methodology for quick annotation of object detection datasets is presented to add semantic masks of moving objects to 98,491 images of an urban navigation dataset. The proposed neural network was trained and evaluated with these data and proved adequate for use as a dynamic object filter in visual odometry tasks. To evaluate the proposed model, comparisons of visual odometry algorithms with and without the use of filtering are presented. Based on the results obtained in this work, the identification and filtering of dynamic objects in an image emerges as a fundamental step in the task of visual odometry, being essential for applications involving the presence of dynamic objects.
  • Item type: Item ,
    Mineração de argumentos em documentos jurídicos em Português
    (Universidade Federal de Goiás, 2024-12-02) Evangelista, Euripedes Balsanulfo; Silva, Nádia Félix Felipe da; http://lattes.cnpq.br/7864834001694765; Silva, Nádia Félix Felipe da; Pereira, Fabíola Souza Fernande; Cordeiro, Douglas Farias
    This work presents an argument mining approach applied to Brazilian labor court documents. Although the mining of arguments in legal documents has been a subject of study for over a decade, only one work has been found that specifically applies this study to Brazilian Portuguese in the legal domain. In this dissertation, we thoroughly explore all the necessary steps to achieve the objective of the argument mining task. Thus, our approach consists of use a Transformers-based Language Model trained on a specific domain corpus of Brazilian labor justice and we report an F1-score of 88.86% on the classification task. The proposal outperformed BERTimbau by 1.88% and Deberta by 3.39%.
  • Item type: Item ,
    Detecção de traços de depressão em textos na língua portuguesa considerando aspectos culturais e regionais do Brasil com o uso de LLMs
    (Universidade Federal de Goiás, 2024-11-25) Rodrigues, Leidiane Beatriz Passos; Lago, Marilúcia Pereira do; http://lattes.cnpq.br/4323131996569717; Fernandes, Deborah Silva Alves; http://lattes.cnpq.br/0380764911708235; Fernandes, Deborah Silva Alves; Lago, Marilucia Pereira do; Silva, Nádia Félix Felipe da; Pires, Sandrerley Ramos
    Embargada
  • Item type: Item ,
    Otimização de Desempenho de Análise Federada para Redes de Próxima Geração (B5G/6G)
    (Universidade Federal de Goiás, 2025-02-20) Sebastião, Xavier Paulino; Ribeiro, Maria Do Rosario Campos; Oliveira Júnior, Antonio Carlos de; http://lattes.cnpq.br/3148813459575445; Oliveira Júnior, Antonio Carlos de; Moreira, Waldir; Ribeiro, Maria Do Rosario Campos; Lopes, Victor Hugo Lazaro
    The rapid growth of interconnected devices across various sectors in the world has led to the generation of large volumes of data. Depending on its nature and specific needs, a significant part of this data is handled and analyzed using data science techniques to support decision-making. Alongside these advancements, as institutions rely more on data-driven systems, they also face increasing threats and security challenges that compromise the privacy of their clients or collaborators, consequently damaging their reputation. Federated Analytics (FA) is an innovative approach for preserving data security and privacy by implementing collaborative analysis of data from distributed devices without sharing raw data. However, in cases where FA operates over wireless transmission, challenges such as interference, signal degradation, and network congestion may arise. These factors can make the wireless transmission unreliable, introducing delays and causing corruption in responses and updates received at the central server, thereby compromising the quality of the final aggregated FA results. This work proposes an integrated framework for simulating FA in real 5G network conditions. The framework applies two algorithms: channel-aware power allocation algorithm to efficiently allocate transmission power for user equipments (UEs) based on distance and channel conditions, and synchronous FA5GLENA integrated algorithm to integrate the FA with NS-3 5G-LENA and aggregate results for optimized performance within 5G network conditions. To evaluate the impact of the network on FA, three scenarios were compared: (1) uniform maximum power allocation for all UEs, (2) random power allocation, and (3) channel-aware power allocation algorithm. The simulation results show that the channel-aware algorithm outperforms the uniform and random power allocation scenarios on both the network and FA operation. On FA, the algorithm achieved statistically higher accuracy (93.17 %), precision (93.31 %), and recall (93.09 %) compared to uniform allocation (accuracy: 55.96 %, precision: 56.02 %, recall: 55.90 %) and random allocation (accuracy: 42 %, precision: 42.02 %, recall: 41.96 %), highlighting the superiority of the algorithm in enhancing FA performance.
  • Item type: Item ,
    Modelo de linguagem para o mercado de ações Brasileiro: Uma abordagem baseada em análise de sentimentos usando o modelo BERTimbau
    (Universidade Federal de Goiás, 2024-10-03) Araujo, Leandro dos Santos; Fernandes, Deborah Silva Alves; http://lattes.cnpq.br/0380764911708235; Fernandes, Deborah Silva Alves; Santos, Adam Dreyton Ferreira dos; Soares, Fabrízzio Alphonsus Alves de Melo Nunes
    Embargada.
  • Item type: Item ,
    Reconhecimento de entidades nomeadas em editais de licitação
    (Universidade Federal de Goiás, 2024-11-29) Souza Filho, Ricardo Pereira de; Silva, Nádia Félix Felipe da; http://lattes.cnpq.br/7864834001694765; Silva, Nádia Félix Felipe da; Fernandes, Deborah Silva Alves; Souza, Ellen Polliana Ramos
    This work explores the use of large language models (LLMs) for information extraction in public procurement notices, focusing on the Named Entity Recognition (NER) task. Given the diverse and unstandardized nature of these documents, the study proposes a methodology that integrates semantic selection techniques with Zero-Shot and Few-Shot scenarios, aiming to optimize the annotation and entity extraction process, reduce manual intervention, and improve accuracy. The first step involved building an annotated corpus containing named entities from pro-curement notices. Subsequently, the BERTimbau, BERTikal, and mDeBERTa models were trained in a supervised manner using this annotated dataset. Experiments showed that BERTimbau achieved the best overall performance, with an F1-score above 0.80. In the Zero-Shot and Few-Shot scenarios, various prompt templates and example selection strategies were tested. Models such as GPT-4 and LLaMA achieved performance compa-rable to supervised models when aided by semantically relevant examples, despite modest results in the absence of examples. The results indicate that combining enriched prompts with examples and the pre-selection of relevant sentences during the annotation phase contributes to greater accuracy and efficiency in the NER process for procurement notices. The proposed methodology can be applied to information extraction, with potential impacts on transparency and auditing in public procurement.
  • Item type: Item ,
    Avaliação de grandes modelos de linguagem na simplificação de texto de decisões jurídicas utilizando pontuações de legibilidade como alvo
    (Universidade Federal de Goiás, 2024-11-29) Paula, Antônio Flávio Castro Torres de; Camilo Junior, Celso Gonçalves; http://lattes.cnpq.br/6776569904919279; Camilo Júnior, Celso Gonçalves; Oliveira, Sávio Salvarino Teles de; Naves, Eduardo Lázaro Martins
    The complexity of language used in legal documents, such as technical terms and legal jargon, hinders access to and understanding of the Brazilian justice system for laypeo ple. This work presents text simplification approaches and assesses the state-of-the-art by considering large language models with readability scoring as a parameter for simplification. Due to limited resources for text simplification in Portuguese, especially within the legal domain, the application of a methodology based on text modification using readability scoring enables experiments that leverage the knowledge acquired during the training of these large language models, while also allowing for automatic evaluation without the need for labeled data. This study evaluates the simplification capabilities of large language models by using eleven models as case studies. Additionally, a real corpus was developed, based on legal rulings from the Brazilian justice system.
  • Item type: Item ,
    Avaliação de Grandes Modelos de Linguagem para Raciocínio em Direito Tributário
    (Universidade Federal de Goiás, 2024-11-22) Presa, João Paulo Cavalcante; Oliveira, Sávio Salvarino Teles de; http://lattes.cnpq.br/1905829499839846; Camilo Junior, Celso Gonçalves; http://lattes.cnpq.br/6776569904919279; Camilo Júnior, Celso Gonçalves; Oliveira, Sávio Salvarino Teles de; Silva , Nádia Felix Felipe da; Leite, Karla Tereza Figueiredo
    Tax law is essential for regulating relationships between the State and taxpayers, being crucial for tax collection and maintaining public functions. The complexity and constant evolution of tax laws make their interpretation an ongoing challenge for legal professionals. Although Natural Language Processing (NLP) has become a promising technology in the legal field, its application in brazilian tax law, especially for legal entities, remains a relatively unexplored area. This work evaluates the use of Large Language Models (LLMs) in Brazilian tax law covering federal tax aspects, analyzing their ability to process questions and generate answers in Portuguese for legal entities’ queries. For this purpose, we built an original dataset composed of real questions and answers provided by experts, allowing us to evaluate the ability of both proprietary and open-source LLMs to generate legally valid answers. The research uses quantitative and qualitative metrics to measure the accuracy and relevance of generated answers, capturing aspects of legal reasoning and semantic coherence. As contributions, this work presents a dataset specific to the tax law domain, a detailed evaluation of different LLMs’ performance in legal reasoning tasks, and an evaluation approach that combines quantitative and qualitative metrics, thus advancing the application of artificial intelligence in the analysis of tax laws and regulations.
  • Item type: Item ,
    Integração de uma aplicação de realidade aumentada com sistemas 5G seguindo o padrão 3GPP
    (Universidade Federal de Goiás, 2024-12-04) Cardoso, Pabllo Borges; Cardoso, Kleber Vieira; http://lattes.cnpq.br/0268732896111424; Corrêa, Sand Luz; http://lattes.cnpq.br/3386409577930822; Cardoso, Kleber Vieira; Freitas , Leandro Alexandre; Oliveira Junior, Antonio Carlos de
    Based on the standards defined by the 3rd Generation Partnership Project (3GPP), this work validates the 5G Media Streaming (5GMS) model, using the MR-Leo prototype as a case study. MR-Leo is a Mixed Reality (MR) application designed to explore the potential of these technologies in high-demand computational environments. The study begins with a review of advancements enabled by 5G networks, emphasizing their ability to provide low-latency connectivity, high bandwidth, and support for heterogeneous devices at scale. Additionally, the frameworks CAPIF and SEAL are discussed as tools to facilitate interoperability and API management in the 5G architecture, though recognized for their technical complexity and limited practical adoption. Edge computing is then investigated as a strategic component capable of bringing computational resources closer to end users, reducing latencies and enhancing the performance of intensive algorithms critical for MR applications. The validation of the proposed study was carried out in three distinct scenarios: a local controlled environment, an emulated 5G network, and a real 5G callbox. Experimental evaluation demonstrated the superiority of the protocol combined with video compression, achieving consistent metrics that meet the key performance indicators (KPIs) defined in the literature. The comparative qualitative analysis highlighted significant compatibilities as well as gaps, such as the absence of a functional component equivalent to the 5GMS Application Function (AF). In this regard, this work makes important contributions by demonstrating the technical feasibility of delivering MR services on 5G networks through edge computing.
  • Item type: Item ,
    Arquitetura holística de redes 6G: integração das camadas de comunicação espacial, aérea, terrestre, marinha e submarina com gêmeos digitais e inteligência artificial
    (Universidade Federal de Goiás, 2024-11-22) Araújo, Antonia Vanessa Dias; Oliveira Júnior, Antonio Carlos de; http://lattes.cnpq.br/3148813459575445; Oliveira Júnior, Antonio Carlos de; Moreira, Rodrigo; Freitas, Leandro Alexandre
    This dissertation proposes a holistic architecture for 6G networks, aiming at the integration of space, aerial, terrestrial, maritime, and submarine communication networks, targeting global and continuous connectivity. The integration of these networks, especially non-terrestrial networks (NTN), with terrestrial infrastructure presents significant technical and architectural challenges. The study focuses on modeling a unified architecture that fosters interaction between these network layers, with an emphasis on extreme and ubiquitous coverage. The methodology involves a detailed analysis of technological challenges and key enablers, such as digital twins, artificial intelligence (AI), and network orchestration, which facilitate the integration and efficient operation of 6G networks. The proposal is evaluated through simulations, highlighting the synergy between the different network components and their ability to provide ubiquitous and transparent communication to the user. It concludes that the proposed architecture provides a promising foundation for the implementation of innovative use cases, such as emergency communications, environmental monitoring, telemedicine, and smart agriculture, emphasizing the importance of extreme global coverage as one of the architectural cornerstones.
  • Item type: Item ,
    Avaliação de Grandes Modelos de Linguagem para Classificação de Documentos Jurídicos em Português
    (Universidade Federal de Goiás, 2024-11-26) Santos, Willgnner Ferreira; Oliveira, Sávio Salvarino Teles de; http://lattes.cnpq.br/1905829499839846; Galvão Filho, Arlindo Rodrigues; http://lattes.cnpq.br/7744765287200890; Galvão Filho, Arlindo Rodrigues; Oliveira, Sávio Salvarino Teles de; Fanucchi , Rodrigo Zempulski; Soares, Anderson da Silva
    The increasing procedural demand in judicial institutions has caused a workload overload, impacting the efficiency of the legal system. This scenario, exacerbated by limited human resources, highlights the need for technological solutions to streamline the processing and analysis of documents. In light of this reality, this work proposes a pipeline for automating the classification of these documents, evaluating four methods of representing legal texts at the pipeline’s input: original text, summaries, centroids, and document descriptions. The pipeline was developed and tested at the Public Defender’s Office of the State of Goiás (DPE-GO). Each approach implements a specific strategy to structure the input texts, aiming to enhance the models’ ability to interpret and classify legal documents. A new Portuguese dataset was introduced, specifically designed for this application, and the performance of Large Language Models (LLMs) was evaluated in classification tasks. The analysis results demonstrate that the use of summaries improves classification accuracy and maximizes the F1-score, optimizing the use of LLMs by reducing the number of tokens processed without compromising precision. These findings highlight the impact of textual representations of documents and the potential of LLMs for the automatic classification of legal documents, as in the case of DPE-GO. The contributions of this work indicate that the application of LLMs, combined with optimized textual representations, can significantly increase the productivity and quality of services provided by judicial institutions, promoting advancements in the overall efficiency of the legal system.
  • Item type: Item ,
    Em Busca do Estado da Arte e da Práca sobre Schema Matching na Indústria Brasileira - Resultados Preliminares de uma Revisão de Literatura e uma Pesquisa de Opinião
    (Universidade Federal de Goiás, 2024-09-02) Borges, Ricardo Henricki Dias; Ribeiro, Leonardo Andrade; http://lattes.cnpq.br/4036932351063584; Graciano Neto, Valdemar Vicente; http://lattes.cnpq.br/9864803557706493; Graciano Neto, Valdemar Vicente; Ribeiro, Leonardo Andrade; Frantz, Rafael Zancan; Galvao Filho, Arlindo Rodrigues
    The integration of systems and interoperability between different databases are critical challenges in information technology, mainly due to the diversity of data schemas. The schema matching technique is essential for unifying these schemas, facilitating research, analysis, and knowledge discovery. This dissertation investigates the application of schema matching in the Brazilian software industry, focusing on understanding the reasons for its low adoption. The research included a systematic mapping of the use of Artificial Intelligence (AI) algorithms and similarity techniques in schema matching, as well as a survey with 35 professionals in the field. The results indicate that, although schema matching offers significant improvements in data integration processes, such as reducing time and increasing accuracy, most professionals are unfamiliar with the term, even among those who use similar tools. The low adoption of these techniques can be attributed to the lack of free or open source tools and the absence of implementation plans within companies. The dissertation highlights the need for initiatives that overcome these barriers, empower professionals, and promote broader use of schema matching in the Brazilian industry.
  • Item type: Item ,
    Legal Domain Adaptation in Portuguese Language Models - Developing and Evaluating RoBERTa-based Models on Legal Corpora
    (Universidade Federal de Goiás, 2024-05-28) Garcia, Eduardo Augusto Santos; Lima, Eliomar Araújo de; http://lattes.cnpq.br/1362170231777201; Silva, Nádia Félix Felipe da; http://lattes.cnpq.br/7864834001694765; Silva, Nádia Félix Felipe da; Lima, Eliomar Araújo de; Soares, Anderson da Silva; Placca, José Avelino
    This research investigates the application of Natural Language Processing (NLP) within the legal domain for the Portuguese language, emphasizing the importance of domain adaptation for pre-trained language models, such as RoBERTa, using specialized legal corpora. We compiled and pre-processed a Portuguese legal corpus, named LegalPT, addressing the challenges of high near-duplicate document rates in legal corpora and conducting a comparison with generic web-scraped corpora. Experiments with these corpora revealed that pre-training on a combined dataset of legal and general data resulted in a more effective model for legal tasks. Our model, called RoBERTaLexPT, outperformed larger models trained solely on generic corpora, such as BERTimbau and Albertina-PT-*, and other legal models from similar works. For evaluating the performance of these models, we propose in this Master’s dissertation a legal benchmark composed of several datasets, including LeNER-Br, RRI, FGV, UlyssesNER-Br, CEIAEntidades, and CEIA-Frases. This study contributes to the improvement of NLP solutions in the Brazilian legal context by openly providing enhanced models, a specialized corpus, and a rigorous benchmark suite.
  • Item type: Item ,
    Análise de um Fluxo Completo Automatizado de Etapas Voltado ao Reconhecimento de Texto em Imagens de Prescrições Médicas Manuscritas
    (Universidade Federal de Goiás, 2024-01-10) Corrêa, André Pires; Lima, Eliomar Araújo de; http://lattes.cnpq.br/1362170231777201; Nascimento, Hugo Alexandre Dantas do; http://lattes.cnpq.br/2920005922426876; Nascimento, Hugo Alexandre Dantas do; Costa, Ronaldo Martins da; Pedrini, Hélio; Lima, Eliomar Araújo de
    Compounding pharmacies deal with large volumes of medical prescriptions on a daily basis, whose data needs to be manually inputted into information management systems to properly process their customers’ orders. A considerable portion of these prescriptions tend to be written by doctors with poorly legible handwriting, which can make decoding them an arduous and time-consuming process. Previous works have investigated the use of machine learning for medical prescription recognition. However, the accuracy rates in these works are still fairly low and their approaches tend to be rather limited, as they typically utilize small datasets, focus only on specific steps of the automated analysis pipeline or use proprietary tools, which makes it difficult to replicate and analyse their results. The present work contributes towards filling this gap by presenting an end-toend process for automated data extraction from handwritten medical prescriptions, from text segmentation, to recognition and post-processing. The approach was built based on an evaluation and adaptation of multiple existing methods for each step of the pipeline. The methods were evaluated on a dataset of 993 images of medical prescriptions with 27,933 annotated words, produced with the support of a compounding pharmacy that participated in the project. The results obtained by the best performing methods indicate that the developed approach is reasonably effective, reaching an accuracy of 68% in the segmentation step, and a character accuracy rate of 86.8% in the text recognition step.