Skip to content

DrugsLM - Small Language Model for Drug Information

Master's Thesis Project | Federal University of Paranรก (UFPR) | Computer Science Department

DrugsLM is a specialized Small Language Model (SLM) trained on drug package inserts and other pharmacological databases, designed to understand and generate accurate and simple pharmaceutical information.


๐ŸŽ“ Academic Context

This project is part of a Master's thesis in Computer Science at the Federal University of Paranรก (UFPR), Curitiba, Brazil. The research focuses on:

  • Democratizing access to complex pharmacological information
  • Structuring unstructured data from official pharmaceutical documentation
  • Domain-adaptation of Language Models for pharmacological information
  • Resource-efficient fine-tuning strategies for Small Language Models (SLMs)
  • Validation and reliability of Generative AI in healthcare contexts

Researcher: Vinรญcius de Lima Gonรงalves
Advisor: Professor Eduardo Todt, PhD
Institution: Department of Computer Science, UFPR

๐ŸŽฏ Project Vision

High-quality outcomes likely depend on rigorously structured data rather than massive scale, favoring Small Language Models (SLMs). Leveraging Knowledge Graphs aims to provide precise context and granularity. Comparing architectures intends to demonstrate that data structure is key to resource-efficient, reliable pharmacological AI.


๐Ÿ“‹ Experimental Assets Lineage ( Data Roadmap )

The diagram below illustrates the complete data acquisition and processing roadmap for this project, organized as a data-centric asset chart, showing progress across all assets. Each node also has a link to the main module responsible for acquiring the respective data, whether completed or in progress.

Legend

flowchart LR

    classDef done fill:#a8d5ba,stroke:#6da382,stroke-width:2px,color:#212529;
    classDef active fill:#ffe6a7,stroke:#c9a655,stroke-width:3px,color:#212529;
    classDef must fill:#f4b6c2,stroke:#c07c88,stroke-width:2px,color:#212529;
    classDef could fill:#add8e6,stroke:#7daab6,stroke-width:1px,stroke-dasharray: 5 5,color:#212529;
    classDef drop fill:#e3e3e3,stroke:#b0b0b0,stroke-width:1px,stroke-dasharray: 2 2,color:#6c757d;

    L1(Complete):::done --- L2(In Progress):::active --- L3(Must Be):::must --- L4(Could Be):::could --- L5(Dropped):::drop

flowchart TD

    classDef done fill:#a8d5ba,stroke:#6da382,stroke-width:2px,color:#212529;
    classDef active fill:#ffe6a7,stroke:#c9a655,stroke-width:3px,color:#212529;
    classDef must fill:#f4b6c2,stroke:#c07c88,stroke-width:2px,color:#212529;
    classDef could fill:#add8e6,stroke:#7daab6,stroke-width:1px,stroke-dasharray: 5 5,color:#212529;
    classDef drop fill:#e3e3e3,stroke:#b0b0b0,stroke-width:1px,stroke-dasharray: 2 2,color:#6c757d;
    classDef hidden fill:none,stroke:none,color:none,width:0px,height:0px;

    subgraph DA["Data Aquisition"]

        AnvisaCat[ANVISA Drug Catalog]:::done
        AnvisaPage[ANVISA Drug Pages]:::active
        AnvisaPDF[Package Insert PDFs]:::must

        Wiki[Wikipedia Drugs]:::could
        WikiCat[Wikipedia Drug Categories]:::could
        WikiPage[Wikipedia Drug Pages]:::could

        Drugs[Drugs.com]:::drop
        DrugsCat[Drugs.com Catalog]:::drop
        DrugsPage[Drugs.com Pages]:::drop

        AnvisaCat ==> AnvisaPage ==> AnvisaPDF
        Wiki -.-> WikiCat -.-> WikiPage
        Drugs -.-> DrugsCat -.-> DrugsPage

    end

    subgraph DP["Data Pre-Processing"]

        AExt[ Anvisa PDF Extracted ]:::must
        WExt[ Wiki HTML Extracted ]:::could
        DExt[ Drugs HTML Extracted ]:::drop

        AParser[ ANVISA Data Parsed ]:::must
        WParser[ Wiki Data Parsed ]:::could
        DParser[ Drugs Data Parsed ]:::drop

        AClean[ ANVISA Data Cleaned ]:::must
        WClean[ Wiki Data Cleaned ]:::could
        DClean[ Drugs Data Cleaned ]:::drop

        AStand[ ANVISA Data Standed ]:::must
        WStand[ Wiki Data Standed ]:::could
        DStand[ Drugs Data Standed ]:::drop

        Join[Data Sources Joined]:::must
        Dedup[Data Sources Deduplicated]:::must
        Review[Data Reviewed ]:::must


        AnvisaPDF ==> AExt ==> AParser ==> AClean ==> AStand ==> Join ==> Dedup ==> Review
        WikiPage --> WExt -->  WParser --> WClean --> WStand --> Join
        DrugsPage -.-> DExt -.->  DParser -.-> DClean -.-> DStand -.-> Join 

    end

    subgraph DS["Data Structuring"]

        SST[Simple Structured Text]:::must

        NER[Named Entity Recognition]:::could
        RST[Related Structured Text]:::could
        GST[Graph Structured Text]:::could

        Review ==> SST
        SST --> NER --> RST --> GST

    end

    subgraph DStorage["Data Storage"]
        VDB[(Vector Database)]:::must
        GDB[(Graph Database)]:::could

        SST ==> VDB
        GST --> GDB
    end


    %% Links
    click Catalog "reference/scraper/anvisa/catalog/" "See Catalog Activity Diagram"

๐Ÿš€ Quick Start

๐Ÿ“– Getting Started

Environment setup, Docker guide, and first scraper execution

Installation Guide โ†’

๐Ÿ—๏ธ Architecture

Technical decisions, data flows, and design patterns

System Design โ†’

๐Ÿ› ๏ธ Infrastructure

Container setup, Selenium Grid, and hardware specs

Deployment Info โ†’

๐Ÿ“š API Reference

Module documentation, scrapers, and code examples

Browse API Docs โ†’

Next Steps: Set up your development environment โ†’

๐Ÿค Contributing

This is an active research project. If you're interested in collaborating or have suggestions, feel free to open an issue or reach out.


๐Ÿ“„ License

This project is licensed under the BSD License. See LICENSE for details.