DrugsLM - Small Language Model for Drug Information
Master's Thesis Project | Federal University of Paranรก (UFPR) | Computer Science Department
DrugsLM is a specialized Small Language Model (SLM) trained on drug package inserts and other pharmacological databases, designed to understand and generate accurate and simple pharmaceutical information.
๐ Academic Context
This project is part of a Master's thesis in Computer Science at the Federal University of Paranรก (UFPR), Curitiba, Brazil. The research focuses on:
- Democratizing access to complex pharmacological information
- Structuring unstructured data from official pharmaceutical documentation
- Domain-adaptation of Language Models for pharmacological information
- Resource-efficient fine-tuning strategies for Small Language Models (SLMs)
- Validation and reliability of Generative AI in healthcare contexts
Researcher: Vinรญcius de Lima Gonรงalves
Advisor: Professor Eduardo Todt, PhD
Institution: Department of Computer Science, UFPR
๐ฏ Project Vision
High-quality outcomes likely depend on rigorously structured data rather than massive scale, favoring Small Language Models (SLMs). Leveraging Knowledge Graphs aims to provide precise context and granularity. Comparing architectures intends to demonstrate that data structure is key to resource-efficient, reliable pharmacological AI.
๐ Experimental Assets Lineage ( Data Roadmap )
The diagram below illustrates the complete data acquisition and processing roadmap for this project, organized as a data-centric asset chart, showing progress across all assets. Each node also has a link to the main module responsible for acquiring the respective data, whether completed or in progress.
Legend
flowchart LR
classDef done fill:#a8d5ba,stroke:#6da382,stroke-width:2px,color:#212529;
classDef active fill:#ffe6a7,stroke:#c9a655,stroke-width:3px,color:#212529;
classDef must fill:#f4b6c2,stroke:#c07c88,stroke-width:2px,color:#212529;
classDef could fill:#add8e6,stroke:#7daab6,stroke-width:1px,stroke-dasharray: 5 5,color:#212529;
classDef drop fill:#e3e3e3,stroke:#b0b0b0,stroke-width:1px,stroke-dasharray: 2 2,color:#6c757d;
L1(Complete):::done --- L2(In Progress):::active --- L3(Must Be):::must --- L4(Could Be):::could --- L5(Dropped):::drop
flowchart TD
classDef done fill:#a8d5ba,stroke:#6da382,stroke-width:2px,color:#212529;
classDef active fill:#ffe6a7,stroke:#c9a655,stroke-width:3px,color:#212529;
classDef must fill:#f4b6c2,stroke:#c07c88,stroke-width:2px,color:#212529;
classDef could fill:#add8e6,stroke:#7daab6,stroke-width:1px,stroke-dasharray: 5 5,color:#212529;
classDef drop fill:#e3e3e3,stroke:#b0b0b0,stroke-width:1px,stroke-dasharray: 2 2,color:#6c757d;
classDef hidden fill:none,stroke:none,color:none,width:0px,height:0px;
subgraph DA["Data Aquisition"]
AnvisaCat[ANVISA Drug Catalog]:::done
AnvisaPage[ANVISA Drug Pages]:::active
AnvisaPDF[Package Insert PDFs]:::must
Wiki[Wikipedia Drugs]:::could
WikiCat[Wikipedia Drug Categories]:::could
WikiPage[Wikipedia Drug Pages]:::could
Drugs[Drugs.com]:::drop
DrugsCat[Drugs.com Catalog]:::drop
DrugsPage[Drugs.com Pages]:::drop
AnvisaCat ==> AnvisaPage ==> AnvisaPDF
Wiki -.-> WikiCat -.-> WikiPage
Drugs -.-> DrugsCat -.-> DrugsPage
end
subgraph DP["Data Pre-Processing"]
AExt[ Anvisa PDF Extracted ]:::must
WExt[ Wiki HTML Extracted ]:::could
DExt[ Drugs HTML Extracted ]:::drop
AParser[ ANVISA Data Parsed ]:::must
WParser[ Wiki Data Parsed ]:::could
DParser[ Drugs Data Parsed ]:::drop
AClean[ ANVISA Data Cleaned ]:::must
WClean[ Wiki Data Cleaned ]:::could
DClean[ Drugs Data Cleaned ]:::drop
AStand[ ANVISA Data Standed ]:::must
WStand[ Wiki Data Standed ]:::could
DStand[ Drugs Data Standed ]:::drop
Join[Data Sources Joined]:::must
Dedup[Data Sources Deduplicated]:::must
Review[Data Reviewed ]:::must
AnvisaPDF ==> AExt ==> AParser ==> AClean ==> AStand ==> Join ==> Dedup ==> Review
WikiPage --> WExt --> WParser --> WClean --> WStand --> Join
DrugsPage -.-> DExt -.-> DParser -.-> DClean -.-> DStand -.-> Join
end
subgraph DS["Data Structuring"]
SST[Simple Structured Text]:::must
NER[Named Entity Recognition]:::could
RST[Related Structured Text]:::could
GST[Graph Structured Text]:::could
Review ==> SST
SST --> NER --> RST --> GST
end
subgraph DStorage["Data Storage"]
VDB[(Vector Database)]:::must
GDB[(Graph Database)]:::could
SST ==> VDB
GST --> GDB
end
%% Links
click Catalog "reference/scraper/anvisa/catalog/" "See Catalog Activity Diagram"
๐ Quick Start
๐ Getting Started
Environment setup, Docker guide, and first scraper execution
Installation Guide โNext Steps: Set up your development environment โ
๐ค Contributing
This is an active research project. If you're interested in collaborating or have suggestions, feel free to open an issue or reach out.
๐ License
This project is licensed under the BSD License. See LICENSE for details.