Dear students, we really do not have any more open thesis positions left... We have already teamed up with 8 new master's and 2 bachelor students for our MSc thesis topics. Check back in Summer 2025 for new opportunities! Cheers, Marco

Open thesis projects @TDS Lab

  1. [NLP] Narrative reporting with LLMs in HealthyChronos

    NB: This project requires mastery of Dutch.

    HealthyChronos is an app that helps young people after cancer treatment to manage their energy levels and regain control over their lives. Based on structured data such as sleep and activity patterns, and unstructured data such as free text inputs concerning the user's personal history and set goals, this application provides a report and advice on how the user is doing in the light of its goals. In this internship, the state-of-the-art regarding retrieval-augmented generation with Large Language Models (LLMs) will be explored, for extraction of information from external databases, and for exploring what narrative formats for presenting the report with LLMs are most appealing to the user, through live testing with a small user panel. One main goal is to have the report as accurate and reliable as possible w.r.t. the information in the database, and narrative formats could be useful in achieving this. There is also room for secondary research interests, such as how LLMs could help produce reports in simple language, other languages, or what else sparks your interest. Part of the internship will be carried out at HealthyChronos in Leiden so that use of source code and data is possible. If you want to make positive impact with your NLP skills in other humans' lives, this is the internship for you!

    Supervisors: Bram van Dijk (LUMC), Max van Duijn (LIACS), Marco Spruit
  2. [ML] Balanced and balancing distance measures for mixed variable types

    Many AI, ML and data science methods depend on the notion of a distance, which often acts as a dissimilarity measure between observations in the data set. In real-world data sets, variables have various types, e.g. continuous, ordinal, nominal/categorical and binary, contained within one data set. In such cases, dissimilarity is almost always measured using Gower's distance. It min-max-scales numeric variables, and assigns distances to non-numeric variables as 1 if the values are unequal, and 0 if they are. Dimensions are just added directly, like in the Manhattan distance measure. The implication is that distances are dominated by categorical dimensions, as the distance (if non-zero) corresponds to the largest possible distance in the numeric dimensions, which will typically have smaller values. Also, average distances per dimension are not equalized (not even if the dimensions themselves are normalized or standardized first), and are dominated by imbalanced columns. This project will develop a balanced version of Gower's distance that makes the contribution of every feature on average equal, and leaves the possibility to re-weigh the contribution of features. The resulting distance measure will be used for risk stratification of people with metabolic syndrome on a large scale data warehouse with health, demographic and socio-economic data, but is expected to find wide-spread use in distance-based machine learning tasks on heterogeneous data.

    Daily supervisors: Marcel Haas (LUMC), Marco Spruit
  3. [NLP] Extracting Adverse Drug Reactions from SmPC Using Large Language Models

    Background
    Previous research has demonstrated the effectiveness of natural language processing techniques in extracting adverse drug reactions (ADRs) from Summary of Product Characteristics (SmPC) documents. However, the potential of large language models (LLMs) for this task remains unexplored.

    Objective
    To develop and evaluate a method using large language models to automatically extract adverse drug reactions from SmPC documents, comparing its performance to previous NLP approaches.

    Methods

    1. Data Collection:
      • Scrape SmPC documents from the Electronic Medicines Compendium (EMC), focusing on section 4.8 (Undesirable effects).
      • Use the same dataset of 647 medicines as in the previous study for comparability.
    2. LLM-based Extraction:
      • Fine-tune a pre-trained LLM (e.g., BERT, RoBERTa, or GPT) on a subset of manually annotated SmPC documents.
      • Develop prompts to guide the LLM in identifying and extracting ADRs and their frequencies.
      • Implement post-processing steps to clean and standardize extracted ADRs.
    3. Evaluation:
      • Use the same subset of 37 commonly prescribed medicines for manual review.
      • Calculate precision, recall, and F1-score to assess performance.
      • Compare results with the previous rule-based NLP approach.
    4. Error Analysis:
      • Analyze false positives and false negatives to identify areas for improvement.

    Expected Outcomes

    Significance
    This study will explore the potential of LLMs in improving the accuracy and efficiency of ADR extraction from SmPC documents, potentially enhancing pharmacovigilance and drug safety monitoring processes.

    Advisors: Ian Shen (ext.), Marco Spruit