Topics for Bachelor Theses, Master Theses and Lab Rotations in Statistics

This page lists different topics that can be turned into bachelor theses, master theses and lab rotations for students in applied statistics, data science, economics, etc., depending on individual qualifications. If you are interested, get in contact with the responsible person listed for the topic.

Title: LASSO regularization and group fixed effects
Short description: Fixed effects specifications in panel data enable to control for various types of unobserved heterogeneity, but considerably inflate the number of parameters to be estimated. To overcome this problem, group fixed effects approaches aim at identifying sub-groups in the data that share the same fixed effects structure. In this thesis, regularization approaches such as the fused LASSO will be investigated with respect to their ability to identify group fixed effects in panel data.
Contact: Thomas Kneib (tkneib@uni-goettingen.de)

Title: Bayesian Quantile Regression with Errors in Variables
Short description: When covariates are measured with error, this can imply considerable bias in the estimates for the corresponding effects in a regression model. In a Bayesian setup, a statistical model can be assumed for the measurement error such that the true covariate values become part of the set of unknown parameters to be estimated. In particular, the true values can be included in a Markov chain Monte Carlo simulation algorithm. In this thesis, existing models shall be extended in at least one of two directions: (i) Implement Bayesian error correction schemes for Bayesian quantile regression or (ii) implement a flexible Dirichlet process mixture prior for the true covariate values.
Contact: Thomas Kneib (tkneib@uni-goettingen.de)

Title: Predicting wealth from satellite images
Short description: Finding out socio-economic factors like the wealth index is normally done with expensive surveys. However, it would be nice to predict them just using the well-available satellite data. In this project the SustainBench Dataset (https://sustainlab-group.github.io/sustainbench/docs/datasets/) can be used to train a model that predicts one of those socioeconomic factors based on satellite image. For Master Thesises also the change over time or space could be modelled.
Contact: Vera Stein (vera.stein@uni-goettingen.de)

Title: Exploring the Relationship Between Climate Variables and Conflict in Ethiopia
Short description: The Climate-Conflict-Vulnerability-Index dataset (https://climate-conflict.org) contains many climate and conflict related variables mapped to smaller regions. The relation between different climate and conflict variables can e.g. be explored for one specific country. If you are interested in one specific climate or conflict variable that could also be closer investigated. Feel free to look at the dataset and also come up with your own ideas.
Contact: Vera Stein (vera.stein@uni-goettingen.de)

Title: Time-series analysis of heatwaves in Ghana
Short description: The Climate-Conflict-Vulnerability-Index dataset (https://climate-conflict.org) contains many climate and conflict related variables mapped to smaller regions over time. Using a time series you can choose a climate or conflict related variable (e.g. heatwaves, wildfires, hunger) and some region you can predict how this variable will change in the future.
Contact: Vera Stein (vera.stein@uni-goettingen.de)

Title: Exploring the use of emojis in Telegram channels
Short description: Different things can be explored here, like how the relation between different emojis being used and the number of forwards of that message is or what topics are mainly assosiated with the different emojis on telegram.
Contact: Vera Stein (vera.stein@uni-goettingen.de)

Title: Stochastic Gradient Hamiltonian Monte Carlo for Bayesian distributional structured additive regression
Short description: Master's thesis. Structured additive distributional regression models allow flexible, component-wise specification of all parameters of a response distribution, such as mean, scale, and shape, using structured additive predictors based on penalized splines and other smooth components. While full Bayesian inference is attractive for uncertainty quantification, it becomes computationally demanding for large datasets or complex model structures. To address this challenge, this thesis integrates Stochastic Gradient Hamiltonian Monte Carlo,a scalable gradient-based Markov chain Monte Carlo method that leverages minibatch stochastic gradients into the distributional regression setting. Prior programming experience in Python is an essential prerequisite for this project.
Chen, T., Fox, E. B., & Guestrin, C. (2014). Stochastic Gradient Hamiltonian Monte Carlo (No. arXiv:1402.4102). arXiv. https://doi.org/10.48550/arXiv.1402.4102
Contact: Johannes Brachem brachem@uni-goettingen.de

Title: Scalable MCMC Sampling for Bayesian Penalized Transformation Models
Short description: Master's thesis. Penalized transformation models (PTMs) are a semiparametric location-scale regression family that estimate a response's conditional distribution directly from the data, and model the location and scale through structured additive predictors. The core of the model is a monotonically increasing transformation function that relates the response distribution to a reference distribution. One current limitation for Bayesian PTMs is slow Markov Chain monte carlo (MCMC) sampling, making large datasets challenging for PTMs. This thesis develops a fast and numerically robust Iteratively Re-Weighted Least Squares (IWLS) sampler for PTMs. Prior programming experience in Python is an essential prerequisite for this project.
Brachem, J., Wiemann, P. F. V., & Kneib, T. (2025). Bayesian penalized transformation models: Structured additive location-scale regression for arbitrary conditional distributions (No. arXiv:2404.07440v4). arXiv. https://doi.org/10.48550/arXiv.2404.07440
Contact: Johannes Brachem brachem@uni-goettingen.de

Title: Bayesian Penalized Transformation Models for Count Data
Short description: Master's thesis. Penalized transformation models (PTMs) are a semiparametric location-scale regression family that estimate a response's conditional distribution directly from the data, and model the location and scale through structured additive predictors. The core of the model is a monotonically increasing transformation function that relates the response distribution to a reference distribution. This thesis extends Bayesian PTMs to count data. Prior programming experience in Python is an essential prerequisite for this project.
Carlan, M., & Kneib, T. (2022). Bayesian discrete conditional transformation models. Statistical Modelling. https://doi.org/10.1177/1471082X221114177
Brachem, J., Wiemann, P. F. V., & Kneib, T. (2025). Bayesian penalized transformation models: Structured additive location-scale regression for arbitrary conditional distributions (No. arXiv:2404.07440v4). arXiv. https://doi.org/10.48550/arXiv.2404.07440
Contact: Johannes Brachem brachem@uni-goettingen.de

Title: MCMC Sampling of Bayesian structured additive models under linear constraints
Short description: Master's thesis. Linear constraints can be incorporated into Bayesian structured additive models by reparameterizing the model, or by directly conducting constrained sampling without changes to the model. This thesis implements constrained sampling in Python in the probabilistic programming framework Liesel and compares the performance of this implementation to constraints via reparameterization in terms of speed, numerical stability, and scalability. Prior programming experience in Python is an essential prerequisite for this project.
Kneib, T., Klein, N., Lang, S., & Umlauf, N. (2019). Modular regression—A Lego system for building structured additive distributional regression models with tensor product interactions. TEST, 28(1), 1–39. https://doi.org/10.1007/s11749-019-00631-z
Contact: Johannes Brachem brachem@uni-goettingen.de

Title: Simulation-based calibration of Bayesian additive distributional regression
Short description: Bachelor's or Master's thesis. This thesis investigates simulation-based calibration (SBC) as a diagnostic tool for Bayesian additive distributional regression models. Additive distributional regression extends classical regression by modeling not only the mean but all parameters of a response distribution, such as scale, shape, or skewness, through structured additive predictors. This flexibility allows rich, data-driven modeling. SBC is an approach based on repeated simulation from the prior and refitting the model to assess whether posterior inferences are well-calibrated. The thesis outlines the SBC procedure, discusses practical choices such as summary statistics and rank-based diagnostics, and adapts it to the specific structure of additive distributional regression. Through simulation experiments, the thesis evaluates how violations of model assumptions, prior choices, or numerical issues affect calibration.
Talts, S., Betancourt, M., Simpson, D., Vehtari, A., & Gelman, A. (2020). _Validating Bayesian Inference Algorithms with Simulation-Based Calibration_ (No. arXiv:1804.06788). arXiv. https://doi.org/10.48550/arXiv.1804.06788
Klein, N., Kneib, T., Lang, S., & Sohn, A. (2015). Bayesian structured additive distributional regression with an application to regional income inequality in Germany. _The Annals of Applied Statistics_, _9_(2), 1024–1052. https://doi.org/10.1214/15-AOAS823
Umlauf, N., Klein, N., Simon, T., & Zeileis, A. (2021). **bamlss**: A Lego Toolbox for Flexible Bayesian Regression (and Beyond). _Journal of Statistical Software_, _100_(4). https://doi.org/10.18637/jss.v100.i04
Contact: Johannes Brachem brachem@uni-goettingen.de

Title: Modeling Extreme Precipitation using Distributional Regression
Short description: Bachelor's or Master's thesis. This thesis applies distributional regression techniques to model extreme precipitation events. Unlike classical regression, which focuses only on the mean, distributional regression models all parameters of a chosen response distribution, such as location, scale, and shape, allowing a much more detailed characterization of precipitation behavior, especially in the tails. This thesis explores suitable extreme-value distributions and incorporates structured additive predictors to capture spatial effects. Using data from the Community Earth System Model (CESM) Large Ensemble Project, it compares model fits, investigates and assesses predictive performance for rare, high-impact events.
Stasinopoulos, D. M., & Rigby, R. A. (2008). Generalized Additive Models for Location Scale and Shape (GAMLSS) in R. _Journal of Statistical Software_, _23_, 1–46. https://doi.org/10.18637/jss.v023.i07
https://gamlss-dev.github.io/gamlss2/
Contact: Johannes Brachem brachem@uni-goettingen.de

Title: Modeling Social Media Addiction using Distributional Regression
Short description: Bachelor's thesis. This thesis uses distributional regression to model and analyze social media addiction among students, leveraging a public dataset on student social media usagage & relationships. The dataset captures anonymized survey responses, including usage patterns, relationship status, and various behavioral and demographic covariates from students across multiple regions. In contrast to standard regression that only models the conditional mean of an outcome, distributional regression (as in GAMLSS / structured additive distributional regression models) allows all parameters of the response distribution (e.g., location, scale, possibly skewness or kurtosis) to depend on predictors. Applied to social media addiction, this approach can capture not only the expected level of “addiction score” but also how its variability and tail behaviour (i.e. risk of extreme addiction) change with covariates such as social-media usage intensity, demographic variables, relationship status, or mental-health indicators. This thesis describes the choice of a suitable response distribution, the specification of structured additive predictors (e.g., smooth effects, interactions, or categorical covariates), and estimation of the full distributional model. It then presents results from fitting the model to the dataset, analyzing how different factors influence not only the mean addiction score but also variability and distributional shape.
Data: https://www.kaggle.com/datasets/adilshamim8/social-media-addiction-vs-relationships
Stasinopoulos, D. M., & Rigby, R. A. (2008). Generalized Additive Models for Location Scale and Shape (GAMLSS) in R. Journal of Statistical Software, 23, 1–46. https://doi.org/10.18637/jss.v023.i07
https://gamlss-dev.github.io/gamlss2/
Contact: Johannes Brachem brachem@uni-goettingen.de

Title: Modeling Flight Prices using Distributional Regression
Short description: Bachelor's thesis. This thesis applies distributional regression to the problem of modeling and predicting flight ticket prices, using a publicly available flight-price dataset. The dataset contains a variety of features: airline, source and destination cities, departure and arrival times, flight class, number of stops, duration, time until departure, etc., as explanatory variables and ticket price as the target variable. Unlike standard regression, which estimates only the conditional mean of price, distributional regression (e.g., in the style of Generalized Additive Model for Location, Scale and Shape / GAMLSS) allows modeling of the entire conditional distribution of prices: not only the average price, but also how variability, skewness or tail-behavior depend on covariates. In this thesis, a suitable parametric distribution for ticket prices is chosen (e.g., a skewed or heavy-tailed continuous distribution), and distributional regression is used to let its parameters (location, scale, possibly shape) vary as functions of flight-specific covariates (airline, date/time of booking and travel, stops, class, origin/destination, etc.). The model specification uses additive predictors (potentially with smooth or categorical terms) to capture non-linear and complex effects. By fitting this model to the dataset, the thesis explores how different factors influence not only the expected price, but also the uncertainty and the likelihood of extreme (very high or low) prices.
Data: https://www.kaggle.com/datasets/shubhambathwal/flight-price-prediction
Stasinopoulos, D. M., & Rigby, R. A. (2008). Generalized Additive Models for Location Scale and Shape (GAMLSS) in R. Journal of Statistical Software, 23, 1–46. https://doi.org/10.18637/jss.v023.i07
Contact: Johannes Brachem brachem@uni-goettingen.de

Title: (Generalized) Linear Models with Stochastic Variational Inference and Sparse Matrix Representations
Short description: Stochastic Variational Inference (SVI) provides a scalable framework for approximating Bayesian posterior distributions via optimization. The core idea is to posit a parametric family of variational distributions over the latent variables, indexed by a vector of variational parameters, and to adjust these parameters so that the variational distribution closely matches the true posterior using stochastic optimization methods. In this project, the student will leverage sparse matrix representations of both the design matrix and the precision matrix to implement an efficient Python library for (Generalized) Linear Models based on SVI, with an emphasis on computational speed and scalability to high-dimensional data.
Hoffman, M. D., Blei, D. M., Wang, C., & Paisley, J. (2013). Stochastic Variational Inference. Journal of Machine Learning Research, 14(1), 1303–1347. https://www.jmlr.org/papers/volume14/hoffman13a/hoffman13a.pdf
Contact: Gianmarco Callegher gianmarco.callegher@uni-goettingen.de

Title: Normalizing Flows for Variational Inference in Liesel
Short description: Normalizing flows provide a flexible way to construct expressive variational approximations by transforming a simple base distribution through a sequence of invertible mappings. In this project, the student will extend Liesel’s variational inference module with normalizing-flow-based posterior approximations. The work will involve implementing a set of flow transformations (e.g., planar, autoregressive flows) in JAX, integrating them into Liesel’s computational graph and vi API, and benchmarking their performance against existing Gaussian variational families and MCMC-based baselines within Liesel.
Papamakarios, G., Nalisnick, E., Rezende, D. J., Mohamed, S., & Lakshminarayanan, B. (2021). Normalizing Flows for Probabilistic Modeling and Inference. Journal of Machine Learning Research, 22(57), 1–64. https://www.jmlr.org/papers/volume22/19-1028/19-1028.pdf
Riebl, H., Wiemann, P. F. V., & Kneib, T. (2022). Liesel: A Probabilistic Programming Framework for Developing Semi-Parametric Regression Models and Custom Bayesian Inference Algorithms. arXiv preprint arXiv:2209.10975. https://arxiv.org/abs/2209.10975
Contact: Gianmarco Callegher gianmarco.callegher@uni-goettingen.de

Title: Spike-and-Slab Variational Inference for Vine Copula Models in Liesel
Short description: This project aims to develop a Bayesian vine-copula regression framework with automatic covariate selection via spike-and-slab priors, fitted using Stochastic Variational Inference (SVI) in Liesel. Vine copulas provide a flexible decomposition of multivariate dependence into a sequence of bivariate copulas, allowing for rich and interpretable joint models. Spike-and-slab priors enable sparse, probabilistic variable selection, helping to identify which covariates drive the strength and structure of dependence.
The student will focus on model design and on integrating vine copulas into the existing Liesel framework. Concretely, they will (i) specify an appropriate vine-copula-based model for multivariate responses with spike-and-slab variable selection, (ii) implement the vine copula log-probability and sampling methods in Liesel, and (iii) use Liesel’s existing module for SVI-based posterior estimation, alongside the available MCMC routines, to compare performance.
Aas, K., Czado, C., Frigessi, A., & Bakken, H. (2009). Pair-copula constructions of multiple dependence. Insurance: Mathematics and Economics, 44(2), 182–198. https://doi.org/10.1016/j.insmatheco.2007.02.001
Riebl, H., Wiemann, P. F. V., & Kneib, T. (2022). Liesel: A Probabilistic Programming Framework for Developing Semi-Parametric Regression Models and Custom Bayesian Inference Algorithms. arXiv preprint arXiv:2209.10975. https://arxiv.org/abs/2209.10975
Contact: Gianmarco Callegher gianmarco.callegher@uni-goettingen.de

Title: Leaf shape heterogeneity analysis
Short description: Leaf shape variability is an important characteristic of plant development and health. It is driven mostly by genetic and environmental factors. Studying and modelling such variability can be used to understand and forecast the growth of the tree. Fresh leaves from juvenile beech trees grown in an experiment were harvested at the same time of the entire tree in August-September 2021. For each tree, all leaves were harvested and a sample of 60 to 120 leaves were scanned on a flatbed scanner. Thus, this dataset corresponds to the raw images of the scan. The aim of this project is to extract the shape, size, and average colour of the leaf images and estimate the Fréchet mean and variance via the elastic metric approach. Further, we study the modes of variation through Geodesic Principal Component Analysis (GPCA).
Contact: Alejandro Pereira (alejandro.pereira@uni-goettingen.de)

Title: Score Matching for Directional regression
Short description: In this project we aim at developing and implementing the Score Matching estimator in the directional/circular regression models. Substantial development has been done, see e.g. [Mardiaetal2016,Katoetal2025]. Score matching estimators are widely used in directional distribution due to the fact that the normalizing constants are usually computationally intractable. Common applications of directional statistics are Forestry, Ecology, and Palaeomagnetism among many others. In this project we will focus on the Kent distribution, also know as the 5 parameter Fisher-Bingham, due to its flexibility, and nice functional form. To the best of my understanding, score matching estimator has not been implemented in the most popular libraries, both in R, Python, and Julia. Your task would be develop, and implement the estimator in your language of choice. And then test the result using either simulated or real data.
Contact: Alejandro Pereira (alejandro.pereira@uni-goettingen.de)

Title: Bayesian Compositional regression
Short description: In this project we study the Bayesian approach to Compositional regression using Liesel. Compositional data refers that represent proportions of a whole. That is a compositional data point can be represented by as a real-valued vector constrained to sum up to 1 [Aitchison,1982]. Mathematically this is define in a simplex:

S^d = \left{x=[x_1,x_2,...,x_d] \in \mathbb{R}^d | x_i>0 ; \sum_{i=1}^d x_i= 1 \right}

\mathcal{S}^D=\left{\mathbf{x}=[x_1,x_2,\dots,x_D]\in\mathbb{R}^D ,\left|, x_i>0,i=1,2,\dots,D; \sum_{i=1}^D x_i=\kappa \right. \right}. \

This type of data naturally arises from Econometrics (e.g. Household expenditure survey), or biology (Microbiome data) to name a few. You would then implement and model the data using a probability model define in the simplex $S^d$. Some classical probability models include the Dirichlet distribution and logit-normal already included in TensorFlow. Depending on the structure of the data, potential models include: VARs, additive models, spatio-temporal models, among others. Then you would be implement the statistical model in Python using the Liesel framework.
Contact: Alejandro Pereira (alejandro.pereira@uni-goettingen.de)