Self-Supervised Learning on Source Code to Assist Software Developers J. Villmow. PhD Dissertation, Hochschule RheinMain, 2025. doi
My PhD thesis, which I worked on from 2019 to 2025. The core question: can we teach transformers to understand code structurally using syntax trees rather than flat token sequences? I developed generative pretraining tasks that use abstract syntax trees to select training spans, trained contrastive retrieval models for code search, and built CodeBuddy, an IDE tool for self-supervised contextualized code retrieval based on cursor position alone. I also explored using language model likelihoods as scoring functions to detect identifier naming guideline violations.
Evaluating Contextualized Code Search in Practical User Studies J. Villmow, A. Ulges, U. Schwanecke. INFORMATIK 2024. doi
We built a code search tool into an IDE and tested it with students and a professional dev team. The tool retrieves relevant snippets based on the developer's current editing context, making cross-team code reuse transparent and practical.
How Well Can Masked Language Models Spot Identifiers That Violate Naming Guidelines? J. Villmow*, V. Campos*, J. Petry, A. Abbad Andaloussi, A. Ulges, B. Weber. SCAM 2023. doi
We tested whether language models can detect bad variable names by scoring identifiers against 28 common naming guidelines. A generative scoring approach worked best, outperforming larger code models on a dataset of over 6000 manual annotations.
Value Stream Repair Using Graph Structure Learning M. Wrzalik, J. Eversheim, J. Villmow, A. Ulges, D. Krechel, S. Spieckermann, R. Forstner. IEA/AIE 2023. doi
We applied graph neural networks to automatically detect and repair broken dependencies in manufacturing value streams.
COLING Addressing Leakage in Self-Supervised Contextualized Code Retrieval J. Villmow, V. Campos, A. Ulges, U. Schwanecke. COLING 2022. doi
Self-supervised code retrieval fails when trivial patterns leak between training pairs. We introduced deleaking techniques and a new evaluation dataset of aligned code clones, achieving state-of-the-art on retrieval, clone detection, and defect detection.
A Structural Transformer with Relative Positions in Trees for Code-to-Sequence Tasks J. Villmow, A. Ulges, U. Schwanecke. IJCNN 2021. doi
We extended transformer attention with relative positional embeddings on syntax trees and added a loss for predicting lowest common ancestors. This gave the model a structural prior over code, improving code summarization by 6% F1 over prior work.
ACL ConTest: A Unit Test Completion Benchmark featuring Context J. Villmow, J. Depoix, A. Ulges. NLP4Prog @ ACL 2021. doi
A large-scale benchmark (365k examples) for predicting assert statements from test setup and focal method. We showed that including surrounding context code is crucial for accurate assertion generation.
COLING Relation Specific Transformations for Open World Knowledge Graph Completion H. Shah, J. Villmow, A. Ulges. TextGraphs @ COLING 2020. doi
We extended knowledge graph completion to unseen entities by learning relation-specific transformations from text embeddings into graph embedding space, giving substantial improvements over relation-agnostic baselines.
Bidirectional Transformer Language Models for Smart Autocompletion of Source Code F. Binder, J. Villmow, A. Ulges. INFORMATIK 2020. doi
A RoBERTa model pretrained on 250M lines of Java code for method ranking, suggesting the right method to call based on code context, reaching 98% top-3 accuracy.
AAAI An Open-World Extension to Knowledge Graph Completion Models H. Shah*, J. Villmow*, A. Ulges, U. Schwanecke, F. Shafait. AAAI 2019. doi
We combined graph embedding models with text-based entity representations to predict facts about entities never seen during training, achieving state-of-the-art on open-world knowledge graph benchmarks.
Automatic Keyphrase Extraction Using Recurrent Neural Networks J. Villmow, M. Wrzalik, D. Krechel. MLDM 2018. doi
An early application of sequence labeling with RNNs for keyphrase extraction from scientific documents.
* shared first authorship