Weak supervision uses information from databases to first obtain labeled data, from which a prediction model can be learned. For example, a database could have entries such as “Barack Obama”, “Michael Jackson” as person entries (PER), and “New York”, “Honolulu” etc as location entries (LOC). A sentence “Barack Obama was born in Honolulu” could then obtain weakly supervised labels “Barack/PER Obama/PER was/O born/O in/O Honolulu/LOC” and used as a training example (together with many other weakly supervised sentence) to train a named entity recognizer (NER model). Previous approaches did this as a two-step process (first automatic labeling, then training).
The goal of this project is to improve on the state-of-the-art in weakly supervised learning with an integrated approach, where training the NER model has knowledge of the automatic labeling process. This is possible by keeping track of so-called “labeling functions”, i.e. the specific rules/reasons why a token was labeled in a certain way. For example, the word “Barack” was labeled “PER” because “Barack Obama” was in “PER_LIST_1” (and was annotated by a labeling function of the same name).
Prerequisites for students: Knowledge of probability theory, some initial experience with training machine learning models (e.g. Transformer-based language models)
Project open to: Business Analytics, Data Science, Digital Humanities (out of 3 students, maximum 2 from Digital Humanities or Business Analytics)
Number of students: 1-3
