Work / tooling

GitLab License Compliance Bot

A Python bot that scans GitLab repos for license risk before legal asks, plus a browser extension that pushes threat intel into MISP.

DevToolsComplianceSecurity

Role

Software Engineer (part-time)

Date

2024-06-01

Read time

3 min read

Stack

7 techs

Most research labs treat open-source license compliance as a thing legal worries about once a year. Then a customer asks for a software bill of materials, and someone spends two weeks walking through requirements.txt files trying to remember why a GPL package ended up in a commercial codebase. That was the problem space at SnT.

I worked on this part-time during my master's. Two complementary tools came out of it: a license compliance bot, and a browser extension for threat intel ingestion into MISP.

The compliance bot

A Python service that scans GitLab repositories on a schedule and on every merge request. It walks the dependency tree, classifies licenses by risk tier (permissive, weak-copyleft, strong-copyleft, "talk to a lawyer"), and writes structured reports back to the project. Flask exposes the REST surface, Docker keeps the runtime portable across the lab's GitLab runners.

The interesting part wasn't the scanning. It was the classification. License names aren't standardized. "MIT" and "MIT License" and "Expat" mean the same thing; "BSD" without a number means whatever the author thought it meant in 2008. I built a normalization layer on top of SPDX identifiers and a manual override table for the long tail. The override table grew quietly and now contains most of the institutional knowledge.

A small React/Next.js dashboard surfaces findings to research leads. GitLab CI hooks run the bot on push, on MR, and as a nightly full sweep.

The MISP extension

Separate project, same general spirit: take a manual workflow and make it disappear into the tools analysts already use. Threat analysts spend their day reading reports, blog posts, sometimes raw paste sites, and harvesting indicators of compromise. The standard workflow was: see an IOC, copy it, switch to MISP, fill a form. Easy to skip. Easy to mistype.

The extension wraps PyMISP and lets the analyst push an enriched observable directly from the page. Right-click, classify, publish. Done.

What I learned

The compliance work is the more boring of the two, and probably the higher-leverage one. License risk is the kind of debt that compounds silently. By the time someone notices, you're stripping dependencies out of a system that has shipped to production for three years.

The override table is the part I underestimated. I assumed SPDX would do most of the work. In practice, half of real-world repos either lie about their license or use a string that isn't in any standard. The lesson generalizes: assume the spec is aspirational, build a manual escape hatch on day one.

Stack

PythonFlaskDockerReactNext.jsGitLab CI/CDPyMISP