MLOps Done Right: GitGuardian's Battle-Tested Open-Source Stack

Michaël Romagné

Michaël is a Machine Learning Engineer at GitGuardian, focused on integrating advanced NLP models into the platform and implementing MLOps best practices across the company.
LinkedIn

Today's Machine Learning Engineers are no longer confined to experimental notebooks – they are architects of production-ready AI systems that drive business impact.

As hands-on developers, they need a tech stack that enables rapid experimentation and smooth deployment to production while maintaining their independence and workflow efficiency.

At GitGuardian, we leverage state-of-the-art NLP techniques to solve critical cybersecurity challenges. Our primary focus has been on reducing false positives in our Secrets Detection Engine and using contextual analysis to identify services associated with leaked secrets. To support these tasks from experimentation to launch, we’ve put together a complete MLOps stack powered by open-source technologies.

In this article, we’ll review the key pieces of our tech stack and demonstrate how they help us run repeatable experiments, work together smoothly, and manage our resources effectively. Let's go!

Building a Robust ML Experimentation Pipeline

The GitGuardian ML team's core mission is to fine-tune Transformer models to achieve the highest precision while maintaining fast inference speeds. This type of model requires substantial computing resources for effective training.

From the start, the team had identified three major pain points that were significantly impacting our development velocity and team collaboration:

Our datasets, models, and experiments were difficult to track, share, and reproduce because we didn't have version control in place.
We lacked a flexible way to visualize metrics and explore our data, making it challenging to gain actionable insights during the experimentation process.
ML Engineers spent excessive time manually configuring and setting up EC2 instances in AWS for prototype training, which led to workflow inefficiencies and reduced productivity.

To address these challenges, we needed a robust MLOps infrastructure that would streamline our workflow. After carefully evaluating various tools and approaches, we decided to build our stack around three core open-source technologies that would address these specific challenges.

Version Control for ML Assets with DVC

The first solution we implemented is DVC (Data Version Control), which has become essential to our experimentation pipelines. DVC lets us version datasets, model artifacts, and parameters right alongside our code, making our workflows more efficient and team collaboration smoother. As Iterative puts it simply: “DVC is Git for Data.”

A DVC pipeline lives in a dvc.yaml file:

stages:
  prepare:
    cmd: python prepare.py
    deps: 
      - prepare.py
      - data/raw/
    outs:
      - train.csv
      - test.csv
  train:
    cmd: python train.py
    deps:
      - train.py
      - train.csv
    outs:
      - model.joblib
  evaluate:
    cmd: python evaluate.py
    deps:
      - evaluate.py
      - model.joblib
      - test.csv
    metrics:
      - accuracy.json

Here’s what this typical training pipeline might look like:

DVC has a lot to offer to streamline ML workflows:

Reproducibility: Ever had that moment when you can't recreate the exact conditions that led to your best model? DVC solves this headache by tracking everything - your data, code, parameters, and dependencies - across all pipeline stages. It's like having a time machine for your experiments - you can jump back to any previous version of your model or results with just a few commands.
Pipeline modularity: With DVC, each pipeline stage is a distinct piece that you can easily swap out or modify when you need to tweak something, without rebuilding the whole thing.
Data Versioning: DVC handles the heavy lifting by storing just lightweight .dvc pointer files in Git to track only small files and references while efficiently managing your actual data files elsewhere. It's like having a smart storage system that knows exactly where everything is without bloating your repository.
Caching: DVC's caching is smart enough to know what's already been computed and only runs what's necessary - saving you precious time and computing resources.
Collaboration: DVC makes collaboration actually work in ML projects. Your team can share results and parameters through Git while DVC handles the big files behind the scenes.

These features make DVC an indispensable tool for MLOps teams to iterate fast while maintaining reproducibility with large datasets and complex pipelines.

Now that artifacts are versioned for each experiment, how can we efficiently manage them?

Model Registry with GTO

In addition to DVC, Iterative maintains GTO (Git Tag Ops), a tool that lets you tag your best models and artifacts. GTO creates a simple mapping between (artifact_name, version) and (file_path, commit_hash), making model versions easily accessible to MLEs. Instead of dealing with complex file paths or commit hashes, you can reference models using a simple name and tag, like my_awesome_model@v0.0.1, or assign stages like my_awesome_model#prod.

GTO is a great basis for creating an artifact registry within your Git repository, with the Model Registry being a particularly valuable implementation. As Iterative explains:

“Such a registry serves as a centralized place to store and operationalize your artifacts along with their metadata, manage artifact’s life-cycle, versions & releases, and easily automate tests and deployments using GitOps.”

We’ll explore how to leverage these versions in our release pipeline later in this article.

The combination of DVC and GTO enables us to reference specific commits in our Git history and easily retrieve key artifacts like models, datasets, and parameters. This capability is crucial for team collaboration and efficiency.

Awesome Web Apps with Streamlit

While DVC excels at versioning, it wasn't designed to provide interactive exploration of pipeline outputs like metrics or model behavior. Iterative has in fact developed DVC Studio for this, but this is not an open-source library.

We thus searched for a lightweight and flexible tool to build web applications for comparing experiments and sharing results with teammates and the broader company. Streamlit has imposed itself as the standard for building simple web apps with custom visualizations, thanks to its intuitive library and strong community that develops various plugins to enhance the tool.

Here is an example of a simple UI that helps both our ML teams debug their models and enables other teams to experiment with the ML team's work:

A simple Streamlit app to test our models

Thanks to DVC, retrieving previous versions of our model and getting prediction scores (in this case, the False Positive Score) on custom inputs ( the .env file content) became straightforward. Streamlit is the ideal tool to tinker with models and display such predictions.

💡

You can find more details about this combination in this article written by Théodo Data&AI: How to Build Customizable Web UI for ML with Streamlit and DVC. And this is only the tip of the iceberg! Take time to explore the Streamlit gallery to spot visualizations that you could integrate into your applications.

Easy Cloud Instances Config with SkyPilot

With versioning and data visualization under control, the transition to managing cloud resources was the next logical step. To automate the creation and configuration of AWS EC2 instances, we chose SkyPilot. The install is really easy, and once your Cloud Provider credentials are set, all you need is to create a skypilot.yaml manifest like the one below:

resources:
  cloud: aws
  # A GPU instance with 1 Tesla V100
  instance_type: p3.2xlarge
  region: # region 
  # use_spot: true
  image_id: # we use Deep Learning AMI GPU PyTorch 2.0.0 (Ubuntu 20.04) in eu-west-1

workdir: ./

envs:
  ENV_VAR: value

file_mounts:
  ~/.netrc: ~/.netrc

# Install PDM and python dependencies
setup: |
  echo "Running setup."
  export PYTHONPATH="/home/ubuntu/sky_workdir:$PYTHONPATH"
  export PATH="/home/ubuntu/.local/bin:$PATH"
  echo 'export PYTHONPATH="/home/ubuntu/sky_workdir:$PYTHONPATH"' >> ~/.zshrc
  echo 'export PATH="/home/ubuntu/.local/bin:$PATH"' >> ~/.zshrc
  pip install --user pdm && pdm install -L pdm.gpu.lock

# Run DVC pipeline and push the results to remote storage.
run: |
  source ~/.zshrc
  pdm run dvc exp run -s --pull scripts/training/dvc.yaml -n $EXPTAG
  pdm run dvc exp push origin $EXPTAG

You then have full control over your cloud instances and cluster at your fingertips:

Create instances and run scripts: sky launch -c mycluster skypilot.yaml
See status of instances: sky status
Kill a cluster: sky down mycluster
Automatically schedule an autostop of the cluster 10 minutes after the end of the task, and launch the cluster in detached mode: sky launch -d -c mycluster2 cluster.yaml -i 10 --down
SSH into a cluster: ssh mycluster

💡

For more information on how to effectively use these tools together, please have a look at Alex Kim’s blog post: ML experiments in the cloud with SkyPilot and DVC.

This stack makes experiments both scalable and reproducible. Metrics are presented through neat visualizations, while intuitive tagging makes artifacts accessible to everyone. All essential building blocks are in place to develop high-quality Machine Learning models. The final step is to package and deploy these models.

Package and Serve NLP models

Build an Inference Service in a blink with BentoML

For our first integrations in GitGuardian’s products, we developed services that could serve our top-performing models under heavy load while maintaining state-of-the-art performance. The packaging process needed to be straightforward so that any team member could easily build and deploy their own service.

BentoML proved to be an excellent solution for several key reasons:

It’s built upon Starlette, an ASGI framework designed for building asynchronous Python web services
It offers multiple integrations for serving modern NLP models, including ONNX Runtime and vLLM
Its Services / Runners function as task queues, allowing you to manage traffic through max concurrency settings, timeouts, and micro batching
It’s flexible enough to build complex inference workflows while exposing models through REST or gRPC endpoints

In BentoML, the definition of a Service can be as simple as that:

import bentoml
from transformers import pipeline

@bentoml.service(
    resources={"cpu": "2"},
    traffic={"timeout": 10},
)
class Summarization:
    def __init__(self) -> None:
        self.pipeline = pipeline('summarization')

    @bentoml.api
    def summarize(self, text: str) -> str:
        result = self.pipeline(text)
        return result[0]['summary_text']

BentoML allows the packaging of models with all necessary dependencies into 'bentos', that are essentially folders that can later be used to build a Docker image. By making use of DVC and GTO to track these bentos, the deployment of ML inference services can be automated through CI/CD. Here is a DVC pipeline example to build a bento:

# Artifacts section for GTO
artifacts:
  my_awesome_model:
    path: data/models/my_awesome_model
    desc: My awesome model.
    type: model
    labels:
      - demo

stages:	
  # Push prod model artifact to the BentoML model store
  store_models:
    cmd:
 BENTOML_ENV=development BENTOML_DO_NOT_TRACK=true BENTOML_HOME=$(pwd)/data/bentoml \
      python -m demo.save_to_bento \
      --model-artifact-name my_awesome_model \
      --model-artifact-stage prod
    deps:
      - demo/save_to_bento.py
      # Add other important files to track
    outs:
      - data/bentoml/models/

  # Build the bento folder
  build_bento:
    cmd:
      BENTOML_DO_NOT_TRACK=true BENTOML_HOME=$(pwd)/data/bentoml \
    bentoml build --bentofile demo_model/bentofile.yaml --version demo-version
    deps:
      - demo/bentofile.yaml # Config file for the service
      # Add other important files to track
    outs:
      - data/bentoml/bentos/my_awesome_model/demo-version

BentoML helps teams create complete inference services with just a few lines of Python code, while managing all service settings through a single YAML file. With simple commands, you can package your service into a Docker image that's ready for high-performance deployment on any cloud platform. The main concepts behind the tool are detailed in this blog post that we highly recommend as an introduction.

Optimizing Production for Real-Time and Batch Secrets Classification

Our production deployment architecture is designed around three key requirements:

We need real-time ML classification as soon as a secret leak is detected
We have a large historical database of secret leaks that requires periodic analysis. For example, when deploying a new model, we need to recompute classifications for all historical incidents.
Input data contains sensitive information and must remain within our production Kubernetes cluster's security boundary

We use ONNX Runtime to power our Transformer model’s inference engine. By leveraging optimizations like graph optimization and quantization, we achieved excellent real-time performance with low latency, even on modest CPU instances.

For our compute-heavy back-population jobs, where we periodically reprocess our extensive history of secret leaks, we needed more flexible scaling options. To manage these demand spikes efficiently, we implemented Kubernetes’ Horizontal Pod Autoscaler (HPA). This automatically adjusts our deployment size based on incoming request volume, ensuring we maintain optimal performance during peak periods without wasting resources during quiet times.

As the final step, we implemented comprehensive monitoring and logging. Tools like Prometheus and Grafana provide real-time insights into model performance and system health, allowing us to catch potential issues early. Monitoring metrics such as latency, request rates, and error rates gives us a clear view of our service's responsiveness and reliability. For more sophisticated visualizations, particularly for business monitoring, we created dedicated Streamlit web applications.

💡

For our use case, serving models on CPU is the preferred solution. Roblox made the same decision to serve 1+ billion daily requests and shared their performance optimization process in this great article. However, on GPU, the stack can be slightly different. You may prefer to use an inference server like TensorFlow Serving, NVIDIA Triton Inference Server, or TensorRT LLM in addition to a backend such as ONNX Runtime or vLLM.

Conclusion

That’s a wrap for the overview of the GitGuardian MLOps stack! As you've seen, we designed it to be modular and to achieve an ideal tradeoff between usability and performance.

While we did not tackle some aspects of our setup, such as our complete service testing suite and job orchestration, we hope this article provides a practical guide and inspiration for ML engineers building or refactoring their own MLOps stack to meet the demands of modern Machine Learning.

MLOps Done Right: GitGuardian's Battle-Tested Open-Source Stack

Guardians

Guardians

Michaël Romagné

Building a Robust ML Experimentation Pipeline

Version Control for ML Assets with DVC

Model Registry with GTO

Awesome Web Apps with Streamlit

Easy Cloud Instances Config with SkyPilot

Package and Serve NLP models

Build an Inference Service in a blink with BentoML

Optimizing Production for Real-Time and Batch Secrets Classification

Conclusion

Thomas Forbes

Guardians

Guardians

Guardians

Start your journey to secrets-free source code

Michaël Romagné

Building a Robust ML Experimentation Pipeline

Version Control for ML Assets with DVC

Model Registry with GTO

Awesome Web Apps with Streamlit

Easy Cloud Instances Config with SkyPilot

Package and Serve NLP models

Build an Inference Service in a blink with BentoML

Optimizing Production for Real-Time and Batch Secrets Classification

Conclusion

Related Articles

Demystifying Docker: Understanding and Optimizing Your Images

Thomas Forbes

A Deep Dive into Celery Task Resilience, Beyond Basic Retries

Guardians

Better Security and Performance For Free? Why PostgreSQL is Amazing

Guardians

Year in Review: GitGuardian's Own Security Team

Guardians

Start your journey to secrets-free source code