Balancing AI Performance and Safety: Lessons from PyData Berlin
Hi Nicolas! How was PyData in Berlin? Have you ever participated in a conference like this before?
This was my first PyData event and it was a very positive experience! It was different from my previous academic conferences where I presented research papers at Amazon Alexa AI. PyData attracts more practitioners and companies doing data science and ML, focusing on engineering, development practices, and tooling. It’s a diverse mix of attendees from different organizations and open-source contributors. I gave a 30-minute talk after submitting a CFP with our ML team.
I didn't learn much new about ML best practices, which made me realize our team at GitGuardian already has good practices in place. However, I gained insights into development practices outside of ML and open-source contributions. It was also great for networking and sharing about our company!
Your talk was "Would you rely on ChatGPT to dial 911"? Can you tell us what it was about?
Yes, I chose a catchy title, but the idea I presented was about the increasing presence of machine learning (ML) in our lives. We often hear that in 90% of cases, ML outperforms humans in specific tasks, but in that remaining 10 %, or even 1% or 0.1%, the machine might falter. These cases can be critical, like when calling emergency services. You can't afford an ML model to fail, even if it performs better than humans 99.9% of the time. To illustrate the urgency, I looked at statistics from the OECD, which showed a 200% increase in AI-related incidents over the last year. These incidents range from minor annoyances to significant errors: for instance, a recent case involved Google AI summarizing search results incorrectly. This one was funny, but there are also severe cases, like computer vision systems in supermarkets incorrectly flagging people for theft. Or even more extreme, the infamous Tesla accidents. The talk was about integrating ML safely in user-facing systems to minimize such issues.
Did you ever deal with a similar issue, like with Alexa and calling 911?
Actually, the 911 example is spot on. When someone asks Alexa to call 911, it goes through various ML models, but ultimately, a deterministic system ensures the call is made without relying on more complex ML predictions, to avoid errors.
When you talk about deterministic ML models, aren't ML models generally based on statistics and probabilities? What defines a deterministic model in ML?
That's a great question. The distinction between deterministic and probabilistic models can be complex. Variability in ML models arises in two places: during training or retraining, and in production inference. Even though ML models are rooted in probability theory, this doesn’t mean their behavior is unpredictable. Good ML models should always produce the same output for the same input. Minor variations can occur due to hardware or infrastructure differences but are usually negligible.
In training, variability can be more impactful. Each time you retrain a model, its performance can change due to the inherent randomness in training processes, much more so than a simple rule-based system. This unpredictability during retraining is often what people refer to when they say ML models aren't deterministic. In ML, there's a principle known as the "CACE principle" which states that changing anything changes everything.
Many people assume that probabilistic models are inherently unpredictable, but that's not true. Deterministic models consistently produce the same results given the same inputs, while probabilistic models involve randomness during training but are usually deterministic in inference. Understanding that variability in ML primarily affects model training, not final outputs, is crucial to demystifying these misconceptions.
What would you say are the most common misconceptions today about ML?
One thing I often hear is that machine learning models are unpredictable and uncontrollable. This is a false belief. With proper testing before deployment, you can know what they will predict. Additionally, there are numerous methods to control and constrain your model's output. There are general techniques and ones specifically for generative AI like chat GPT. For instance, alignment techniques ensure the model adheres to certain principles during training or generates outputs limited to a specific vocabulary to ensure consistency.
Another misconception is that machine learning will replace everything and solve all problems. For many use cases, simpler models or even regex can be just as effective, cheaper, and more reliable than ML. The combination of simple rules and machine learning often provides the best user experience and a balance of cost and accuracy.
Is that a guiding principle for ML development at GitGuardian?
Exactly. For example, at GitGuardian, we don't try to replace our secret detection engine with machine learning. Instead, we use it to efficiently and reliably identify an initial pool of secrets, scanning many commits cost-effectively. Then, we layer machine learning on top to add additional context and validation. This hybrid approach leverages the strengths of both systems.
Models like the Confidence Scorer or our recently announced FP Remover are excellent examples of how we blend rule-based systems and machine learning. These models are very robust because we extensively and rigorously test them to ensure they meet all requirements and don't degrade performance on critical use cases. We expose them to varied data during training to handle potential differences in real-world inputs, which is part of making our models more error-resistant.
Lastly, we use advanced monitoring tools in production to quickly identify and resolve issues, ensuring any problems can be swiftly addressed or rolled back.
Recently, I was talking with Arnault (GitGuardian's lead ML engineer), and he mentioned something interesting: he said that at GitGuardian, our rule-based model could already be considered an AI system. What’s your take on that?
I actually agree with Arnault. When people talk about AI, they often think of the latest technologies like ChatGPT and get dazzled by the hype. However, AI isn’t necessarily about those flashy, state-of-the-art technologies. It includes any systems that perform tasks better or more efficiently than humans. Under that definition, our secrets detection engine fits perfectly as AI. Machine learning is just a small subset of AI.
The main difference between machine learning and other forms of AI is that machine learning models learn their rules from data, whereas other AI systems might have rules explicitly coded. But both fall under the AI umbrella. For instance, Netflix uses complex rule-based systems for recommendations, and no one would argue that it isn’t AI just because it relies on predefined rules. We could find more examples of such systems that are widely accepted as AI despite not being purely machine learning-based.
Do you have any comments specifically regarding generative AI? Does mixing deterministic and probabilistic models add complexity, or is it a different challenge altogether?
With generative AI, controlling and ensuring the reliability of models becomes increasingly difficult. Unlike traditional models that might choose from a limited set of categories, these models can generate a wide range of outputs: phrases, images, and more, meaning their output space is vast and less constrained. They also often take in multiple types of inputs—text, voice, images—making them multimodal. The broader the space for generation or prediction, the higher the risk, and the more challenging it becomes to control.
Given these complexities, new solutions are being developed to address the specific risks of generative AI. For instance, guardrails for large language models (LLMs) aim to control what the LLM can generate, verifying if the output is valid or not. In AI companies, red teams specifically attempt to break LLMs by finding ways to make them produce undesirable results.
On the other end, you can always add layers to make models safer, but if you overdo it, you can make them unusable. A good example is this model called Goody-2. Its creators cheekily and rightfully call it “Outrageously safe": it’s a language model that was trained so rigorously to avoid offensive language and any potential problems that it became practically useless. You could ask it about the weather in Paris, and it would respond with something like, "I cannot answer this because it might influence your decision to go outside."
Even the biggest companies seem to be struggling with this because we keep seeing incredible errors. We talked about Google, but it's the same across the board.
It's quite alarming, but we shouldn't get too pessimistic and assume it's all doom and gloom. The problem is magnified with generative AI systems. They're becoming more integrated into our daily lives—soon they'll be on your iPhone, everywhere. This omnipresence increases the risk because these models are involved in so many more aspects of our lives than the more isolated systems used by specific clients, for example. The likelihood of encountering a problem is exponentially higher.
Besides the risks, what's their key advantage compared to traditional techniques or the usual mix of simpler models?
The real opportunity with this new era of generative models, which aren't fundamentally different from before but have evolved in development and resource allocation, is significant. Previously, we had small expert systems trained to perform specific tasks very well. Now, these models can handle a wide range of tasks without being explicitly trained for them, performing as well as a human right out of the gate. Plus, these models are multimodal, handling text, images, and even mathematical tasks simultaneously, which is a fundamental leap forward. To me, this versatility starts to resemble "intelligence," but—let's be clear—these systems still often fall short when it comes to other crucial aspects of intelligence, like reasoning, for instance.
Finally, do you have any tips or recommendations for people wanting to integrate ML into their projects?
My first recommendation is a bit of a cliché but still very relevant: "Don't do ML if it's not needed." If you can solve your problem with simpler rule-based systems that are cheaper and more efficient, go for it.
The second recommendation is to deeply understand the user experience. Often, those working in ML and data science focus too much on improving model performance on datasets, forgetting what's truly important to the user. This ties back to my talk where certain data points matter more than others—for instance, Alexa responding accurately to "Play Spotify" is much more critical than it telling a joke. Therefore, it's crucial for anyone looking to incorporate ML into their projects to have product managers focused on UX and product needs and to align your metrics and training with user satisfaction as much as possible.
At GitGuardian, for example, our strategy is to optimize models that directly impact user experience and product performance. We don't waste resources on improving detectors nobody uses or issues that don't affect user satisfaction. Having direct input from product managers helps us ensure that our models stay relevant.
Lastly, beyond developing good models, it's essential to have robust testing and monitoring in place. People need to understand that an ML engineer is still an engineer. Good testing and monitoring practices are just as vital as developing the models themselves. These principles are the same as any engineering task: ensure your tools and guidelines for testing and monitoring are top-notch to maintain model performance and reliability.
Thanks, Nicolas. This has been very insightful!
Thanks!