Episode 53: Differential Privacy Demystified (Pt II)
In this episode, Jon Prial continues his conversation about differential privacy with Yevgeniy Vahlis, Georgian Partners’ Director of Applied Research. Find out more about how differential privacy works as Yevgeniy explains it using simple, every day examples. He then goes on to describe why differential privacy isn’t just for the likes of Google and Apple, but rather something that most companies should be taking a close look at.
Jon Prial: Welcome to part two of our Impact Podcast on differential privacy. If you haven’t listened to part one, I encourage you to do so. Today we’ll go deeper and learn more about both data, which can be made deferentially private, or the models that are developed to run against that data.
This is a topic that I know you’ll be hearing a lot more about across the industry. I think you’ll find it quite interesting right now, today.
I’m Jon Prial. Welcome to the Impact Podcast.
Jon: Just recently, The Wall Street Journal published an article on differential privacy. Yeah, you and I were talking about a tech topic in the Wall Street Journal. I guess we’ve crossed out of early adoption phase.
Anyway, it was a pretty cool example. It was focused on how data might be modified. You’ve been unable to get to the answer to a specific question about a specific person when you had other data.
Today I’ve got Yevgeniy Vahlis with me. He’s the director of Security First on Georgian’s Impact Team. Now, Yevgeniy, the work you’ve been doing with our portfolio companies, steps you away from the data, correct?
Yevgeniy Vahlis: The way to think about it is you’re anonymizing the thing that actually matters, which is either the machine learning model or the statistics that you cultivate. You’re not anonymizing individual records.
What you really want is the statistics, the insights, the analytics, the machine learning models. You don’t actually almost ever want the raw data itself. You want the insights from the data. Differential privacy helps you anonymize those insights.
Jon: In our first podcast, we spoke about the trade-offs that you have to make between perfect accuracy and perfect privacy protection. If you could stay on that thought, take me to an example and try to cover both the point of view of modifying the data, as well as modifying the model. That will be great.
Yevgeniy: Let’s think about different ways that you could protect data when you want to do analytics. Their database is a traditional relational database. You have a bunch of columns representing fields. You remove some of those columns that you don’t need for your analytics. We know that that doesn’t work.
We know that you can cross‑reference the remaining columns with other data sets and recover usually, most of the data. Let’s go to an extreme. Let’s say we set all the data to zeros. Data is binary in computers, zeros and ones. We just make sure everything is zeros. That clearly does protect privacy of the data. Nothing is left.
What if we didn’t have to set the data to all zeros? What if we could look at a numeric representation of the data, which we need to do anyway to do any kind of analytics because analytics works on numeric data, and add some noise to it.
Tweak the data a little bit so that each record is not exactly the original record. Or the way we combine them is not exactly by computing an average, but by computing an average plus some random noise. Or by training a logistic regression classifier and adding some noise to the parameters or the model.
That noise, if you construct it carefully enough and with awareness of your objective, can both achieve the goal of hiding the data effectively, going towards this idea of setting everything to zero, while still preserving enough of the shape of the data so that you still get your accuracy in prediction, classification, and analytics.
Jon: I’m quite fascinated by your use of the phrase, “Shape of the data.” Actually, that’s very helpful to me. Machine learning models are the result of all the data so it makes sense that you’re primarily focused on the model in the right context. Then you make sure that the data you’re using is correct. Yes?
Yevgeniy: Yeah. Where we’re going is where you really protect the models. You don’t protect the data. You want to train many different kinds of models on the data. You want to keep the data intact.
Once you’ve trained the model, you know your use case. You know what you’re trying to predict. You want to strip out everything else that’s not required to produce your prediction. Because of the complexity of the math underneath your prediction, you have to match that with the right complexity of analyzation.
If you think about it, the machine learning models are very complex. Using something as simple as removing fields from your data is too crude of a tool to accomplish the goal.
Just like we go to great lengths and build complex mathematical model algorithms to achieve good accuracy in analytics, the goal is not complex to you, obviously, but we want to be as complex as we have to be in protecting the data so that we don’t remove the things that we need but we remove everything else.
Jon: Let’s step back and build up to that model discussion. I’d like to start with the simple view of making data deferentially private by adding noise. I’m realizing this isn’t something trivial. You’ve got to have a sense of what you’ve got, what your intent with the data is and more.
Yevgeniy: Let’s say all we care about is who has more air miles. Jon or Yevgeniy, who has more air miles on their frequent flyer account? We don’t care how much each of us has. We just want to know who has more.
If I take some numbers and make some assumption about how they far they are from each other. I’ll assume that whoever has more has at least double. Then I’ll add some random number to each one.
I’ll generate a random number in my computer between zero and a hundred. I’ll add to Jon’s a number of miles. I’ll generate another random number and add it to Yevgeniy’s miles.
Now, since I assume there’s a certain gap between the two numbers and I’ve added the randomness, I still preserve the order but I can no longer recover the exact number. Now this may not be deferentially private, but that’s the general idea.
What I wanted to show you here is that I’ve considered the problem and I’ve looked at the assumptions I can make about the data — that here’s a certain gap between the two numbers — I’ve used that to limit how much noise I’m adding to the data.
I said I’m going to choose a random number between 0 and 100. I didn’t say I’m going to choose a random number between 0 and 10 billion. That would completely destroy all the information and probably even change the order.
In other words, you don’t want the randomness to dominate the signal. You want to make sure that the noise is significantly smaller than the signal. In this case, the signal has more and the fact that one has two times more than the other.
Jon: Good. Keep going. Now, how do we change the model versus the data?
Yevgeniy: Let’s take this air miles example one step further. Let’s say I’ve figured out that everyone who’s working remotely for Georgian Partners has at least 100,000 air miles. Everyone who’s working locally has a little bit less, 90,000 or so.
I’m just given the number of air miles that someone has. I need to tell you whether they’re remote or local, I’ll just check if it’s above that 95,000 or below that 95,000.
How can I make this deferentially private? Well, I know that there’s a certain gap. Again, I know that the remote people have 100,000, the local people have about 90,000. I know there’s a certain gap. I know that my objective is to tell you, given the number of air miles, whether someone is remote or local.
I’ll take this sharp threshold which I’ve computed from the exact data, which is exactly the average between 90,000 and 100,000, 95,000. That’s my classification threshold. If it’s above 95, it’s remote. If it’s below 95, it’s local.
I’m going to perturbed a little bit. I’m going to choose some random number between minus a thousand and plus a thousand. I’m going to add it to this threshold of 95,000. Now my model has changed. My data hasn’t changed, but my model in this case is very simple.
It consists of a single number which is the threshold that’s based on which I decide was remote and was local. By moving it around a little bit, I’ve made the model deferentially private. At the same time, I haven’t moved it too much so that I can no longer use it for answering this question.
Jon: In our prior podcast on differential privacy, we spoke about these bigger companies. You talked about Google and Apple. Before we start to go deeper, take me through your thoughts on why this isn’t just for the big guns.
Yevgeniy: Absolutely. I even think that it’s more important for growth-stage companies and early stage startups to use differential privacy as a differentiator. The reason is that unlike Apple, Google, and Facebook, growth-stage companies can’t afford to experience a massive loss of confidence because they’re using a predictive product.
Imagine you’re building an AI agent and that agent reveals some information about your customers that is really sensitive. As a growth-stage company, that’s potentially either a significant hit to your reputation or even a business ending event.
Apple uses differential privacy as a differentiator and as a way to position itself as a leader on privacy. Growth-stage companies can do the same, but it’s more than that. I think that they have to use tools like differential privacy if they are building AI because if you’re building AI without the right security measures in place, that’s the first thing that’s going to fail.
When it fails because you haven’t given any thought to security, it’s going to be a catastrophic failure.
Jon: Just to set this next section up. Last week I walked past a guy wearing a T-shirt that said something like, “Between the two of us, Bill Gates and I are worth $80 billion.” Foolishly, I point it out to my wife. She didn’t laugh.
For me, it kept me thinking more about the underlying data and how to extract individual’s information. I’d love for you to go further down into the protections of a mode for differential privacy and the levels that can be had.
Yevgeniy: When you build a predictive model, let’s say it’s a classifier, many of business will be familiar with logistic regression. It’s one of the most common first things to try when you’re trying to classify your data as being type A or type B. Is the person going to buy or not going to buy?
How do I protect something like that? In machine learning, the way that logistic regression behaves is it creates essentially a line or a sheet between the two data sets, the data set where the class is a yes and the data set that where the class is no. It creates a way to split those two.
That partition reveals a lot about the original data. Just as an example, if you only had two data points, one of type A and one of type B, that partition would be exactly in the middle between the two data points. If you know one of the data points, you know the other.
It’s like your example of the T-shirt with Bill Gates’ net worth. If I know Bill Gates’ net worth and I know the average, then I know the other person’s net worth.
Jon: I couldn’t lie. Bill Gates is public knowledge, of course. That’s public.
Yevgeniy: Exactly. That’s a very simple case because you only have two data points. You may have millions or tens of millions of data points. The idea is the same. There’s something that tells you how to split these two types of data.
What differential privacy does is to move that barrier from its original position, which is derived purely from the original data, to some slightly random location around that original. If you think about it as a wall somewhere between two data sets, that wall gets moved around back and forth a bit and maybe rotated a bit, randomly.
That prevents this type of data recovery, like what you will get on the Bill Gates case. Even if you know all the data, except one point, and you want to figure out what’s the type of that point, if the model has been protected through differential privacy, you can’t do that.
On the other hand, if we just had the logistic regression model and we know all the data, except a single record, we can easily recover that record. That’s exactly the same as the Bill Gates case.
Jon: If you’re a business delivering these type of solutions, you need to give a level of comfort to all of your companies that you’re aggregating data. As you mentioned earlier, there might be competitors that they need a level of comfort that no matter what happens, everybody benefits from this aggregation of all the data.
All of the companies, all of your clients are benefiting from it and they’re protected at the same time. You’re going to yield a stronger solution and have no issues in terms of communicating risks of breaches.
Yevgeniy: Exactly. When you talk about the risks of aggregation, you just defer to the well-understood mathematics behind differential privacy. By well understood, I don’t mean that the legal and compliant teams have to understand it completely.
It’s the researchers that are constantly trying to find flaws in various security techniques that believe that differential privacy works. Here you are essentially standing on the shoulders of giants and saying, “I’m leveraging this technology that the best minds in the world believe to be secure.”
Jon: Thanks. I liked your reference to standing on the shoulders of giants. During the development of the principles of security first, you focused on leveraging the existing well-known technologies were appropriate and not inventing the wheel. It’s good to see you come back to this.
As we wrap this up, are there a couple of Security First principles that you can highlight?
Yevgeniy: Yeah. I would say there’s a few principles that are relevant. We can look at three, which is creating new value through security and privacy, and number eight, which is designing systems to reduce the impact of a compromise.
If you’re thinking about creating new value through security and privacy, there’s no better example I can think of than differential privacy. You’re enabling aggregation of data across data sets that you’re not allowed or may not feel comfortable aggregating, because you’re using the right security controls, in this case, differential privacy.
That lead to potentially, things like solving the cold-start problem where you have a new customer that has no data and you want to use everyone else’s data to help them out. It helps increase everyone’s accuracy because your overall data set is much richer than any individual customers.
You’re getting all sorts of benefits and it’s because you’re doing the right things to protect the data. Then, if we want to think about the impact of compromise, differential privacy is a good example, as well.
We want to design our products in a way that if we do get hacked, the damage is limited. If you think about what happens if a product that uses machine learning gets hacked, the parameters or the trained models are leaked and revealed to their hacker.
If those parameters contain sensitive information, that sensitive information has been disclosed and may trigger both a compliance level breach, so with notification, obligations, compensation, and losses, and just the fact that you’ve hurt your customers.
By using differential privacy, leaking the machine learning model parameters doesn’t actually disclose any of the original data, because it’s provably impossible to reverse engineer these models back to the data. You’ve reduced the impact of a breach in this case.
Jon: The machine learning and artificial intelligence are already changing the software game significantly. Differential privacy looks to be one technology that will help improve the results a business can deliver to its customers with some confidence that will protect the most valuable asset of all, people’s data.
Thanks for listening. We’ll be sure to bring you more on this topic. Check out our other pieces on security first or other thesis areas on our website georgianpartners.com. For the Impact Podcast, I’m Jon Prial.