Modern Cyber with Jeremy Snyder - Episode
32

Sounil Yu of Knostic on AI

In this episode of Modern Cyber recorded earlier in 2024, Jeremy sits down with Sounil Yu, co-founder of Knostic.ai, to discuss the growing implications of artificial intelligence (AI) in cybersecurity.

Sounil Yu of Knostic on AI

Podcast Transcript

00:08
All right. Hello. And welcome back to another episode of the modern cyber podcast brought to you by fire tail leaders in API security. Find us online at fire tail. I O please remember to rate, like, subscribe, share with your friends, et cetera. I am delighted to be joined by somebody who really doesn't need much of an introduction, especially if you work in the cyber security space, I'm sure it's a very well known name, but I am joined today by Sunil you Sunil is the co-founder at Gnostic AI spelled K N O S T I C dot AI.

00:35
And previously Sunil was the CISO and head of research at Jupiter One, as well as chief security scientist at Bank of America, among other cybersecurity leadership roles. Sunil is a well-known name, as I said, in the cybersecurity space. He created the cyber defense matrix and the die triad, which are reshaping approaches to cybersecurity. I know I find them incredibly informative and helpful in terms of just shaping some product direction and things that we're doing over here at Firetail. I hope all of you are actually referring to them.

01:03
Sunil has an MS in electrical engineering from Virginia Tech, a BS in electrical engineering, and a BA in economics from Duke University. Just by way of disclosure, before we dive into today's conversation, Sunil, you is an advisory board member at FIERTAIL. Sunil, all of that is really the story of, I think, a long and noteworthy career. I thank you so much for taking the time to join us today. Thanks for having me. Awesome.

01:27
I want to talk today about something that I know is on the mind of kind of everybody as we sit here in early 2024, and that is AI. And I kind of made the joke recently at a talk that I gave at a meetup, which was that I think we're legally required to mention AI in pretty much every conversation that we go into at this point, and especially some of the unknowns that I think are filling people's

01:56
Just like the early days of cloud, when people were worried, oh, what is this cloud thing going to do? Is it safe? Is it secure? Can we trust it, et cetera? It feels to me very much like we're in a similar position with AI. How does it feel to you? Well, there's a degree to which it feels like the Groundhog Day cycle where we're waking up to what appears to be a repeat of what we've already been through.

02:19
But at the same time, there is some substantively different challenges that we're facing here. And so I've actually thought about this quite a bit in terms of what are the patterns that we've seen in the past? And do these patterns apply going forward? Do these patterns actually help us see the shape of the future and help us then define the problem space in a systematic way? And so yeah, I've actually thought about this quite a bit. And I think when it comes to the main differences, though, it is one of impact and scale.

02:48
And this is why I think a lot of folks have concerns around the existential threats associated with um, and and there's existential threats with ai in general, but also interesting things that are happening with generative ai and then that's really what's causing most of the um The enthusiasm that we're seeing today What are some of those patterns and where do you see them kind of being?

03:13
the same as in the past or different from what we've seen with new waves of emerging technology over the last, you know, 15, 20 years? Yeah, well, actually here's an example. So, well, first of all, I operate in models and you kind of mentioned a couple right after that. So there's a model that I use that I didn't come up with but I think I'm one of the first to apply it towards this space. And the model is called the DIKW pyramid. It stands for data, information, knowledge, wisdom.

03:43
So it's a pyramid and just imagine this pyramid as you go up. Yep. It's almost like an OSI model. They give it also like the OSI model. Well, what we're seeing is this progression of technologies that have allowed us to take advantage of data, take advantage of information, and now tools like ChatGPT allow us to take advantage of knowledge. As we try to define the problem space for knowledge for ChatGPT, this pyramid actually provides a good anchor for understanding the rest of the problem space.

04:13
What are the things that we dealt with for data? Okay. Data security, data quality, data governance, data lineage, data provenance, all these things, right? Data privacy. Well, these words, these suffixes that we apply can apply to information as well, information security. So then what exactly is knowledge security? Or let's look at data quality. Well, you know what a data quality issue probably is. Yeah, yeah, yeah. What do you think is a knowledge quality issue?

04:43
That's a really interesting question. Because I mean, to your point, on the data quality side, I think about bad data, incomplete data, duplicate data, things like that. And I always tend to think from that pyramid model that you said, I worked on a big data project years ago where the tagline for the project was turning data into information. So from what you described in that pyramid model, I would think that you have more data than you have information and then you have

05:12
more information than you have knowledge, right? As the pyramid gets more narrow as you go towards the top. So what would be a knowledge quality issue would be if I have a piece of knowledge or a piece of information that has turned into knowledge that is potentially bad. Because it's bad in one way. Well, it's been reinforced from multiple sources. And so we've kind of accepted it as knowledge.

05:39
but there's a fundamental flaw maybe in the assumption or in the information that kind of led to this knowledge. And I can think of one other scenario, which is that the knowledge was applicable up to a point in time, but there's a change in landscape that makes the knowledge no longer applicable going forward. But maybe that's not what you had in mind. I'm curious. Yeah, and that could, that may be a different word associated with data. So when you have aged data, you have aged knowledge.

06:08
Right? Yeah. Yeah. When we talk about data quality, you talked about, you know, like fields in the wrong place or duplicate data, whatever else it is. But when I think we talk about knowledge, knowledge quality in the world of chat, you BT knowledge quality would be something like hallucinations. OK. OK. So this allows us to then see the shape of the future. And we say, OK, we now, as we look at the problem space, it becomes clearer to me that we're going to face.

06:37
similar issues that we've seen in the data world, data and information world that now match a pattern for the knowledge world as well. But I'll give you another example. Let's take data privacy versus knowledge privacy. Well, what's the difference? Well, back, I don't know, a dozen years ago, we had a situation where Facebook released something called Social Graph. Yeah. And it was amazing. It was great because you could find out all these interesting things about your friends, like their, what?

07:06
What movies you liked, what music you liked, what pizza toppings you liked, what sexual orientation they were and what political preferences they had. Right. Things that Cambridge Analytica abused. What was abused wasn't your data privacy. What was abused was your knowledge privacy. And it was substantively different and frankly, more private, right? More, you know, a deeper, deeper thing than just your data privacy. I mean, who cares about your social security number?

07:35
If you're revealing things that are deeper in nature about who you are, then that's a bigger violation of privacy. That's interesting because even going back to that pyramid analogy, then I would think that the way Cambridge Analytica kind of extracted or kind of betrayed the knowledge privacy was by scraping a bunch of the data and then transforming it into knowledge either from their side or via the graph.

08:04
Or by inference or by inference. Okay. Which is maybe what I mean by the graph, but okay. And what are these LLMs? They are. Yeah. And they're inference models. Yeah. This is a really interesting point. When you think about the problems that it creates aside from, let's say creating problems around, let's say bad knowledge or potentially bad knowledge.

08:29
What are some of the other issues that you think are real? I mean, you mentioned that hallucinations are something that, you know, could be a source of bad knowledge. I would tend to think that, you know, a good system should have a human in the loop for any emerging technologies before it's trained and can be trusted up to a sufficient level to be operational. But I know that that's not always the case. I know some organizations kind of rushed to adopt new technology super quickly. And there are.

08:57
You know, real problems of data volumes and data volume management that are beyond human capacity. So I get that as well, but what are some of the other issues? Cause we hear about hallucinations. We hear about the potential leakage of intellectual property when there's a model that's kind of a hybrid, a use of a public LLM supplemented with organizational data, what are some of the things that you think are real problems and some of the things that you think are not real problems? So I'll start with the not real problems one moment. So.

09:25
There is a common misconception, as you pointed out, that if I use something like a public ChatGPT service and I upload all my corporate content up there, that all of a sudden someone else can query ChatGPT and get that content. And the way I think about this is, I'll use an analogy here. So imagine you're a geography teacher and you're teaching to prepare for the class geography, for class, you pull together all the facts of the world and

09:54
get ready for your class, and you walk into your first day of class and you realize all the students are flat earthers. Okay. So in the homework assignment that you give, the students transmit to you what they consider to be their sensitive proprietary information about all these conspiracy theories as to why the world is flat. Now do you as a professor, take that homework, all the content for the homework assignment and use that to teach next semester's class?

10:23
That'd be, you're poisoning your own truths there, right? But what you would do is you would fine tune your syllabus, you would fine tune your curriculum so that you're addressing the misconceptions of these students, right? So we hear this notion that OpenAI and others are fine tuning their models to, based on our inputs, but that's the type of fine tuning that I would look at it from that type of fine tuning. It doesn't mean that next semester's class is going to all of a sudden hear about

10:52
all these stupid flatter theories. Rather, it's for this particular class, I'm gonna fine tune it for the situation at hand. Now, this doesn't deobligate the professor from having proper security on their briefcase that holds all the homework assignments. It doesn't deobligate the professor from negligently leaving behind all the homework in the teacher's lounge so that other employees can come in and see that content. So we still have

11:22
we still should expect obligations from the foundation model providers to secure their application and to make sure that their employees can't just poke into people's content and pull that out. And we certainly know, no question whatsoever, that these foundation model companies are a nation-state target, right? So, of course, we should expect them to be targeted as such and to have the proper controls on the back end so that if someone is poking around looking at

11:51
customer data, so to speak, then they have controls for that. So anyway, I'm not completely dismissing the concern, but the common misconception of, will the next person that asks ChatGPT get content from my stuff that I uploaded? That's just not how it works. Yeah. So that's something, you know, that's an overblown fear that I think people have. Then there's another fear that I don't think people are fully aware of that.

12:20
that we're also seeing. And this goes back to my pyramid model again. We talk about what data security is, information security. What exactly is knowledge security? Well, to me, knowledge security is about how you control for need to know. So you start aggregating all this content within an LLM, and all of a sudden, again, it enables you to ask amazing questions about... Let's go back to the Facebook graph, social graph question again.

12:49
I think it's entirely okay to learn about people's common interests around music and art and food. But does everyone have a need to know for your sexual preferences? And you would say, no, that doesn't make sense. But how do you allow for one, but how do you enable one side but not enable the other side? And the way to do that is to control for need to know. And we already have a way to do that today. Because if you go to chat.gbt and ask the question, hey, how do I build a biological

13:19
It basically says, I can't tell you because you don't have a need to know. But that one size fits all model won't work for an enterprise. We need ways to, to break that down in a more segmented manner. And that's the problem that I think we're going to face in the near future. And that, by the way, that's the problem that Gnostic is trying to tackle as well. Got it. So it's that kind of authorized access to elements of knowledge.

13:46
that may or may not be authorized, right? And if I think about this from an organizational perspective, you can just think of the typical stuff, right? You may, due to your role or due to your level in an organization, you may or may not be privy to all of the company's financials. You should never be privy to other people's salaries and HR benefits and things like that. So is it this type of access to data slash information that you're thinking about? Well, it's also a ritual language around how we describe access.

14:16
Um, at the data level, we only have maybe a dozen attributes that describe access. At the next level, at the information level, we have even a richer set of attributes, hundreds of attributes. At the knowledge level, we have thousands of attributes and those attributes can be used to describe. What do you actually have access to from a need to know standpoint? Yeah. Today, we just have this cro course way of saying this is company confidential, but that doesn't really dictate whether or not you have a need to know.

14:44
And we see this all the time, of course, in the national security space, just because you have a secret clearance or top secret clearance doesn't grant you need to know, well, how is that actually established? And it's entirely based on context and that language that I mentioned earlier. Yeah. It's interesting. You use the word attribute there in that context, because I know, you know, from the work that we do on the API security side, one of the things, and one of the challenges that we're seeing right now.

15:09
is that I think a lot of apps and APIs are kind of reaching the natural limit of how far they can go in terms of fine grained authorization with the kind of classic RBAC model, right? Role-based access control. And it's, you know, it's the kind of thing where like, you know, Jeremy, due to his position and the groups that he's a member of and the departments that he's a member of, should have access to, you know, this data, these systems, et cetera.

15:35
except for also these things because of some attribute that he's part of a project team or an M&A initiative or whatever, and that doesn't really fit. And that's where we see organizations starting to kind of move away from that. And even the cloud providers are providing a pretty rich set of tools to break that, which I know actually frustrates a bunch of my friends who are cloud security practitioners because dealing with IAM on a lot of the cloud providers platforms is kind of a nightmare, as I'm sure you know as well.

16:05
It's really interesting, this kind of challenge about managing access to data in a way is something that we've been really trying to solve in cybersecurity for as long as I've been doing it. And it's the same problem kind of over and over again, just in different flavors because of the changes in technology landscape over the years. Right. Right. And so that's that groundhog day pattern that was at the very beginning. We've seen this pattern before and it looks familiar.

16:35
But when you start digging deeper, it's like, whoa, this is, this is definitely different. But at least I can anticipate what the problem space is. And that's been actually really helpful. Uh, but yeah, it's going to be up to us, um, as practitioners, as entrepreneurs to figure out, uh, how to be fully understand the space and, and create solutions to address this next, just new tier, this new, new level that we have. On the Groundhog Day side, I'm really curious to get your perspective on something because

17:03
You know, I've spent a long time in cloud security, right? The last kind of seven years of my career, I include the API security work that we do now in that broader cloud security bucket because pretty much all modern API development is done on cloud platforms. We don't see a ton of API innovation on-prem. We do see some people with kind of, you know, VMware environments or what have you, sure. But what we saw with cloud security and cloud adoption and

17:29
kind of the early days of cloud when I started working at one of the cloud providers in 2010 was organizations would take one of two models. They would either take the kind of liberal model of enable the organization, give developers the access they need, whether that's credentials, keys, accounts, what have you. Some had guardrails, some didn't. And a lot of them kind of just, you know, set them free, go do what you need to do, innovate, move quickly, et cetera. Others were much more the

17:56
what I would call the CASB path, right? The cloud access security broker. You can't do anything unless it matches these patterns. And we saw all kinds of technical attempts to kind of constrain that through tools like CASB, through things like service catalog, et cetera. With new technologies or emerging technologies, what do you think is the right approach if there is one?

18:21
Um, well, it's not so much, I don't know if there's a right approach, but there are more opportunities for different approaches. Okay. And in the context of even in the context of AI and, and the, um, that knowledge domain again, uh, we have new tools or new, a new language to be able to express, um, as we mentioned earlier, access. Yeah. Um, and in the past, when we had a very limited set of attributes or a limited set of, um,

18:50
permit as I guess you can say even. It limited our opportunity to allow the business to do what it needs to do. We are known as the department of no because often it's a binary decision yes or no. Yeah. But it's actually interesting in the knowledge space because now we have an opportunity to do something different with knowledge and that we can reshape knowledge. Things are more malleable at this new tier.

19:17
And that has its pros and cons. There's cons in that it seems a lot more non-deterministic. But the opportunity is to say, how do I ensure that if you need access to something, I still give you what you need, but only what you need and not more? Right. OK. And the way I equate that is to say, let's imagine that you're a

19:46
you're a nuclear physicist and I'm a fifth grader and ask you about quantum field theory. Well, I wouldn't want you to tell me something that blows up my head with string theory and 14 dimensional or whatever, whatever. But if I were a PhD candidate, I would expect you to share that with me, right? There's a certain level of need to know that could harm one individual but could help another. And...

20:15
I think we have to think about like, what is the opportunity that this new tier allows us to do that we were unable to do in the past? And so in this ground, you know, the story of the groundhog day at the end of the day, at the end of the movie, he succeeds in figuring out how to, how to solve the cycle. Yeah. And I think we have, I don't know if this will allow us to break out of the cycle, but there's only four levels in the pyramid, the DIY pyramid.

20:44
And I reserve the very top one, the wisdom one, uh, for the human for now. Yeah. Um, but the, but the nice thing is, you know, it's, I'm not sure where we go, where we go past that. And so, who knows, we might solve it after, after this. Yeah. Th there's another big question around AI that I get asked a lot in. I don't know that I myself am smart enough to answer it and I'm no data scientist, but one of the big questions I get asked is who has the upper hand with AI right now?

21:12
Is it, is it the attackers or is it the defenders? And I can see arguments on both sides of the table. I know for instance, in our own environment, when we stand up APIs for testing purposes for our own products, et cetera, we see anything that we put public access on. We see traffic, we see probing, we see attempted attacks, typically within like three to five minutes. And that's stuff without a DNS name. That's just a random API.

21:39
sitting on a random IP address in a random region of one of the cloud providers. And so there's some extent of hackers have cloud hackers have automation security by obscurity is dead, but we don't see a lot of break-ins. We see a lot of attempted break-ins. So there's some part of me that says, okay, they've not yet used AI to break into APIs, maybe that's because they don't care, maybe because they just haven't incorporated it yet. But I do see the argument for AI enabling more.

22:09
creative, both more scale and more scope to the types of attacks. But then on the flip side, there's a big part of me that says, actually, we're pretty much running on a hundred percent virtualized platforms right now. We do very little with hardware based security anymore. All of security is database, whether that's telemetry data, whether that is, let's say real-time calculations of authorized access or unauthorized access, but it's all kind of data science problem. And AI gives a huge amount of.

22:37
firepower to the defenders to incorporate it into that. Because as I mentioned earlier, the data volumes are pretty much always beyond human comprehension. Where do you see the balance or where do you see the advantage if there is one? Sure. Yeah. So I do actually see an advantage and the advantage goes to the first mover. Okay. The defender is never the first mover. Yeah. The attacker beats the defender, but there is an entity that does move before the attacker.

23:06
and that's the builder. So the business is usually the builder. And to your point earlier, yeah, there's an amazing ability for these generative AI tools to be creative. And I think that creativity channel towards things like vulnerability research is amazing. I think that is something that we absolutely should expect to see by both attackers, but also by the builders.

23:34
We absolutely think, I think that the builders have a huge opportunity to have the upper hand by using generative AI tools to actually go find the vulnerabilities before the attacker does. And to me, that's a builder obligation, not a defender obligation. Because as the builder, you always have first mover advantage because that's your, you are building it. And if you can build it in a way that, and you leverage these AI tools to

24:02
clean up cruft that the attacker will eventually find using the same tools, then I think the business and the builder have the first mover advantage and therefore have the advantage over the attacker. If the defender is coming in after the attacker with our own set of AI tools, I think then yeah, the attackers are going to constantly win. Yeah. Yeah, I get it. There's a saying that I heard from a friend of mine who used to work on a red team side and...

24:31
They had a saying, which was, if you weren't able to access whatever system you were targeting, try again, try harder, be more creative. And I find that the creativity mindset is, is a really interesting one to bear in mind, but not one that I would say cyber defenders have been, have a long reputation for. And, you know, I think a lot of defense up until now has been very rules driven and

25:00
Potentially rightly so, because a lot of the times, most of the problems that we're looking for can be boiled down to a set of policies or rules that, you know, you don't violate unless there's business justification for it. But very often that's things like, you know, no open ask us to S3 buckets as a rule, that's a very easy kind of rule to encapsulate and then codify into a product. But I do think going forward, it'll be your point about

25:27
building that capability into defense mechanisms is a really powerful one. I want to shift gears for the last topic around AI for today, which is regulatory frameworks. We've seen the first wave of regulations around AI coming out of the EU. Tends to be that the EU is the first to unleash regulations on these things. If you can't innovate, you're regulating, right?

25:55
You know, I do think there's innovation going on over there as well, but you know, fair enough. But I get the rationale for regulation. I may or may not agree with every aspect of the regulatory framework that they came out with, but I understand it. And I guess I have two questions for you. Number one is, do you think that that's the beginning of a wave of regulation that is actually going to proliferate in more geographies and in more industries, or maybe will become

26:24
maybe have some industry specific regulation. And number two is, is regulation necessarily a good or bad thing? Because the concern that I have is that, okay, so we regulate it. That means we have to play by the rules as legitimate companies building legitimate products, doing legitimate business, but bad actors are not gonna play by those rules. And we all know that. So take them in either order you like. Well.

26:49
Some people have, and I'm in the same camp. So some people have equated AI with the building of weapons of mass destruction. And when Oppenheimer and team were planning for building the Manhattan, or for the Manhattan Project, they really thought of it more than just, they thought through the ramifications of what would happen across society as a whole.

27:20
not just what happens when you explode a bomb, but how are we going to transport it? How are we going to deal with the waste products? How are we going to, what's it going to do to society? And they had to think through all these different things because it was such a existential risk, right? Now there's debate amongst many as to whether AI is going to be an existential risk and will it ever even get there. But

27:49
I would think that it's a big enough, it's a, it's, I don't think anybody would say it's a zero probability. I think some people would, most people would say it's a pretty distinct probability. And even if you think it's a very low probability, wouldn't you still at least want to take efforts towards safeguarding the human race from such potential harm? Right. And so I look at regulation in that sort of view and say,

28:19
we're going to have some really rudimentary efforts, and we're going to stumble through these things. And no doubt, it's actually going to be slower than the pace of innovation. The technology that we have are going to vastly exceed this speed at which we roll out this regulation. And that's, I think, a real problem, which is why I also liked not just the regulation, but also the code of conduct. So there was a code of conduct that was

28:47
released a few months ago that basically says, hey, companies and industries that are using these, that are building AI capabilities and products, we would like for you to adhere to this code of conduct, which right now has no teeth, has no enforcement building or anything, but at least it starts establishing some basis to say, this is what we think is...

29:13
I'm going to put these in air quotes because it's so loosely defined right now, but these are the things that are safe and ethical and responsible. And that's something that we're having to wrestle with to understand what do you mean by safe? What do you really mean by ethical? What do you mean by responsible? These words are very loosely understood. And we need to come up with a way to fully understand that before we can even think about truly enforcing some of these regulations. Because

29:40
know, how do you regulate safety when you don't even know what the definition of safety is? Yeah. But that said, yes, do we need safety regulations? We absolutely do. We just don't know what the parameters are of those are going to be. So anyway, I'm in support of having these, having regulation, or at least having the conversation towards regulation being brought into the table because it causes people to pause and think and say,

30:08
What is it that we're actually trying to achieve here? What are we building and what's the real goal? Because if we, going back to the example with the need to know, we jam all this content into this knowledge base and all of a sudden, again, it's super powerful like a nuclear bomb, but it's like, no, sorry, it's super powerful like nuclear energy, but it's super dangerous like a nuclear bomb. Yeah, yeah. It's a really interesting point because I think one of the things that, to your point,

30:38
causing somebody to pause and ask what you're really trying to do here is a question that gets asked very frequently when we see things like the overreach of data collection from social media companies, ride share apps, what have you. And it's always one of the questions on when they're testifying before Congress, why did you collect this data? And it's the kind of thing that nobody paused to ask, what are we trying to achieve here? And they just collect it. And I mean, that's kind of a small scale or a smaller scope example.

31:08
than what could be achieved through LLMs and what is potentially happening with AI today. What about the second aspect of it towards, you know, us behaving in an ethical regulated manner and our adversaries not necessarily doing so. Does that cause you any concerns or does that cause you to just kind of go back to like, hey, just use it in appropriate ways within what you're building and you'll always maintain the first mover advantage?

31:36
Yeah, this is hard. And the analogy, we see an analogies here, whether it's with nuclear weapons or bio weapons or so on and so forth. And it's really hard for us to establish these norms and have people adhere to these norms. We do have norms for my biological weapons. Are there bad actors still trying to build biological weapons? Absolutely. But we at least have this common understanding of, guys, let's not kill ourselves and the whole world.

32:06
process of what we're doing here. The real challenge of artificial intelligence is that it's not just the government actors that are involved, it's every enterprise out there that's madly rushing towards, that's optimizing towards profit. Yeah. Which is actually, let me just drop a quick note on that. There was a reason why OpenAI was initially set up as a nonprofit. Yeah. The reason why Anthropic is a nonprofit, okay. There's a reason.

32:35
I don't have the full details, but it's some of the concerns around open AI becoming having a profit for profit aspect of it. Does that drive the right incentives? And one would argue, perhaps it doesn't, right? It creates the wrong kinds of incentives and the wrong kind of motivation to move towards a model that drives us.

33:04
faster and faster towards extinction versus doing it in a way that allows us to control that speed so that we don't have a massive car crash. Yeah. Well, on that uplifting note, I think we've covered a broad range of topics and I think it's been a fascinating conversation. So on behalf of myself and all of our listeners, I thank you for taking the time to talk to us today. This has been Sunil Yoo joining you on the Modern Cyber Podcast to talk about AI. Thanks.

Discover all of your APIs today

If you can't see it, you can't secure it. Let FireTail find and inventory all of the APIs across your organization. Start a free trial now.