Vulnerability Research in the Age of AI

The research process

Vulnerability research can, approximately, be split up into a number of discrete stages:

Ideation: This is the first stage in which we identify something worth looking into, whether an ecosystem, an industry, or a single project. Usually, at this stage, there is some expectation of impact for the things you find, and all good researchers want to identify the highest-impact vulnerabilities we can.
Literature review: Also known as 'prior art,' this is a stage of reviewing what has already been done for the topic at hand. We focus on which areas have been focused on, which are secure, and which haven't had the levels of inspection we may be able to bring.
Finding stuff: This is the bulk of a research project in which we actually performing the grunt work to explore the targets. This includes identifying attack surfaces, risks, and, hopefully, new high-severity vulnerabilities.
Proving it: A hunch, a line of code, or a suspicious-looking control flow is nothing if you cannot prove impact. In this stage, we will take what we’ve identified in previous steps and devise a proof of concept; some kind of script or other payload that can show off the impact of what you've identified.
Reporting and fixing: With a completed project and hopefully some juicy proof of concepts, the final steps are to report to vendors and get the bugs fixed. Hopefully, along with your proof of concepts and writeups, you're also including patches or suggested mitigations for what you've found

AI, in all its forms, be it chatbots, agents, or autonomous hive minds, has varying degrees of success when it comes to the different stages of a research project. In the following sections, I will discuss each section in turn and discuss my own successes and failures in these areas, where the future might be, and how we can still retain our roles in the process.

1. Ideation

This is the first step of any research project and, I think, the biggest moat for human researchers. Sure, if your only goal is "I want to find some XSS to get some bug bounty payments," then ideation is just a matter of listing the bug bounty programs by average payout. But if you want to perform novel, interesting, or even cool research, then a little more thought is necessary. Deciding what to target and how approaches an art form.

An LLM can help here, for sure. LLMs are great at summarizing the state of the art of a problem of interest, often from a prompt of just a few words, cutting down what used to take hours of Googling. It can be a very quick sounding board for ideas to throw against the wall and see what sticks. They can also be useful for finding tangents, derivative work, or other offshoots from your initial idea.

This is what I referred to in the intro as 'shifting left'. In the upcoming sections, we'll explore the impact of LLMs in improving research velocity, including feasibly having a whole pipeline of AI agents that do everything for you, but the initial steps, the planning, and the creativity will always require someone to kick it off.

2. Literature review

Once you have the kernel of an idea, LLMs can help refine the goals. Given a vague initial idea, "I want to look at technology X", something like Gemini's Deep Research can take that away and give you a deep dive into where that technology is used, what kind of impactful vulnerabilities might exist, and maybe where to focus your initial energy. As part of the output I receive, it also includes a literature review and can identify (to some extent) the gaps. I've had some success here in having the LLM compare existing work with new features and documentation to find more recent attack surfaces than those previously explored.

3. Finding stuff

The area with the most complexity, and as a consequence, the area with the most variation in experience, is what I've called the 'finding stuff' phase. The actual 'how' of security research is an enormous topic, one that I could not thoroughly cover in three times as many words as this article. There are, however, a few key areas that are worth covering.

Direct discovery

The first part I wanted to discuss is what I have (just now) coined 'direct discovery'. This is where you tell your LLM to go find you some vulnerabilities in a particular codebase, and it will go away and perform a code review on the whole repository. Hopefully, it will then come back with a list of findings it has identified. This is what things like Claude Code Security do.

LLMs are okay at this. There are, of course, stacks of optimizations you can apply to make things a little better, things like subagents, limiting your research to one vulnerability class at a time, or limiting yourself to a single area of the codebase at a time, for example. But fundamentally, due to things like limited context windows, LLMs (in my experience) have only really had strong success where a vulnerability is relatively well constrained.

This works well for many relatively shallow vulnerabilities, like SQL injection, where you're appending user data to statements, or XSS, where you're returning user data without appropriate encoding. It also has some success with more context-dependent vulnerabilities, which historically have been quite tricky to scan for due to differences across code bases. An example of this is missing authentication vulnerabilities or similar issues, where knowing what authentication looks like is hard to do statically, but the LLM can easily infer. Where an LLM falls down often is in wider-ranging vulnerabilities, such as those in business logic that require looking at completely disparate parts of the codebase.

As a consequence of using LLMs to find the 'low-hanging fruit' and do the grunt work of reading through all the code, a researcher is less likely to become thoroughly familiar with a particular codebase, which is extremely beneficial when it comes to finding the more complex, high-impact, and deeply ingrained security vulnerabilities.

In many cases, there is significant overlap between what an LLM can identify and what a traditional SAST scanner can identify. And if that is the case, why not go with a SAST scanner, which will be faster and cheaper than your LLM $/tok?

However.

I firmly believe this is the area where LLMs are improving the fastest, and I fully expect that what I've written here will no longer hold true in the next few years. Context windows are getting larger, tooling is getting smarter, and as soon as an LLM can hold your entire project in its effective context window, vulnerability researchers might be cooked.

When an LLM can reason across an entire code base at once, deep-seated business logic vulnerabilities that currently need a human's time and understanding can be spotted in an instant by an LLM. SAST products will still naturally have a place for consistency, reliability, and confidence in results. Still, LLM-based scanners may also be an important part of the stack for the more wide-reaching and deeply complex vulnerabilities.

I cut my teeth on complex, multi-system business logic vulnerabilities, with nothing more than a very strong understanding of a particular codebase. But the looming specter of LLMs that could reproduce that in a fraction of the time is certainly a game-changer, and it doesn't seem like it will be that long until my entire 7-year back catalog can be reproduced in a day.

Whilst the fun may be leaving for the researchers themselves, this is an absolute boon for application owners, who can find and secure more complex vulnerabilities than ever before.

Indirect discovery

The complementary area to 'direct discovery' is indirect discovery. This area refers more to the second-order discovery, things where you have to build first before the research begins in earnest, things like fuzzing harnesses.

LLMs are amazing at this. It's just writing code after all, exactly what LLMs are designed to do.

Recently, for an experiment, I asked Claude Opus to write a fuzzing harness for a particular project I had always wanted to fuzz, but the time investment to get the harness up and running wasn't worth the likelihood of zero findings. I asked the LLM (using the Cursor CLI) to write a fuzzing harness for Honggfuzz, targeting a particular part of the wider application. I then went and made a coffee. Upon my return, the LLM had cloned the repo, analyzed the project to identify exactly what I cared about, and written a fuzzer that targeted it. It had even written a set of scripts to validate that the fuzzer was hitting the right functionality, and a Makefile for me. Its final instruction was to just run 'make fuzz'. Lo and behold, it worked perfectly, the first time. It had created a complete, complex harness in minutes that would have taken me at least a day or two.

This power also makes it trivial to create tooling that would otherwise not be worth the time. Honggfuzz, as a fuzzer, is great, but the UI takes a second to understand the progress, especially over time. Generally, this isn't a major problem, but to extend my experiment, I asked Opus to create a dashboard for me with graphs and statistics for the fuzzing job. Specifically, I was interested to see the delta-pc, or the change in new code blocks visited over time. It's useful to know that the fuzzer is still finding inputs that exercise new code paths, and it's nice to track how it slows down over time.

I am confident there are some amazingly built wrappers around honggfuzz which will show these graphs amongst many others, and if I did fuzzing a lot more than I do, I would definitely lean towards looking into them. But for this experiment, given the very low time investment I wanted to make, finding and setting up a project like that was not worth it. But what was worth it was to describe to Opus what I wanted to see, and then go to a meeting. When I came back to it, I had a perfect web interface that showed me exactly what I wanted to see. Sure, it was ugly, and the code was terrible, but it took me 30 seconds of 'effort', and a little bit of waiting (while I did other things) to achieve something that improved my quality of life.

LLMs may not be able to replace well-built tools in minutes, but they can definitely replace tools where quality isn't that important. This is a massive force multiplier for research, as you can explore more paths, faster, with more depth than ever before. If you have an idea that you know will take a decent amount of engineering effort, is it a very long shot? Have an LLM do it. Have a slight annoyance that isn't enough to look up an existing solution, but it would fractionally improve your quality of life? Have an LLM do it.

4. Proving it

You can do all of the coolest research in the world, but if all you have is a suspicious line of code or a hunch that something is vulnerable in a specific way, nobody will believe you. All good research comes with a strong Proof of Concept (PoC), which demonstrates the vulnerability and impact you've identified. LLMs have essentially solved this in my experience.

Once a vulnerability is found, and you've got enough detail to describe it, LLMs excel at producing a working proof of concept to show off the vulnerability.

As part of a different experiment I attempted recently, I used a particular language interpreter that I knew published its vulnerabilities as issues, with enough detail to reproduce them. I tasked Claude Opus with picking one of the memory corruption vulnerabilities that did not require any external dependencies (or even any imports) and creating a working proof-of-concept that demonstrated successful exploitation.

The LLM successfully browsed the issue tracker, found a previously reported memory corruption vulnerability, cloned the repository, and got to work. After an hour or so, it had a complete script that exploited a UAF vulnerability. It had elected to build the interpreter with AddressSanitizer (ASAN) and prove the UAF there. I tasked it to go further and get rid of ASAN, extending the proof of concept to show control over the instruction pointer. This took a little longer because it had to investigate the interpreter's memory model, perform heap grooming, and find a complete, valid payload. Nevertheless, after a few hours, I had an updated script which, when run under a debugger, showed a crash with RIP at 0xcafef00d.

I had a little less success going from this stage to full shellcode execution, mainly because I gave the LLM some difficult-to-work-with constraints (no ROP chains), but it still successfully made it to calling system("/bin/sh"), which is significant. I think this is partially due to some of the early choices it made, kneecapping it slightly.

I am confident that I could have done a better job than the LLM and achieved full shellcode execution without ROP chains. But I am also confident that it would likely have taken me a week or so (I'm pretty rusty with UAF). The LLM was able to complete the work with about 2 minutes of my prompting and a few hours of running. This is unbeatable. I doubt even the best UAF exploiters in the world would have been able to achieve the same in under a few hours of active work.

I love writing proof-of-concept code, especially for complex vulnerabilities, which present a really interesting challenge and puzzle. But often, at the end of the day, we just need to prove the vulnerability, not have fun, and so LLMs have trounced security researchers here also.

5. Reporting and fixing

You've had your idea, you've done your research, and you've created your PoC (or an LLM has done it all for you). You still need to communicate the findings to the vendor/maintainer/whoever to get them resolved.

Somewhat unsurprisingly, LLMs can usually do a pretty good job of writing up a vulnerability, its root cause, impact, and suggested mitigations. This can be a massive time saver. However, until now, the impacts of all LLM output have fallen directly to the researchers who chose to use the LLMs. Reporting changes that. The report will likely end up in the hands of someone who did not employ the LLM, may not be involved in the research work, and, in the case of open source projects, may have very little time to respond to such a report.

It is therefore absolutely critical that a researcher respects the time of the person they are reporting to and ensures that any report they send is completely accurate. This means fully proofreading the writing, understanding the vulnerability well, and the system in which it is contained. In my experience, when LLM output requires this level of verification, it is often faster to just write it myself in the first place, to be absolutely sure it's accurate and meaningful.

With LLMs being so good at writing and modifying code, I think it behooves us as researchers to also deliver complete patches alongside our vulnerability reports. Similarly to the above, it's still absolutely necessary to review the LLM build patch, but with a strong understanding of what you're reporting, borne through writing the report yourself, it should be a pretty quick job to review and confirm effectiveness.

Conclusion

LLMs have changed the face of vulnerability research forever. Whilst they may have gaps currently, areas where they are less effective than a dedicated person, it appears to be shrinking rapidly, and before long, we'll all have AI Agents finding CVEs all day long.

In some ways, this is a shame. I love security research, and I particularly love writing proofs of concept. But it is reaching the point where focusing on those areas is not a particularly good use of time. But in a much more effective way, this is a boon for security as a whole. Vulnerability research is becoming more and more approachable by anyone, and with more eyes on everything, more vulnerabilities will be found, and applications will become more and more secure. That being said, an over-reliance on LLMs can lead to shallower analysis, and an individual security researcher should still strive to learn the relevant skills themselves to be able to solve the more complex research problems.

But where does this leave security research teams?

Dedicated research teams can pivot from the classic vulnerability research, where the output is lists and lists of CVEs, to more holistic security research. Why find an XSS, which an LLM could do for you, when you could find the next XSS - a new class of security vulnerability that applications are vulnerable to, and LLMs have never seen before? Additionally, with the cost of building things shrinking by the month, security research teams can use their deep security expertise in finding new ways to detect and protect against vulnerabilities. Here in Snyk's Security Labs team, we are striving to do both: we want to find new vulnerability classes, new impacts, and new exposure in the enormous amounts of software being produced today, and we want to use our security expertise to ensure that organizations and their applications can be appropriately protected.

Vulnerability research is not dead. Classic vulnerabilities like XSS and SQL injection have been solved problems for decades, yet they still pop up. There will always be a long tail of known, solved vulnerabilities. But the era of security research teams finding the same vulnerabilities is over, and we must strive to push forward where LLMs cannot, and leave it to the LLMs to find the known vulnerabilities.