If you see this, something is wrong
To get acquainted with the document, the best thing to do is to select the "Collapse all sections" item from the "View" menu. This will leave visible only the titles of the top-level sections.
Clicking on a section title toggles the visibility of the section content. If you have collapsed all of the sections, this will let you discover the document progressively, from the top-level sections to the lower-level ones.
Generally speaking, anything that is blue is clickable.
Clicking on a reference link (like an equation number, for instance) will display the reference as close as possible, without breaking the layout. Clicking on the displayed content or on the reference link hides the content. This is recursive: if the content includes a reference, clicking on it will have the same effect. These "links" are not necessarily numbers, as it is possible in LaTeX2Web to use full text for a reference.
Clicking on a bibliographical reference (i.e., a number within brackets) will display the reference.
Speech bubbles indicate a footnote. Click on the bubble to reveal the footnote (there is no page in a web document, so footnotes are placed inside the text flow). Acronyms work the same way as footnotes, except that you have the acronym instead of the speech bubble.
By default, discussions are open in a document. Click on the discussion button below to reveal the discussion thread. However, you must be registered to participate in the discussion.
If a thread has been initialized, you can reply to it. Any modification to any comment, or a reply to it, in the discussion is signified by email to the owner of the document and to the author of the comment.
First published on Wednesday, Aug 6, 2025 and last modified on Wednesday, Aug 6, 2025 by François Chaplais.
Esslingen University, Esslingen, Germany Email
Institute for Intelligent Systems, Esslingen University, Esslingen, Germany Email
Institute for Intelligent Systems, Esslingen University, Esslingen, Germany Email
Chair of Communication Networks, University of Tübingen, Tübingen, Germany Email
Esslingen University, Esslingen, Germany Email
Software Vulnerability Detection, Vulnerability Detection, Software Engineering, Vulnerability Datasets, Benchmark Datasets, LLM, SLR, Survey
The increasing adoption of Large Language Models (LLMs) in software engineering has sparked interest in their use for software vulnerability detection. However, the rapid development of this field has resulted in a fragmented research landscape, with diverse studies that are difficult to compare due to differences in, e.g., system designs and dataset usage. This fragmentation makes it difficult to obtain a clear overview of the state-of-the-art or compare and categorize studies meaningfully. In this work, we present a comprehensive systematic literature review (SLR) of LLM-based software vulnerability detection. We analyze 227 studies published between January 2020 and June 2025, categorizing them by task formulation, input representation, system architecture, and adaptation techniques. Further, we analyze the datasets used, including their characteristics, vulnerability coverage, and diversity. We present a fine-grained taxonomy of vulnerability detection approaches, identify key limitations, and outline actionable future research opportunities. By providing a structured overview of the field, this review improves transparency and serves as a practical guide for researchers and practitioners aiming to conduct more comparable and reproducible research. We publicly release all artifacts and maintain a living repository of LLM-based software vulnerability detection studies.
Vulnerability detection plays a critical role in the software development life cycle, identifying security vulnerabilities before they can be exploited in deployed systems. The growing complexity of modern software, combined with an ever-increasing threat landscape, has led to a surge in reported vulnerabilities. The Common Vulnerabilities and Exposures (CVE) records [1] provide unique identifiers for publicly known software vulnerabilities. Over 40,000 CVEs have been published in 2024 alone, with over 12,000 reported in the first quarter of 2025 [2]. Complementing CVEs, the Common Weakness Enumeration (CWE) [3] offers a hierarchical classification of vulnerability types, serving as a higher-level abstraction for understanding underlying vulnerability patterns.
Vulnerability assessment pipelines as found in industrial development workflows typically consist of multiple steps: detection (identifying whether code contains a vulnerability), localization (pinpointing the exact location in the code), severity estimation (evaluating potential impact), and repair (modifying the code to mitigate the vulnerability), often initiated reactively when a new CVE is disclosed. Despite advances in tooling, e.g., static analysis tools, these processes remain largely manual or only partially automated due to the reliance on expert-defined rule sets and high false positive rates of such tools [4]. As such, existing workflows are insufficient to meet the scale of remediation demands.
At the same time, the adoption of Large Language Models (LLMs) for software engineering tasks has grown rapidly since the introduction of BERT [6] in 2018. LLMs are deep neural network models, typically based on transformer architectures [7], that are pre-trained on massive corpora of natural language. LLMs have demonstrated impressive capabilities in understanding and generating code. Industry forecasts further emphasize the transformative potential of LLMs. For instance, Anthropic CEO Dario Amodei predicts that in mid to end 2025, AI is already writing 90% of the code, with developers primarily specifying design goals and constraints [8].
However, while LLMs hold promise in code generation, they also introduce new risks. Generated code often lacks awareness of existing libraries or internal codebases, which may lead to redundancies and the creation of syntactically different but semantically equivalent code. This complicates the detection of vulnerabilities, especially when they are subtle or semantically nuanced. Additionally, LLMs are prone to hallucinations and the generation of insecure code [9, 10, 11, 12]. Development teams may be unaware of vulnerabilities introduced by generated code, whether in their codebase, third-party components, or libraries.
The combination of LLM-generated code and the increasing volume of vulnerabilities calls for automated, scalable, and reliable vulnerability assessment pipelines. Recent research has begun to explore the use of LLMs for vulnerability remediation. In a prior study [13], we surveyed early work on LLM-based vulnerability handling, including detection, localization, and repair, and highlighted open challenges with regard to scalability, data diversity, and model reliability. We continued to extensively document the progress in this field, with a focus on vulnerability detection studies as a prerequisite for all subsequent remediation steps. As the number of studies on LLM-based vulnerability detection continues to grow, we observe a wide variety of system architectures, adaptation techniques, and evaluation methodologies. This heterogeneity leads to two pressing challenges: First, there is currently no comprehensive survey that maps the landscape of LLM-based software vulnerability detection methods, their system designs, and dataset usage. As a result, researchers face difficulties identifying trends, gaps, or best practices in this rapidly evolving area. Second, studies frequently adopt custom datasets, metrics, and data splits, making it difficult to assess progress, reproduce results, or compare models fairly. To address these challenges, we contribute this systematic literature review (SLR) of LLM-based software vulnerability detection studies. Specifically, the contributions are as follows:
The remainder of this paper is structured as follows: We review related surveys in Section 2, and describe the SLR methodology in Section 3. Section 4 presents the comprehensive taxonomy for LLM-based vulnerability detection. We extend this taxonomy and analyze datasets used in Section 5. We discuss current limitations and future research opportunities in Section 6. Finally, we conclude this review in Section 7.
With the growing adoption of LLMs across various domains, numerous works have emerged to explore their applications. We review related surveys that intersect most closely with the topic of LLM-based software vulnerability detection.
| Coverage | |||||||
| Domain and Reference | Published | Time Frame | #Studies | LLMs | Input Repr. | Adapt. Techniques | Datasets |
| Cybersecurity | |||||||
| Xu et al. [1] | Jul 2024 | 2020 – 2024 | 127 (17) | ● | ○ | ● | ● |
| Zhang et al. [2] | Dec 2024 | 2023 – 2024 | 300 (29) | ● | ○ | ◖ | ◖ |
| Ferrag et al. [3] | Jan 2025 | n/s | n/s (6) | ● | ○ | ◖ | ● |
| Software Engineering | |||||||
| Zhang et al. [4] | Sep 2024 | 2020 – 2024 | 947 (71) | ● | ○ | ◖ | ○ |
| Hou et al. [5] | Dec 2024 | 2017 – 2024 | 395 (18) | ● | ◖ | ● | ◖ |
| Hu et al. [6] | May 2025 | 2007 – 2024 | 191 (30) | ● | ○ | ○ | ● |
| Vulnerability Detection | |||||||
| Shiri Harzevili et al. [7] | Nov 2024 | 2011 – 2024 | 138 (4) | ◖ | ● | ○ | ● |
| Shereen et al. [8] | Dec 2024 | 2006 – 2024 | 79 (10) | ◖ | ● | ○ | ● |
| Shimmi et al. [9] | Jun 2025 | 2018 – 2023 | 98 (8) | ◖ | ● | ● | ● |
| Taghavi Far and Feyzi [10] | Jan 2025 | n/s | 119 (n/s) | ● | ○ | ◖ | ● |
| Basic and Giaretta [11] | Apr 2025 | n/s | n/s (n/s) | ● | ○ | ◖ | ○ |
| Zhou et al. [12] | Oct 2024 | 2018 – 2024 | 58 (40) | ● | ○ | ● | ◖ |
| Sheng et al. [13] | Feb 2025 | 2019 – 2024 | 58 | ● | ◖ | ● | ● |
| This survey | July 2025 | 2020 – 2025 | 227 | ● | ● | ● | ● |
In Table 1, we provide an overview of related surveys, with particular attention to their coverage of input representations, adaptation techniques, and datasets. Several surveys analyze the application of LLMs in the domain of cybersecurity [15, 16, 17], typically addressing vulnerability detection as one of several sub-tasks. Similarly, surveys in the domain of software engineering [19, 18, 20] categorize LLM usage by phases of the software development life cycle. Vulnerability detection is typically a part of software testing [18] or software quality assurance [19]. Consequently, these broader surveys provide limited insights into LLM applications specific to software vulnerability detection, such as task-specific adaptation techniques and dataset usage. Shiri Harzevili et al. [21], Shereen et al. [22], and Shimmi et al. [23] focus on the application of machine learning and deep learning techniques for software vulnerability detection. Their discussion of studies applying LLMs is limited, covering only four, ten, and eight studies, respectively. Taghavi Far and Feyz [24] do not focus specifically on software vulnerability detection but consider a wider scope, including phishing detection, threat detection in logs, and patch recommendation. The surveys by Basic and Giaretta [25] and Zhou et al. [26] are more closely aligned with LLM-based software vulnerability detection. However, their scope extends to code security and vulnerability repair, providing only limited insights specific to software vulnerability detection. Sheng et al. [27] present an overview of LLMs applied to vulnerability detection, datasets, metrics, and techniques, focusing on code processing and prompt engineering methods. In contrast to prior surveys, this SLR provides a focused and up-to-date overview of 227 studies, presenting a comprehensive taxonomy of LLM-based software vulnerability detection. We classify existing work across detection tasks and objectives, input representations, system architectures, and adaptation techniques. This is the first comprehensive taxonomy for the research area of LLM-based software vulnerability detection, systematically analyzing state-of-the-art techniques.
A further distinguishing aspect of this SLR is its in-depth analysis of vulnerability detection datasets. High-quality, realistic, and diverse datasets play a crucial role in learning-based software vulnerability detection, prompting studies to analyze, classify, and evaluate their characteristics, quality, and impact on model performance. Previous surveys have touched upon datasets: Shiri Harzevili et al. [21] categorize dataset sources into benchmark, hybrid, open source software, and repository. Shereen et al. [22] compare frequently used datasets with respect to properties such as language, size, annotation, and granularity. Similarly, Shimmi et al. [23] list used datasets with respect to granularity and language. Ferrag et al. [17] provide an overview of datasets suitable for fine-tuning LLMs in security applications. Hou et al. [20] analyze benchmarks for evaluating LLM-based solutions, focusing on construction methods, programming languages, and metrics. Sheng et al. [27] discuss datasets used for LLM-based vulnerability detection, focusing on granularity and programming languages. In contrast to previous surveys that primarily list or categorize datasets by properties such as granularity, language, or source, this review offers a more comprehensive and structured analysis. We categorize datasets using a detailed taxonomy covering type, granularity, source, and labeling methodology, and investigate additional quality dimensions, such as class balance, CWE diversity, and distribution, offering an in-depth analysis of represented vulnerabilities. Further, we discuss emerging trends and use cases in dataset utilization.
Other related studies focus explicitly on vulnerability detection datasets: Lin et al. [28] examine dataset construction methodologies, comparing datasets across six dimensions, such as granularity, vulnerability type, and labeling approach. Guo et al. [29] analyze selected datasets in terms of vulnerability distributions and types. Jain et al. [30] provide a code-centric evaluation of commonly used C/C 0.1ex ++ datasets for deep learning-based vulnerability detection. Moreover, the works [31, 4, 32, 33] provide in-depth analyses of quality issues in few selected vulnerability datasets, such as data imbalance, low vulnerability coverage, biased vulnerability distribution, high duplication rates, and label noise. Despite this growing body of work, a targeted and comprehensive overview of vulnerability detection datasets commonly used for LLM-based software vulnerability detection approaches, along with their limitations and suitability remains missing. This survey addresses this gap, offering a taxonomy and systematic analysis of datasets, their CWE coverage, diversity, and usage, thereby facilitating better comparability and benchmarking in future research.
We followed established guidelines by Kitchenham et al. [34, 35] for conducting SLRs in software engineering research. We investigate the following research questions RQ1–RQ4, cf. Table 2: RQ1 How is the vulnerability detection task formulated and approached in LLM-based systems? RQ2 How is input, particularly the semantics of vulnerabilities, represented and encoded for LLMs? RQ3 What are the predominant techniques for adapting LLMs to the task of software vulnerability detection? RQ4 What datasets are used to evaluate LLM-based vulnerability detection approaches, and how do dataset choices impact comparability across studies? A joint investigation of RQ1–RQ4 is crucial to obtain a holistic understanding of LLM-based vulnerability detection research, from input representations to system design and evaluation, and to identify key levers to enhance robustness, generalization, and comparability. We conducted the SLR in three phases, i.e., (1) literature search, (2) study selection, and (3) study analysis, as detailed in the following.
| Research Question | Objective | Discussion | |
| RQ1 | How is the vulnerability detection task formulated and approached in LLM-based systems? | This question focuses on how the detection task is formulated, and how this formulation shapes system architecture. | Section 4.1 and Section 4.3 |
| RQ2 | How is input, particularly the semantics of vulnerabilities, represented and encoded for LLMs? | This question investigates how code and vulnerability-related semantics are formatted and passed to the model, with a focus on how effectively vulnerabilities are represented. | Section 4.2 |
| RQ3 | What are the predominant techniques for adapting LLMs to the task of software vulnerability detection? | Here, the focus lies on identifying and categorizing state-of-the-art adaptation strategies, including prompt engineering styles, fine-tuning techniques, and advanced training paradigms. | Section 4.4 |
| RQ4 | What datasets are used to evaluate LLM-based vulnerability detection approaches, and how do dataset choices impact comparability across studies? | This question explores the characteristics of datasets used in current research (such as type, size, and label quality), how they affect model generalization, benchmarking, and the overall comparability of experimental results across studies. | Section 5 |
To identify relevant studies, we conducted a systematic literature search across three major databases: IEEE Xplore, ACM Digital Library, and arXiv. We selected IEEE Xplore and the ACM Digital Library as they host peer-reviewed publications from conferences and journals. We further include arXiv for preprints to capture ongoing research that is still in the submission process, given the rapidly evolving nature of this field. We manually identified a set of relevant primary studies in our prior publication [13], which we used to derive keywords for an initial search string. We refined this search string using keywords from related surveys [19, 36]. We searched the selected databases for studies whose titles or abstracts matched this search string. Before the cutoff date in July, the database search yielded a total of 552 studies from IEEE Xplore, 344 studies from ACM, and 2287 studies from arXiv. For reproducibility, the final search string is as follows:
Keywords: (LLM OR LLMs OR Large Language Model OR Large Language Models OR pre-train* OR GPT* OR ChatGPT OR T5 OR LLaMA* OR Codex OR BERT) AND ((vulnerability OR vulnerabilities OR CVE OR CWE) AND (detect* OR identif* OR classif* OR analy* OR discover* OR assess*) AND (software OR program OR code))
Following the initial retrieval, we applied a selection process to filter irrelevant studies. We reviewed the title and abstract of each study and screened it to assess its relevance based on predefined inclusion (✓) and exclusion (✗) criteria:
Given the inclusion of preprints from arXiv, we conducted a manual quality check to determine study eligibility. We assessed whether the studies provided a clear contribution, adequately described the proposed workflow and implementation, outlined the evaluation setup (including datasets, baselines, and metrics), and presented coherent findings that support their central claims.
Finally, we performed backward and forward snowballing. We manually reviewed the reference lists of the selected studies (backward snowballing) and used citation tracking via Google Scholar to identify and evaluate additional papers citing the included studies (forward snowballing).
Following the selection process, we identified 227 papers that met our selection criteria. The earliest included study dates back to 2020. The vast majority of publications (106 studies, 46.7%) were published in 2024, followed by 54 studies in 2025, underscoring the rapidly growing interest in and significance of LLM-based vulnerability detection. Notably, 83 (36.6%) of the selected studies were sourced from arXiv, reflecting both the fast-paced development of this field and the prevalence of recent works still undergoing peer review.
During the full-text review, we extracted key data points, recording how the vulnerability detection task was defined, which LLMs were used, and what adaptation techniques were employed. Additionally, we extracted information on input representation formats, datasets used for fine-tuning and evaluation, targeted vulnerability types (e.g., specific CWEs), and evaluation metrics. This structured analysis enabled us to derive a comprehensive taxonomy of LLM-based software vulnerability detection approaches, address the research questions, and develop a deeper understanding of current trends, open challenges, and emerging directions in LLM-based software vulnerability detection.
Recent research has adopted a variety of techniques for detecting software vulnerabilities using LLMs. In this chapter, we present a comprehensive taxonomy that categorizes the different ways LLMs are applied to vulnerability detection. Specifically, we examine task formulations, input representations, system architectures, and adaptation techniques. The overall taxonomy is illustrated in Figure 1. We discuss each category on the basis of the surveyed studies in the following subsections.
LLM-based vulnerability detection can be framed in multiple ways, depending on how the task is formulated and what objectives are prioritized beyond detection. This section outlines and discusses the most common formulations, particularly as classification tasks, and additional objectives.
Vulnerability detection is most commonly formulated as a classification problem, differentiating binary classification, vulnerability-specific classification, and multi-class classification [22]. Binary Classification determines whether a given code contains a security vulnerability or not (’Yes’/’No’). Vulnerability-Specific Classification, as a refinement of the binary formulation, determines whether a given code contains a specific vulnerability, represented, e.g., by CWE-ID. Multi-Class Classification determines which specific type of vulnerability the given code contains, often using CWE-IDs as class labels. This formulation may use a given pre-defined list of vulnerability types in a prompt engineering setting.
The choice of task formulation depends on factors such as the application domain, the availability and quality of labeled data, and the surrounding tooling and infrastructure. The surveyed studies use one or multiple of these formulations. With 156 studies, binary classification remains the most prevalent task formulation, cf. [22], followed by multi-class classification with 90 studies. Only 27 studies [37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63] adopt a vulnerability-specific classification. Notably, Zibaeirad and Vieira [52] systematically apply and compare all three task formulations using different prompts.
Binary classification simplifies the detection task as it is reduced to a decision of whether a code sample is vulnerable or not. While this approach facilitates model training and deployment, it lacks fine-grained insight into the specific type or cause of the vulnerability, limiting its use in real-world vulnerability remediation.
Vulnerability-specific classification is more targeted, which is valuable in contexts involving high-impact CWEs or regulatory compliance. However, it depends on curated datasets and may struggle to generalize to related vulnerability types not present in the training data.
Multi-class classification provides the most detailed perspective by distinguishing among various types of vulnerabilities. This formulation aligns more closely with real-world vulnerability remediation, where identifying the exact nature of a vulnerability is essential for vulnerability patching. However, it introduces additional complexity in both modeling and evaluation. Its effectiveness also depends heavily on high-quality, CWE-labeled, and well-distributed datasets.
Some studies combine or cascade these formulations. For example, a binary classifier may act as a preliminary filter or router, forwarding likely vulnerable samples to a CWE-specific classifier for further analysis [51].
While classification is the primary formulation for vulnerability detection, it offers limited insight into the exact location or root cause of vulnerabilities. Consequently, several studies have extended the scope to incorporate additional objectives that align more closely with real-world remediation workflows. With the growing adoption of LLMs for Code Generation, several studies [64, 65, 41, 42, 66, 67, 68, 69, 55, 63] investigated the vulnerabilities present in LLM-generated code, often using the same LLM for both generation and detection (self-evaluation capabilities). Other studies extend the vulnerability detection task to include additional stages in the vulnerability assessment pipeline, such as Vulnerability Localization [70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 46, 47, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 54, 55, 92, 93, 94], Severity Estimation [75, 95, 96, 97, 98, 99], Vulnerability Repair [64, 41, 100, 101, 102, 75, 95, 42, 103, 104, 105, 66, 46, 106, 67, 84, 86, 107, 108, 89, 52, 97, 55, 57, 62], or Security Testing [67, 97, 59], e.g., validating the results by running the security tests generated [59]. In addition to classification outputs, many approaches prompt for natural language descriptions of the identified vulnerabilities (extending the classification into a generation task), typically referring to the CWE-type in multi-class classification. With the emergence of more advanced reasoning models and structured prompting techniques, studies [109, 74, 100, 110, 77, 45, 66, 46, 47, 111, 82, 85, 112, 86, 50, 107, 113, 114, 89, 115, 90, 53, 116, 117, 97, 54, 55, 57, 59, 93, 99, 61, 62, 118] shift towards Reasoning about root causes of vulnerabilities. For example, Steenhoek et al. [85] perform an error analysis on LLM responses, demonstrating that the models struggle to reason about the code semantics relevant to identifying vulnerabilities, especially subtle semantic differences caused by small textual changes. Wen et al. [93] propose forward and backward reasoning, where the forward reasoning aims to deduce the causes of vulnerabilities, while the backward reasoning seeks to understand the code changes in the fixes. Their results show that LLMs can reason about vulnerabilities but often lack the domain-specific knowledge required to effectively invoke such reasoning.
The developments in additional objectives reflect a broader shift in the role of LLMs from simple classifiers to end-to-end vulnerability assistants that support developers interactively throughout the software development life cycle. The presented objectives also create opportunities for multitask benchmarks that evaluate detection accuracy alongside, e.g., reasoning quality, localization precision, and severity estimation, offering a more holistic perspective on vulnerability remediation.
In the context of LLM-based vulnerability detection, the way source code and related information are represented as input plays a crucial role in model performance. Beyond plain software code, inputs may include natural language vulnerability descriptions or structured abstractions of code. These representations influence the model’s ability to capture both syntax and deeper semantic relationships relevant to vulnerability patterns. Building on the taxonomy introduced for vulnerability repair by Zhang et al. [36], we adapt it for the vulnerability detection task, categorizing input representations as raw, structure-aware, prompt, and conversation-style.
Raw input representations of code involve directly feeding source code into the model, without any additional annotations or transformations [119, 37, 120, 121, 39, 122, 123, 124, 109, 125, 72, 126, 127, 128, 129, 130, 131, 75, 76, 132, 133, 134, 135, 136, 137, 138, 139, 140, 103, 77, 141, 142, 143, 144, 145, 146, 147, 148, 78, 149, 150, 151, 152, 153, 154, 155, 156, 157, 46, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 84, 171, 172, 173, 174, 175, 96, 176, 177, 178, 179, 180, 181, 182, 183, 184, 113, 51, 108, 88, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 98, 200, 58, 201, 202, 203, 204, 205, 118]. This approach treats code as plain text sequences, relying entirely on the model’s pre-trained understanding of programming languages to detect patterns indicative of vulnerabilities. This representation is widely used for training or fine-tuning LLMs. However, raw code inputs may struggle to capture the deeper semantics and context necessary for detecting complex or subtle vulnerabilities. As such, this approach may underperform in scenarios requiring precise reasoning about program logic, data flow, or code structure.
Structure-aware representations enhance the model’s understanding of code by capturing deeper syntactic and semantic structures of code and vulnerabilities, thereby addressing the limitations of relying solely on textual patterns. Code is transformed into graph-based or other structured formats that expose control flow, data flow, and dependency relationships relevant to vulnerability semantics.
Common graph representations include Abstract Syntax Trees (ASTs) [111, 206, 207, 208, 70, 209, 210, 90, 211], Control Flow Graphs (CFGs) [212, 208, 213], Data Flow Graphs (DFGs) [214, 215, 45], Program Dependency Graphs (PDGs) [208, 74, 94], combining data and control dependencies, and Code Property Graphs (CPGs) [216, 217, 218, 219], which integrate multiple code views (AST, CFG, PDG). Some studies employ composite or multi-layered graph structures that integrate several of the above [220, 221, 222, 223], which can outperform other graphs [224], or combine multiple modalities including text, graphs, and visual representations [225].
In addition to graph structures, other structured inputs aim to isolate semantically relevant code regions, such as program slices [226, 73, 227, 228, 117, 91], e.g., meaningful blocks based on AST structure [117], or code gadgets [229, 230, 231, 232, 233, 234, 235, 92, 236, 237]. Code gadgets, in particular, consist of program statements that are semantically related to each other through data or control dependencies [238], isolating vulnerability-relevant code regions. Other studies include program traces, i.e., execution-aware code structure [239, 54]. Mächtle et al. [54] specifically build trace gadgets, i.e., trace the code before slicing. By making structural and semantic dependencies explicit, structure-aware representations enable models to reason more effectively about vulnerability-related behaviors, especially those that span across the codebase or require understanding of program logic and execution flow.
Prompt engineering has emerged as a widely used technique for steering LLMs towards effective vulnerability detection without requiring any modifications to the model’s weights. Prompt engineering formulates the task through strategically designed input prompts, typically consisting of a system prompt (providing general context or behavioral instructions) and a user prompt (specifying the task and providing the code to be analyzed).
Prompt content often extends beyond a basic task instruction. It may include vulnerability reports or CWE descriptions [240, 100, 241, 242, 112], function-level insights such as descriptions of program logic, control flow [86], or static analysis outputs [243, 47]. Well-designed prompts are crucial to ensure the model receives relevant context and is guided toward the correct reasoning path.
Some studies engage the model in a back-and-forth analysis, rather than issuing a single prompt for vulnerability detection [215, 244, 69, 245, 50]. This interactive, Conversation-Style setting enables the model to double-check and revise earlier predictions, aiming to mitigate as many vulnerabilities as possible through a process of progressive refinement [66]. Such settings support multi-step reasoning, for example, by discussing program analysis before initiating vulnerability detection [79, 245, 101, 62, 246]. Conversation-style interactions are particularly prominent in self-refinement workflows, where models are prompted to review and improve their outputs [69, 74, 79, 86, 247].
A notable technique using prompt-based inputs (apart from typical prompt engineering) is prompt-tuning, which involves learning prompts to optimize model performance on a downstream task. Two main prompt forms are distinguished: hard prompts, which are manually crafted sequences of tokens appended to the input, and soft prompts, which are learned continuous embeddings that serve as virtual tokens prepended to the input.
Vulnerability detection systems using LLMs differ significantly in how the models are integrated into the overall architecture. Broadly, we categorize existing approaches into two groups: LLM-centric systems, where the LLM serves as the core analytical component, and hybrid systems, where the LLM is used in conjunction with other models.
In LLM-centric approaches, the vulnerability detection task is handled primarily by the language model itself. These systems rely on the LLM for core tasks such as classification and reasoning, and typically use either prompt engineering or fine-tuning to adapt the model to the detection task. Surveyed studies rely on two main types of LLMs: general-purpose LLMs and code LLMs. Depending on the study setup, models from both categories may be used independently (comparison or benchmark) or in combination (ensemble), e.g., [164, 98].
General-purpose LLMs are pre-trained on natural language corpora and tend to perform well on explanation-based or natural language-heavy tasks. Early examples include BERT [6] and its variants RoBERTa [248] and DistilBERT [249]. Other models widely used for vulnerability detection include: the GPT series [250, 251], with the latest version being GPT-4 [252]; the LLaMA series [253, 254, 255]; the Qwen series [256, 257, 258]; Gemma 1 and 2 [259, 260]; DeepSeek R1 and v2 [261, 262]; Mistral [263] and Mixtral [264]; T5 [265]; Claude 3 [266]; Gemini 1.5 [267]; and Phi2 [268] and Phi3 [269].
Code LLMs are pre-trained specifically on large-scale code corpora for code-related tasks, enhancing performance on syntax-sensitive or program-structure-aware tasks.
Examples include: BERT-based models, such as CodeBERT [270] and GraphCodeBERT [271]; StarCoder [272], and StarCoder2 [273]; UniXcoder [274]; the GPT-based Codex [275]; CodeLlama [276]; CodeQwen 1.5 [256], Qwen2.5-Coder [277]; CodeGemma [278]; DeepSeek-Coder [279, 280]; and CodeT5(+) [281, 282].
Hybrid architectures combine LLMs with other deep learning architectures to leverage complementary strengths. Often, the LLM is used to produce embeddings that are subsequently processed by neural networks such as Recurrent Neural Networks (RNNs), Convolutional Neural Networks (CNNs), or Graph Neural Networks (GNNs).
RNNs are neural architectures designed specifically to handle sequential data by modeling temporal and contextual dependencies between elements, such as syntactic, semantic, and control-flow relationships [283]. Prominent RNN variants include Long Short-Term Memory networks (LSTMs) and their bidirectional forms (BiLSTMs), which are designed to handle long-range dependencies effectively. In several studies, BERT embeddings are directly fed into (Bi-)LSTM layers [176, 194]. Other studies enhance this setup by passing code embeddings to (Bi)LSTM modules with added attention layers [234, 183].
CNNs are neural architectures specialized in identifying local spatial patterns or features within data through convolutional filters. Initially developed for image recognition tasks, CNNs have been successfully adapted to textual and sequential data analysis by treating sequences as one-dimensional spatial inputs. This capability allows CNNs to effectively capture localized syntactic and semantic patterns in code sequences, making them useful for tasks such as vulnerability detection [284]. Some studies apply CNNs directly to code or graph embeddings [167, 212]. Others utilize pre-trained embedding weights within TextCNN architectures [139], or integrate CNNs with attention mechanisms to enhance feature extraction [220].
GNNs are specialized neural architectures designed to operate directly on graph-structured data, effectively modeling relationships between nodes and their structural dependencies. Given structured representations of code, such as AST and CFG, several studies combine language models with GNN-based architectures to take advantage of both sequential and structural representations of code [284]. A common approach is to pass code embeddings, e.g., generated by CodeBERT or GraphCodeBERT, into GNN-based classifiers [147, 216, 222, 197, 219]. Other studies use LLMs to feed graph embeddings [221, 77, 141] or multiple modalities of a function, including text, graph, and image representations [225] into Graph Convolution Networks (GCNs). Yang et al. [108] use a GNN as a light-weight adapter layer and concatenate its learned embedding with a fine-tuned LLM.
Several hybrid architectures combine multiple neural components, integrating RNNs, CNNs, and GNNs in sequence or parallel. For example, Chang et al. [285] process CodeBERT embeddings using a two-layer GCN and a BiLSTM. Bahaa et al. [70] use a multi-layer approach composed of a DistilBERT embedding layer followed by CNN and BiLSTM layers. Mei et al. [159] use CodeBERT embeddings in three hybrid models, i.e., CLSTM, CBiLSTM (sequential structure), and CNN-BiLSTM (parallel structure). JianJie and Le [206] present a multi-base classifier combining BERT-encoder, GCN, CNN, and BiLSTM; the outputs are fused via stacking ensemble learning. Sun et al. [228] propose a stacking ensemble strategy that integrates serialization-based and graph-based neural networks to exploit complementary code semantics.
Some studies combine language models with other architectures that do not rely on RNNs, CNNs, or GNNs. Melo et al. [200] evaluate the effectiveness of Sparse Autoencoders (SAEs) when applied to representations from LLMs, which are then classified using traditional machine learning models such as random forests. Peng et al. [92] use CodeBERT embeddings as input to a Transformer-based classifier. Xue et al. [211] perform feature fusion from multiple pre-trained models and process the combined representation with a Transformer model.
LLMs are adapted to the vulnerability detection task through various state-of-the-art techniques, which differ in how they modify the model’s parameters. The primary adaptation techniques are prompt engineering, fine-tuning, and pre-training. Each strategy reflects different trade-offs between computational cost, model adaptation, and task-specific performance. In the following, we discuss the adaptation techniques as observed in the surveyed studies in detail.
Prompt engineering refers to the design and structuring of input prompts in a way that elicits desired behaviors or outputs. As LLMs remain sensitive to the phrasing, context, and formatting of prompts, numerous prompting strategies have emerged to improve task alignment and performance.
Zero-shot prompting refers to providing the LLM with a task description and input (i.e., a potentially vulnerable code snippet) without any labeled examples [64, 38, 71, 122, 286, 109, 65, 287, 247, 101, 102, 95, 42, 138, 110, 144, 104, 44, 288, 45, 289, 215, 80, 208, 155, 47, 290, 106, 48, 291, 81, 235, 292, 293, 242, 85, 172, 112, 86, 175, 96, 294, 50, 68, 179, 209, 107, 295, 69, 89, 188, 191, 192, 115, 52, 296, 53, 97, 55, 56, 297, 59, 60, 62, 298, 63]. Thus, the model is expected to rely solely on its pre-trained knowledge to perform vulnerability detection. Typical prompt formats include simple binary instructions such as "Is the following code vulnerable? Yes or No," or targeted CWE-specific queries like "Does the following code have a CWE-xxx vulnerability?" or "Which of these types of vulnerabilities is contained in the code? [CWE-xx, CWE-yy, ...]". Studies often experiment with arbitrarily extensive variations to the base prompt, such as modifying task phrasing (e.g., "vulnerable", "buggy", "insecure"), providing additional contextual information (e.g., programming language, CWE descriptions, or code function description), or adopting role-based formulations (e.g., "Act as a vulnerability detection system" [215]) to guide model behavior. Some studies further extend the basic prompt with reinforcement-like strategies [38, 66, 299]. For example, Liu et al. [66] implement a scoring system where the model earns points when vulnerabilities are correctly identified and fixed, and is penalized when vulnerabilities remain after the fix. Due to its simplicity, zero-shot prompting is commonly used as a starting point and serves as a baseline for comparison with more advanced prompting techniques.
Few-shot prompting [251], also referred to as in-context learning, extends zero-shot prompting by providing the model with several input-output examples relevant to the task. These examples typically consist of code snippets paired with their labels (e.g., vulnerable or non-vulnerable). By presenting these examples in the prompt, the model is guided to infer the desired task behavior, enabling in-context learning without modifying model weights. The examples used for few-shot prompting can be static, i.e., manually selected and reused across prompts [79, 49, 128, 287, 290, 88, 144, 41, 78, 83, 294, 114, 55, 62], or vulnerability-specific, i.e., tailored examples targeting particular vulnerability types [50, 296]. In another study, the examples are generated by the LLM itself [122]. Other common strategies for example selection, as discussed by Steenhoek et al. [85], include using random vulnerable and non-vulnerable examples from the training dataset [244, 192, 47, 191, 290, 87, 218, 85], constructing contrastive pairs, where both a vulnerable code snippet and its corresponding fixed version are provided within the same prompt [48, 86, 40, 114, 218, 85], or selecting examples from the training dataset based on (embedding) similarity to the input [192, 290, 87, 242, 297, 85].
RAG [300] extends the few-shot paradigm by dynamically augmenting the prompt with relevant external information retrieved from a private knowledge base. RAG systems use a retriever component to identify and fetch contextually similar entries based on embedding similarity or other relevance metrics. The LLM then works with the provided knowledge to generate the output. A key advantage of RAG over traditional few-shot prompting is its ability to dynamically incorporate domain-specific and up-to-date knowledge, making it especially useful for integrating evolving vulnerability knowledge. In the context of vulnerability detection, RAG-based approaches use a wide range of retrieval content. Some studies retrieve the most similar code snippets or code dependencies along with known vulnerability information [188, 241, 245, 113, 208, 182, 117, 61]. Others rely on abstract code functionality descriptions to better capture the vulnerability semantics and context of the input code [242, 112, 100, 57]. Additionally, vulnerability reports and CWE descriptions are retrieved to guide the model’s reasoning using structured vulnerability information [243, 105, 81, 240, 301].
CoT prompting [302] aims at enhancing the reasoning capabilities of LLMs by encouraging them to generate intermediate reasoning steps before arriving at a final prediction. Reasoning behavior is typically triggered by simple cues such as "Let’s think step by step" [244, 79, 38, 128, 81, 191, 290, 144, 241, 41, 114, 50, 93, 116, 62, 246], requests for explanation [47, 109, 218], or structured step-by-step instructions [49, 110, 40, 86, 74, 215, 46, 112, 113, 89, 241, 115, 50, 56, 57, 62]. Automated CoT prompt construction has also been explored [294].
Several studies adapt and extend CoT prompting with domain-specific techniques. For instance, Nong et al. [48] propose a vulnerability-semantics-guided prompt format to guide the model through relevant data and control flow facts. More advanced CoT-inspired prompting techniques include self-consistency [303], where the prompt is run multiple times, and predictions are aggregated (e.g., a file is marked vulnerable if the model returns a positive classification in two out of three runs) [86], and Tree-of-Thought (ToT) prompting [86, 56], which branches the reasoning process by generating multiple alternative steps per stage, followed by a selection of the most promising path.
Self-verification builds on the initial prompt-response interaction by prompting the model to re-evaluate, validate, or refine the output. This technique aims to increase detection accuracy and robustness by introducing an additional reasoning loop, either through self-reflection or through interactions with other components.
In single-model setups, the model is prompted to assess its prior response (self-validation or self-refinement) by verifying the correctness of its output [79, 298] or recursively critiquing and improving its initial answer [86, 74, 69]. Zibaeirad and Vieira [115] introduce a "Think & Verify" process: the model first performs initial vulnerability detection and then reassesses its findings, assigning a confidence score and providing a severity rating of the detected issue.
In multi-model settings, self-verification is implemented as an interactive process between components. For example, Wang et al. [179] use a vulnerability description refinement approach, where the output of one model is used to prompt another with corrective cues such as "An expert has found... please recheck it" [179]. Eberhardt and Milánkovich [247] introduce a framework composed of a controller and an evaluator component that work collaboratively using reflection and clarification questions.
Agentic prompting extends traditional prompt engineering by embedding LLMs into autonomous, decision-making loops, following a higher-level behavioral pattern involving multi-step planning, tool use, and self-directed reasoning. Agents may operate individually [82] or as part of multi-agent systems, where agents specialize in subtasks, interact with tools, and collaboratively refine outputs [240, 114, 67, 43, 57, 99, 90]. For example, Yildiz et al. [114] employ Reasoning and Acting (ReAct) agents [304], which follow a thought-action-observation cycle, for vulnerability detection in code repositories. Seo et al. [57] propose a three-agent framework comprising a similarity analyzer, vulnerability verifier, and code patcher; the similarity analyzer performs semantic and taint analysis on LLM-generated code against a RAG database of recent vulnerabilities, while the verifier retrieves matching entries and constructs a one-shot verification example, explaining how the matched vulnerability manifests and its root cause. Widyasari et al. [99] present a courtroom-inspired multi-agent framework with four role-specific agents: security researcher, code author, moderator, and review board that collaboratively assess code vulnerabilities through iterative analysis, argumentation, summarization, and a final verdict based on collective reasoning. Ahmed et al. [90] propose an agent pipeline that sequentially analyzes code via normalization, planning, i.e., summarizing the function and generating a checklist of potential vulnerabilities, identifying required external symbols, vulnerability detection using the prior inputs, and validation, with iterative feedback loops between the detection and validation agents.
The adaptation of LLMs to software vulnerability detection tasks often follows the established pre-training and fine-tuning paradigm. In this context, models are either trained from scratch using large code corpora to capture programming-specific syntax and semantics [119, 121, 73, 239, 139, 227, 143, 152, 161, 176, 184, 214, 190], used in their (pre-)trained form without any additional training, freezing the model weights [221, 77, 206, 148, 225, 167, 177, 220, 212, 98, 200, 197, 98, 200, 92, 219], or further fine-tuned for domain-specific applications. Fine-tuning using vulnerability-specific datasets is the most widely adopted strategy. Studies adopt a range of fine-tuning strategies that vary in their degree of parameter update and architectural design.
Full-parameter fine-tuning refers to tuning all parameters of a pre-trained language model for the downstream task of vulnerability detection. Tuning involves replacing the original output layer with a task-specific layer, such as a linear classifier layer, and training all initial model parameters [305]. This approach is widely adopted in the surveyed studies [119, 37, 120, 70, 121, 122, 123, 124, 109, 125, 72, 126, 127, 73, 239, 128, 130, 131, 75, 76, 135, 136, 137, 138, 233, 139, 227, 140, 103, 141, 143, 145, 146, 229, 147, 78, 234, 149, 150, 152, 216, 153, 154, 156, 157, 158, 159, 111, 232, 161, 235, 162, 163, 306, 164, 165, 166, 207, 242, 169, 170, 307, 84, 171, 172, 173, 213, 174, 230, 96, 176, 178, 179, 181, 209, 182, 183, 184, 51, 88, 185, 214, 187, 189, 190, 222, 191, 192, 226, 194, 195, 91, 198, 199, 54, 236, 228, 58, 201, 93, 211, 202, 203, 204, 94, 237]. While this approach is computationally intensive and less reusable across tasks, it allows the model to specialize in vulnerability detection by adapting all internal representations.
A specific variant of this technique is Instruction-Tuning, where a language model is fine-tuned on labeled input-output pairs annotated with natural language instructions [74, 142, 186, 188, 60, 196], e.g., "Instruction: Detect whether the following code contains vulnerabilities" [142]. This method teaches the model to better follow human-written prompts by learning general patterns for task completion. In [99], Widyasari et al. use GPT-4o for generating the instructions. Instruction-tuning has shown promise in enhancing performance on unseen instructions and is often combined with parameter-efficient fine-tuning techniques to reduce training cost [74, 142, 188, 60].
A key drawback of full-parameter fine-tuning is its high computational cost, which could potentially harm the model’s generalization ability. To address this drawback, several studies adopt parameter-efficient fine-tuning (PEFT) techniques, which adapt only a small subset of model parameters or learn external modules for new tasks while keeping most of the pre-trained model parameters frozen. To categorize the PEFT methods encountered in the surveyed studies, we adopt the taxonomy by Han et al. [308], covering additive and reparameterized PEFT techniques.
PEFT techniques introduce a small number of trainable parameters that are strategically positioned within the model architecture. During fine-tuning, only the weights of these additional modules or parameters are updated, which results in a substantial reduction in storage, memory, and computational resource requirements.
Adapter-Tuning inserts small trainable neural modules, i.e., adapters, into a frozen pre-trained model. During fine-tuning, only the adapter parameters are updated, while the model’s parameters remain unchanged. Rather than employing default configurations that use the same adapter settings across all layers, Akli et al. [195] present AutoAdapt to automatically discover task-specific, layer-wide adapter configurations, allowing each layer to adopt distinct parameters.
Prompt-Tuning prepends adjustable embedding vectors, referred to as soft prompts, to the start of the input sequence. This technique keeps the pre-trained model weights frozen while tuning only the parameters of the soft prompts [285, 153, 181]. Various adaptations of prompt-tuning have been investigated. Ren et al. [306] integrate prompt-tuning within a reinforcement learning framework. Feng et al. [217] explore graph-enhanced prompt-tuning, which incorporates structural code information by embedding graph features into soft prompt vectors.
Apart from the methods mentioned above, Other-Additive approaches have been used that strategically train additional parameters during the fine-tuning process. Li et al. [151] introduce a vulnerability steering vector that represents the concept of vulnerability in the representation space. The vector is injected into the activation values of the corresponding layer, guiding the model’s behavior without modifying all its parameters. Wang et al. [180] use (IA)3, an activation-scaling method that modifies intermediate activations using learnable rescaling vectors.
PEFT techniques transform a model’s architecture by modifying how its parameters are represented and trained. This typically involves replacing large components of the model with smaller, low-rank versions during training, reducing the number of parameters that need to be updated. At inference time, the model can be converted to its original parameterization.
The most widely adopted reparameterization technique in the surveyed studies is categorized as Low-Rank Decomposition technique: LoRA (Low-Rank Adaptation) [309]. LoRA injects trainable low-rank decomposition matrices into the attention layers of a frozen language model, enabling task-specific adaptation with significantly fewer parameters [74, 142, 39, 134, 138, 223, 155, 46, 168, 172, 210, 113, 191, 193, 297, 93, 60].
LoRA Derivatives have also been explored to improve efficiency and adaptability. Tian et al. [175] use PiSSA (Principal Singular Values and Singular Vectors Adaptation) [310], which shares the same architecture as LoRA but adopts a different initialization method. Ibanez-Lissen et al. [198] use GaLore (Gradient Low-Rank Projection) [311], a memory-efficient training strategy that allows full-parameter training while requiring less memory than common low-rank approaches such as LoRA. Another efficient PEFT design is QLoRA (Quantized LoRA) [312], which applies LoRA to a quantized version of the language model to further reduce memory consumption. QLoRA has been used in multiple surveyed studies to efficiently fine-tune large models for code vulnerability tasks [188, 125, 172, 108].
Beyond supervised learning, several advanced learning paradigms have been employed to enhance the performance and generalization capabilities of vulnerability detection models: contrastive learning, causal learning, multi-task learning, knowledge distillation, and continual learning.
Contrastive learning aims to learn more discriminative representations by encouraging the model to pull semantically similar instances closer and push dissimilar ones apart in the representation space, thereby improving the model’s ability to distinguish between structurally similar but semantically different code, such as secure versus vulnerable code. Several studies adapt contrastive learning techniques to the vulnerability detection domain [111, 207, 210, 129, 128, 144]. Notably, Du et al. [129] propose mutual nearest neighbor contrastive learning to align the source and target domains. For multi-class classification, Ding et al. [128] introduce class-aware contrastive learning, which minimizes similarity only between samples with different class labels to improve type-specific discrimination. Ji et al. [144] present a hierarchical contrastive learning framework to bring vector representations of related CWEs closer together, combining supervised and self-supervised contrastive losses to promote geometric spread and better inter-class separation.
Causal learning guides models to focus on genuine cause-and-effect relationships rather than relying on spurious correlations in the data, e.g., variable or API names that are not actual causes of vulnerabilities. Rahman et al. [163] identify such spurious features via semantic-preserving perturbations (e.g., variable renaming or dead-code injection) and then systematically remove their influence using causal inference techniques. Specifically, they simulate interventions, i.e., asking how the model would behave if certain spurious features were actively changed, and block non-causal paths by conditioning on known spurious factors. This strategy allows the model to rely on invariant, causally-relevant features and thereby improve robustness and generalization to unseen or perturbed code scenarios, such as different projects or naming conventions.
Multi-task learning has emerged as an effective paradigm for improving generalization by jointly training on related tasks, thereby reducing overfitting and enhancing model robustness across a range of tasks. Several studies adopt multi-task learning to enhance vulnerability analysis. Fu et al. [75] fine-tune CodeBERT under the multi-task setting for predicting CWE-ID and CWE-type, while Chen et al. [94] combine CWE-ID classification with line-level vulnerability detection. Curto et al. [72] perform binary vulnerability detection alongside multi-class vulnerability categorization. Similarly, Steenhoek et al. [84] fine-tune CodeBERT for simultaneous binary classification, multi-class vulnerability type prediction, and localization. Du et al. [74] apply multi-task instruction-tuning for vulnerability detection, localization, and interpretation of root causes. Russo et al. [166] leverage shared information between self-admitted technical debt (SATD) and vulnerabilities to jointly detect both issues. Ding et al. [239] use multi-task pre-training to jointly learn static and dynamic code properties.
Knowledge distillation is a model compression strategy in which a smaller model, i.e., the student, is trained to replicate the behavior of a larger model, i.e., the teacher. By learning from the teacher’s output distributions, the student achieves comparable performance while significantly reducing computational cost, making it suitable for resource-constrained settings. Some studies directly adopt pre-trained distilled models such as DistilBERT for vulnerability detection [126, 223], while other studies apply knowledge distillation explicitly, using teacher and student models from the BERT family [231, 205]. Nguyen et al. [160] adopt a multi-teacher setup with CNN and GNN models to distill knowledge into a student LLM backbone. A more advanced approach is presented by Fu et al. [132], which introduce a hierarchical knowledge distillation framework. The CWE label space is split into multiple sub-distributions based on semantic similarity, and individual CNN teacher models are trained on each subset. A student language model then learns to generalize across the teacher outputs, improving its ability to handle multi-class vulnerability detection tasks. Weyssow et al. [118] distill structured reasoning capabilities from a teacher LLM into smaller student LLMs using reinforcement learning from AI feedback (RLAIF). Panichella [204] combines metamorphic testing with many-objective optimization for distillation of LLMs for code, measuring model robustness across semantically equivalent code variants. Ibanez-Lissen et al. [198] use linear classifier probes (LPs) to identify valuable layers in pre-trained LLMs before fine-tuning, guiding selective compression and adaptation based on task-specific relevance.
Retraining a model on all known vulnerabilities each time a new vulnerability is disclosed would demand considerable computational resources and time. Continual learning enables models to incrementally learn new knowledge over time, such as new programming languages or up-to-date vulnerability data, without forgetting what has already been learned (Catastrophic Forgetting). Gao et al. [133] propose an adaptive regularization strategy that preserves the most representative (i.e., informative and diverse) samples in each dataset for model retraining under project-level, language-level, and time-level continual learning settings. Similarly, Zhang et al. [189] present an incremental learning approach that allows the model to incorporate new programming languages without degrading performance on those it has already learned. Tian et al. [58] introduce a training-free parameter fusion method that merges independent classification heads fine-tuned on different vulnerability types, allowing the fused model to adapt to new categories while preserving multi-class classification capabilities.
Other data-centric learning techniques have been explored in the surveyed studies to improve model performance, i.e., Positive and Unlabeled (PU) learning, active learning, and adversarial training. PU learning tackles the issue of limited and noisy labels by training models using only positively (vulnerable) labeled and unlabeled data [201, 147]. Active learning aims to reduce labeling costs by strategically selecting the most informative samples for annotation. For example, Lan et al. [203] combine dataset maps with active learning to prioritize valuable samples while filtering out those that may harm performance. Adversarial training enhances model robustness by introducing crafted perturbed code examples during training, improving resilience to noise and attacks [214, 94, 246].
Figure 2 maps the relationships between system architecture, input representation, adaptation techniques, and task formulation across the 227 surveyed studies. Each node is annotated with the absolute number of studies assigned to that category, providing a quantitative overview of the landscape. The diagram enables a holistic view of how LLMs are combined with the different input representations and adaptation strategies. The rightmost part of the diagram visualizes how these techniques correlate with task formulations, highlighting the dominance of binary classification. The diagram further reveals underexplored combinations, such as the use of structure-aware inputs with PEFT and prompting methods for vulnerability-specific classification. This overview serves as a compact visual summary of the field and offers starting points for researchers to explore novel system designs or revisit rarely studied combinations.
In this SLR, we set an emphasis on investigating the datasets used in the training and evaluation of LLM-based software vulnerability detection approaches. Datasets play a central role in the development and evaluation: Their quality, structure, and coverage of diverse vulnerability types directly affect model performance and generalization. Dataset choice can influence which methods appear to perform better, potentially favoring certain vulnerability types. Systematically comparing datasets is essential to understand their suitability for the vulnerability detection task, assess their limitations, and ensure meaningful evaluations.
To systematically analyze the datasets used in the surveyed studies, we extend the presented taxonomy to cover datasets. We outline the taxonomy based on the datasets used most commonly for training or evaluation in the surveyed studies, discussing type, granularity, source, and labeling. Further, we investigate the CWE coverage and diversity of selected datasets, discussing limitations such as long-tail distributions and lack of representative vulnerability coverage. Finally, we discuss trends in dataset use, including comparability of evaluations across different use cases.
We categorize vulnerability datasets according to type, granularity, source, and labelling, as shown in Figure 3, to capture their quality and realistic nature. Similar to related surveys [22, 17, 24, 27, 29, 31], we provide a comprehensive overview of commonly used datasets in the surveyed studies as well as selected, more recent datasets suited to address open challenges of LLM-based methods, refer to Table 3. We briefly introduce these datasets along with the taxonomy and discuss the resulting practical challenges and biases in dataset construction. In addition, we compare the selected datasets based on further characteristics such as their programming languages covered, as well as quantitative characteristics, i.e., dataset size, the share of samples representing vulnerabilities, and the number of distinct CWEs covered, cf. Table 4, as a proxy for label diversity and comprehensiveness.
We distinguish three dataset types: (i) synthetic, (ii) real-world, and (iii) mixed.
Synthetic datasets are artificially generated, often by injecting predefined vulnerability patterns into source code. The synthetic datasets Juliet C/C 0.1ex ++ [313] and Juliet Java [314] (subsets of the Software Assurance Reference Dataset (SARD) [315]) provide comprehensive sets of test cases and entire test suites. These test suites are widely used to benchmark both traditional static analysis tools and deep learning-based vulnerability detectors. A key strength of synthetic datasets lies in their reliable labels, as they inject known vulnerable patterns into code, thereby ensuring clear ground truth annotations. However, this injection-based approach often results in limited code diversity and reduced realism, as the code samples tend to be simple and may not reflect the complexity of real-world source code. Further, the manual effort required to represent a diverse set of vulnerabilities can be labor-intensive.
Real-World datasets are sourced from open-source software projects, e.g., Devign [316], or curated vulnerability databases, e.g., Big-Vul [317] and MegaVul [318]. These datasets are more suited for a realistic evaluation, capturing the challenges posed by detecting software vulnerabilities in real-world projects. However, due to the efforts required for crawling and parsing software projects, real-world datasets often suffer from label noise.
Mixed datasets combine real-world code with synthetic data, leveraging augmentation to enrich or balance the dataset. For example, the SARD [315] dataset contains a mix of production, synthetic, and academic vulnerable samples. VulDeePecker [238] combines samples from SARD and the National Vulnerability Database (NVD) [319]. Draper [320] combines synthetic functions from the Juliet test suite with open-source code from software packages and public repositories. Mixed datasets offer a practical trade-off between realism and control, making them particularly valuable for covering underrepresented or rare vulnerability types.
Vulnerabilities can occur at various structural levels within the software, from single lines to entire files. Dataset granularity refers to the level at which code is labeled or segmented within the dataset. Following Shereen et al. [22], we differentiate coarse, medium, and fine-grained levels.
Coarse-grained datasets operate at the Project, Commit, or File level. ReposVul [321] focuses on extracting dependencies on multiple granularity levels, including repository-level and file-level. Similarly, CrossVul [322] labels samples at the file level, thereby aligning closely with practical use cases in vulnerability detection. While this level of granularity offers realistic scenarios for analysis, it also introduces a high risk of noise due to the large code context surrounding the vulnerability.
Medium-grained datasets segment code at the Program, Slice or Function level. The largest share of commonly used vulnerability datasets is labeled at the function level [315, 320, 316, 317, 323, 324, 325, 124, 326, 100, 318, 149], cf. Table 3. The VulDeePecker dataset, also known as the Code Gadget Database (CGD), represents program slices as code gadgets composed of a number of program statements that are semantically related to each other through data or control dependencies. While this granularity supports a more focused analysis, it may miss critical contextual information when vulnerabilities span multiple functions or require interprocedural analysis.
Fine-grained datasets label vulnerabilities at the Line level enable precise vulnerability localization. Big-Vul, D2A [323], FormAI, and ReposVul provide annotations at the line-level. Such a narrow scope, however, omits the broader context required to detect certain vulnerabilities, especially those involving complex data flows or control dependencies.
| Dataset | Year | Type | Granularity | Source | Labeling | #Used | Resource |
| SARD [315] | 2006 | Mixed | File | Open-Source | Synthetically | 28 | [315] |
| Juliet C/C 0.1ex ++ [313] | 2017 | Synthetic | File | Open-Source | Synthetically | 9 | [313] |
| Juliet Java [314] | 2017 | Synthetic | File | Open-Source | Synthetically | 6 | [314] |
| VulDeePecker [238] | 2018 | Mixed | Slice | Constructed | Security Vendor | 9 | [327] |
| Draper [320] | 2018 | Mixed | Function | Constructed | Tool, Synthetically | 8 | [328] |
| Devign [316] | 2019 | Real | Function | Open-Source | Developer | 53 | [329] |
| Big-Vul [317] | 2020 | Real | Function, Line | Collected | Security Vendor | 50 | [330] |
| D2A [323] | 2021 | Real | Function, Line | Open-Source | Tool | 10 | [331] |
| ReVeal [324] | 2021 | Real | Function | Collected | Developer | 30 | [332] |
| CVEfixes (v1.0.8) [325] | 2021 | Real | Function, Commit, File | Collected | Security Vendor | 14 | [333] |
| CrossVul [322] | 2021 | Real | File | Collected | Security Vendor | 4 | [334] |
| SecurityEval (v2.1) [335] | 2022 | Mixed | Program | Constructed | Synthetically, Tool | 5 | [336] |
| DiverseVul [124] | 2023 | Real | Function | Collected | Developer | 21 | [337] |
| SVEN [338] | 2023 | Real | Program | Constructed | Developer | 4 | [339] |
| FormAI* [340] | 2023 | Synthetic | Program, Line | Constructed | Tool | 1 | [341] |
| ReposVul* [321] | 2024 | Real | Project, File, Function, Line | Collected | Tool | 2 | [342] |
| PrimeVul [128] | 2024 | Real | Function | Constructed | Security Vendor | 15 | [343] |
| PairVul* [100] | 2024 | Real | Function | Collected | Security Vendor | 1 | [344] |
| MegaVul* (2024/04) [318] | 2024 | Real | Function | Collected | Security Vendor | 2 | [345] |
| CleanVul* [149] | 2024 | Real | Function | Open-Source | Developer, Tool | 1 | [346] |
To categorize the source of the dataset, we adapt the four categories defined by Hou et al. [19], i.e., (i) open-source, (ii) collected, (iii) constructed, and (iv) closed-source.
Open-Source datasets are derived from public data collections accessible through open-source platforms or repositories, e.g., Github repositories. Examples for open-source datasets are SARD, Devign, D2A, and CleanVul [149]. Their accessibility ensures transparency and reproducibility, making them widely used in academic research. However, datasets sourced directly from Github repositories often lack reliable ground-truth labels and are sensitive to repository selection and curation quality, i.e., the specific projects chosen and how samples are filtered, labeled, and balanced, leading to biased, noisy, or unrepresentative data.
Collected datasets are those scraped, mined, and extracted by researchers from various sources such as security trackers or related databases (e.g., NVD [319], CVE [1], or CWE database [3]). MegaVul, for example, is collected by crawling the CVE database along with CVE-related open-source projects hosted across Git-based platforms. While authoritative vulnerability records provide structured metadata and standardized vulnerability classifications, the mapping of records to source code can be non-trivial and may miss undocumented or undisclosed vulnerabilities.
Constructed datasets are created by modifying or augmenting one or multiple other (collected) datasets, either manually or using semi-automatic methods, to better align with domain-specific objectives, e.g., focusing on specific CWEs or programming languages. Constructed datasets include the mixed datasets VulDeePecker and Draper. Further, SecurityEval [335] is constructed from sources such as GitHub’s CodeQL documentation, the CWE database, Sonar static analyzer rules, and prior Copilot-based studies. Additional samples were crafted by the authors. SVEN [338] refines and validates samples from Big-Vul, CrossVul, and VUDENC [347], focusing on nine common CWE types. The PrimeVul dataset [326] merges and de-duplicates data from Big-Vul, CrossVul, CVEfixes [325], and DiverseVul [124].
Closed-Source datasets are obtained from commercial or industrial entities. Closed-sourced datasets are absent from the listed commonly used datasets. Notably, only two surveyed studies [73, 211] use an industrial, closed-sourced dataset. Although such datasets offer significant potential for research targeting real-world deployment scenarios, they are often subject to restrictions that limit the publication and sharing of company-internal data. Such restricted access limits reproducibility, public benchmarking, and broader adoption within the research community.
Accurate labeling is essential for constructing high-quality vulnerability datasets and reliable evaluations. To categorize the labeling of datasets, we adopt the four main label origin categories by Croft et al. [4]: (i) security vendor-provided, (ii) developer-provided, (iii) tool-created, and (iv) synthetically created.
Security Vendor-provided labels are derived from curated vulnerability databases maintained by security vendors, i.e., organizations that collect and standardize information on disclosed vulnerabilities from various advisories. An example database is the NVD. Many entries include links to corresponding patches on security trackers or GitHub sites, enabling the mapping of vulnerabilities to real-world source code. For example, the Big-Vul dataset was collected by crawling the public CVE database and linking the CVE entries to related source code changes in GitHub projects. This labeling strategy is reliable and consistent with real-world vulnerability disclosures. However, it is limited to publicly disclosed cases and includes historical data with labels that may not align with up-to-date vulnerability mappings [321].
Developer-provided labels are extracted directly from a project’s development history or issue tracker systems on platforms such as GitHub. A commonly used strategy involves vulnerability fixing commits (VFCs), identifying vulnerable code before a fix and secure code after a fix. VFCs are often identified via project references by the NVD or keyword search. The Devign dataset (also referred to as FFmpeg and QEMU) consists of vulnerable functions extracted from VFCs identified via keyword-based filtering and manually labeled by security researchers. Devign is also part of the CodeXGLUE benchmark [348]. Contrary to manually labeling VFCs, relying solely on VFCs often results in noisy labels due to the simplifying assumption that all pre-commit code is vulnerable and all post-commit code is secure [4]. In practice, VFCs may contain unrelated changes such as refactoring or test updates, and fixes are sometimes distributed across multiple commits. Further, labeling based solely on modified lines may overlook vulnerable context and cross-line dependencies. To address these challenges, newer datasets propose refined VFC labeling strategies. The PrimeVul dataset [326] uses two labeling strategies: PRIMEVUL-ONEFUNC, which marks a function as vulnerable only if it is the only function modified by a VFC, and PRIMEVUL-NVDCHECK, which cross-references CVE descriptions with modified functions. CleanVul addresses the label noise using an LLM, which assigns a confidence score from 0 (no vulnerability detected) to 4 (very high likelihood of vulnerability) to identify vulnerability-fixing changes and filter out unrelated modifications such as test updates or refactorings.
Tool-created labels are automatically generated using, e.g., static analyzers or formal verification tools. For instance, in the Draper dataset, labels are generated using static analyzers and then manually mapped to binary vulnerability labels and CWEs by a team of security experts. Similarly, the D2A dataset applies differential analysis to VFC version pairs from several open-source projects using static analysis tools, identifying disappearing bugs likely to present true vulnerabilities. Notably, the FormAI dataset [340] consists of compilable and self-contained C programs generated using GPT-3.5 turbo, incorporating varying levels of complexity to simulate a broad spectrum of coding patterns. Labels are derived using formal verification. ReposVul jointly employs LLMs for evaluating the relevance between code changes and vulnerability fixes, and static analysis tools for checking the vulnerabilities in the changes. While tools enable large-scale dataset creation, they are often criticized for generating a high volume of false positives. Static analysis tools, in particular, produce numerous warnings, many of which originate from low-confidence detections [4]. Formal verification offers higher detection reliability but at a higher computational cost.
Synthetically created labels result from injecting known vulnerable patterns into code, generating controlled artificial samples, e.g., as used by the synthetic Juliet test suites. This method is useful for training on rare vulnerability types and balancing datasets. However, such patterns may oversimplify the complexity of real-world vulnerabilities.
| Dataset | Programming Language | Granularity | Size | %Vuln | Multi-Class | #CWEs |
| SARD [1] | C/C 0.1ex ++ , C#, Java, PHP | File | 450000+ | n/s | ✓ | 150+ |
| Juliet C/C 0.1ex ++ [2] | C/C 0.1ex ++ | File | 64099 | n/s | ✓ | 118 |
| Juliet Java [3] | Java | File | 28881 | n/s | ✓ | 112 |
| VulDeePecker [4] | C/C 0.1ex ++ | Slice | 61638-28.76 | 28.7 | ✓ | 2 |
| Draper [5] | C/C 0.1ex ++ | Function | 1274366-6.47 | 6.5 | ✓ | 4 |
| Devign [6] | C/C 0.1ex ++ | Function | 48687-47.97 | 48.0 | ✗ | n/a |
| Big-Vul [7] | C/C 0.1ex ++ | Function, Line | 264919-4.46 | 4.5 | ✓ | 91 |
| D2A [8] | C/C 0.1ex ++ | Function, Line | 1295623-1.44 | 1.4 | ✓ | n/a |
| ReVeal [9] | C/C 0.1ex ++ | Function | 18169-9.16 | 9.2 | ✗ | n/a |
| CVEfixes (v1.0.8) [10] | 27 languages | Function, Commit, File | 277948-45.50 | 45.5 | ✓ | 272 |
| CrossVul [11] | 40+ languages | File | 27476-50.00 | 50.0 | ✓ | 168 |
| SecurityEval (v2.1) [12] | Python | Program | 121-100.00 | 100.0 | ✓ | 69 |
| DiverseVul [13] | C/C 0.1ex ++ | Function | 330492-5.7 | 5.7 | ✓ | 150 |
| SVEN [14] | C/C 0.1ex ++ , Python | Program | 1606-50.00 | 50.0 | ✓ | 9 |
| FormAI* [15] | C | Program, Line | 246549-80.23 | 80.2 | (✓) | (41) |
| ReposVul* [16] | C/C 0.1ex ++ , Java, Python | Project, File, Function, Line | 232465-0.74 | 0.7 | ✓ | 236 |
| PrimeVul [17] | C/C 0.1ex ++ | Function | 235768-2.96 | 2.9 | ✓ | 140 |
| PairVul* [18] | C/C 0.1ex ++ | Function | 8628-50.00 | 50.0 | ✓ | 95 |
| MegaVul* (2024/04) [19] | C/C 0.1ex ++ | Function | 353873-5.08 | 5.1 | ✓ | 176 |
| CleanVul* [20] | C/C 0.1ex ++ , C#, Java, JavaScript, Python | Function | 11632-50.00 | 50.0 | ✗ | n/a |
Commonly used vulnerability datasets are heavily biased towards C/C 0.1ex ++ , as shown in Table 4. This bias reflects the widespread use of these languages in technological infrastructure. Fewer datasets cover multiple programming languages, i.e., SARD, CVEfixes, CrossVul, and CleanVul; however, they are not represented in the same share. Recent studies have begun constructing datasets for less covered languages, e.g., Python [291] and Rust [155], contributing to broader language diversity in vulnerability research.
The size of a dataset is strongly influenced by the dataset source and its labeling method. Smaller datasets typically involve manual curation, enabling high label quality but limited scalability. SecurityEval, for example, focuses on evaluating secure code generation in Python. It consists of 121 manually curated prompt-based samples across 75 CWE types, resulting in a dataset with 100% vulnerable samples. Similarly, SVEN focuses on evaluating secure code generation, offering improved label reliability with a 50% share of vulnerable samples, but is limited in scope due to its focus on a fixed subset of CWEs. In contrast, larger datasets rely on automated pipelines for scalability. DiverseVul [124] scales up the volume and variety of functions compared to earlier VFC-based datasets such as Devign, ReVeal [324], Big-Vul, CrossVul, and CVEfixes, with a share of 5% vulnerable samples.
The dataset type and the labeling method used in dataset construction influence both the quality and realism, cf. Figure 4. Synthetic datasets, such as Juliet or FormAI, offer high label accuracy due to controlled injection of known vulnerability patterns but often lack the structural complexity and semantic diversity found in real-world code. In contrast, datasets derived from real-world sources provide greater realism, better reflecting practical vulnerability scenarios. However, they often suffer from label noise due to assumptions inherent in VFC-based labeling. These datasets also vary in the proportion of vulnerable samples. Highly imbalanced datasets, i.e., where vulnerabilities are underrepresented, reflect the natural rarity of vulnerabilities in real code. While realistic, such an imbalance poses challenges for training models, which may become biased towards predicting the majority class. To mitigate this challenge, preprocessing techniques are often required to ensure effective training, e.g., resampling: oversampling the minority (vulnerable) class or undersampling the majority (non-vulnerable) class to achieve a more balanced class distribution. Figure 4 visualizes this trade-off between labeling accuracy and realism, providing a practical framework for selecting datasets for different research objectives.
Vulnerability detection is often framed as a binary classification task, i.e., distinguishing between vulnerable and non-vulnerable code. The ReVeal dataset provides security-related patches collected from bug tracking systems, offering paired vulnerable and fixed code samples labeled in binary form.
Many datasets go beyond binary classification by providing labels for specific vulnerability types, enabling multi-class classification. CVEfixes is a comprehensive and automatically curated dataset collected from CVE entries in the NVD. It identifies linked open-source projects and extracts function-level vulnerable and fixed code samples from VFCs. The latest release at the time of writing (v1.0.8) includes all CVEs published up to July 23, 2024, covering 272 distinct CWEs. The PairVul dataset [100] takes a more controlled approach by focusing on pairs of vulnerable and non-vulnerable code with high lexical similarity. These pairs are extracted from Linux kernel CVEs, labeled using VFCs, and filtered to ensure that the associated fixes are not reverted or modified in subsequent commits. While the dataset offers high-quality and balanced samples, its scope is currently limited to selected Linux kernel vulnerabilities.
A common limitation across vulnerability datasets is the lack of coverage of recently disclosed vulnerabilities. Static datasets may quickly become outdated, reducing their relevance for real-time or forward-looking research. This highlights the need for continuously maintained and automatically extensible datasets, such as CVEfixes, which is supported by an automated pipeline.
Beyond classification, datasets such as Big-Vul, CrossVul, CVEfixes, and ReposVul offer rich metadata, including CWE descriptions and severity scores. Such metadata supports more advanced tasks such as severity estimation, root-cause reasoning, and context-aware vulnerability analysis, broadening their use beyond binary classification.
To assess the real-world applicability and limitations of LLM-based vulnerability detection approaches, it is crucial to use datasets that capture a diverse range of vulnerability types. Such diversity enables more realistic evaluations, as it reflects the variety and complexity of vulnerabilities encountered in practical software development. As shown in Table 4, the number and variety of vulnerability types (i.e., number of CWE classes) represented vary considerably across the selected datasets. Some datasets focus narrowly on specific vulnerability types and CWE classes, while others aim for broader CWE diversity. To assess the diversity and composition of CWEs, we analyze selected real-world and mixed datasets providing multi-class labels (cf. Table 3) in more depth, focusing on their CWE coverage and distribution. Specifically, we analyze VulDeePecker, Draper, Big-Vul, CVEfixes, CrossVul, SecurityEval, DiverseVul, SVEN, ReposVul, PrimeVul, PairVul, and MegaVul. D2A and FormAI are excluded from this analysis. D2A is not considered, as it provides only static analyzer outputs without CWE annotations. Similarly, FormAI, while performing manual CWE mapping, does not include these mappings as part of the dataset but only error types provided by the formal verification tool.
As discussed in prior works [127, 142, 318], real-world vulnerability datasets exhibit a characteristic long-tail distribution: A small number of CWE types are heavily represented, while the majority of CWE types appear only rarely, often with just one or two examples. This characteristic is visualized in Figure 5, which shows the count of CWEs for the 15 most and least represented CWE classes for selected datasets. For example, while CVEfixes covers 272 distinct CWE types, the majority are sparsely represented. A similar pattern is observed for DiverseVul, which covers 150 CWEs. Such skewed distributions lead to biased learning, where models disproportionately focus on a small subset of vulnerability types. MegaVul and SecurityEval represent efforts to encourage a more balanced representation. SecurityEval, in particular, provides one to six manually curated examples for 69 CWEs. These efforts represent a step towards reducing long-tail effects and developing more generalizable vulnerability detection models.
| CWE-ID | Pillar Name | Description | #CWEs |
| CWE-284 | Improper Access Control | Weaknesses related to protection mechanisms such as authentication, authorization, and accountability | 166 |
| CWE-435 | Improper Interaction Between Multiple Correctly-Behaving Entities | Weaknesses that arise due to unexpected consequences when multiple entities interact, even though each entity behaves correctly in isolation | 16 |
| CWE-664 | Improper Control of a Resource Through its Lifetime | Resource management weaknesses such as improper initialization, reuse, or cleanup; includes cases where explicit instructions for resource creation, usage, or destruction are not properly followed | 367 |
| CWE-682 | Incorrect Calculation | Calculations that generate incorrect or unintended results later used in security-critical decisions or resource management | 14 |
| CWE-691 | Insufficient Control Flow Management | Logic or flow vulnerabilities such as incorrect conditions or decision-making | 84 |
| CWE-693 | Protection Mechanism Failure | Weaknesses caused by bypassing, misconfiguring, or incorrectly using security mechanisms | 100 |
| CWE-697 | Incorrect Comparison | Weaknesses in comparison logic between variables, objects, or values | 22 |
| CWE-703 | Improper Check or Handling of Exceptional Conditions | Weaknesses where exceptional conditions that rarely occur during normal operation are not properly anticipated or handled | 59 |
| CWE-707 | Improper Neutralization | Inadequate sanitation or escaping of input/output data where the application fails to ensure that data conforms to expected formats and is safe | 144 |
| CWE-710 | Improper Adherence to Coding Standards | Violations of safe and established programming practices | 195 |
For a more structured comparison of the CWE distributions across the datasets, we adopt the CWE-1000 Research View [349]. This view provides a hierarchical grouping of CWEs intended for research and analysis. It groups CWEs into broad research concepts to help identify inter-dependencies, shared characteristics, and root causes of vulnerabilities. The CWE-1000 view spans multiple levels of abstraction: At the highest level are pillars and categories, which provide high-level groupings but are not intended for direct vulnerability mapping. Beneath these are the class, base, and variant levels, with variant representing the most specific level, i.e., language- or technology-specific weaknesses. By design, the CWE-1000 view organizes every weakness in the CWE catalog into one of ten pillars, cf. Table 5.
We consider the mapping of CWE-1000 to cluster the CWEs represented in the individual datasets. This clustering enables a structured comparison beyond CWE counts by highlighting which vulnerability types are emphasized or underrepresented, assessing the thematic diversity and coverage of vulnerability datasets. Notably, many datasets were built using the NVD, which also includes deprecated or discouraged CWE entries not mapped by CWE-1000, as well as general categories such as CWE-OTHER or CWE-NOINFO, used when insufficient information is available. To account for these cases, we introduce an additional group labeled not mapped, capturing all CWE entries that fall outside the scope of the CWE-1000 view. For each dataset, we compute the relative share of each CWE based on the vulnerable functions (or files, depending on the granularity provided). A special case is the CVEfixes dataset, where VFCs are associated with CVE identifiers, and each CVE may be linked to one or more CWEs. Some commits address multiple CVEs, and the associated CWE mappings may span different levels of specificity. To normalize the distributions and ensure a consistent comparison, we compute the share of each CWE based on the total number of CWE labels assigned to all vulnerable samples, rather than the number of vulnerable functions.
Figure 6 visualizes the distribution of the top 25 CWE types covered by vulnerable samples across the selected datasets. Among the mixed datasets and those with a CWE-specific focus, VulDeePecker, Draper, and SVEN include only a small number of CWEs, i.e., two, four, and nine, respectively. SecurityEval features the widest variety of CWEs among mixed datasets, with a similar distribution across its covered types. The CWEs represented in the top 25 across all datasets account for only around 20% of its total CWE labels. In comparison, the real-world datasets exhibit a more pronounced long-tail distribution: the top 25 CWEs typically account for around 80% of all vulnerability labels. This highlights a strong skew towards a small set of frequently occurring vulnerability types.
The most widely represented CWEs across datasets are those associated with the pillar Improper Control of a Resource Through its Lifetime, which includes memory and resource management vulnerabilities that are commonly found in C/C 0.1ex ++ , the most dominant language across the studied datasets. This pillar is also the largest in terms of CWE count (cf. Table 5), followed by Improper Neutralization, encompassing issues related to input validation and sanitization. Other pillars are underrepresented in comparison to their relative size in the CWE view. These pillars include Improper Adherence to Coding Standards and Improper Access Control. Notably, some pillars, such as Improper Interaction Between Multiple Correctly-Behaving Entities, Protection Mechanism Failure, and Incorrect Comparison, are not represented at all among the top 25 CWEs, and are less represented in the overall CWEs covered, cf. Figure 7. These categories are inherently more difficult to capture in real-world datasets, particularly at the function level, as they often involve complex inter-component dependencies, system-level interactions, or contextual analysis that is not easily observable from isolated code snippets.
A considerable share of CWEs in the analyzed datasets fall into the not mapped category. This is often the result of outdated or abstract vulnerability labels sourced from the NVD. For instance, VulDeePecker covers two high-level CWEs, i.e., the class CWE-119 Improper Restriction of Operations within the Bounds of a Memory Buffer [350] and the category CWE-399 Resource Management Errors [351], flattening the existing CWE hierarchy and obscuring more specific vulnerability types. Importantly, the use of abstract categories such as classes for mapping has been discouraged since 2019, as they are frequently misapplied in low-information vulnerability reports where more specific child CWEs would be more appropriate [350]. The use of categories for vulnerability mapping is explicitly prohibited [351]. Several other CWEs commonly found in the selected datasets are also discouraged for use in vulnerability mappings, e.g., CWE-20 Improper Input Validation, CWE-200 Exposure of Sensitive Information to an Unauthorized Actor, and CWE-400 Uncontrolled Resource Consumption. In addition, pillars such as CWE-284 Improper Access Control and CWE-703 Improper Check or Handling of Exceptional Conditions are often used as placeholders in vulnerability reports lacking detailed analysis, leading to imprecise dataset annotations.
Some CWEs are marked as allowed with review, meaning they can be used in vulnerability mapping but should be applied carefully and only when more specific alternatives are not available. For example, CWE-120 Buffer Copy without Checking Size of Input is frequently selected based on keyword presence (’Classic Buffer Overflow’), but is only appropriate when the vulnerability involves unchecked buffer copy operations [352]. Similar caution applies to CWEs such as CWE-362 Concurrent Execution using Shared Resource with Improper Synchronization (’Race Condition’) and CWE-94 Improper Control of Generation of Code (’Code Injection’).
The use of general or deprecated CWE labels presents challenges to dataset quality and limits the effectiveness of CWE-based classification and reasoning. Notably, there are 555 CWEs in the CWE-1000 view that are allowed for vulnerability mapping but are not represented as vulnerable samples in any of the analyzed datasets. This indicates that current datasets cover only a subset of CWE types, leaving many mappable software weaknesses absent from training and benchmarking. As a result, tools trained on these datasets may lack generalization capabilities when faced with previously unseen vulnerabilities. Further, evaluation results appear artificially strong, as they are often biased towards commonly occurring CWEs while excluding rare and long-tail vulnerability types. To address these limitations, more effort is needed to incorporate the CWE hierarchy into training and evaluation protocols, moving beyond binary labels to consider hierarchically related CWEs. For example, Tamberg and Bahsi [86] follow an evaluation method in which both parent and child CWEs are considered valid classifications using MITRE’s Research Concepts view.
Understanding how vulnerability datasets are used over time provides insight into community practices, widely used benchmark datasets, and emerging research directions. Analyzing temporal trends also reveals how quickly new datasets are adopted, how long benchmark datasets remain influential, and where gaps or biases may persist in evaluation setups. Such an analysis is particularly important in this rapidly evolving field to guide future dataset development and advance performance through more representative and robust evaluations.
Figure 8 visualizes the temporal usage and usage frequency of the selected vulnerability datasets from 2020 to mid-2025 by the surveyed studies. Red markers indicate the publication dates of the datasets. The figure shows that even today, early deep learning datasets such as Devign remain in widespread use, serving as a foundational dataset for graph-based and learning-based vulnerability detection models. The introduction of Big-Vul in September 2020 further enhanced dataset diversity and volume; Big-Vul also continues to be widely adopted. The persistent use of Devign and Big-Vul is partly due to their role in enabling comparability across studies. Classic test suites such as SARD and Juliet are also frequently used, cf. Figure 8, useful for controlled evaluations due to their synthetic nature. However, these earlier datasets often lack current CWE coverage and may reinforce outdated vulnerability mappings.
Figure 8 also shows that since 2021, there has been a clear trend towards real-world vulnerability datasets, e.g., CVEfixes, CrossVul, and DiverseVul. Several more recent datasets, such as FormAI, PrimeVul, PairVul, and CleanVul, are explicitly designed for evaluating LLM-based vulnerability detection approaches. For these datasets, adoption lags of approximately one to six months can be observed between a dataset’s release and adoption in published research. This lag reflects the time needed for several practical steps, including dataset validation, integration into existing toolchains and model pipelines, and broader community uptake. In many cases, adoption may also depend on the availability of documentation and preprocessing scripts. Further, early adopters often play a key role in demonstrating a dataset’s utility, which can influence its acceptance and reuse.
Another notable trend is the increasing use of newly created or custom datasets, see row "Custom" in Figure 8. A total of 71 studies either use or introduce custom datasets, often to address perceived limitations in existing datasets. Common motivations include correcting unrealistic data distributions, expanding project and vulnerability diversity, including more recent disclosures, or improving overall dataset quality (e.g., fixing incomplete or incorrectly merged functions) [318]. This trend is also driven by domain-specific or task-aligned needs. While such efforts encourage innovation and enable tailored evaluations, they also complicate comparability and may introduce implicit biases tied to individual evaluation design choices and construction assumptions.
The choice of suitable vulnerability datasets reflects a combination of common benchmark datasets, task alignment, and system design. In total, 158 distinct datasets are used across the surveyed studies, illustrating a high degree of fragmentation. Most studies either use or introduce custom datasets, with 126 datasets appearing only once. Only eight datasets are reused in more than ten studies, highlighting the lack of a standardized suite of benchmark datasets and consistency in dataset selection. Many studies use more than one dataset, often combining datasets to increase training data diversity or to evaluate model generalization, e.g., vulnerability types and programming languages. Figure 9 highlights the most frequently used dataset pairs across all surveyed studies. Among these, the most common combinations are ReVeal and Devign, Big-Vul and Devign, as well as Big-Vul and ReVeal. These pairings often complement each other in terms of data characteristics, such as vulnerable percentage or combining datasets with broad versus narrow CWE coverage.
The use of many different datasets and combinations, however, hampers direct comparability of evaluation results, particularly when studies apply different splits for training, validation, and testing, or pre-process the same datasets differently. Figure 10 offers insight into overlaps and divergences in dataset usage across specific use cases. When fine-tuning LLMs for vulnerability detection, e.g., using LoRA (Low-Rank Decomposition, 17 studies), researchers often resort to large and diverse datasets, most commonly DiverseVul, Devign, Big-Vul, and ReVeal. However, many studies make use of additional, lesser-used datasets. This practice reduces comparability and makes it difficult to replicate results or isolate model performance. RAG studies (17 studies) most commonly use Devign and Big-Vul, appearing in four studies. Big-Vul, in particular, is often used for knowledge base construction of CWE entries due to the provided metadata, e.g., CWE descriptions. Five RAG studies introduce custom datasets, which hinders comparison, especially concerning the semantics of the retrieved context. Agentic approaches (8 studies) show even greater fragmentation, with only two studies aligning on PrimeVul. GNN-based hybrid approaches most frequently rely on Devign, followed by custom datasets, ReVeal, and Big-Vul. Devign is particularly suited for GNNs due to its composite code representation, encoding program control and data dependencies as heterogeneous edges within a unified graph structure. The absence of shared benchmarks within the discussed use cases highlights the need for controlled and standardized evaluation protocols.
Despite significant progress in leveraging LLMs for software vulnerability detection, we identified several limitations in this SLR. In the following, we outline the key limitations derived from the literature and propose actionable research directions to address them. Table 6 summarizes the discussed limitations and future research opportunities.
The limitations identified in this review emerge from recurring issues observed across the surveyed studies and span key aspects such as dataset realism and representativeness, vulnerability diversity, methodological rigor, and evaluation practices. We structure these findings into seven themes L1-L7 that outline the most pressing limitations.
| Limitations | Future Research Opportunities |
| L1 Dataset and Detection Granularity | |
| • Focus on function-level inputs | • Explore coarser detection granularity (project- and repo-level) |
| • Fails to capture complex vulnerabilities with dependencies | • Augment code context |
| • Focus on C/C++ code | • Expand to other languages |
| • Develop multilingual and cross-project benchmarks | |
| L2 Dataset Labeling and Quality | |
| • Synthetic datasets oversimplify real-world complexity | • Build new datasets focusing on vulnerability diversity |
| • Automated labeling introduces noise and redundancy | • Improve label quality (manual verification, improved VFC tools) |
| • Class imbalance and long-tail distribution of CWE types | • Reevaluate approaches on diverse and representative datasets |
| L3 Study Evaluation and Comparability | |
| • Inconsistent use of datasets | • Align dataset use across similar use cases for comparability |
| • Lack of predefined train/validate/test splits | • Use datasets with documented splits |
| • Diverse modifications to datasets | • Evaluate on multiple datasets for comparability |
| • Document all preprocessing and filtering steps | |
| • Encourage open science | |
| L4 Up-to-Date Vulnerability Knowledge | |
| • Static datasets quickly become outdated | • Use automatically updated datasets (e.g., CVEfixes) |
| • Retraining is expensive and often infeasible | • Further explore RAG and continual learning |
| L5 Code and Vulnerability Representation | |
| • Raw code often used without structure or abstraction | • Investigate structure-aware representations |
| • Reliance on superficial text patterns | • Combine LLMs with structural models (e.g., GNNs) |
| • Sensitivity to semantic-preserving transformations | • Reevaluate robustness on perturbed or transformed code |
| L6 Model Interpretability and Explainability | |
| • Limited insight into predictions | • Develop metrics for explanation trustworthiness |
| • Hallucinated or misleading justifications, self-contradictions | • Use structured explanation formats |
| • Leverage external knowledge to ground reasoning | |
| L7 Integration into Pipelines and Workflows | |
| • Few evaluations in realistic development settings | • Integrate tools into IDEs or pipelines with developer feedback |
| • Poor generalization to production codebases | • Evaluate models on proprietary or closed-source datasets |
| • High computational cost of tuning and inference | • Develop CWE-type classifiers for finer-grained prediction |
| • Explore agentic systems for reasoning or self-assessment | |
| • Explore model compression for lightweight deployment | |
In the surveyed studies, vulnerability detection is typically applied at the level of individual functions or small program slices. This focus on function-level granularity has been influenced by the dominance of function-level vulnerability datasets (cf. Table 3) and partly by architectural limitations, particularly the restricted input lengths of earlier LLMs [85]. However, methods have been proposed to enhance context, such as the work by Chen et al. [123], and LLMs now support significantly longer input contexts, e.g., 128,000 tokens for GPT4 [252]. The focus on function-level detection fails to capture the context of and to generalize to more complex vulnerabilities that span multiple functions or files. For example, Huynh et al. [246] investigate how varying levels of code context influence detection performance, comparing snippets enriched with comments and docstrings against full-file inputs. The findings suggest that even with extended context windows, current models still perform poorly on distributed vulnerabilities. Further, the surveyed studies focus predominantly on detecting vulnerabilities in C/C 0.1ex ++ code (cf. Table 4), with limited coverage of other widely used programming languages, such as Java or Python. This narrow programming language coverage limits the applicability of current models in diverse, multilingual software environments.
Synthetic datasets offer high label-precision, as vulnerabilities are injected and, thus, well-defined. However, synthetic datasets may oversimplify real-world scenarios and fail to capture the complexity and diversity of real-world vulnerabilities, limiting their effectiveness in preparing models for deployment in realistic environments. Similarly, the common formulation as a binary classification task (cf. Figure 2) lacks the granularity needed for practical application, where identifying the specific vulnerability type is essential for severity assessment, prioritization, and appropriate repair. Real-world datasets are typically labeled using automated methods such as VFCs or static analysis tools. While these approaches enable large-scale dataset creation, they are associated with label noise and redundancy, which reduce the uniqueness and representativeness of the dataset [4]. Although manual labeling or verification can improve precision, it remains a resource-intensive process, especially when scaling to larger datasets.
Further, most datasets suffer from a strong class imbalance, where vulnerable samples are heavily outnumbered by non-vulnerable ones (cf. Table 4), and from a long-tail distribution of vulnerability types, with few common CWEs dominating and many others being severely underrepresented (cf. Figure 5 and Figure 6). These imbalances pose significant challenges for both training balance and evaluation reliability, resulting in models achieving substantially higher performance when identifying frequently occurring vulnerabilities compared to less common ones [246].
The lack of consistency and transparency in dataset use and evaluation protocols complicates meaningful comparison across studies. Even within the same use case, studies are often not comparable due to the use of different datasets (cf. Figure 10). Further, most datasets do not provide predefined train/validate/test splits, which are essential for ensuring a consistent and fair evaluation [22]. Only a few studied datasets offer standardized splits, e.g., Draper [320], CodeXGLUE [348] (a subset of Devign [316]), Big-Vul [317], and PrimeVul [326]. Without fixed splits, results can vary significantly, even when using the same dataset. For example, if subsequent studies fine-tune on the same dataset but apply different splits, direct performance comparisons become unreliable. In addition, several studies further modify or refine existing datasets [189, 88, 353, 134, 178], or augment them with custom data [162]. While such adaptations are often motivated by practical needs, e.g., improving data quality or evaluating generalization, they introduce additional variability and complicate comparability with prior and future works.
The number of disclosed software vulnerabilities continues to grow rapidly [2]. As a result, static datasets may quickly become outdated. Models trained on such outdated datasets may lack awareness of newly discovered vulnerabilities and emerging vulnerability patterns, limiting their practical relevance and effectiveness in real-time scenarios and for the detection of zero-day vulnerabilities [48]. Another challenge lies in how to efficiently expose models to this growing volume of vulnerability information. Since retraining models from scratch is computationally expensive and impractical at high frequency, there is a critical need for scalable methods to update or augment model knowledge.
In most surveyed studies, the code to be analyzed is provided as raw code or embedded directly within the prompt during inference (cf. Figure 2). However, models frequently rely on superficial textual patterns rather than capturing the underlying semantic structures of vulnerabilities [128, 354, 53]. As a result, model performance degrades under semantic-preserving code transformations, such as variable renaming, code reordering, or formatting changes, highlighting a lack of robustness and true semantic understanding [290, 165, 162, 353, 204, 62]. While contrastive learning is designed to produce more discriminative embeddings, models still often fail to draw the classification boundary correctly and identify the vulnerable patterns [128].
Current LLM-based vulnerability detection approaches offer limited insight into why a model classifies the input code as vulnerable or non-vulnerable. Even in studies that use CoT prompting or explicitly investigate reasoning as an additional objective, issues with hallucinated or misleading justifications are commonly reported [101, 106]. Similarly, studies implementing self-verification report instances of inaccurate corrections, self-contradictions, and hallucinated justifications [69]. These issues undermine the trustworthiness of model outputs and hinder their integration in development workflows where interpretability and reliability are critical.
Despite growing interest, the integration of LLM-based vulnerability detection into practical development workflows remains underexplored. Only a few studies investigate real-world applicability through prototype tools or user studies [75, 84], though such efforts are essential for assessing feasibility and trust in practical settings. Generalization to realistic codebases also remains challenging, with models often underperforming outside curated benchmarks [355].
Resource efficiency is a further limiting factor for real-world deployment, as fine-tuning and inference with LLMs remain computationally expensive. While model compression techniques such as knowledge distillation reduce inference latency and memory usage [126, 205, 198], their practical application for deployment on developer machines remains unexplored. This gap is particularly relevant when considering the privacy constraints of proprietary codebases, which may require on-premise or offline analysis [196].
Building on the identified limitations, each theme reflects core challenges in current research and highlights open questions that future work must address to enable more robust, generalizable, and reproducible progress in LLM-based software vulnerability detection. In this section, we provide interpretations or promising research directions, offering actionable insights to guide future work towards addressing these challenges.
To move beyond the limitations of function-level detection, future work should focus on context-enhanced vulnerability detection, including project- and repository-level analysis [191, 182, 45, 43]. Initial steps in this direction include augmenting function-level samples with additional context such as surrounding code lines [91], function arguments, external functions, type definitions, global variables, environmental constraints [32], as well as dependency and execution flows [202]. For example, Ahmed et al. [90] use GPT-4 to identify the required contexts for a given vulnerability in a function, but still find limitations in pinpointing vulnerable statements and their root causes in complex real-world code. Li et al. [116] demonstrate that incorporating execution and data context improves model performance and reasoning quality, emphasizing that the challenge lies in precise, context-aware vulnerability reasoning. Similarly, Yang et al. [218] propose a program analysis-based approach that abstracts complex function calls into primitive API representations to enrich contextual understanding. Further investigating such context-enhancements is essential for capturing dependencies that span multiple functions or files, enabling the detection of more complex vulnerability types that are otherwise missed in isolated, function-level analysis.
The current language focus on C/C 0.1ex ++ should be expanded to include other widely used languages such as Python, Java, and Rust [356]. Establishing multilingual and cross-project benchmarks will be essential for assessing robustness and model generalization.
To address current limitations in dataset construction, future work should focus on building real-world vulnerability datasets with improved label quality and balanced coverage across CWE types. Label quality can be improved through two main strategies: (1) refining existing datasets via targeted manual labeling to correct and verify automatically generated labels [338], and (2) advancing automated labeling methods, such as VFCs, to reduce noise and improve labeling reliability [326, 149, 90, 357, 358]. Balanced coverage across frequent and less common vulnerability types can be achieved, e.g., by using LLMs to generate vulnerable samples for specific vulnerability types [359, 196, 360].
In addition, new datasets should be constructed with minimal overlap to existing corpora to prevent data leakage and inflated performance due to prior exposure during model training. These datasets should set a focus on programming language and vulnerability diversity, capturing variations in human-written and LLM-generated code, complex multi-line vulnerabilities, and less frequently represented CWE types [361]. A practical direction for dataset expansion is to use the set of mappable-but-unused CWEs identified in Section 5.2, i.e., CWE classes that are currently not represented in the discussed datasets. Revisiting existing vulnerability detection approaches on more representative datasets may yield new insights into their robustness and effectiveness in real-world vulnerability detection.
To enhance reproducibility and comparability across studies, future research should prioritize standardized evaluation protocols. Specifically, we recommend the use of datasets with documented train/validation/test splits and careful alignment with datasets used in related studies to ensure meaningful comparisons within the same use case. We provide the mappings of datasets to surveyed use cases in the artifacts [14].
To assess generalization and maximize comparability with prior works, models should be evaluated on multiple datasets, combining recent datasets such as PrimeVul [326] or MegaVul [318] with established benchmark datasets such as Devign [316] or Big-Vul [317]. All pre-processing and filtering steps must be documented to enable replication. In support of open science [362], sharing of code, evaluation scripts, and model checkpoints is encouraged.
Given the broad and continuously growing number of CVEs, ensuring that models remain informed about newly discovered vulnerabilities is crucial for maintaining relevance in practical scenarios. To address the limitation of static datasets quickly becoming outdated, future work should focus on continuously maintained and automatically extensible datasets, such as CVEfixes [325], which employs an automated pipeline for updates. Advanced adaptation techniques also offer promising pathways to bridge this knowledge gap. RAG enables LLMs to access external, up-to-date knowledge at inference time. Continual learning can further support adaptability by integrating new vulnerability knowledge while preserving performance on previously seen vulnerability types. However, both techniques have seen limited adoption in the context of software vulnerability detection (cf. Figure 2). Further research should explore their effectiveness, especially in handling large and heterogeneous vulnerability information. For RAG, this includes investigating how different knowledge base representations, e.g., structured CWE hierarchy graphs or abstract function descriptions, affect retrieval quality and downstream detection accuracy.
To address the need for robust representations that capture vulnerability semantics, future work should further focus on structure-aware input representations. Directions include graph-based representations, such as control and data flow graphs, (LLM-generated) abstract descriptions that capture semantics and code behavior [363, 100], and multi-modal inputs, e.g., combining API abstractions, data flow graphs, and natural language documentation [218]. Understanding how different representations complement each other can enhance vulnerability detection capabilities. Ideally, such representations should be language-agnostic to support generalization across diverse programming languages (cf. multi-language studies [287, 188]). In addition, hybrid architectures should be further explored. GNNs, in particular, have demonstrated potential for capturing code semantics and structural dependencies [285, 221, 77, 141, 206, 147, 216, 225, 108, 222]. Future work should also evaluate the robustness of different representation strategies against syntactic perturbations and domain shifts, such as LLM-generated or refactored code, to better assess real-world generalization capabilities.
To build trust in LLM-based vulnerability detection systems, future research should go beyond standard evaluation metrics (e.g., accuracy, recall, precision, F1) and incorporate dedicated metrics that assess the trustworthiness and consistency of model-generated explanations. As LLMs increasingly take on roles in security workflows, their predictions must not only be correct but also justifiable and transparent to developers. Justification includes distinguishing between correct predictions made for the wrong reasons (spurious correlations) and those grounded in valid vulnerability semantics [32].
Interpretability may further be enhanced through structured explanation formats, such as vulnerability propagation paths or flow annotations. These structured rationales offer greater reliability than natural language justifications and can be cross-checked against program logic or expert knowledge. Further, integrating RAG techniques with structured vulnerability knowledge, such as CWE hierarchies or exploit chains, could improve reasoning by anchoring the model’s output in explicit and verifiable context. This approach may not only strengthen the factual grounding of explanations but also aligns with the information needs of developers and auditors in practice.
To ensure the practical applicability of LLM-based vulnerability detection, future work should focus on the integration into real-world development workflows. Such works should, e.g., integrate solutions into IDEs and CI/CD pipelines, conduct field studies, and collect developer feedback. Closed-source or proprietary codebases can serve as valuable testbeds for evaluating real-world performance, robustness, and usability.
Agentic systems and reasoning-driven workflows offer promising pathways to autonomously assist developers through tasks such as self-assessment, explanation generation, or iterative refinement. For example, Farr et al. [296] propose an expert-in-the-loop system to route code into the three categories "automatic quarantine", "deployment clearance", or "manual review", based on model confidence scores, helping prioritize expert attention where most needed.
To further improve generalization to practical vulnerability remediation workflows, approaches should move towards CWE-type classification. Promising directions include the development of type-specific classifiers that align with the CWE hierarchy and research views, cf. [37, 39, 127], enabling more actionable insights and effective prioritization.
Finally, trade-offs between performance and resource efficiency must be considered for real-world deployment. Fine-tuning smaller models for on-premise use is both practical and privacy-preserving [196]. Alternatively, model compression techniques such as knowledge distillation and quantization offer promising pathways for enabling lightweight deployment on developer machines.
In this systematic literature review, we analyzed 227 studies on LLM-based software vulnerability detection published between January 2020 and June 2025. To structure this rapidly evolving field, we introduced a comprehensive taxonomy that covers task formulation, i.e., the classification task presented to the LLM; input representation, i.e., how code and context are provided to the LLM; system architecture, differentiating between LLM-centric and hybrid approaches; and model adaptation techniques, including prompt engineering, fine-tuning, and learning paradigms. The analysis shows that most studies use simple approaches, such as binary code classification (vulnerable/non-vulnerable) and general-purpose LLMs. Common techniques include full-parameter fine-tuning and zero-shot prompting. More advanced techniques such as parameter-efficient fine-tuning, continual learning, retrieval-augmented generation, and agentic workflows are only beginning to emerge. We further analyzed the datasets used for training and evaluation, extending the taxonomy to include dataset type, granularity, source, and labeling. We investigated commonly used datasets with respect to their realism, vulnerability coverage, diversity, and usage trends. Despite notable progress in the field, we identified several key limitations that hinder practical adoption of LLMs in software vulnerability detection, including limited detection granularity, limited robustness in code representations, and outdated vulnerability knowledge. In particular, dataset-related limitations, such as low vulnerability type diversity and class imbalance, pose challenges to generalization and cross-study comparability. To address these limitations, we outlined actionable research directions, such as advancing structure-aware and language-agnostic input representations, aligning datasets across use cases to enable cross-study comparison, and adopting RAG and continual learning techniques to improve adaptability. By mapping existing studies, identifying open challenges, and proposing future research directions, this review aims to guide researchers and practitioners in selecting techniques and datasets as well as support more comparable and reproducible research; ultimately advancing the development of reliable, generalizable, and practically applicable LLM-based vulnerability detection systems.
Acknowledgements This work has been supported in part by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) Project-ID 528745080 - FIP 68. The authors alone are responsible for the content of the paper.
[2] CVE Metrics 2025
[3] CWE - Common Weakness Enumeration 2025
[4] Data Quality for Software Vulnerability Datasets International Conference on Software Engineering (ICSE) 2023 121-133 IEEE/ACM 10.1109/ICSE48619.2023.00022
[5] Coding-PTMs: How to Find Optimal Code Pre-trained Models for Code Embedding in Vulnerability Detection? International Conference on Automated Software Engineering (ASE) 2024 1732–1744 IEEE/ACM 10.1145/3691620.3695539
[6] Bert: Pre-training of Deep Bidirectional Transformers for Language Understanding Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT) 2019 4171–4186 ACL
[7] Attention is All you Need Advances in Neural Information Processing Systems (NeurIPS) 2017 30 5998-6008 Curran Associates, Inc.
[8] CEO Speaker Series With Dario Amodei of Anthropic 2025 Interview by Michael Froman
[9] Asleep at the Keyboard? Assessing the Security of GitHub Copilot’s Code Contributions Communications of the ACM 2025 68 2 96–105 10.1145/3610721
[10] No Need to Lift a Finger Anymore? Assessing the Quality of Code Generation by ChatGPT IEEE Transactions on Software Engineering (TSE) 2024 50 6 1548-1584 10.1109/TSE.2024.3392499
[11] 2024. Exploring and Evaluating Hallucinations in LLM-Powered Code Generation. https://arxiv.org/abs/2404.00971
[12] 2025. Security Weaknesses of Copilot-Generated Code in GitHub Projects: An Empirical Study. https://arxiv.org/abs/2310.02059
[13] Vulnerability Handling of AI-Generated Code - Existing Solutions and Open Challenges Conference on AI, Science, Engineering, and Technology (AIxSET) 2024 145-148 IEEE 10.1109/AIxSET62544.2024.00026
[14] Artifacts - Awesome-LLM4SVD 2025
[15] 2025. Large Language Models for Cyber Security: A Systematic Literature Review. https://arxiv.org/abs/2405.04760
[16] When LLMs Meet Cybersecurity: A Systematic Literature Review Cybersecurity 2025 8 1 55
[17] Generative AI in Cybersecurity: A Comprehensive Review of LLM Applications and Vulnerabilities Internet of Things and Cyber-Physical Systems (IOTCPS) 2025 5 1–46 10.1016/j.iotcps.2025.01.001
[18] 2024. A Survey on Large Language Models for Software Engineering. https://arxiv.org/abs/2312.15223
[19] Large Language Models for Software Engineering: A Systematic Literature Review ACM Transactions on Software Engineering and Methodology (TOSEM) 2024 33 8 10.1145/3695988
[20] 2025. Assessing and Advancing Benchmarks for Evaluating Large Language Models in Software Engineering Tasks. https://arxiv.org/abs/2505.08903
[21] A Systematic Literature Review on Automated Software Vulnerability Detection Using Machine Learning ACM Computing Surveys 2024 57 3 10.1145/3699711
[22] 2024. SoK: On Closing the Applicability Gap in Automated Vulnerability Detection. https://arxiv.org/abs/2412.11194
[23] 2025. AI-Based Software Vulnerability Detection: A Systematic Literature Review. https://arxiv.org/abs/2506.10280
[24] Large Language Models for Software Vulnerability Detection: A Guide for Researchers on Models, Methods, Techniques, Datasets, and Metrics International Journal of Information Security (IJIS) 2025 24 2 78 10.1007/s10207-025-00992-7
[25] 2025. From Vulnerabilities to Remediation: A Systematic Literature Review of LLMs in Code Security. https://arxiv.org/abs/2412.15004
[26] Large Language Model for Vulnerability Detection and Repair: Literature Review and the Road Ahead ACM Transactions on Software Engineering and Methodology (TOSEM) 2025 34 5 10.1145/3708522
[27] 2025. LLMs in Software Security: A Survey of Vulnerability Detection Techniques and Insights. https://arxiv.org/abs/2502.07049
[28] Vulnerability Dataset Construction Methods Applied To Vulnerability Detection: A Survey International Conference on Dependable Systems and Networks Workshops (DSN-W) 2022 141-146 IEEE 10.1109/DSN-W54100.2022.00032
[29] A Comprehensive Analysis on Software Vulnerability Detection Datasets: Trends, Challenges, and Road Ahead International Journal of Information Security 2024 23 5 3311–3327 10.1007/s10207-024-00888-y
[30] A Code Centric Evaluation of C/C++ Vulnerability Datasets for Deep Learning Based Vulnerability Detection Techniques Innovations in Software Engineering Conference (ISEC) 2023 ACM 10.1145/3578527.3578530
[31] An Investigation of Quality Issues in Vulnerability Detection Datasets European Symposium on Security and Privacy Workshops (EuroS&PW) 2023 29-33 IEEE 10.1109/EuroSPW59978.2023.00008
[32] Top Score on the Wrong Exam: On Benchmarking in Machine Learning for Vulnerability Detection Proceedings of the ACM on Software Engineering (ASE) 2025 2 ISSTA 10.1145/3728887
[33] R+R: Security Vulnerability Dataset Quality Is Critical Annual Computer Security Applications Conference (ACSAC) 2024 1047-1061 IEEE 10.1109/ACSAC63791.2024.00086
[34] Guidelines for Performing Systematic Literature Reviews in Software Engineering Keele University 2007
[35] Guidelines for Conducting Systematic Mapping Studies in Software Engineering: An Update Information and Software Technology (INFSOF) 2015 64 1–18 10.1016/j.infsof.2015.03.007
[36] 2024. A Systematic Literature Review on Large Language Models for Automated Program Repair. https://arxiv.org/abs/2405.01466
[37] 2024. From Generalist to Specialist: Exploring CWE-Specific Vulnerability Detection. https://arxiv.org/abs/2408.02329
[38] Enhancing Software Code Vulnerability Detection Using GPT-4o and Claude-3.5 Sonnet: A Study on Prompt Engineering Techniques Electronics 2024 13 13 2657 10.3390/electronics13132657
[39] 2024. RealVul: Can We Detect Vulnerabilities in Web Applications with LLM? https://arxiv.org/abs/2410.07573
[40] 2025. Can LLM Prompting Serve as a Proxy for Static Analysis in Vulnerability Detection. https://arxiv.org/abs/2412.12039
[41] 2025. Helping LLMs Improve Code Generation Using Feedback from Testing and Static Analysis. https://arxiv.org/abs/2412.14841
[42] 2025. How Well Do Large Language Models Serve as End-to-End Secure Code Agents for Python? https://arxiv.org/abs/2408.10495
[43] 2025. RepoAudit: An Autonomous LLM-Agent for Repository-Level Code Auditing. https://arxiv.org/abs/2501.18160
[44] Understanding the Effectiveness of Large Language Models in Detecting Security Vulnerabilities Conference on Software Testing, Verification and Validation (ICST) 2025 103-114 IEEE 10.1109/ICST62969.2025.10988968
[45] 2025. IRIS: LLM-Assisted Static Analysis for Detecting Security Vulnerabilities. https://arxiv.org/abs/2405.17238
[46] 2025. Towards Explainable Vulnerability Detection with Large Language Models. https://arxiv.org/abs/2406.09701
[47] Effectiveness of ChatGPT for Static Analysis: How Far Are We? International Conference on AI-Powered Software (AIware) 2024 151–160 ACM 10.1145/3664646.3664777
[48] 2024. Chain-of-Thought Prompting of Large Language Models for Discovering and Fixing Software Vulnerabilities. https://arxiv.org/abs/2402.17230
[49] A New Approach to Web Application Security: Utilizing GPT Language Models for Source Code Inspection Future Internet 2023 15 10 326
[50] LLMs Cannot Reliably Identify and Reason About Security Vulnerabilities (Yet?): A Comprehensive Evaluation, Framework, and Benchmarks Symposium on Security and Privacy (SP) 2024 862-880 IEEE 10.1109/SP54263.2024.00210
[51] 2025. One-for-All Does Not Work! Enhancing Vulnerability Detection by Mixture-of-Experts (MoE). https://arxiv.org/abs/2501.16454
[52] 2024. VulnLLMEval: A Framework for Evaluating Large Language Models in Software Vulnerability Detection and Patching. https://arxiv.org/abs/2409.10756
[53] SV-TrustEval-C: Evaluating Structure and Semantic Reasoning in Large Language Models for Source Code Vulnerability Analysis Symposium on Security and Privacy (SP) 2025 3014-3032 IEEE 10.1109/SP61157.2025.00191
[54] 2025. Trace Gadgets: Minimizing Code Context for Machine Learning-Based Vulnerability Prediction. https://arxiv.org/abs/2504.13676
[55] 2025. Can You Really Trust Code Copilots? Evaluating Large Language Models from a Code Security Perspective. https://arxiv.org/abs/2505.10494
[56] ♪ With a Little Help from My (LLM) Friends: Enhancing Static Analysis with LLMs to Detect Software Vulnerabilities International Workshop on Large Language Models for Code (LLM4Code) 2025 25-32 IEEE/ACM 10.1109/LLM4Code66737.2025.00008
[57] 2025. AutoPatch: Multi-Agent Framework for Patching Real-World CVE Vulnerabilities. https://arxiv.org/abs/2505.04195
[58] 2025. You Only Train Once: A Flexible Training Framework for Code Vulnerability Detection Driven by Vul-Vector. https://arxiv.org/abs/2506.10988
[59] 2025. Leveraging Large Language Models for Command Injection Vulnerability Analysis in Python: An Empirical Study on Popular Open-Source Projects. https://arxiv.org/abs/2505.15088
[60] A Method of SQL Injection Attack Detection Based on Large Language Models International Conference on Computer Network Technology and Electronic and Information Engineering (CNTEIE) 2024 154-158 IEEE 10.1109/CNTEIE66268.2024.00035
[61] SSRFSeek: An LLM-based Static Analysis Framework for Detecting SSRF Vulnerabilities in PHP Applications International Seminar on Artificial Intelligence, Networking and Information Technology (AINIT) 2025 939-944 IEEE 10.1109/AINIT65432.2025.11035424
[62] SecureMind: A Framework for Benchmarking Large Language Models in Memory Bug Detection and Repair International Symposium on Memory Management (ISMM) 2025 27–40 ACM 10.1145/3735950.3735954
[63] 2025. SafeGenBench: A Benchmark Framework for Security Vulnerability Detection in LLM-Generated Code. https://arxiv.org/abs/2506.05692
[65] 2025. A Comprehensive Study of LLM Secure Code Generation. https://arxiv.org/abs/2503.15554
[66] 2024. From Solitary Directives to Interactive Encouragement! LLM Secure Code Generation by Natural Language Prompting. https://arxiv.org/abs/2410.14321
[67] 2024. AutoSafeCoder: A Multi-Agent Framework for Securing LLM Code Generation through Static Analysis and Fuzz Testing. https://arxiv.org/abs/2409.10737
[68] 2023. Enhancing Large Language Models for Secure Code Generation: A Dataset-driven Study on Vulnerability Mitigation. https://arxiv.org/abs/2310.16263
[69] Fight Fire With Fire: How Much Can We Trust ChatGPT on Source Code-Related Tasks? IEEE Transactions on Software Engineering (TSE) 2024 50 12 3435-3453 10.1109/TSE.2024.3492204
[70] DB-CBIL: A DistilBert-Based Transformer Hybrid Model Using CNN and BiLSTM for Software Vulnerability Detection IEEE Access 2024 12 64446-64460 10.1109/ACCESS.2024.3396410
[71] 2023. Using ChatGPT as a Static Application Security Testing Tool. https://arxiv.org/abs/2308.14434
[72] MultiVD: A Transformer-based Multitask Approach for Software Vulnerability Detection International Conference on Security and Cryptography (SECRYPT) 2024 416–423 SCITEPRESS
[73] Leveraging Deep Learning Models for Cross-function Null Pointer Risks Detection International Conference On Artificial Intelligence Testing (AITest) 2023 107-113 IEEE 10.1109/AITest58265.2023.00025
[74] 2024. Generalization-Enhanced Code Vulnerability Detection via Multi-Task Instruction Fine-Tuning. https://arxiv.org/abs/2406.03718
[75] AIBugHunter: A Practical Tool for Predicting, Classifying and Repairing Software Vulnerabilities Empirical Software Engineering (EMSE) 2024 29 1 4
[76] LineVul: A Transformer-based Line-level Vulnerability Prediction International Conference on Mining Software Repositories (MSR) 2022 608–620 ACM 10.1145/3524842.3528452
[77] 2024. Unintentional Security Flaws in Code: Automated Defense via Root Cause Analysis. https://arxiv.org/abs/2409.00199
[78] Software Vulnerability Prediction in Low-Resource Languages: An Empirical Study of CodeBERT and ChatGPT International Conference on Evaluation and Assessment in Software Engineering (EASE) 2024 679–685 ACM 10.1145/3661167.3661281
[79] Enhancing Static Analysis for Practical Bug Detection: An LLM-Integrated Approach Proceedings of the ACM on Programming Languages (PACMPL) 2024 8 OOPSLA1 10.1145/3649828
[80] 2024. VulDetectBench: Evaluating the Deep Capability of Vulnerability Detection with Large Language Models. https://arxiv.org/abs/2406.07595
[81] 2024. Automated Software Vulnerability Static Code Analysis Using Generative Pre-Trained Transformer Models. https://arxiv.org/abs/2408.00197
[82] 2024. A Preliminary Study on Using Large Language Models in Software Pentesting. https://arxiv.org/abs/2401.17459
[83] Large Language Models for In-File Vulnerability Localization Can Be ""Lost in the End"" Proceedings of the ACM on Software Engineering (PACMSE) 2025 2 FSE 10.1145/3715758
[84] Closing the Gap: A User Study on the Real-world Usefulness of AI-powered Vulnerability Detection & Repair in the IDE International Conference on Software Engineering (ICSE) 2025 1-13 IEEE/ACM 10.1109/ICSE55347.2025.00126
[85] 2025. To Err is Machine: Vulnerability Detection Challenges LLM Reasoning. https://arxiv.org/abs/2403.17218
[86] Harnessing Large Language Models for Software Vulnerability Detection: A Comprehensive Benchmarking Study IEEE Access 2025 13 29698-29717 10.1109/ACCESS.2025.3541146
[87] 2024. Beyond ChatGPT: Enhancing Software Quality Assurance Tasks with Diverse LLMs and Validation Techniques. https://arxiv.org/abs/2409.01001
[88] Multitask-Based Evaluation of Open-Source LLM on Software Vulnerability IEEE Transactions on Software Engineering (TSE) 2024 50 11 3071-3087 10.1109/TSE.2024.3470333
[89] 2025. An Insight into Security Code Review with LLMs: Capabilities, Obstacles, and Influential Factors. https://arxiv.org/abs/2401.16310
[90] 2025. SecVulEval: Benchmarking LLMs for Real-World C/C++ Vulnerability Detection. https://arxiv.org/abs/2505.19828
[91] 2025. FuncVul: An Effective Function Level Vulnerability Detection Model using LLM and Code Chunk. https://arxiv.org/abs/2506.19453
[92] PTLVD:Program Slicing and Transformer-based Line-level Vulnerability Detection System International Working Conference on Source Code Analysis and Manipulation (SCAM) 2023 162-173 IEEE 10.1109/SCAM59687.2023.00026
[93] 2025. Boosting Vulnerability Detection of LLMs via Curriculum Preference Optimization with Synthetic Reasoning Data. https://arxiv.org/abs/2506.07390
[94] 2025. Improving Vulnerability Type Prediction and Line-Level Detection via Adversarial Training-based Data Augmentation and Multi-Task Learning. https://arxiv.org/abs/2506.23534
[95] ChatGPT for Vulnerability Detection, Classification, and Repair: How Far Are We? Asia-Pacific Software Engineering Conference (APSEC) 2023 632-636 IEEE 10.1109/APSEC60848.2023.00085
[96] 2025. Streamlining Security Vulnerability Triage with Large Language Models. https://arxiv.org/abs/2501.18908
[97] 2025. VADER: A Human-Evaluated Benchmark for Vulnerability Assessment, Detection, Explanation, and Remediation. https://arxiv.org/abs/2505.19395
[98] An Ensemble Transformer Approach with Cross-Attention for Automated Code Security Vulnerability Detection and Documentation International Symposium on Digital Forensics and Security (ISDFS) 2025 1-6 IEEE 10.1109/ISDFS65363.2025.11012039
[99] 2025. Let the Trial Begin: A Mock-Court Approach to Vulnerability Detection using LLM-Based Agents. https://arxiv.org/abs/2505.10961
[100] 2025. Vul-RAG: Enhancing LLM-based Vulnerability Detection via Knowledge-level RAG. https://arxiv.org/abs/2406.11147
[101] May the Source Be with You: On ChatGPT, Cybersecurity, and Secure Coding Information 2024 15 9 572 10.3390/info15090572
[102] Exploring AI for Vulnerability Detection and Repair Cyber Awareness and Research Symposium (CARS) 2024 1-9 IEEE 10.1109/CARS61786.2024.10778769
[103] 2024. Enhancing Source Code Security with LLMs: Demystifying The Challenges and Generating Reliable Repairs. https://arxiv.org/abs/2409.00571
[104] Code Vulnerability Repair with Large Language Model Using Context-Aware Prompt Tuning Security and Privacy Workshops (SPW) 2025 283-287 IEEE 10.1109/SPW67851.2025.00040
[105] Assessing the Effectiveness of LLMs in Android Application Vulnerability Analysis International Conference on Attacks and Defenses for Internet-of-Things (ADIoT) 2024 139–154 Springer 10.1007/978-3-031-85593-1_9
[106] 2023. Can Large Language Models Find And Fix Vulnerable Software? https://arxiv.org/abs/2308.10345
[107] 2023. Exploring the Limits of ChatGPT in Software Security Applications. https://arxiv.org/abs/2312.05275
[108] 2024. Security Vulnerability Detection with Multitask Self-Instructed Fine-Tuning of Large Language Models. https://arxiv.org/abs/2406.05892
[109] Automating the Detection of Code Vulnerabilities by Analyzing GitHub Issues International Workshop on Large Language Models for Code (LLM4Code) 2025 41-48 IEEE/ACM 10.1109/LLM4Code66737.2025.00010
[110] Software Vulnerability and Functionality Assessment using Large Language Models International Workshop on NL-Based Software Engineering (NLBSE) 2024 25–28 ACM/IEEE 10.1145/3643787.3648036
[111] Distinguishing Look-Alike Innocent and Vulnerable Code by Subtle Semantic Representation Learning and Explanation Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE) 2023 1611–1622 ACM 10.1145/3611643.3616358
[112] 2025. LLM4Vuln: A Unified Evaluation Framework for Decoupling and Enhancing LLMs' Vulnerability Reasoning. https://arxiv.org/abs/2401.16185
[113] DLAP: A Deep Learning Augmented Large Language Model Prompting Framework for Software Vulnerability Detection Journal of Systems and Software (JSS) 2025 219 112234 10.1016/j.jss.2024.112234
[114] 2025. Benchmarking LLMs and LLM-based Agents in Practical Vulnerability Detection for Code Repositories. https://arxiv.org/abs/2503.03586
[115] 2025. Reasoning with LLMs for Zero-Shot Vulnerability Detection. https://arxiv.org/abs/2503.17885
[116] 2025. Everything You Wanted to Know About LLM-based Vulnerability Detection But Were Afraid to Ask. https://arxiv.org/abs/2504.13474
[117] 2025. SAVANT: Vulnerability Detection in Application Dependencies through Semantic-Guided Reachability Analysis. https://arxiv.org/abs/2506.17798
[118] 2025. R2Vul: Learning to Reason about Software Vulnerabilities with Reinforcement Learning and Structured Reasoning Distillation. https://arxiv.org/abs/2504.04699
[119] Unified Pre-training for Program Understanding and Generation Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT) 2021 2655–2668 ACL 10.18653/v1/2021.naacl-main.211
[120] 2024. Vulnerability Detection in Popular Programming Languages with Language Models. https://arxiv.org/abs/2412.15905
[121] 2020. Exploring Software Naturalness through Neural Language Models. https://arxiv.org/abs/2006.12641
[122] 2023. Transformer-based Vulnerability Detection in Code at EditTime: Zero-shot, Few-shot, or Fine-tuning? https://arxiv.org/abs/2306.01754
[123] Bridge and Hint: Extending Pre-trained Language Models for Long-Range Code International Symposium on Software Testing and Analysis (ISSTA) 2024 274–286 ACM 10.1145/3650212.3652127
[124] DiverseVul: A New Vulnerable Source Code Dataset for Deep Learning Based Vulnerability Detection International Symposium on Research in Attacks, Intrusions and Defenses (RAID) 2023 654–668 IEEE 10.1145/3607199.3607242
[125] Can a Llama Be a Watchdog? Exploring Llama 3 and Code Llama for Static Application Security Testing International Conference on Cyber Security and Resilience (CSR) 2024 395-400 IEEE 10.1109/CSR61664.2024.10679444
[126] On the Compression of Language Models for Code: An Empirical Study on CodeBERT International Conference on Software Analysis, Evolution and Reengineering (SANER) 2025 12-23 IEEE 10.1109/SANER64311.2025.00010
[127] Improving Long-Tail Vulnerability Detection Through Data Augmentation Based on Large Language Models International Conference on Software Maintenance and Evolution (ICSME) 2024 262-274 IEEE 10.1109/ICSME58944.2024.00033
[128] Vulnerability Detection with Code Language Models: How Far are We? International Conference on Software Engineering (ICSE) 2025 1729-1741 IEEE/ACM 10.1109/ICSE55347.2025.00038
[129] Joint Geometrical and Statistical Domain Adaptation for Cross-domain Code Vulnerability Detection Conference on Empirical Methods in Natural Language Processing (EMNLP) 2023 12791–12800 ACL
[130] Python Source Code Vulnerability Detection with Named Entity Recognition Computers & Security (COSE) 2024 140 103802 10.1016/j.cose.2024.103802
[131] SecureFalcon: Are We There Yet in Automated Software Vulnerability Detection With LLMs? Transactions on Software Engineering (TSE) 2025 51 4 1248-1265 10.1109/TSE.2025.3548168
[132] VulExplainer: A Transformer-Based Hierarchical Distillation for Explaining Vulnerability Types IEEE Transactions on Software Engineering (TSE) 2023 49 10 4550–4565
[133] Keeping Pace with Ever-Increasing Data: Towards Continual Learning of Code Intelligence Models International Conference on Software Engineering (ICSE) 2023 30-42 IEEE/ACM 10.1109/ICSE48619.2023.00015
[134] Evaluating LLaMA 3.2 for Software Vulnerability Detection European Interdisciplinary Cybersecurity Conference (EICC) 2025 38–51 Springer
[135] The EarlyBIRD Catches the Bug: On Exploiting Early Layers of Encoder Models for More Efficient Code Classification Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE) 2023 895–907 ACM 10.1145/3611643.3616304
[136] BiT5: A Bidirectional NLP Approach for Advanced Vulnerability Detection in Codebase Procedia Computer Science 2024 233 812–821
[137] DetectBERT: Code Vulnerability Detection Global Conference on Communications and Information Technologies (GCCIT) 2024 1-21 IEEE 10.1109/GCCIT63234.2024.10862235
[138] Outside the Comfort Zone: Analysing LLM Capabilities in Software Vulnerability Detection European Symposium on Research in Computer Security (ESORICS) 2024 271–289 Springer 10.1007/978-3-031-70879-4_14
[139] VulBERTa: Simplified Source Code Pre-Training for Vulnerability Detection International Joint Conference on Neural Networks (IJCNN) 2022 1-8 IEEE 10.1109/IJCNN55064.2022.9892280
[140] Leveraging an Enhanced CodeBERT-Based Model for Multiclass Software Defect Prediction via Defect Classification IEEE Access 2025 13 24383-24397 10.1109/ACCESS.2024.3525069
[141] DFEPT: Data Flow Embedding for Enhancing Pre-Trained Model Based Vulnerability Detection Asia-Pacific Symposium on Internetware (Internetware) 2024 95–104 ACM 10.1145/3671016.3671388
[142] 2025. Investigating Large Language Models for Code Vulnerability Detection: An Experimental Study. https://arxiv.org/abs/2412.18260
[143] StagedVulBERT: Multigranular Vulnerability Detection With a Novel Pretrained Code Model IEEE Transactions on Software Engineering (TSE) 2024 50 12 3454-3471 10.1109/TSE.2024.3493245
[144] Applying Contrastive Learning to Code Vulnerability Type Classification Conference on Empirical Methods in Natural Language Processing (EMNLP) 2024 11942–11952 ACL
[145] Vulnerability Classification on Source Code Using Text Mining and Deep Learning Techniques International Conference on Software Quality, Reliability, and Security Companion (QRS-C) 2024 47-56 IEEE 10.1109/QRS-C63300.2024.00017
[146] Vulnerability Prediction using Pre-trained Models: An Empirical Evaluation International Conference on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS) 2024 1-6 IEEE 10.1109/MASCOTS64422.2024.10786510
[147] A Source Code Vulnerability Detection Method Based on Positive-Unlabeled Learning International Conference on Robotics, Intelligent Control and Artificial Intelligence (RICAI) 2024 551-556 IEEE 10.1109/RICAI64321.2024.10911761
[148] Fine-Tuning Transformer LLMs for Detecting SQL Injection and XSS Vulnerabilities International Conference on Artificial Intelligence in Information and Communication (ICAIIC) 2025 0946-0951 IEEE 10.1109/ICAIIC64266.2025.10920868
[149] 2025. CleanVul: Automatic Function-Level Vulnerability Detection in Code Commits Using LLM Heuristics. https://arxiv.org/abs/2411.17274
[150] Distilled and Contextualized Neural Models Benchmarked for Vulnerable Function Detection Mathematics 2022 10 23 4482 10.3390/math10234482
[151] Steering Large Language Models for Vulnerability Detection International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025 1-5 IEEE 10.1109/ICASSP49660.2025.10887736
[152] Pre-training by Predicting Program Dependencies for Vulnerability Analysis Tasks International Conference on Software Engineering (ICSE) 2024 1863-1875 IEEE/ACM 10.1145/3597503.3639142
[153] Assessing the Effectiveness of Vulnerability Detection via Prompt Tuning: An Empirical Study Asia-Pacific Software Engineering Conference (APSEC) 2023 415-424 IEEE 10.1109/APSEC60848.2023.00052
[154] Detecting Integer Overflow Errors in Java Source Code via Machine Learning International Conference on Tools with Artificial Intelligence (ICTAI) 2021 724-728 IEEE 10.1109/ICTAI52525.2021.00115
[155] 2025. HALURust: Exploiting Hallucinations of Large Language Models to Detect Vulnerabilities in Rust. https://arxiv.org/abs/2503.10793
[156] Exploring Transformers for Multi-Label Classification of Java Vulnerabilities International Conference on Software Quality, Reliability and Security (QRS) 2022 43-52 IEEE 10.1109/QRS57517.2022.00015
[157] 2024. Enhancing Reverse Engineering: Investigating and Benchmarking Large Language Models for Vulnerability Analysis in Decompiled Binaries. https://arxiv.org/abs/2411.04981
[158] SecureQwen: Leveraging LLMs for Vulnerability Detection in Python Codebases Computers & Security (COSE) 2025 148 104151 10.1016/j.cose.2024.104151
[159] Detecting Vulnerabilities in IoT Software: New Hybrid Model and Comprehensive Data Analysis Journal of Information Security and Applications (JISA) 2023 74 103467 10.1016/j.jisa.2023.103467
[160] 2024. SAFE: Advancing Large Language Models in Leveraging Semantic and Syntactic Relationships for Software Vulnerability Detection. https://arxiv.org/abs/2409.00882
[161] An Empirical Study on Software Defect Prediction Using CodeBERT Model Applied Sciences 2021 11 11 4793
[162] 2024. Enhancing Pre-Trained Language Models for Vulnerability Detection via Semantic-Preserving Data Augmentation. https://arxiv.org/abs/2410.00249
[163] Towards Causal Deep Learning for Vulnerability Detection International Conference on Software Engineering (ICSE) 2024 IEEE/ACM 10.1145/3597503.3639170
[164] EnStack: An Ensemble Stacking Framework of Large Language Models for Enhanced Vulnerability Detection in Source Code International Conference on Big Data (BigData) 2024 6356-6364 IEEE 10.1109/BigData62323.2024.10825609
[165] Uncovering the Limits of Machine Learning for Automatic Vulnerability Detection USENIX Security Symposium (USENIX Security) 2024 4247–4264 USENIX
[166] Leveraging Multi-Task Learning to Improve the Detection of SATD and Vulnerability International Conference on Program Comprehension (ICPC) 2025 1-12 IEEE/ACM 10.1109/ICPC66645.2025.00017
[167] Deep Learning-Based Framework for Automated Vulnerability Detection in Android Applications International Bhurban Conference on Applied Sciences and Technology (IBCAST) 2023 1-5 IEEE 10.1109/IBCAST59916.2023.10713017
[168] Finetuning Large Language Models for Vulnerability Detection IEEE Access 2025 13 38889-38900 10.1109/ACCESS.2025.3546700
[169] VulSim: Leveraging Similarity of Multi-Dimensional Neighbor Embeddings for Vulnerability Detection USENIX Security Symposium (USENIX Security) 2024 1777–1794 USENIX
[170] Cyber Security Vulnerability Detection Using Natural Language Processing World AI IoT Congress (AIIoT) 2022 174-178 IEEE 10.1109/AIIoT54504.2022.9817336
[171] 2023. Do Language Models Learn Semantics of Code? A Case Study in Vulnerability Detection. https://arxiv.org/abs/2311.04109
[172] 2024. Code Vulnerability Detection: A Comparative Analysis of Emerging Large Language Models. https://arxiv.org/abs/2409.10490
[173] 2024. Enhanced LLM-Based Framework for Predicting Null Pointer Dereference in Source Code. https://arxiv.org/abs/2412.00216
[174] Optimizing Pre-trained Language Models for Efficient Vulnerability Detection in Code Snippets International Conference on Computer and Communications (ICCC) 2023 2139-2143 IEEE 10.1109/ICCC59590.2023.10507456
[175] SQL Injection Vulnerability Detection Based on Pissa-Tuned Llama 3 Large Language Model International Conference on Frontier Technologies of Information and Computer (ICFTIC) 2024 255-259 IEEE 10.1109/ICFTIC64248.2024.10912886
[176] Software Defect Prediction Employing BiLSTM and BERT-based Semantic Feature Soft Computing 2022 26 16 7877–7891
[177] 2025. ANVIL: Anomaly-based Vulnerability Identification without Labelled Training Data. https://arxiv.org/abs/2408.16028
[178] 2024. Line-level Semantic Structure Learning for Code Vulnerability Detection. https://arxiv.org/abs/2407.18877
[179] 2024. M2CVD: Enhancing Vulnerability Semantic through Multi-Model Collaboration for Code Vulnerability Detection. https://arxiv.org/abs/2406.05940
[180] Parameter-efficient Multi-classification Software Defect Detection Method based on Pre-trained LLMs International Journal of Computational Intelligence Systems (IJCIS) 2024 17 1 152 10.1007/s44196-024-00551-3
[181] Prompt Tuning in Code Intelligence: An Experimental Evaluation IEEE Transactions on Software Engineering (TSE) 2023 49 11 4869-4885 10.1109/TSE.2023.3313881
[182] 2024. VulEval: Towards Repository-Level Evaluation of Software Vulnerability Detection. https://arxiv.org/abs/2404.15596
[183] VulD-CodeBERT: CodeBERT-Based Vulnerability Detection Model for C/C++ Code International Conference on Communications, Information System and Computer Engineering (CISCE) 2024 914-919 IEEE 10.1109/CISCE62493.2024.10653337
[184] Software Vulnerabilities Detection Based on a Pre-trained Language Model International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom) 2023 904-911 IEEE 10.1109/TrustCom60117.2023.00129
[185] Deep Neural Embedding for Software Vulnerability Discovery: Comparison and Optimization Security and Communication Networks 2022 2022 1 5203217 10.1155/2022/5203217
[186] 2024. Your Instructions Are Not Always Helpful: Assessing the Efficacy of Instruction Fine-tuning for Software Vulnerability Detection. https://arxiv.org/abs/2401.07466
[187] Intelligent Detection of Vulnerable Functions in Software through Neural Embedding-based Code Analysis International Journal of Network Management (IJNM) 2023 33 3 e2198 10.1002/nem.2198
[188] 2025. Benchmarking Large Language Models for Multi-Language Software Vulnerability Detection. https://arxiv.org/abs/2503.01449
[189] 2024. MVD: A Multi-Lingual Software Vulnerability Detection Framework. https://arxiv.org/abs/2412.06166
[190] Python Source Code Vulnerability Detection Based on CodeBERT Language Model International Conference on Algorithms, Computing and Artificial Intelligence (ACAI) 2024 1-6 IEEE 10.1109/ACAI63924.2024.10899694
[191] 2024. Comparison of Static Application Security Testing Tools and Large Language Models for Repo-level Vulnerability Detection. https://arxiv.org/abs/2407.16235
[192] Large Language Model for Vulnerability Detection: Emerging Results and Future Directions International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER) 2024 47–51 ACM/IEEE 10.1145/3639476.3639762
[193] Detecting Source Code Vulnerabilities Using Fine-Tuned Pre-Trained LLMs International Conference on Signal Processing (ICSP) 2024 238-242 IEEE 10.1109/ICSP62129.2024.10846595
[194] Security Vulnerability Detection Using Deep Learning Natural Language Processing Conference on Computer Communications Workshops (INFOCOM WKSHPS) 2021 1-6 IEEE 10.1109/INFOCOMWKSHPS51825.2021.9484500
[195] AutoAdapt: On the Application of AutoML for Parameter-Efficient Fine-Tuning of Pre-Trained Code Models ACM Transactions on Software Engineering and Methodology (TOSEM) 2025 Just Accepted Just Accepted 10.1145/3734867
[196] 2025. Case Study: Fine-tuning Small Language Models for Accurate and Private CWE Detection in Python Code. https://arxiv.org/abs/2504.16584
[197] LineVD: Statement-level Vulnerability Detection using Graph Neural Networks International Conference on Mining Software Repositories (MSR) 2022 596–607 ACM 10.1145/3524842.3527949
[198] LPASS: Linear Probes as Stepping Stones for Vulnerability Detection using Compressed LLMs Journal of Information Security and Applications (JISA) 2025 93 104125 10.1016/j.jisa.2025.104125
[199] LLM-Based Approach for Buffer Overflow Detection in Source Code International Conference on Computer and Information Technology (ICCIT) 2024 1898-1902 IEEE 10.1109/ICCIT64611.2024.11021816
[200] 2025. Are Sparse Autoencoders Useful for Java Function Bug Detection? https://arxiv.org/abs/2505.10375
[201] When Less is Enough: Positive and Unlabeled Learning Model for Vulnerability Detection International Conference on Automated Software Engineering (ASE) 2023 345-357 IEEE/ACM 10.1109/ASE56229.2023.00144
[202] 2025. Learning to Focus: Context Extraction for Efficient Code Vulnerability Detection with Language Models. https://arxiv.org/abs/2505.17460
[203] 2025. Smart Cuts: Enhance Active Learning for Vulnerability Detection by Pruning Bad Seeds. https://arxiv.org/abs/2506.20444
[204] Metamorphic-Based Many-Objective Distillation of LLMs for Code-Related Tasks International Conference on Software Engineering (ICSE) 2025 1001-1013 IEEE/ACM 10.1109/ICSE55347.2025.00230
[205] Greening Large Language Models of Code International Conference on Software Engineering: Software Engineering in Society (ICSE-SEIS) 2024 142–153 ACM 10.1145/3639475.3640097
[206] Code Defect Detection Method Based on BERT and Ensemble International Conference on Computer and Communications (ICCC) 2023 2130-2138 IEEE 10.1109/ICCC59590.2023.10507306
[207] DP-CCL: A Supervised Contrastive Learning Approach Using CodeBERT Model in Software Defect Prediction IEEE Access 2024 12 22582-22594 10.1109/ACCESS.2024.3362896
[208] GRACE: Empowering LLM-based Software Vulnerability Detection with Graph Structure and In-Context Learning Journal of Systems and Software (JSS) 2024 212 112031
[209] SCALE: Constructing Structured Natural Language Comment Trees for Software Vulnerability Detection International Symposium on Software Testing and Analysis (ISSTA) 2024 235–247 ACM 10.1145/3650212.3652124
[210] SCL-CVD: Supervised Contrastive Learning for Code Vulnerability Detection via GraphCodeBERT Computers & Security (COSE) 2024 145 103994 10.1016/j.cose.2024.103994
[211] AIDetectVul: Software Vulnerability Detection Method Based on Feature Fusion of Pre-trained Models International Conference on Consumer Electronics and Computer Engineering (ICCECE) 2025 258-263 IEEE 10.1109/ICCECE65250.2025.10985370
[212] Vulnerability Detection by Learning From Syntax-Based Execution Paths of Code IEEE Transactions on Software Engineering (TSE) 2023 49 8 4196-4212 10.1109/TSE.2023.3286586
[213] Enhancing Source Code Vulnerability Detection Using Flattened Code Graph Structures International Conference on Frontier Technologies of Information and Computer (ICFTIC) 2024 209-213 IEEE 10.1109/ICFTIC64248.2024.10913325
[214] PATVD: Vulnerability Detection Based on Pre-training Techniques and Adversarial Training Smartworld, Ubiquitous Intelligence & Computing, Scalable Computing & Communications, Digital Twin, Privacy Computing, Metaverse, Autonomous & Trusted Vehicles (SmartWorld/UIC/ScalCom/DigitalTwin/PriComp/Meta) 2022 1774-1781 IEEE 10.1109/SmartWorld-UIC-ATC-ScalCom-DigitalTwin-PriComp-Metaverse56740.2022.00253
[215] Exploration On Prompting LLM With Code-Specific Information For Vulnerability Detection International Conference on Software Services Engineering (SSE) 2024 273-281 IEEE 10.1109/SSE62657.2024.00049
[216] Vul-LMGNNs: Fusing Language Models and Online-distilled Graph Neural Networks for Code Vulnerability Detection Information Fusion 2025 115 102748 10.1016/j.inffus.2024.102748
[217] 2025. CGP-Tuning: Structure-Aware Soft Prompt Tuning for Code Vulnerability Detection. https://arxiv.org/abs/2501.04510
[218] 2025. Context-Enhanced Vulnerability Detection Based on Large Language Model. https://arxiv.org/abs/2504.16877
[219] GraphCodeBERT-Augmented Graph Attention Networks for Code Vulnerability Detection Conference on Artificial Intelligence (CAI) 2025 912-917 IEEE 10.1109/CAI64502.2025.00161
[220] 2023. DefectHunter: A Novel LLM-Driven Boosted-Conformer-based Code Vulnerability Detection Mechanism. https://arxiv.org/abs/2309.15324
[221] An Unbiased Transformer Source Code Learning with Semantic Vulnerability Graph European Symposium on Security and Privacy (EuroS&P) 2023 144-159 IEEE 10.1109/EuroSP57164.2023.00018
[222] SVulDetector: Vulnerability Detection based on Similarity using Tree-based Attention and Weighted Graph Embedding Mechanisms Computers & Security (COSE) 2024 144 103930 10.1016/j.cose.2024.103930
[223] Enhancing Vulnerability Detection Efficiency: An Exploration of Light-weight LLMs with Hybrid Code Features Journal of Information Security and Applications (JISA) 2025 88 103925 10.1016/j.jisa.2024.103925
[224] Comparing the Performance of Different Code Representations for Learning-based Vulnerability Detection Asia-Pacific Symposium on Internetware (Internetware) 2023 174–184 ACM 10.1145/3609437.3609464
[225] Function-Level Vulnerability Detection Through Fusing Multi-Modal Knowledge International Conference on Automated Software Engineering (ASE) 2023 1911-1918 IEEE/ACM 10.1109/ASE56229.2023.00084
[226] BERT-Based Vulnerability Type Identification with Effective Program Representation International Conference on Wireless Algorithms, Systems, and Applications (WASA) 2022 271–282 Springer 10.1007/978-3-031-19208-1_23
[227] BBVD: A BERT-based Method for Vulnerability Detection International Journal of Advanced Computer Science and Applications (IJACSA) 2022 13 12 10.14569/IJACSA.2022.01312103
[228] An Enhanced Vulnerability Detection in Software Using a Heterogeneous Encoding Ensemble Symposium on Computers and Communications (ISCC) 2023 1214-1220 IEEE 10.1109/ISCC58397.2023.10217978
[229] VulDeBERT: A Vulnerability Detection System Using BERT International Symposium on Software Reliability Engineering Workshops (ISSREW) 2022 69-74 IEEE 10.1109/ISSREW55968.2022.00042
[230] Transformer-Based Language Models for Software Vulnerability Detection Annual Computer Security Applications Conference (ACSAC) 2022 481–496 ACM 10.1145/3564625.3567985
[231] VulDefend: A Novel Technique based on Pattern-exploiting Training for Detecting Software Vulnerabilities Using Language Models Jordan International Joint Conference on Electrical Engineering and Information Technology (JEEIT) 2023 287-293 IEEE 10.1109/JEEIT58638.2023.10185860
[232] VulDetect: A novel technique for detecting software vulnerabilities using Language Models International Conference on Cyber Security and Resilience (CSR) 2023 105-110 IEEE 10.1109/CSR57506.2023.10224924
[233] VULREM: Fine-Tuned BERT-Based Source-Code Potential Vulnerability Scanning System to Mitigate Attacks in Web Applications Applied Sciences 2024 14 21 9697 10.3390/app14219697
[234] Automated Software Vulnerability Detection via Pre-trained Context Encoder and Self Attention International Conference on Digital Forensics and Cyber Crime (ICDF2C) 2021 248–264 Springer
[235] Software Vulnerability Detection using Large Language Models International Symposium on Software Reliability Engineering Workshops (ISSREW) 2023 112-119 IEEE 10.1109/ISSREW60843.2023.00058
[236] XGV-BERT: Leveraging Contextualized Language Model and Graph Neural Network for Efficient Software Vulnerability Detection The Journal of Supercomputing 2025 81 6 750 10.1007/s11227-025-07198-7
[237] Adversarial Training for Robustness Enhancement in LLM-Based Code Vulnerability Detection International Conference on Communications, Information System and Computer Engineering (CISCE) 2025 1147-1152 IEEE 10.1109/CISCE65916.2025.11065803
[238] VulDeePecker: A Deep Learning-Based System for Vulnerability Detection Network and Distributed Systems Security Symposium (NDSS) 2018 1-15 The Internet Society 10.14722/ndss.2018.23158
[239] TRACED: Execution-aware Pre-training for Source Code International Conference on Software Engineering (ICSE) 2024 IEEE/ACM 10.1145/3597503.3608140
[240] LLM-CloudSec: Large Language Model Empowered Automatic and Deep Vulnerability Analysis for Intelligent Clouds International Conference on Computer Communications Workshops (INFOCOM WKSHPS) 2024 1-6 IEEE 10.1109/INFOCOMWKSHPS61880.2024.10620804
[241] Research on the LLM-Driven Vulnerability Detection System Using LProtector International Conference on Data Science and Computer Application (ICDSCA) 2024 192-196 IEEE 10.1109/ICDSCA63855.2024.10859408
[242] Software Vulnerability Detection Using LLM: Does Additional Information Help? Annual Computer Security Applications Conference Workshops (ACSAC Workshops) 2024 216-223 IEEE 10.1109/ACSACW65225.2024.00031
[243] 2024. Boosting Cybersecurity Vulnerability Scanning based on LLM-supported Static Application Security Testing. https://arxiv.org/abs/2409.15735
[244] 2023. How Far Have We Gone in Vulnerability Detection Using Large Language Models. https://arxiv.org/abs/2311.12420
[245] Software Vulnerability Detection with GPT and In-Context Learning International Conference on Data Science in Cyberspace (DSC) 2023 229-236 IEEE 10.1109/DSC59305.2023.00041
[246] Detecting Code Vulnerabilities using LLMs International Conference on Dependable Systems and Networks (DSN) 2025 401-414 IEEE/IFIP 10.1109/DSN64029.2025.00047
[247] VulnGPT: Enhancing Source Code Vulnerability Detection Using AutoGPT and Adaptive Supervision Strategies International Conference on Distributed Computing in Smart Systems and the Internet of Things (DCOSS-IoT) 2024 450-454 IEEE 10.1109/DCOSS-IoT61029.2024.00072
[248] 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. https://arxiv.org/abs/1907.11692
[249] 2020. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. https://arxiv.org/abs/1910.01108
[250] Language Models are Unsupervised Multitask Learners OpenAI Blog 2019 1 8 9
[251] Language Models are Few-shot Learners Advances in Neural Information Processing Systems (NeurIPS) 2020 33 1877–1901 Curran Associates, Inc.
[252] 2024. GPT-4 Technical Report. https://arxiv.org/abs/2303.08774
[253] 2023. LLaMA: Open and Efficient Foundation Language Models. https://arxiv.org/abs/2302.13971
[254] 2023. Llama 2: Open Foundation and Fine-Tuned Chat Models. https://arxiv.org/abs/2307.09288
[255] 2024. The Llama 3 Herd of Models. https://arxiv.org/abs/2407.21783
[256] 2023. Qwen Technical Report. https://arxiv.org/abs/2309.16609
[257] 2024. Qwen2 Technical Report. https://arxiv.org/abs/2407.10671
[258] 2025. Qwen2.5 Technical Report. https://arxiv.org/abs/2412.15115
[259] 2024. Gemma: Open Models Based on Gemini Research and Technology. https://arxiv.org/abs/2403.08295
[260] 2024. Gemma 2: Improving Open Language Models at a Practical Size. https://arxiv.org/abs/2408.00118
[261] 2025. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. https://arxiv.org/abs/2501.12948
[262] 2024. DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model. https://arxiv.org/abs/2405.04434
[263] 2023. Mistral 7B. https://arxiv.org/abs/2310.06825
[264] 2024. Mixtral of Experts. https://arxiv.org/abs/2401.04088
[265] Exploring the Limits of Transfer Learning with a Unified Text-to-text Transformer Journal of Machine Learning Research (JMLR) 2020 21 140 1–67
[266] The Claude 3 Model Family: Opus, Sonnet, Haiku Claude-3 Model Card 2024 1 1
[267] 2024. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. https://arxiv.org/abs/2403.05530
[268] Phi-2: The Surprising Power of Small Language Models Microsoft Research Blog 2023 1 3 3
[269] 2024. Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone. https://arxiv.org/abs/2404.14219
[270] 2020. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. https://arxiv.org/abs/2002.08155
[271] 2021. GraphCodeBERT: Pre-training Code Representations with Data Flow. https://arxiv.org/abs/2009.08366
[272] 2023. StarCoder: may the source be with you! https://arxiv.org/abs/2305.06161
[273] 2024. StarCoder 2 and The Stack v2: The Next Generation. https://arxiv.org/abs/2402.19173
[274] 2022. UniXcoder: Unified Cross-Modal Pre-training for Code Representation. https://arxiv.org/abs/2203.03850
[275] 2021. Evaluating Large Language Models Trained on Code. https://arxiv.org/abs/2107.03374
[276] 2024. Code Llama: Open Foundation Models for Code. https://arxiv.org/abs/2308.12950
[277] 2024. Qwen2.5-Coder Technical Report. https://arxiv.org/abs/2409.12186
[278] 2024. CodeGemma: Open Code Models Based on Gemma. https://arxiv.org/abs/2406.11409
[279] 2024. DeepSeek-Coder: When the Large Language Model Meets Programming – The Rise of Code Intelligence. https://arxiv.org/abs/2401.14196
[280] 2024. DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence. https://arxiv.org/abs/2406.11931
[281] 2021. CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. https://arxiv.org/abs/2109.00859
[282] 2023. CodeT5+: Open Code Large Language Models for Code Understanding and Generation. https://arxiv.org/abs/2305.07922
[283] A Survey of Machine Learning for Big Code and Naturalness ACM Computing Surveys (CSUR) 2018 51 4 10.1145/3212695
[284] Software Vulnerability Detection Using Deep Neural Networks: A Survey Proceedings of the IEEE 2020 108 10 1825-1848 10.1109/JPROC.2020.2993293
[285] Fine-Tuning Pre-trained Model with Optimizable Prompt Learning for Code Vulnerability Detection International Symposium on Software Reliability Engineering (ISSRE) 2024 108-119 IEEE 10.1109/ISSRE62328.2024.00021
[286] 2023. Evaluation of ChatGPT Model for Vulnerability Detection. https://arxiv.org/abs/2304.07232
[287] 2024. Large Language Models for Secure Code Assessment: A Multi-Language Empirical Study. https://arxiv.org/abs/2408.06428
[288] A Qualitative Study on Using ChatGPT for Software Security: Perception vs. Practicality International Conference on Trust, Privacy and Security in Intelligent Systems, and Applications (TPS-ISA) 2024 107-117 IEEE 10.1109/TPS-ISA62245.2024.00022
[289] 2024. \showarticletitleEvaluating Large Language Models in Vulnerability Detection Under Variable Context Windows. In International Conference on Machine Learning and Applications (ICMLA). IEEE, Miami, FL, USA, 1131–1134. doi:\nolinkurl10.1109/ICMLA61862.2024.00173
[290] 2024. Learning-based Models for Vulnerability Detection: An Extensive Study. https://arxiv.org/abs/2408.07526
[291] 2023. \showarticletitleNew Tricks to Old Codes: Can AI Chatbots Replace Static Code Analysis Tools?. In European Interdisciplinary Cybersecurity Conference (EICC). ACM, Stavanger, Norway, 13–18. doi:\nolinkurl10.1145/3590777.3590780
[292] 2023. \showarticletitleEnhancing Code Security Through Open-source Large Language Models: A Comparative Study. In International Symposium on Foundations and Practice of Security (FPS). Springer, Bordeaux, France, 233–249. doi:\nolinkurl10.1007/978-3-031-57537-2_15
[293] 2024. \showarticletitleVulnerAI: GPT Based Web Application Vulnerability Detection. In International Conference on Artificial Intelligence, Metaverse and Cybersecurity (ICAMAC). IEEE, Dubai, UAE, 1–6. doi:\nolinkurl10.1109/ICAMAC62387.2024.10828788
[294] 2025. \showarticletitleManual Prompt Engineering is Not Dead: A Case Study on Large Language Models for Code Vulnerability Detection with DSPy. In International Conference on Data Science and Machine Learning Applications (CDMA). IEEE, Riyadh, Saudi Arabia, 168–173. doi:\nolinkurl10.1109/CDMA61895.2025.00034
[295] 2024. \showarticletitleEvaluating the Impact of Conventional Code Analysis Against Large Language Models in API Vulnerability Detection. In European Interdisciplinary Cybersecurity Conference (EICC). ACM, Xanthi, Greece, 57–64. doi:\nolinkurl10.1145/3655693.3655701
[296] 2025. Expert-in-the-Loop Systems with Cross-Domain and In-Domain Few-Shot Learning for Software Vulnerability Detection. https://arxiv.org/abs/2506.10104
[297] 2025. Large Language Models for Multilingual Vulnerability Detection: How Far Are We? https://arxiv.org/abs/2506.07503
[298] 2025. \showarticletitleBeyond Static Pattern Matching? Rethinking Automatic Cryptographic API Misuse Detection in the Era of LLMs. Proceedings of the ACM on Software Engineering (PACMSE) 2, ISSTA, Article ISSTA006 (2025), 24 pages. doi:\nolinkurl10.1145/3728875
[299] 2025. \showarticletitleCASTLE: Benchmarking Dataset for Static Code Analyzers and LLMs towards CWE Detection. In International Symposium on Theoretical Aspects of Software Engineering (TASE). Springer, Limassol, Cyprus, 253–272.
[300] Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks Advances in Neural Information Processing Systems (NeurIPS) 2020 33 9459–9474 Curran Associates, Inc.
[301] 2024. LLbezpeky: Leveraging Large Language Models for Vulnerability Detection. https://arxiv.org/abs/2401.01269
[302] Chain-of-Thought Prompting Elicits Reasoning in Large Language Models Advances in Neural Information Processing Systems (NeurIPS) 2022 35 24824–24837 Curran Associates, Inc.
[303] 2023. Self-Consistency Improves Chain of Thought Reasoning in Language Models. https://arxiv.org/abs/2203.11171
[304] ReAct: Synergizing Reasoning and Acting in Language Models International Conference on Learning Representations (ICLR) 2023 1-33 NSF PAR
[305] 2023. \showarticletitleAn Empirical Study of Parameter-Efficient Fine-Tuning Methods for Pre-Trained Code Models. In International Conference on Automated Software Engineering (ASE). IEEE/ACM, Kirchberg, Luxembourg, 397–408. doi:\nolinkurl10.1109/ASE56229.2023.00125
[306] 2024. \showarticletitleProRLearn: Boosting Prompt Tuning-based Vulnerability Detection by Reinforcement Learning. Automated Software Engineering (ASE) 31, 2 (2024), 38. doi:\nolinkurl10.1007/s10515-024-00438-9
[307] 2023. \showarticletitleSoftware Defect Prediction via Code Language Models. In International Conference on Communication Technology and Information Technology (ICCTIT). IEEE, Xi'an, China, 97–102. doi:\nolinkurl10.1109/ICCTIT60726.2023.10435711
[308] 2024. Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey. https://arxiv.org/abs/2403.14608
[309] LoRA: Low-Rank Adaptation of Large Language Models International Conference on Learning Representations (ICLR) 2022 1 2 3
[310] Pissa: Principal Singular Values and Singular Vectors Adaptation of Large Language Models Advances in Neural Information Processing Systems (NeurIPS) 2024 37 121038–121072 Curran Associates, Inc.
[311] GaLore: Memory-efficient LLM Training by Gradient Low-rank Projection International Conference on Machine Learning (ICML) 2024 JMLR
[312] QLoRA: Efficient Finetuning of Quantized LLMs Advances in Neural Information Processing Systems (NeurIPS) 2023 36 10088–10115 Curran Associates, Inc.
[313] 2017. Juliet C/C++ 1.3. https://samate.nist.gov/SARD/test-suites/112.
[314] 2017. Juliet Java 1.3. https://samate.nist.gov/SARD/test-suites/111.
[315] 2006. Software Assurance Reference Dataset. https://samate.nist.gov/SARD.
[316] 2019. \showarticletitleDevign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks. In International Conference on Neural Information Processing Systems (NeurIPS). Curran Associates, Inc., Vancouver, BC, Canada, Article 915, 11 pages.
[317] 2020. \showarticletitleA C/C++ Code Vulnerability Dataset with Code Changes and CVE Summaries. In International Conference on Mining Software Repositories (MSR). ACM, Seoul, South Korea, 508–512. doi:\nolinkurl10.1145/3379597.3387501
[318] 2024. \showarticletitleMegaVul: A C/C++ Vulnerability Dataset with Comprehensive Code Representations. In International Conference on Mining Software Repositories (MSR). ACM, Lisbon, Portugal, 738–742. doi:\nolinkurl10.1145/3643991.3644886
[319] 2025. National Vulnerability Database (NVD). https://nvd.nist.gov/.
[320] 2018. \showarticletitleAutomated Vulnerability Detection in Source Code Using Deep Representation Learning. In International Conference on Machine Learning and Applications (ICMLA). IEEE, Orlando, FL, USA, 757–762. doi:\nolinkurl10.1109/ICMLA.2018.00120
[321] 2024. \showarticletitleReposVul: A Repository-Level High-Quality Vulnerability Dataset. In International Conference on Software Engineering: Companion Proceedings (ICSE-Companion). IEEE/ACM, Lisbon, Portugal, 472–483. doi:\nolinkurl10.1145/3639478.3647634
[322] 2021. \showarticletitleCrossVul: A Cross-Language Vulnerability Dataset with Commit Data. In Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE). ACM, Athens, Greece, 1565–1569. doi:\nolinkurl10.1145/3468264.3473122
[323] 2021. \showarticletitleD2A: A Dataset Built for AI-Based Vulnerability Detection Methods Using Differential Analysis. In International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE/ACM, Madrid, Spain, 111–120. doi:\nolinkurl10.1109/ICSE-SEIP52600.2021.00020
[324] 2022. \showarticletitleDeep Learning Based Vulnerability Detection: Are We There Yet? IEEE Transactions on Software Engineering (TSE) 48, 9 (2022), 3280–3296. doi:\nolinkurl10.1109/TSE.2021.3087402
[325] 2021. \showarticletitleCVEfixes: Automated Collection of Vulnerabilities and Their Fixes from Open-Source Software. In International Conference on Predictive Models and Data Analytics in Software Engineering (PROMISE). ACM, Athens Greece, 30–39. doi:\nolinkurl10.1145/3475960.3475985
[326] 2024. Vulnerability Detection with Code Language Models: How Far Are We? https://arxiv.org/abs/2403.18624
[327] 2017. VulDeePecker: A Deep Learning-Based System for Vulnerability Detection. https://github.com/CGCL-codes/VulDeePecker.
[328] 2018. Draper VDISC Dataset - Vulnerability Detection in Source Code. https://osf.io/d45bw/.
[329] 2020. Devign. https://github.com/epicosy/devign.
[330] 2020. A C/C++ Code Vulnerability Dataset with Code Changes and CVE Summaries. https://github.com/ZeoVan/MSR_20_Code_vulnerability_CSV_Dataset.
[331] 2021. D2A Dataset and Generation Pipeline. https://github.com/IBM/D2A.
[332] 2020. Deep Learning based Vulnerability Detection:Are We There Yet? https://github.com/VulDetProject/ReVeal.
[333] 2024. CVEfixes Dataset: Automatically Collected Vulnerabilities and Their Fixes from Open-Source Software. https://zenodo.org/records/13118970.
[334] 2021. Cross-Language Vulnerability Dataset with File Changes and Commit Messages. https://zenodo.org/records/4734050.
[335] 2022. \showarticletitleSecurityEval Dataset: Mining Vulnerability Examples to Evaluate Machine Learning-Based Code Generation Techniques. In International Workshop on Mining Software Repositories Applications for Privacy and Security (MSR4P&S). ACM, Singapore, 29–33. doi:\nolinkurl10.1145/3549035.3561184
[336] 2022. SecurityEval. https://github.com/s2e-lab/SecurityEval.
[337] 2025. DiverseVul: A New Vulnerable Source Code Dataset for Deep Learning Based Vulnerability Detection. https://github.com/wagner-group/diversevul.
[338] 2023. \showarticletitleLarge Language Models for Code: Security Hardening and Adversarial Testing. In Conference on Computer and Communications Security (CCS). ACM, Copenhagen, Denmark, 1865–1879. doi:\nolinkurl10.1145/3576915.3623175
[339] 2023. SVEN: Security Hardening and Adversarial Testing for Code LLMs. https://github.com/eth-sri/sven.
[340] 2023. \showarticletitleThe FormAI Dataset: Generative AI in Software Security through the Lens of Formal Verification. In International Conference on Predictive Models and Data Analytics in Software Engineering (PROMISE). ACM, San Francisco, CA, USA, 33–43. doi:\nolinkurl10.1145/3617555.3617874
[341] 2023. FormAI Dataset. https://github.com/FormAI-Dataset/FormAI-dataset.
[342] 2024. [ICSE 2024 Industry Challenge Track] Official Implementation of "ReposVul: A Repository-Level High-Quality Vulnerability Dataset". https://github.com/Eshe0922/ReposVul.
[343] 2024. PrimeVul: Vulnerability Detection with Code Language Models: How Far Are We? https://github.com/DLVulDet/PrimeVul.
[344] 2024. KnowledgeRAG4LLMVulD. https://github.com/KnowledgeRAG4LLMVulD/KnowledgeRAG4LLMVulD/tree/main/dataset.
[345] 2023. MegaVul. https://github.com/Icyrockton/MegaVul.
[346] 2024. CleanVul. https://github.com/yikun-li/CleanVul.
[347] 2022. \showarticletitleVUDENC: Vulnerability Detection with Deep Learning on a Natural Codebase for Python. Information and Software Technology (INFSOF) 144 (2022), 106809. doi:\nolinkurl10.1016/j.infsof.2021.106809
[348] 2021. CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation. https://arxiv.org/abs/2102.04664
[349] CWE-1000: Research Concepts 2023
[351] CWE CATEGORY: Resource Management Errors 2023
[353] 2025. \showarticletitleSemantic-Preserving Transformations as Mutation Operators: A Study on Their Effectiveness in Defect Detection. In International Conference on Software Testing, Verification and Validation Workshops (ICSTW). IEEE, Naples, Italy, 337–346. doi:\nolinkurl10.1109/ICSTW64639.2025.10962512
[354] 2023. How to Get Better Embeddings with Code Pre-Trained Models? {An} Empirical Study. https://arxiv.org/abs/2311.08066
[355] 2024. \showarticletitleRevisiting the Performance of Deep Learning-Based Vulnerability Detection on Realistic Datasets. IEEE Transactions on Software Engineering (TSE) 50, 8 (2024), 2163–2177. doi:\nolinkurl10.1109/TSE.2024.3423712
[356] 2025. PYPL PopularitY of Programming Language Index. https://pypl.github.io/PYPL.html.
[357] 2025. Mono: Is Your "Clean" Vulnerability Dataset Really Solvable? Exposing and Trapping Undecidable Patches and Beyond. https://arxiv.org/abs/2506.03651
[358] 2025. \showarticletitleICVul: A Well-labeled C/C++ Vulnerability Dataset with Comprehensive Metadata and VCCs. In International Conference on Mining Software Repositories (MSR). IEEE/ACM, Ottawa, ON, Canada, 154–158. doi:\nolinkurl10.1109/MSR66628.2025.00034
[359] 2025. \showarticletitleGVI: Guided Vulnerability Imagination for Boosting Deep Vulnerability Detectors. In International Conference on Software Engineering (ICSE). IEEE/ACM, Ottawa, ON, Canada, 2867–2879. doi:\nolinkurl10.1109/ICSE55347.2025.00214
[360] 2025. \showarticletitleFrom Theory to Practice: Code Generation Using LLMs for CAPEC and CWE Frameworks. In International Workshop on Large Language Models for Code (LLM4Code). IEEE/ACM, Ottawa, ON, Canada, 137–144. doi:\nolinkurl10.1109/LLM4Code66737.2025.00022
[361] 2023. \showarticletitleAn Empirical Study of Deep Learning Models for Vulnerability Detection. In International Conference on Software Engineering (ICSE). IEEE/ACM, Melbourne, Australia, 2237–2248. doi:\nolinkurl10.1109/ICSE48619.2023.00188
[362] Open Science in Software Engineering: A Study on Deep Learning-Based Vulnerability Detection IEEE Transactions on Software Engineering (TSE) 2023 49 4 1983-2005 10.1109/TSE.2022.3207149
[363] 2025. \showarticletitleEnhancing Vulnerability Detection via Inter-procedural Semantic Completion. Proceedings of the ACM on Software Engineering (ASE) 2, ISSTA, Article ISSTA037 (2025), 23 pages. doi:\nolinkurl10.1145/3728912