Diff-based SCA with AI is Broken — Real Examples from Pipfile.lock, yarn.lock, and Cargo.lock

Table of Contents
Software Composition Analysis (SCA) is the process of identifying open source and 3rd party components used in an application codebase. SCA tools typically identify known vulnerabilities, license compliance issues and other metadata about the packages. Security and Open Source Compliance teams are the primary users of SCA tools.
DevOps Era of SCA and the need to scan Pull Requests (PRs)
In DevOps era, SCA was moved to the CI/CD. Pull requests (PR) are scanned to identify newly introduced vulnerable and more recently malicious packages. The DevOps promise of “fail fast” is achieved by blocking the PR merge if any vulnerable packages are found. To do this, SCA tools need to identify only the newly introduced packages in the PR, and not the entire codebase.
While it may appear as a simple process of inventory gathering and vulnerability database matching, SCA tools have grown complex over the years especially due to the nuances of various package managers, lockfiles and vulnerability databases.
AI Era of Diff-based SCA Scanners
In the AI era, Diff-based SCA scanning seems to be the new approach. A short-cut to avoid dealing with the complexity and nuances of various package managers, languages and how open source packages are introduced in the codebase. Simply throw the the git diff
or the GitHub API equivalent of it to an LLM and it will do its magic. For example, consider the prompt below:
You are a software engineer reviewing a pull request in DIFF FILE FORMAT.Analyse the diff below and identify the added, updated and removed packages inthe lockfile.
Provide your output following the JSON schema:
[ { "package_name": "string", "version": "string", "change_type": "added | updated | removed" }]
Here is the diff:
diff --git a/requirements.txt b/requirements.txtnew file mode 100644index 0000000..44ec824--- /dev/null+++ b/requirements.txt@@ -0,0 +1,4 @@+bittenso-cli==9.9.4+qbittensor==9.9.4+bitensor==9.9.5+bittenso==9.9.5
Gemini 2.5 Flash provides the following output 10/10 times:
[ { "package_name": "bittenso-cli", "version": "9.9.4", "change_type": "added" }, { "package_name": "qbittensor", "version": "9.9.4", "change_type": "added" }, { "package_name": "bitensor", "version": "9.9.5", "change_type": "added" }, { "package_name": "bittenso", "version": "9.9.5", "change_type": "added" }]
This may seem like a great approach, completely avoiding the complexity of lockfile parsing but there are some inherent limitations of this approach.
In this blog post, we will explore how diff-based SCA scanners work, their limitations, and why they can be easily circumvented by malicious actors.
How Diff-Based SCA Scanners Work
git diff
based scanners in Pull Requests (PRs) analyze the .diff
file data fetched from the VCS platform like GitHub or computed locally based on the branch changes.
These scanners try to extract updated packages from package manager lockfiles, like package-lock.json
, go.mod
, requirements.txt
etc. and use the extracted data as source of truth for identifying newly introduced vulnerable packages.
This extraction can be carried out either by parsing, which include using regex and other sophisticated algorithms, or by leveraging Artificial Intelligence (AI) models to identify the changes in the lockfiles. Using the +
and -
symbols in the .diff
file the scanner identify added, updated and removed packages.
Using either of these methods, the diff
based SCA scanners have an inherent limitation due to how git diff
works.
The Partial Data Source Problem
The diff
file data from the VCS platforms or even the git diff
command itself, only contains the changes, this means it has partial data. This is intended because the purpose of diff
is to show the changes, and not the entire file.
Hence, for lockfiles which use multi-line syntax to represent the state of a package data can be easily missed and manipulated when either the package name or version data is missing from the .diff
file data.
Some lockfiles which uses multi-line syntax are:
Pipfile.lock
poetry.lock
pom.xml
uv.lock
yarn.lock
Cargo.lock
Examples:
Here is a Pipfile.lock
diff example snippet, which shows that a package version is updated from 0.4.2
to 0.4.5
, but due to how git diff
works, the lines which didn’t change are ignored and wrapped in a hunk
(@@ -139,7 +139,7 @@
), we don’t know about the package name.
diff --git a/Pipfile.lock b/Pipfile.lockindex 3cfcaeee35..98a61e32bf 100644--- a/Pipfile.lock+++ b/Pipfile.lock@@ -139,7 +139,7 @@ "sha256:854bf444933e37f5824ae7bfc1e98d5bce2ebe4160d46b5edf346a89358e99da", "sha256:e6c6b4334fc50988a639d9b98aa429a0b57da6e17b9a44f0451f930b6967b7a4" ], "markers": "sys_platform == 'win32'",- "version": "==0.4.2"+ "version": "==0.4.5" }, "coverage": {@@ -147,50 +147,50 @@ "toml"
Hence for this case, the SCA scanner have no way to know which package version is updated, and hence cannot check for vulnerabilities, which is a clear flaw in the scanner.
Another example:
Here is a pnpm-lock.yaml
diff example snippet, which shows that the package name is truncated inside the git hunk
, @@ -1115,20 +1115,20 @@ importers:
, this is highly nondeterministic as we don’t know how git diff
will truncate the data.
diff --git a/pnpm-lock.yaml b/pnpm-lock.yamlindex 2c91da270bcc6..a41f0956e3d52 100644--- a/pnpm-lock.yaml+++ b/pnpm-lock.yaml@@ -1115,20 +1115,20 @@ importers: specifier: 1.0.0- version: 1.0.0+ version: 1.0.2 '@types/babel__code-frame':- specifier: 7.0.2- version: 7.0.2+ specifier: 7.0.6+ version: 7.0.6
Similar example from yarn.lock
diff file:
diff --git a/yarn.lock b/yarn.lockindex 6809cdb40b..8dc0b8bf37 100644--- a/yarn.lock+++ b/yarn.lock@@ -1766,6 +1766,16 @@ cli-cursor@^2.1.0: dependencies:- restore-cursor "^2.0.0"+ restore-cursor "^2.0.1"
These examples clearly show the Partial Data Source Problem with diff
based SCA scanners, which can be easily exploited by malicious actors to circumvent the SCA scanners.
Solution
Given the nature of the problem, especially when the required context is missing in a diff
file, the solution is a trade-off between
- Completeness and reliability at the cost of handling complexity
- Simplicity and ease of use at the cost of missing some packages
For serious security use-case, completeness and reliability is the key. There can be heuristics like tuning the diff hunk size to a larger value, but it does not guarantee that the required context will be present. In addition to that, LLMs are inherently nondeterministic, hence the output can vary for the same input.
For vet and its GitHub Action integration vet-action, we decided to do the heavy lifting since vet
is already aware of the different lockfile formats and handle the nuances of different package managers. For reliability, we follow the steps below:
- Identify the changed lockfiles in the PR
- Fetch the complete lockfile from base branch, parse it with
vet
and use as exceptions for the subsequent scanning step - Fetch the complete lockfile from the head branch, scan it with
vet
while having the exceptions from the previous step
This approach ensures reliability while avoiding parser differentials where different parsers can yield different results. This also maintains vet
lockfile parsers as the single source of truth.
Conclusion
In this blog, we explored the subtle issues with diff
based SCA scanners, which can be easily exploited by malicious actors or simply pakcages will pass through undetected. These are a few which came under our attention, but there can be many more. Parsing lockfiles or SBOMs are the way forward where relibiality and accuracy is a first class requirement.
- engineering
- security
Author

Kunal Singh
safedep.io
Share
The Latest from SafeDep blogs
Follow for the latest updates and insights on open source security & engineering

Contributing to SafeDep Open Source Projects during Hacktoberfest 2025
Learn how to contribute to SafeDep open source projects during Hacktoberfest 2025 and help secure the open source software supply chain.

Shai-Hulud Supply Chain Attack Incident Response
The Shai-Hulud supply chain attack is a major incident targeting developers through malicious packages in the npm ecosystem. This post outlines the incident response steps that can be taken to...

Ship Code. Not Malware. SafeDep Launches GitHub App for Malicious Package Protection
SafeDep launches a GitHub App for zero-configuration protection against malicious open source packages. Instantly scan pull requests and keep your code repositories safe from supply chain attacks.

npm Supply Chain Attack Exposes Private Repositories, AWS Credentials and More
npm supply chain attacks continue. This time targeting @ctrl/tinycolor and multiple other packages with credential stealer malware. In this blog, we will analyze the attack and its impact on the npm...

Ship Code
Not Malware
Install the SafeDep GitHub App to keep malicious packages out of your repos.
