Diff-based SCA with AI is Broken — Real Examples from Pipfile.lock, yarn.lock, and Cargo.lock
Table of Contents
Software Composition Analysis (SCA) is the process of identifying open source and 3rd party components used in an application codebase. SCA tools typically identify known vulnerabilities, license compliance issues and other metadata about the packages. Security and Open Source Compliance teams are the primary users of SCA tools.
DevOps Era of SCA and the need to scan Pull Requests (PRs)
In DevOps era, SCA was moved to the CI/CD. Pull requests (PR) are scanned to identify newly introduced vulnerable and more recently malicious packages. The DevOps promise of “fail fast” is achieved by blocking the PR merge if any vulnerable packages are found. To do this, SCA tools need to identify only the newly introduced packages in the PR, and not the entire codebase.
While it may appear as a simple process of inventory gathering and vulnerability database matching, SCA tools have grown complex over the years especially due to the nuances of various package managers, lockfiles and vulnerability databases.
AI Era of Diff-based SCA Scanners
In the AI era, Diff-based SCA scanning seems to be the new approach.
A short-cut to avoid dealing with the complexity and nuances of various package
managers, languages and how open source packages are introduced in the
codebase. Simply throw the the git diff or the GitHub API equivalent of it to
an LLM and it will do its magic. For example, consider the prompt below:
You are a software engineer reviewing a pull request in DIFF FILE FORMAT.Analyse the diff below and identify the added, updated and removed packages inthe lockfile.
Provide your output following the JSON schema:
[ { "package_name": "string", "version": "string", "change_type": "added | updated | removed" }]
Here is the diff:
diff --git a/requirements.txt b/requirements.txtnew file mode 100644index 0000000..44ec824--- /dev/null+++ b/requirements.txt@@ -0,0 +1,4 @@+bittenso-cli==9.9.4+qbittensor==9.9.4+bitensor==9.9.5+bittenso==9.9.5Gemini 2.5 Flash provides the following output 10/10 times:
[ { "package_name": "bittenso-cli", "version": "9.9.4", "change_type": "added" }, { "package_name": "qbittensor", "version": "9.9.4", "change_type": "added" }, { "package_name": "bitensor", "version": "9.9.5", "change_type": "added" }, { "package_name": "bittenso", "version": "9.9.5", "change_type": "added" }]This may seem like a great approach, completely avoiding the complexity of lockfile parsing but there are some inherent limitations of this approach.
In this blog post, we will explore how diff-based SCA scanners work, their limitations, and why they can be easily circumvented by malicious actors.
How Diff-Based SCA Scanners Work
git diff based scanners in Pull Requests (PRs) analyze the .diff file data
fetched from the VCS platform like GitHub or computed locally based on the
branch changes.
These scanners try to extract updated packages from package manager lockfiles,
like package-lock.json, go.mod, requirements.txt etc. and use the extracted data
as source of truth for identifying newly introduced vulnerable packages.
This extraction can be carried out either by parsing, which include using regex
and other sophisticated algorithms, or by leveraging Artificial Intelligence
(AI) models to identify the changes in the lockfiles. Using the + and -
symbols in the .diff file the scanner identify added, updated and removed
packages.
Using either of these methods, the diff based SCA scanners have an inherent limitation
due to how git diff works.
The Partial Data Source Problem
The diff file data from the VCS platforms or even the git diff command itself,
only contains the changes, this means it has partial data. This is intended
because the purpose of diff is to show the changes, and not the entire file.
Hence, for lockfiles which use multi-line syntax to represent the state of a package
data can be easily missed and manipulated when either the package name or version data
is missing from the .diff file data.
Some lockfiles which uses multi-line syntax are:
Pipfile.lockpoetry.lockpom.xmluv.lockyarn.lockCargo.lock
Examples:
Here is a Pipfile.lock diff example snippet, which shows that a package
version is updated from 0.4.2 to 0.4.5, but due to how git diff works,
the lines which didn’t change are ignored and wrapped in a hunk (@@ -139,7 +139,7 @@), we don’t know about the package name.
diff --git a/Pipfile.lock b/Pipfile.lockindex 3cfcaeee35..98a61e32bf 100644--- a/Pipfile.lock+++ b/Pipfile.lock@@ -139,7 +139,7 @@ "sha256:854bf444933e37f5824ae7bfc1e98d5bce2ebe4160d46b5edf346a89358e99da", "sha256:e6c6b4334fc50988a639d9b98aa429a0b57da6e17b9a44f0451f930b6967b7a4" ], "markers": "sys_platform == 'win32'",- "version": "==0.4.2"+ "version": "==0.4.5" }, "coverage": {@@ -147,50 +147,50 @@ "toml"Hence for this case, the SCA scanner have no way to know which package version is updated, and hence cannot check for vulnerabilities, which is a clear flaw in the scanner.
Another example:
Here is a pnpm-lock.yaml diff example snippet, which shows that the package
name is truncated inside the git hunk, @@ -1115,20 +1115,20 @@ importers:,
this is highly nondeterministic as we don’t know how git diff will truncate
the data.
diff --git a/pnpm-lock.yaml b/pnpm-lock.yamlindex 2c91da270bcc6..a41f0956e3d52 100644--- a/pnpm-lock.yaml+++ b/pnpm-lock.yaml@@ -1115,20 +1115,20 @@ importers: specifier: 1.0.0- version: 1.0.0+ version: 1.0.2 '@types/babel__code-frame':- specifier: 7.0.2- version: 7.0.2+ specifier: 7.0.6+ version: 7.0.6Similar example from yarn.lock diff file:
diff --git a/yarn.lock b/yarn.lockindex 6809cdb40b..8dc0b8bf37 100644--- a/yarn.lock+++ b/yarn.lock@@ -1766,6 +1766,16 @@ cli-cursor@^2.1.0: dependencies:- restore-cursor "^2.0.0"+ restore-cursor "^2.0.1"These examples clearly show the Partial Data Source Problem with diff based SCA scanners,
which can be easily exploited by malicious actors to circumvent the SCA
scanners.
Solution
Given the nature of the problem, especially when the required context is
missing in a diff file, the solution is a trade-off between
- Completeness and reliability at the cost of handling complexity
- Simplicity and ease of use at the cost of missing some packages
For serious security use-case, completeness and reliability is the key. There can be heuristics like tuning the diff hunk size to a larger value, but it does not guarantee that the required context will be present. In addition to that, LLMs are inherently nondeterministic, hence the output can vary for the same input.
For vet and its GitHub Action integration
vet-action, we decided to do the heavy
lifting since vet is already aware of the different lockfile formats and
handle the nuances of different package managers. For reliability, we follow
the steps below:
- Identify the changed lockfiles in the PR
- Fetch the complete lockfile from base branch, parse it with
vetand use as exceptions for the subsequent scanning step - Fetch the complete lockfile from the head branch, scan it with
vetwhile having the exceptions from the previous step
This approach ensures reliability while avoiding parser differentials where
different parsers can yield different results. This also maintains vet
lockfile parsers as the single source of truth.
Conclusion
In this blog, we explored the subtle issues with diff based SCA scanners,
which can be easily exploited by malicious actors or simply pakcages will pass
through undetected. These are a few which came under our attention, but there
can be many more. Parsing lockfiles or SBOMs are the way forward where
relibiality and accuracy is a first class requirement.
- engineering
- security
Author
Kunal Singh
safedep.io
Share
The Latest from SafeDep blogs
Follow for the latest updates and insights on open source security & engineering
Miasma Worm Infects Multiple LeoPlatform npm Packages
A Miasma worm variant compromised a single maintainer account and used it to publish infected versions of 20 LeoPlatform npm packages within a 3-second window. The worm also pushed weaponized GitHub...
MYRA: A Full Linux RAT Distributed via npm
The npm package apintergrationpost is a red team RAT called MYRA with native C rootkit, triple persistence, fileless execution, live screen streaming, and process masquerade. This analysis documents...
The wshu.net npm Campaign Delivers a Multi-Stage Infostealer
One actor seeded 15 npm packages across 13 throwaway scopes in a single morning, each shipping a ~270KB obfuscated downloader behind a postinstall hook. The downloader pulls a Rust infostealer from...
@withgoogle/stitch-sdk: Scope Squat Harvests Developer Credentials
A malicious npm package squats the @withgoogle scope to impersonate Google Stitch, silently harvesting credentials from Claude Code, git, GitHub CLI, SSH keys, npm, and Docker on install.
Ship Code.
Not Malware.
Start free with open source tools on your machine. Scale to a unified platform for your organization.