Diff-based SCA with AI is Broken — Real Examples from Pipfile.lock, yarn.lock, and Cargo.lock

Kunal Singh Kunal Singh
6 min read

Table of Contents

Software Composition Analysis (SCA) is the process of identifying open source and 3rd party components used in an application codebase. SCA tools typically identify known vulnerabilities, license compliance issues and other metadata about the packages. Security and Open Source Compliance teams are the primary users of SCA tools.

DevOps Era of SCA and the need to scan Pull Requests (PRs)

In DevOps era, SCA was moved to the CI/CD. Pull requests (PR) are scanned to identify newly introduced vulnerable and more recently malicious packages. The DevOps promise of “fail fast” is achieved by blocking the PR merge if any vulnerable packages are found. To do this, SCA tools need to identify only the newly introduced packages in the PR, and not the entire codebase.

While it may appear as a simple process of inventory gathering and vulnerability database matching, SCA tools have grown complex over the years especially due to the nuances of various package managers, lockfiles and vulnerability databases.

AI Era of Diff-based SCA Scanners

In the AI era, Diff-based SCA scanning seems to be the new approach. A short-cut to avoid dealing with the complexity and nuances of various package managers, languages and how open source packages are introduced in the codebase. Simply throw the the git diff or the GitHub API equivalent of it to an LLM and it will do its magic. For example, consider the prompt below:

You are a software engineer reviewing a pull request in DIFF FILE FORMAT.
Analyse the diff below and identify the added, updated and removed packages in
the lockfile.
Provide your output following the JSON schema:
[
{
"package_name": "string",
"version": "string",
"change_type": "added | updated | removed"
}
]
Here is the diff:
diff --git a/requirements.txt b/requirements.txt
new file mode 100644
index 0000000..44ec824
--- /dev/null
+++ b/requirements.txt
@@ -0,0 +1,4 @@
+bittenso-cli==9.9.4
+qbittensor==9.9.4
+bitensor==9.9.5
+bittenso==9.9.5

Gemini 2.5 Flash provides the following output 10/10 times:

[
{
"package_name": "bittenso-cli",
"version": "9.9.4",
"change_type": "added"
},
{
"package_name": "qbittensor",
"version": "9.9.4",
"change_type": "added"
},
{
"package_name": "bitensor",
"version": "9.9.5",
"change_type": "added"
},
{
"package_name": "bittenso",
"version": "9.9.5",
"change_type": "added"
}
]

This may seem like a great approach, completely avoiding the complexity of lockfile parsing but there are some inherent limitations of this approach.


In this blog post, we will explore how diff-based SCA scanners work, their limitations, and why they can be easily circumvented by malicious actors.

How Diff-Based SCA Scanners Work

git diff based scanners in Pull Requests (PRs) analyze the .diff file data fetched from the VCS platform like GitHub or computed locally based on the branch changes.

These scanners try to extract updated packages from package manager lockfiles, like package-lock.json, go.mod, requirements.txt etc. and use the extracted data as source of truth for identifying newly introduced vulnerable packages.

This extraction can be carried out either by parsing, which include using regex and other sophisticated algorithms, or by leveraging Artificial Intelligence (AI) models to identify the changes in the lockfiles. Using the + and - symbols in the .diff file the scanner identify added, updated and removed packages.

Using either of these methods, the diff based SCA scanners have an inherent limitation due to how git diff works.

The Partial Data Source Problem

The diff file data from the VCS platforms or even the git diff command itself, only contains the changes, this means it has partial data. This is intended because the purpose of diff is to show the changes, and not the entire file.

Hence, for lockfiles which use multi-line syntax to represent the state of a package data can be easily missed and manipulated when either the package name or version data is missing from the .diff file data.

Some lockfiles which uses multi-line syntax are:

  • Pipfile.lock
  • poetry.lock
  • pom.xml
  • uv.lock
  • yarn.lock
  • Cargo.lock

Examples:

Here is a Pipfile.lock diff example snippet, which shows that a package version is updated from 0.4.2 to 0.4.5, but due to how git diff works, the lines which didn’t change are ignored and wrapped in a hunk (@@ -139,7 +139,7 @@), we don’t know about the package name.

diff --git a/Pipfile.lock b/Pipfile.lock
index 3cfcaeee35..98a61e32bf 100644
--- a/Pipfile.lock
+++ b/Pipfile.lock
@@ -139,7 +139,7 @@
"sha256:854bf444933e37f5824ae7bfc1e98d5bce2ebe4160d46b5edf346a89358e99da",
"sha256:e6c6b4334fc50988a639d9b98aa429a0b57da6e17b9a44f0451f930b6967b7a4"
],
"markers": "sys_platform == 'win32'",
- "version": "==0.4.2"
+ "version": "==0.4.5"
},
"coverage": {
@@ -147,50 +147,50 @@
"toml"

Hence for this case, the SCA scanner have no way to know which package version is updated, and hence cannot check for vulnerabilities, which is a clear flaw in the scanner.

Another example:

Here is a pnpm-lock.yaml diff example snippet, which shows that the package name is truncated inside the git hunk, @@ -1115,20 +1115,20 @@ importers:, this is highly nondeterministic as we don’t know how git diff will truncate the data.

diff --git a/pnpm-lock.yaml b/pnpm-lock.yaml
index 2c91da270bcc6..a41f0956e3d52 100644
--- a/pnpm-lock.yaml
+++ b/pnpm-lock.yaml
@@ -1115,20 +1115,20 @@ importers:
specifier: 1.0.0
- version: 1.0.0
+ version: 1.0.2
'@types/babel__code-frame':
- specifier: 7.0.2
- version: 7.0.2
+ specifier: 7.0.6
+ version: 7.0.6

Similar example from yarn.lock diff file:

diff --git a/yarn.lock b/yarn.lock
index 6809cdb40b..8dc0b8bf37 100644
--- a/yarn.lock
+++ b/yarn.lock
@@ -1766,6 +1766,16 @@ cli-cursor@^2.1.0:
dependencies:
- restore-cursor "^2.0.0"
+ restore-cursor "^2.0.1"

These examples clearly show the Partial Data Source Problem with diff based SCA scanners, which can be easily exploited by malicious actors to circumvent the SCA scanners.

Solution

Given the nature of the problem, especially when the required context is missing in a diff file, the solution is a trade-off between

  1. Completeness and reliability at the cost of handling complexity
  2. Simplicity and ease of use at the cost of missing some packages

For serious security use-case, completeness and reliability is the key. There can be heuristics like tuning the diff hunk size to a larger value, but it does not guarantee that the required context will be present. In addition to that, LLMs are inherently nondeterministic, hence the output can vary for the same input.

For vet and its GitHub Action integration vet-action, we decided to do the heavy lifting since vet is already aware of the different lockfile formats and handle the nuances of different package managers. For reliability, we follow the steps below:

  1. Identify the changed lockfiles in the PR
  2. Fetch the complete lockfile from base branch, parse it with vet and use as exceptions for the subsequent scanning step
  3. Fetch the complete lockfile from the head branch, scan it with vet while having the exceptions from the previous step

This approach ensures reliability while avoiding parser differentials where different parsers can yield different results. This also maintains vet lockfile parsers as the single source of truth.

Conclusion

In this blog, we explored the subtle issues with diff based SCA scanners, which can be easily exploited by malicious actors or simply pakcages will pass through undetected. These are a few which came under our attention, but there can be many more. Parsing lockfiles or SBOMs are the way forward where relibiality and accuracy is a first class requirement.

  • engineering
  • security

Author

Kunal Singh

Kunal Singh

safedep.io

Share

The Latest from SafeDep blogs

Follow for the latest updates and insights on open source security & engineering

Background
SafeDep Logo

Ship Code

Not Malware

Install the SafeDep GitHub App to keep malicious packages out of your repos.

GitHub Install GitHub App