Diff-based SCA with AI is Broken — Real Examples from Pipfile.lock, yarn.lock, and Cargo.lock

Kunal Singh

• Sep 19, 2025 • 6 min read

Software Composition Analysis (SCA) is the process of identifying open source and 3rd party components used in an application codebase. SCA tools typically identify known vulnerabilities, license compliance issues and other metadata about the packages. Security and Open Source Compliance teams are the primary users of SCA tools.

DevOps Era of SCA and the need to scan Pull Requests (PRs)

In DevOps era, SCA was moved to the CI/CD. Pull requests (PR) are scanned to identify newly introduced vulnerable and more recently malicious packages. The DevOps promise of “fail fast” is achieved by blocking the PR merge if any vulnerable packages are found. To do this, SCA tools need to identify only the newly introduced packages in the PR, and not the entire codebase.

While it may appear as a simple process of inventory gathering and vulnerability database matching, SCA tools have grown complex over the years especially due to the nuances of various package managers, lockfiles and vulnerability databases.

AI Era of Diff-based SCA Scanners

In the AI era, Diff-based SCA scanning seems to be the new approach. A short-cut to avoid dealing with the complexity and nuances of various package managers, languages and how open source packages are introduced in the codebase. Simply throw the the git diff or the GitHub API equivalent of it to an LLM and it will do its magic. For example, consider the prompt below:

1
You are a software engineer reviewing a pull request in DIFF FILE FORMAT.
2
Analyse the diff below and identify the added, updated and removed packages in
3
the lockfile.
4

5
Provide your output following the JSON schema:
6

7
[
8
  {
9
    "package_name": "string",
10
    "version": "string",
11
    "change_type": "added | updated | removed"
12
  }
13
]
14

15
Here is the diff:
16

17
diff --git a/requirements.txt b/requirements.txt
18
new file mode 100644
19
index 0000000..44ec824
20
--- /dev/null
21
+++ b/requirements.txt
22
@@ -0,0 +1,4 @@
23
+bittenso-cli==9.9.4
24
+qbittensor==9.9.4
25
+bitensor==9.9.5
26
+bittenso==9.9.5

Gemini 2.5 Flash provides the following output 10/10 times:

1
[
2
  {
3
    "package_name": "bittenso-cli",
4
    "version": "9.9.4",
5
    "change_type": "added"
6
  },
7
  {
8
    "package_name": "qbittensor",
9
    "version": "9.9.4",
10
    "change_type": "added"
11
  },
12
  {
13
    "package_name": "bitensor",
14
    "version": "9.9.5",
15
    "change_type": "added"
16
  },
17
  {
18
    "package_name": "bittenso",
19
    "version": "9.9.5",
20
    "change_type": "added"
21
  }
22
]

This may seem like a great approach, completely avoiding the complexity of lockfile parsing but there are some inherent limitations of this approach.

In this blog post, we will explore how diff-based SCA scanners work, their limitations, and why they can be easily circumvented by malicious actors.

How Diff-Based SCA Scanners Work

git diff based scanners in Pull Requests (PRs) analyze the .diff file data fetched from the VCS platform like GitHub or computed locally based on the branch changes.

These scanners try to extract updated packages from package manager lockfiles, like package-lock.json, go.mod, requirements.txt etc. and use the extracted data as source of truth for identifying newly introduced vulnerable packages.

This extraction can be carried out either by parsing, which include using regex and other sophisticated algorithms, or by leveraging Artificial Intelligence (AI) models to identify the changes in the lockfiles. Using the + and - symbols in the .diff file the scanner identify added, updated and removed packages.

Using either of these methods, the diff based SCA scanners have an inherent limitation due to how git diff works.

The Partial Data Source Problem

The diff file data from the VCS platforms or even the git diff command itself, only contains the changes, this means it has partial data. This is intended because the purpose of diff is to show the changes, and not the entire file.

Hence, for lockfiles which use multi-line syntax to represent the state of a package data can be easily missed and manipulated when either the package name or version data is missing from the .diff file data.

Some lockfiles which uses multi-line syntax are:

Pipfile.lock
poetry.lock
pom.xml
uv.lock
yarn.lock
Cargo.lock

Examples:

Here is a Pipfile.lock diff example snippet, which shows that a package version is updated from 0.4.2 to 0.4.5, but due to how git diff works, the lines which didn’t change are ignored and wrapped in a hunk (@@ -139,7 +139,7 @@), we don’t know about the package name.

1
diff --git a/Pipfile.lock b/Pipfile.lock
2
index 3cfcaeee35..98a61e32bf 100644
3
--- a/Pipfile.lock
4
+++ b/Pipfile.lock
5
@@ -139,7 +139,7 @@
6
                 "sha256:854bf444933e37f5824ae7bfc1e98d5bce2ebe4160d46b5edf346a89358e99da",
7
                 "sha256:e6c6b4334fc50988a639d9b98aa429a0b57da6e17b9a44f0451f930b6967b7a4"
8
             ],
9
             "markers": "sys_platform == 'win32'",
10
-             "version": "==0.4.2"
11
+             "version": "==0.4.5"
12
         },
13
         "coverage": {
14
@@ -147,50 +147,50 @@
15
                 "toml"

Hence for this case, the SCA scanner have no way to know which package version is updated, and hence cannot check for vulnerabilities, which is a clear flaw in the scanner.

Another example:

Here is a pnpm-lock.yaml diff example snippet, which shows that the package name is truncated inside the git hunk, @@ -1115,20 +1115,20 @@ importers:, this is highly nondeterministic as we don’t know how git diff will truncate the data.

1
diff --git a/pnpm-lock.yaml b/pnpm-lock.yaml
2
index 2c91da270bcc6..a41f0956e3d52 100644
3
--- a/pnpm-lock.yaml
4
+++ b/pnpm-lock.yaml
5
@@ -1115,20 +1115,20 @@ importers:
6
         specifier: 1.0.0
7
-         version: 1.0.0
8
+         version: 1.0.2
9
       '@types/babel__code-frame':
10
-        specifier: 7.0.2
11
-        version: 7.0.2
12
+        specifier: 7.0.6
13
+        version: 7.0.6

Similar example from yarn.lock diff file:

1
diff --git a/yarn.lock b/yarn.lock
2
index 6809cdb40b..8dc0b8bf37 100644
3
--- a/yarn.lock
4
+++ b/yarn.lock
5
@@ -1766,6 +1766,16 @@ cli-cursor@^2.1.0:
6
   dependencies:
7
-     restore-cursor "^2.0.0"
8
+     restore-cursor "^2.0.1"

These examples clearly show the Partial Data Source Problem with diff based SCA scanners, which can be easily exploited by malicious actors to circumvent the SCA scanners.

Solution

Given the nature of the problem, especially when the required context is missing in a diff file, the solution is a trade-off between

Completeness and reliability at the cost of handling complexity
Simplicity and ease of use at the cost of missing some packages

For serious security use-case, completeness and reliability is the key. There can be heuristics like tuning the diff hunk size to a larger value, but it does not guarantee that the required context will be present. In addition to that, LLMs are inherently nondeterministic, hence the output can vary for the same input.

For vet and its GitHub Action integration vet-action, we decided to do the heavy lifting since vet is already aware of the different lockfile formats and handle the nuances of different package managers. For reliability, we follow the steps below:

Identify the changed lockfiles in the PR
Fetch the complete lockfile from base branch, parse it with vet and use as exceptions for the subsequent scanning step
Fetch the complete lockfile from the head branch, scan it with vet while having the exceptions from the previous step

This approach ensures reliability while avoiding parser differentials where different parsers can yield different results. This also maintains vet lockfile parsers as the single source of truth.

Conclusion

In this blog, we explored the subtle issues with diff based SCA scanners, which can be easily exploited by malicious actors or simply pakcages will pass through undetected. These are a few which came under our attention, but there can be many more. Parsing lockfiles or SBOMs are the way forward where relibiality and accuracy is a first class requirement.

engineering
security

Author

Kunal Singh

safedep.io

Share

Open Source

Curious Case of Embedded Executable in a Newly Introduced Transitive Dependency

A routine dependency upgrade introduced a suspicious transitive dependency with an embedded executable. While manual analysis confirmed it wasn't malicious, this incident highlights the implicit...

Announcements

Contributing to SafeDep Open Source Projects during Hacktoberfest 2025

Learn how to contribute to SafeDep open source projects during Hacktoberfest 2025 and help secure the open source software supply chain.

Malware

Malicious npm Packages Impersonating Hyatt Internal Dependencies

Three malicious npm packages disguised as Hyatt internal dependencies were discovered using install hooks to execute malicious payloads. All packages share identical attack patterns and...

Ship Code. Not Malware. SafeDep Launches GitHub App for Malicious Package Protection

SafeDep launches a GitHub App for zero-configuration protection against malicious open source packages. Instantly scan pull requests and keep your code repositories safe from supply chain attacks.

View All Blogs