Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Huge performance degradation with large license candidate files due to a bug #31

Open
SamiHiltunen opened this issue Feb 6, 2023 · 1 comment
Labels
help wanted Extra attention is needed

Comments

@SamiHiltunen
Copy link

The library is building a regex here of the normalized first lines of license files. It then later splits files using the regex here.

The problem here is that the App-s2p.txt license's first line normalizes into an empty string. This then causes the regex to match every line beginning and ending as we can see for example in this regex tester. You can see the bug in the regex by searching for || which is where the license's first line would go.

This causes huge performance degradation in repositories with large files that match the license filename pattern. One example of a such a repository is https://gitlab.com/tikiwiki/tiki which contains a large file called copyright.txt. Detecting a license for the repository took 22s. Detecting the license takes 260ms with the below patch:

diff --git a/licensedb/internal/db.go b/licensedb/internal/db.go
index a7254fd..d69118e 100644
--- a/licensedb/internal/db.go
+++ b/licensedb/internal/db.go
@@ -176,6 +176,11 @@ func loadLicenses() *database {
 		if len(header.Name) <= 6 {
 			continue
 		}
+
+		if header.Name == "./App-s2p.txt" {
+			continue
+		}
+
 		key := header.Name[2 : len(header.Name)-4]
 		text := make([]byte, header.Size)
 		readSize, readErr := archive.Read(text)

What would be the appropriate fix here?

@bzz bzz added the help wanted Extra attention is needed label Jul 11, 2023
@angshumukherjee100
Copy link

We are also observing slowness in large license files with the performance being much worse than what has been mentioned (degraded from seconds to close to an hour), although it gets resolved when we downgrade the version from 4.3.1 to 4.3.0.

I cannot share the license file so will try to debug this and update this issue about the bottleneck.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Development

No branches or pull requests

3 participants