Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Native support for scanning docker images (transparent nested .tar unpacking) #674

Open
hlein opened this issue Jul 27, 2022 · 1 comment

Comments

@hlein
Copy link

hlein commented Jul 27, 2022

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Description

It would be nice if trufflehog could smartly scan nested .tar files, as seen in e.g. docker containers.

Problem to be Addressed

When scanning a docker image tarball (such as one saved with docker save ...), trufflehog currently just prints the top-level .tar filename for every hit. This doesn't give a lot of transparency to what component inside the image, or what resulting file path inside a container launched using the image, contains the hit.

Description of the Preferred Solution

Best-case, trufflehog would understand and record-keep when looking inside tar archives, and support doing so in a nested fashion, because docker images are typically nested .tar files of multiple layers, and then print out that context on a hit, maybe something like:

File: foo.tar:b0d4d7051229875a2bfd9809c631c9899748f0e1fc6f408a446048dc6b60ca20:etc/secrets

Maybe this would be something generalized, that makes trufflehog filesystem smarter. Or, it might have to be a dedicated mode, trufflehog archive or something. Uncompressed .tar is one thing; I expect compressed archives would be more painful.

Additional Context

There is a fuse filesystem for mounting archives which supports recursive/nested archives as well, https://github.com/mxmlnkn/ratarmount, which transparently turns archive files into subdirectories.

So for example:

mkdir -p some_container
ratarmount -c -r -o ro,allow_other some_container.tar some_container
trufflehog filesystem --directory=some_container 2>&1 | tee "trufflehog_some_container.out"

Found unverified result 🐷🔑❓
Detector Type: URI
Raw result: http://user:host@foo:3128
File: some_container/e60a0dfc08a94dabb221d8a28c6fdbeaa7cab0c146d35e8eff8e50bc2e4c194b/layer.tar/usr/lib/python2.7/site-packages/urlgrabber/grabber.py

Found unverified result 🐷🔑❓
Detector Type: URI
Raw result: http://username:[email protected]:80/path
File: some_container/96e436883f4940841fc9f1f7e935bada3859d2ffb0e5455952438d844f8e9c26/layer.tar/usr/lib/python2.7/site-packages/pip/_vendor/urllib3/util/url.py

Found unverified result 🐷🔑❓
Detector Type: PrivateKey
Raw result: -----BEGIN PRIVATE KEY-----
MIICd[snip]
-----END PRIVATE KEY-----
File: some_container/b0d4d7051229875a2bfd9809c631c9899748f0e1fc6f408a446048dc6b60ca20/layer.tar/usr/share/doc/perl-IO-Socket-SSL/example/simulate_proxy.pl
...

Or for a large collection of them:

# for A in *tar ; do 
  D=$(echo "$A" | sed 's/\.tar$//') ;
  mkdir -p "$D" ; 
  ratarmount -r -o ro,allow_other "$A" "$D" ;
done
$ for A in *tar ; do
  D=$(echo "$A" | sed 's/\.tar$//') ;
  test -s "trufflehog_${D}.out" && continue ;
  echo "$D" ;
  trufflehog filesystem --directory="$D" 2>"trufflehog_${D}.err" | tee "trufflehog_${D}.out"
done

If adding native nested-archive support does not seem worth it/desirable, then perhaps just polish/improve this example and document it somewhere.

@nyanshak
Copy link
Contributor

nyanshak commented May 2, 2023

I was going to have a look into this but realized I probably don't have enough time to untangle this right now since it's tied to multiple things, so instead I'll try to leave some notes that might be helpful for anyone else looking into it.

Right now the Handler interface has FromFile(context.Context, io.Reader) chan ([]byte) . For archive handler, we might instead want the return type to be (path string, []byte). Then we could update some field on the chunk.SourceMetadata to represent any sub-archive paths.

The problems that I see with it:

  • Right now, the path can be used to link to the file (e.g., provide a direct link to the file in GitHub), but the sub-archive (archive in archive) can't be linked to in the sources, so this is one factor that would indicate a new field is needed.
  • Different Source types have unique fields for setting paths. For example, in filesystem, it would be .Filesystem.File, S3 would be .S3.File, GitHub's is .Github.File. Even File itself is not guaranteed, as in the case of Circleci, which might be .Circleci.Link (not sure).

Suggestion might be to add something like ArchivePath to SourceMetadata directly, where you can set full paths, like some_container/b0d4d7051229875a2bfd9809c631c9899748f0e1fc6f408a446048dc6b60ca20/layer.tar:/usr/share/doc/perl-IO-Socket-SSL/example/simulate_proxy.pl. More generally it could look like PATH_TO_FILE_IN_ARCHIVE[:PATH_TO_FILE_IN_SUB_ARCHIVE]...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

2 participants