Native support for scanning docker images (transparent nested .tar unpacking) #674

hlein · 2022-07-27T07:45:12Z

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Description

It would be nice if trufflehog could smartly scan nested .tar files, as seen in e.g. docker containers.

Problem to be Addressed

When scanning a docker image tarball (such as one saved with docker save ...), trufflehog currently just prints the top-level .tar filename for every hit. This doesn't give a lot of transparency to what component inside the image, or what resulting file path inside a container launched using the image, contains the hit.

Description of the Preferred Solution

Best-case, trufflehog would understand and record-keep when looking inside tar archives, and support doing so in a nested fashion, because docker images are typically nested .tar files of multiple layers, and then print out that context on a hit, maybe something like:

File: foo.tar:b0d4d7051229875a2bfd9809c631c9899748f0e1fc6f408a446048dc6b60ca20:etc/secrets

Maybe this would be something generalized, that makes trufflehog filesystem smarter. Or, it might have to be a dedicated mode, trufflehog archive or something. Uncompressed .tar is one thing; I expect compressed archives would be more painful.

Additional Context

There is a fuse filesystem for mounting archives which supports recursive/nested archives as well, https://github.com/mxmlnkn/ratarmount, which transparently turns archive files into subdirectories.

So for example:

mkdir -p some_container
ratarmount -c -r -o ro,allow_other some_container.tar some_container
trufflehog filesystem --directory=some_container 2>&1 | tee "trufflehog_some_container.out"

Found unverified result 🐷🔑❓
Detector Type: URI
Raw result: http://user:host@foo:3128
File: some_container/e60a0dfc08a94dabb221d8a28c6fdbeaa7cab0c146d35e8eff8e50bc2e4c194b/layer.tar/usr/lib/python2.7/site-packages/urlgrabber/grabber.py

Found unverified result 🐷🔑❓
Detector Type: URI
Raw result: http://username:[email protected]:80/path
File: some_container/96e436883f4940841fc9f1f7e935bada3859d2ffb0e5455952438d844f8e9c26/layer.tar/usr/lib/python2.7/site-packages/pip/_vendor/urllib3/util/url.py

Found unverified result 🐷🔑❓
Detector Type: PrivateKey
Raw result: -----BEGIN PRIVATE KEY-----
MIICd[snip]
-----END PRIVATE KEY-----
File: some_container/b0d4d7051229875a2bfd9809c631c9899748f0e1fc6f408a446048dc6b60ca20/layer.tar/usr/share/doc/perl-IO-Socket-SSL/example/simulate_proxy.pl
...

Or for a large collection of them:

# for A in *tar ; do 
  D=$(echo "$A" | sed 's/\.tar$//') ;
  mkdir -p "$D" ; 
  ratarmount -r -o ro,allow_other "$A" "$D" ;
done
$ for A in *tar ; do
  D=$(echo "$A" | sed 's/\.tar$//') ;
  test -s "trufflehog_${D}.out" && continue ;
  echo "$D" ;
  trufflehog filesystem --directory="$D" 2>"trufflehog_${D}.err" | tee "trufflehog_${D}.out"
done

If adding native nested-archive support does not seem worth it/desirable, then perhaps just polish/improve this example and document it somewhere.

The text was updated successfully, but these errors were encountered:

nyanshak · 2023-05-02T18:04:57Z

I was going to have a look into this but realized I probably don't have enough time to untangle this right now since it's tied to multiple things, so instead I'll try to leave some notes that might be helpful for anyone else looking into it.

Right now the Handler interface has FromFile(context.Context, io.Reader) chan ([]byte) . For archive handler, we might instead want the return type to be (path string, []byte). Then we could update some field on the chunk.SourceMetadata to represent any sub-archive paths.

The problems that I see with it:

Right now, the path can be used to link to the file (e.g., provide a direct link to the file in GitHub), but the sub-archive (archive in archive) can't be linked to in the sources, so this is one factor that would indicate a new field is needed.
Different Source types have unique fields for setting paths. For example, in filesystem, it would be .Filesystem.File, S3 would be .S3.File, GitHub's is .Github.File. Even File itself is not guaranteed, as in the case of Circleci, which might be .Circleci.Link (not sure).

Suggestion might be to add something like ArchivePath to SourceMetadata directly, where you can set full paths, like some_container/b0d4d7051229875a2bfd9809c631c9899748f0e1fc6f408a446048dc6b60ca20/layer.tar:/usr/share/doc/perl-IO-Socket-SSL/example/simulate_proxy.pl. More generally it could look like PATH_TO_FILE_IN_ARCHIVE[:PATH_TO_FILE_IN_SUB_ARCHIVE]...

hlein added the enhancement label Jul 27, 2022

zricethezav mentioned this issue Aug 28, 2023

Docker Image Identification in Tar Files #1643

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Native support for scanning docker images (transparent nested .tar unpacking) #674

Native support for scanning docker images (transparent nested .tar unpacking) #674

hlein commented Jul 27, 2022

nyanshak commented May 2, 2023

Native support for scanning docker images (transparent nested .tar unpacking) #674

Native support for scanning docker images (transparent nested .tar unpacking) #674

Comments

hlein commented Jul 27, 2022

Community Note

Description

Problem to be Addressed

Description of the Preferred Solution

Additional Context

nyanshak commented May 2, 2023