Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: added debian parser #3543

Open
wants to merge 19 commits into
base: main
Choose a base branch
from
Open

Conversation

joydeep049
Copy link
Contributor

@joydeep049 joydeep049 commented Nov 27, 2023

closes #2917

@joydeep049 joydeep049 closed this Nov 27, 2023
@joydeep049 joydeep049 reopened this Nov 27, 2023
@joydeep049
Copy link
Contributor Author

Only review and Merging left @terriko @anthonyharrison

@anthonyharrison
Copy link
Contributor

@crazytrain328 Can you please add some tests and include some sample data to demonstrate the parser working.

@joydeep049
Copy link
Contributor Author

@anthonyharrison Could you tell me how to do that? I have worked on adding fuzz testing to parsers before. Do i do the same here?

Signed-off-by: Joydeep Tripathy <[email protected]>
@codecov-commenter
Copy link

codecov-commenter commented Nov 29, 2023

Codecov Report

Attention: Patch coverage is 57.42574% with 43 lines in your changes are missing coverage. Please review.

Project coverage is 80.35%. Comparing base (d6cbe40) to head (d0b260a).
Report is 179 commits behind head on main.

Files Patch % Lines
cve_bin_tool/parsers/deb.py 55.29% 34 Missing and 4 partials ⚠️
test/test_language_scanner.py 66.66% 4 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3543      +/-   ##
==========================================
+ Coverage   75.41%   80.35%   +4.93%     
==========================================
  Files         808      823      +15     
  Lines       11983    12799     +816     
  Branches     1598     1999     +401     
==========================================
+ Hits         9037    10284    +1247     
+ Misses       2593     2089     -504     
- Partials      353      426      +73     
Flag Coverage Δ
longtests 75.27% <53.46%> (-0.15%) ⬇️
win-longtests 78.54% <57.42%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@terriko terriko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking promising!

We do probably need a test. I'd suggest you craft a slightly empty .deb file that has the appropriate product data but no actual product files. Put it in the test/assets directory (alongside the other test.deb and .cab and .apk files -- you might be able to re-use the existing test.deb but I don't know if the data you need was stripped or not). Then build a test that calls the functions you wrote to parse the test file and verify that it's finding the correct data.

You'll also want to add documentation explaining how it works and what parts of the .deb control data it's using to do its thing. e.g. If I type cve-bin-tool mydebfile.deb what's going to happen?

@joydeep049
Copy link
Contributor Author

joydeep049 commented Dec 4, 2023

Hello @terriko
I was busy with my end semester exams (finally they are over).
Finally I'm free and I'll be able to contribute regularly.
Thanx for the help on this.

@joydeep049
Copy link
Contributor Author

I am supposed to get a list of the Deb_products to add in the script only after i run a test with DebParser using a test.deb file. I chose to use the test.deb file in the test/assets section.
I am using the following code to test it

import sys
import os
sys.path.append('/home/joydeep/dev/cve-bin-tool')
from cve_bin_tool.parsers.deb import DebParser

from cve_bin_tool.cvedb import CVEDB
from cve_bin_tool.log import LOGGER

cve_db= CVEDB()
logger= LOGGER

file_path = os.path.join(os.getcwd(), 'test.deb')

deb_parser= DebParser(cve_db=cve_db,logger=logger)

deb_parser.run_checker(file_path)

I have brought the test.deb file into the same directory as my testing file. I think there is something wrong in the way im calling Logger. Can you help me @terriko @anthonyharrison ?

@anthonyharrison
Copy link
Contributor

@crazytrain328

The test is OK in your local environment to prove the functionality.. However what we need is a test within the cve-bin-tool test environment where we can add it to the test suite.

The language parsers all have test files in the test/language_data directory. Can you add your test.deb file in this directory and then update the test_language_scanner file to add your test code. I suggest you add a new test test_debian_package which calls the scanner and then asserts that the results are as expected.

Can you confirm that the parser is doing more than is already covered in the extractor module and tested in the test_extractor file which explicitly has a a test for files with a .deb extension.

@joydeep049
Copy link
Contributor Author

@anthonyharrison
This test is not working in my Local Environment. It executes but it does not give any output . Since all the outputs to the console in the run_checker() function is through the logger object, I thought the way in which Im using logger in my test code is wrong.

@joydeep049
Copy link
Contributor Author

joydeep049 commented Dec 7, 2023

I tried to change a few things but my local test still gives no output.
For a change, I set the logging level down to DEBUG, but that does not help.

My code for testing:

import sys
import os
import logging  # Import the logging module

sys.path.append('/home/joydeep/dev/cve-bin-tool')
from cve_bin_tool.parsers.deb import DebParser
from cve_bin_tool.cvedb import CVEDB
from cve_bin_tool.log import LOGGER

LOGGER.setLevel(logging.DEBUG)  

cve_db = CVEDB()
logger = LOGGER

file_path = os.path.join(os.getcwd(), 'test.deb')

deb_parser = DebParser(cve_db=cve_db, logger=logger)

deb_parser.run_checker(file_path)

Modified run_checker() function:

def run_checker(self, filename):
        """Process .deb control file with file existence check"""
        self.logger.debug(f"Scanning .deb control file: {filename}")

        # Check if the file exists
        if not os.path.exists(filename):
            self.logger.error(f"File not found: {filename}")
            return  # Exit the method if file doesn't exist

        try:
            with open(filename) as file:
                control_data = file.read()
            product, version = self.parse_control_file(control_data)
            if product and version:
                product_info = self.find_vendor(product, version)
                if product_info:
                    yield from product_info
            else:
                self.logger.debug(f"No product/version found in {filename}")
        except Exception as e:
            self.logger.error(f"Error processing file {filename}: {e}")

        self.logger.debug(f"Done scanning file: {filename}")

Im stuck! Please help @terriko @anthonyharrison.

@anthonyharrison
Copy link
Contributor

@crazytrain328 Can you provide the test.deb file that you are using?

Tried to run the parser in my environment. The run_checker routine wan't being called. However if I call a different routine in the class it does get called so I suspected there was something wrong with the way run-checker is defined/being called.

I created the following routine

    def do_it(self, filename):
        print ("DO IT")
        try:
            print ("Read file")
            with open(filename) as file:
                control_data = file.read()
            print ("File read")
        except:
            print ("We have a problem")
        print ("DONE IT")

And called this instead of run_checker. This resulted in the exception being called when reading a .deb file (I used the test.deb file in the test/assets directory). Renaming this as run_checker, does result in the run_checker being called. So I think you need to work through the run_checker routine line by line to validate the operation. Using print statement rather than logging may also help.

@joydeep049
Copy link
Contributor Author

@anthonyharrison
I am also using the test.deb in the test/assets directory.
But I will go through the run_checker() definition once again.

@anthonyharrison
Copy link
Contributor

anthonyharrison commented Dec 8, 2023 via email

@joydeep049
Copy link
Contributor Author

joydeep049 commented Dec 10, 2023

I was able to run the run_checker() function.
But now I am getting an Exception : Unable to open database file.

I created my own .deb file and am trying to parse its control file with my DebParser.
I tried to call the GoParser with the go.mod file as well and it is giving me the same error.(Database one)
Heres the local testing code:

import os
import sys
import logging
sys.path.append('/home/joydeep/dev/cve-bin-tool')
from cve_bin_tool.parsers.deb import DebParser

from cve_bin_tool.cvedb import CVEDB
from cve_bin_tool.log import LOGGER

# Set logger to DEBUG level
LOGGER.setLevel(logging.DEBUG)

# Verify the test file path
file_path = '/home/joydeep/mypackage/DEBIAN/control'
if not os.path.exists(file_path):
    raise FileNotFoundError(f"The file {file_path} does not exist.")

# Instantiate the database and the parser
cve_db = CVEDB()
deb_parser = DebParser(cve_db=cve_db, logger=LOGGER)

# Run the parser
try:
    for info in deb_parser.run_checker(file_path):
        print(info)
except Exception as e:
    LOGGER.error(f"Exception occurred: {e}")

Is there something wrong with the way Im using CVEDB?
@anthonyharrison @terriko

@terriko
Copy link
Contributor

terriko commented Dec 12, 2023

If I had to guess, the database problem is that your database hasn't been created or updated in a while. So you're making a new CVEDB() but you're not populating it.

To update your local db and make sure it's functioning:
cve-bin-tool -u now main/test/csv/triage.csv

(The file doesn't matter, I just chose one from our test suite so you can see if the database is working in other code.)

And in your code you'd probably want to call cvedb.get_cvelist_if_stale() to do the equivalent update. That said, this is where using the existing pytest harness would help you a lot over writing a separate test, as we already have database setup code and stuff in the existing test/test_* files and when you run it all in github actions you have access to the cached database so you don't have to initialize it yourself. I'd strongly recommend that you move your test into pytest and use the existing framework before spending too long debugging this: you're going to have to do it eventually anyhow because we need all tests to run through there before this code can be merged, so might as well just learn to do it that way first instead of figuring it out twice.

@joydeep049
Copy link
Contributor Author

joydeep049 commented Dec 13, 2023

If I had to guess, the database problem is that your database hasn't been created or updated in a while. So you're making a new CVEDB() but you're not populating it.

To update your local db and make sure it's functioning: cve-bin-tool -u now main/test/csv/triage.csv

(The file doesn't matter, I just chose one from our test suite so you can see if the database is working in other code.)

And in your code you'd probably want to call cvedb.get_cvelist_if_stale() to do the equivalent update. That said, this is where using the existing pytest harness would help you a lot over writing a separate test, as we already have database setup code and stuff in the existing test/test_* files and when you run it all in github actions you have access to the cached database so you don't have to initialize it yourself. I'd strongly recommend that you move your test into pytest and use the existing framework before spending too long debugging this: you're going to have to do it eventually anyhow because we need all tests to run through there before this code can be merged, so might as well just learn to do it that way first instead of figuring it out twice.

How do I get the DEBIAN_PRODUCTS which i have to add in the test/test_language_scanner.py file, when i write the test using the existing pytest setup?

@terriko
Copy link
Contributor

terriko commented Dec 13, 2023

How do I get the DEBIAN_PRODUCTS which i have to add in the test/test_language_scanner.py file, when i write the test using the existing pytest setup?

Usually you'd make this manually (i.e. cut and paste the data that you used when you created the file).

For example, if you look at https://github.com/intel/cve-bin-tool/blob/main/test/language_data/requirements.txt and then at the PYTHON_PRODUCTS array in https://github.com/intel/cve-bin-tool/blob/main/test/test_language_scanner.py you'll see that the test is just a subset of what could have been detected from the file.

In your case, since a debian package often contains only one product, you may have an array that's just the one thing you put into the metadata of the file, so you could probably write something like

def test_python_package(self, filename: str) -> None:
   assert scanner.scan_file(filename) == "debian_package"

Although you'll have to account for it returning an array rather than a single string or whatever it actually does (sorry, I've got to run to a meeting so I don't have time to double-check the api myself, but you can probably figure it out from the other tests!)

Copy link
Contributor

@terriko terriko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting closer, but it looks like the test doesn't yet match what it's actually returning:

 =========================== short test summary info ===========================
FAILED test/test_language_scanner.py::TestLanguageScanner::test_language_package[D:\\a\\cve-bin-tool\\cve-bin-tool\\test\\language_data\\test_control.deb-products9] - AssertionError: assert 'aptitude' in ['dpkg', 'apt', 'lintian']
FAILED test/test_language_scanner.py::TestLanguageScanner::test_debian_control[D:\\a\\cve-bin-tool\\cve-bin-tool\\test\\language_data\\test_control.deb] - AssertionError: assert ['dpkg', 'apt', 'lintian'] == ['dpkg', 'apt...debuild', ...]
  At index 2 diff: 'lintian' != 'aptitude'
  Right contains 7 more items, first extra item: 'debmake'
  Full diff:
    [
     'dpkg',
     'apt',
  -  'aptitude',
  -  'debmake',
     'lintian',
  -  'debuild',
  -  'dh_make',
  -  'debconf',
  -  'alien',
  -  'gdebi',
    ]
=========== 2 failed, 1869 passed, 32 skipped in 1232.78s (0:20:32) ===========

Note that there's failures only on windows due to an unrelated issue with the linux tests that I'm working on in #3631 -- if you run the linux tests on a local linux machine they'll probably fail too.

@terriko
Copy link
Contributor

terriko commented Dec 19, 2023

Oh, and if you want to run just your new test to see how it works on your system, you can use the -k option:

pytest -vv -k test_control.deb

should probably get you just the new piece you added so you don't have to wait for a whole file worth of tests (or the whole test suite!) to complete.

@joydeep049
Copy link
Contributor Author

joydeep049 commented Dec 20, 2023

Oh, and if you want to run just your new test to see how it works on your system, you can use the -k option:

pytest -vv -k test_control.deb

should probably get you just the new piece you added so you don't have to wait for a whole file worth of tests (or the whole test suite!) to complete.

All the products that my test_control.deb has I have listed in the DEBIAN_PRODUCTS list ..I also modified the debparser code to be able to extract the products and their versions more efficiently, but still it does not give me the desired output.

One thing I read about debian control files is that while the actual package has a .deb extension, the control file inside the package (which basically contains metadata about the debian package) is actually a text file (without extension). Should I write my tests to process a control.txt file?

@terriko
Copy link
Contributor

terriko commented Jan 3, 2024

Typically, you'd want to have the test process a .deb and find and parse the control.txt, so... both?

@joydeep049
Copy link
Contributor Author

So I've been trying to find ways on how to unpack Debian packages using python and so far haven't had any luck in that .
The test.deb file in test/assets has a structure like:
test.deb
--->control.tar.xz
---> .
---> control
--->usr/bin ...
Need Ideas on how to proceed.

@joydeep049
Copy link
Contributor Author

joydeep049 commented Feb 5, 2024

Hello @terriko , @anthonyharrison @b31ngd3v @Rexbeast2
Finally I was able to make my code parse a debian package and bring out the contents of its control file. However, I wouldnt expect any cves to actualy be present since this is purely a test file. So, if I write the usual test in test_language_scanner , Im bound to get an assertion error. How do i solve this?
Also , please help me with the issue in bandit linter as it says that tarfile library has high severity. I went through all the docs and the ways and still couldnt find a way to build a tarfile extractor without using that library. Even the internal library of python uses a function shutil.unpack_archive which in turn uses _extract_tarfile functions which uses tarfile library.
What to do ?

@joydeep049
Copy link
Contributor Author

joydeep049 commented Feb 21, 2024

Only review and merge left.
Thanx
@terriko
(I want to remove some of the comments from the code but If i do it now it will get stuck in the CI due to failing tests. Will open another doc issue for that)

Eagerly Waiting for review :)

Copy link
Contributor

@terriko terriko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry it's taken so long to get back to this one. There's two discussions happening behind the scenes that affect this PR:

  1. The tarfile issue. We need to make sure that this code also extracts only the files needed and that it can't clobber any existing files. I think the best solution for you is to extract only the one specific file you need and no other files.
  2. Checking in binary files to our git repo. We've been doing a lot of this in the past and it's getting flagged by OpenSSF. I think the right choice in this case is have the test actually construct the .deb file that we'll use for testing.

I also think we need to do some thinking to make sure that we aren't duplicating effort or over-riding existing code that previously unpacked debian files and looks for strings. I'm honestly not sure what the right choice is to streamline that, but maybe it won't be so bad if this particular code is only extracting the control data so we're at least not unpacking two full archives into /tmp/


# Change the working directory to the temp_dir for extraction
os.chdir(temp_dir)
await aio_run_command(["ar", "x", filename])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. We should check to make sure ar exists before we run it.
  2. Is there any chance that ar could escape the temp directory? I know we're having this problem with tar but I don't know if ar has the same problem.
  3. If we're only using the control data to identify the .deb file, let's change the code to only extract that part and absolutely no other files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ar, as used in the aio_run_command() function, helps to extract the contents of the given debian package into a temporary directory, inside which we keep extracting till we reach the control file, then the contents of the file are written onto control_data variable and then the temporary directory is closed, deleting all the extracted contents.
I think my code already does what you want it to.
As for why im using -x to extract all the files inside the package - It is because the control files are sometimes present in different directories than one might expect it to . It might be present directly inside the debian package or maybe inside another tar file in the package. That is why all the files are being extracted here.

@@ -236,13 +238,15 @@ def test_language_package_none_found(self, filename: str) -> None:
(str(TEST_FILE_PATH / "Package.resolved"), SWIFT_PRODUCTS),
(str(TEST_FILE_PATH / "composer.lock"), PHP_PRODUCTS),
(str(TEST_FILE_PATH / "cpanfile"), PERL_PRODUCTS),
(str(TEST_FILE_PATH / "test.deb"), DEBIAN_PRODUCTS),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have test_language_scanner create the .deb file rather than checking it in to the test directory?

We've typically just put small test files in the directory and let it be, but it's hurting our OpenSSF score when we provide things like .deb packages that are basically installable (weirdly, it doesn't flag on the thousand .tar.gz files... yet).

I think .deb files use mostly tools we already have installed so it should be possible to write python or a makefile to generate the file here and add some code to skip the test if the file can't be built. Sorry that you get stuck as a guinea pig here!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Umm, I was working on this and was able to write something like this..

import os
import subprocess

def create_debian_package(directory, package_name, version, architecture, description, maintainer):
    # Create the necessary directory structure
    debian_dir = os.path.join(directory, 'DEBIAN')
    os.makedirs(debian_dir, exist_ok=True)

    # Create the control file
    control_content = f"""Package: {package_name}
Version: {version}
Architecture: {architecture}
Maintainer: {maintainer}
Description: {description}
"""
    with open(os.path.join(debian_dir, 'control'), 'w') as control_file:
        control_file.write(control_content)

    # Build the package
    subprocess.run(['dpkg-deb', '--build', directory, f'{package_name}_{version}_{architecture}.deb'])

if __name__ == '__main__':
    create_debian_package(
        directory='mypackage',
        package_name='mypackage',
        version='1.0',
        architecture='all',
        description='Example package',
        maintainer='Joydeep <[email protected]>'
    )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I add this file in the current PR or as a different PR?
This one has been going on for way too long 🥲🥲

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test.deb can of course be created for testing. But, would'nt that be like extra memory space?
Even if I create the debian package as a temporary file everytime, it would still be extra work

Copy link
Contributor

@terriko terriko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a heads up: this has a bunch of merge conflicts now and will need some work.

@joydeep049
Copy link
Contributor Author

Just a heads up: this has a bunch of merge conflicts now and will need some work.

I'll get back to solving this as soon as we have the PURL generation for language parsers and their tests figured out.
@terriko @anthonyharrison

@terriko
Copy link
Contributor

terriko commented Apr 3, 2024

Marking this as blocked so I don't look at it again until after 3.3 is out.

@terriko terriko added the blocked label Apr 3, 2024
@joydeep049
Copy link
Contributor Author

Marking this as blocked so I don't look at it again until after 3.3 is out.

Sure! I did mention working on this issue as part of stretch goals in my GSOC project.
Maybe I'll get to it in the community bonding period.

Btw, When will the 3.3 version be coming out? @terriko

@joydeep049
Copy link
Contributor Author

joydeep049 commented Apr 20, 2024

Hello @terriko , Since the release is out I was thinking we can finally work on finishing this one.
Or should this be prioritised after 3.3.1 release is out?
(Merge conflicts have been resolved)

@terriko
Copy link
Contributor

terriko commented Apr 22, 2024

I've barely started with 3.3.1 planning so I expect this will get merged long before there, but it's going to be at least a few weeks. I'm severely backlogged on non-cve-bin-tool stuff at the moment and have to put my focus elsewhere.

@joydeep049
Copy link
Contributor Author

I've barely started with 3.3.1 planning so I expect this will get merged long before there, but it's going to be at least a few weeks. I'm severely backlogged on non-cve-bin-tool stuff at the moment and have to put my focus elsewhere.

Absolutely no problem at all!
I think this one is almost ready to merge except that I would have to add a script which creates a temporary debian file to test the code itself.
But I thought it may end up being more expensive than what we are doing now.

@terriko terriko removed the blocked label Apr 22, 2024
@terriko
Copy link
Contributor

terriko commented Apr 22, 2024

As for testing files: for now, let's got with just including the .deb in git. I'm going to have to deal with the OpenSSF's insistence on there not being binary files eventually but I think at the moment it's more important to me that we have a functional test if it's not super easy to just have a makefile for it or something.

@joydeep049
Copy link
Contributor Author

As for testing files: for now, let's got with just including the .deb in git. I'm going to have to deal with the OpenSSF's insistence on there not being binary files eventually but I think at the moment it's more important to me that we have a functional test if it's not super easy to just have a makefile for it or something.

So would the current structure work?
I'm using the test.deb file in test/assets

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat: "language" parser for .deb control data
4 participants