Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to create code2vec input #186

Open
messiGao opened this issue Nov 16, 2023 · 9 comments
Open

How to create code2vec input #186

messiGao opened this issue Nov 16, 2023 · 9 comments

Comments

@messiGao
Copy link

I use command like “{java -cp JavaExtractor-0.0.1-SNAPSHOT.jar JavaExtractor.App --max_path_length 8 --max_path_width 2 --dir test.java >file.txt }“ ,then use ”{python3 code2vec.py --load models/java14_model/saved_model_iter8.release --test file.txt}“,but get error “ {return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict,
tensorflow.python.framework.errors_impl.InvalidArgumentError: Expect 201 fields but have 4 in record
[[{{node IteratorGetNext}}]] }”.

@urialon
Copy link
Collaborator

urialon commented Nov 16, 2023

Hi @messiGao ,
Thank you for your interest in our work.

I think there is a confusion, because the exception that is raised is coming from TensorFlow, while the java command that you mentioned does not involve TensorFlow at all.

May I also ask what kinds of tasks are you looking into?
Maybe I can recommend a newer model.

Best,
Uri

@messiGao
Copy link
Author

messiGao commented Nov 16, 2023

I want to use the “--test” command to export <TEST_FILE>.vectors,but I don't know what kind of TEST_FILE is correct。when i ask gpt-4, the answer is use the JavaExtractor to convert my test.java to test.txt。

@messiGao
Copy link
Author

Additionally,My aim is to store a Java codebase in a vector database to run similarity searches and retrieve code files from the db relevant to my query.

@urialon
Copy link
Collaborator

urialon commented Nov 16, 2023

Hi @messiGao ,

Please see https://github.com/neulab/code-bert-score
You don't need the approach itself, but it contains Huggingface models, and one specifically for java called neulab/codebert-java.

This will allow you to use the Huggingface library with that model and a BERT-like framework.

Best,
Uri

@asyed79gatech
Copy link

I have a similar dilemma with regards to creating embeddings of csharp code using a code2vec model I have trained. As
@messiGao mentioned, I want to use the "--test" command to create .vectors file as mentioned in the repo but when i execute the command, it gives the following error:

tensorflow.python.framework.errors_impl.InvalidArgumentError: Expect 201 fields but have 2 in record
         [[node IteratorGetNext (defined at /usr/local/lib/python3.7/dist-packages/tensorflow_core/python/framework/ops.py:1751) ]]```

@urialon
Copy link
Collaborator

urialon commented Feb 22, 2024

Hi @asyed79gatech ,
Thank you for your interest in our work.

I believe that you haven't run the preprocess.sh script on the data.

However in general, I recommend using the newer https://github.com/neulab/code-bert-score project. It is based on Huggingface, which is actively maintained.

Best,
Uri

@asyed79gatech
Copy link

Hi @urialon

Thanks for your prompt response. I thought we only needed to run the preprocess.sh script while training the code2vec model. Right now, I already have a trained model released and want it to generate embeddings for vector store.

@XuPing1234
Copy link

我使用像“{java -cp JavaExtractor-0.0.1-SNAPSHOT.jar JavaExtractor.App --max_path_length 8 --max_path_width 2 --dir test.java >file.txt }”这样的命令,然后使用“{python3 code2vec.py --load models/java14_model/saved_model_iter8.release --test file.txt}”,但出现错误“ {return tf_session。TF_SessionRun_wrapper(self._session、选项、feed_dict、tensorflow.python.framework.errors_impl。InvalidArgumentError:预期有 201 个字段,但记录中有 4 个字段 [[{{node IteratorGetNext}}]] }“。

Hello, have you resolved your issue? How can Java source code be converted into the input format required by code2vec?

@zhaojialinnn
Copy link

我使用像“{java -cp JavaExtractor-0.0.1-SNAPSHOT.jar JavaExtractor.App --max_path_length 8 --max_path_width 2 --dir test.java >file.txt }”这样的命令,然后使用“{python3 code2vec .py --load models/java14_model/saved_model_iter8.release --test file.txt}”,但出现错误“ {return tf_session。TF_SessionRun_wrapper(self._session、选项、feed_dict、tensorflow.python.framework.errors_impl。InvalidArgumentError:预期有 201 个字段,但记录有 4 个字段 [[{{node IteratorGetNext}}]] }“。

您好,您的问题解决了吗?Java 源代码如何转换成 code2vec 所需的输入格式?

hello, I encountered the same issue. Have you resolved it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants