Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Repo level concatenation of data #43

Open
Bytes-Explorer opened this issue Nov 21, 2023 · 38 comments
Open

Repo level concatenation of data #43

Bytes-Explorer opened this issue Nov 21, 2023 · 38 comments

Comments

@Bytes-Explorer
Copy link

Can you share more details on the technique for repo level concatenation part?

@guoday
Copy link
Collaborator

guoday commented Nov 24, 2023

We first parsing the dependencies of files, e.g. A->B, B->C, B->D. And then we rearrange the file positions based on their dependencies, e.g. A,B,C,D. For file paths, we add them on each code file as a comment. An example is shown in https://github.com/deepseek-ai/DeepSeek-Coder#4-repository-level-code-completion

@Bytes-Explorer
Copy link
Author

Thank you for your response. Is this done for all languages in the data?

@guoday
Copy link
Collaborator

guoday commented Nov 27, 2023

only for python, java, c#, c and c++

@Bytes-Explorer
Copy link
Author

Thank you @guoday

@Bytes-Explorer
Copy link
Author

@guoday Do you then do repo level dedup for all programming languages or just the above languages?

@guoday
Copy link
Collaborator

guoday commented Nov 28, 2023

just the above languages. Other languages employ file level dedup.

@Bytes-Explorer
Copy link
Author

@guoday Thank you for your prompt responses. I was curious if you did any ablation studies/evaluations to understand if repo level concatenation helped the model performance in a significant way.

@guoday
Copy link
Collaborator

guoday commented Nov 28, 2023

Not yet. We will try to evaluate the model on repo-level benchmark. For function-level benchmark, the repo level concatenation doesn't help or hurt the model performance.

@Bytes-Explorer
Copy link
Author

Bytes-Explorer commented Nov 28, 2023

Do you have your own repo level benchmark or use a standard one?

@guoday
Copy link
Collaborator

guoday commented Nov 28, 2023

We will use public datasets like RepoCoder and CrossCodeEval to evaluate.

@Bytes-Explorer
Copy link
Author

Ok thanks, was aware of those. Once again, appreciate your prompt responses. I look forward to reading the technical report from your group. Thanks!

@Casi11as
Copy link

temp

Hello, I would like to know the details of the concatenation of data. Assume that the structure of parsed dependencies is in the picture, what is the concatenation results? Is it ACF,ADF,ADG,BCF,BDF,BDG,BE? 7 pieces?

@guoday
Copy link
Collaborator

guoday commented Nov 29, 2023

First, we select the file with the smallest incoming degree, and if there are multiple files with the smallest incoming degree, we randomly choose one. This process is repeated until a dependency order is obtained. For your example, there are many possibilities, one of which could be BACDFGE.

@Casi11as
Copy link

First, we select the file with the smallest incoming degree, and if there are multiple files with the smallest incoming degree, we randomly choose one. This process is repeated until a dependency order is obtained. For your example, there are many possibilities, one of which could be BACDFGE.

In other words, will all the files of the same language in a repo only concatenate one sample?

@guoday
Copy link
Collaborator

guoday commented Nov 29, 2023

Theoretically, yes. However, to shorten the sample length, we will parse a repository in advance and then divide it into multiple independent subgraphs based on dependencies, with each independent subgraph regarding as a sample.

@Casi11as
Copy link

Thanks ! So what are the rules for dividing into subgraphs? Taking the picture I posted above as an example, what sub-pictures will it be divided into?

@slamandar
Copy link

Regarding repo-level concatenation, I have a related question.

In a batch, one sample may contain multi docs from different files, such as repo_a/file_a and repo_a/file_b. When concatenating these files into one sample for pre-training and calculating loss, will there still be an attention mask to prevent file_b from attending file_a?

If there is an attention mask, how can it serve the purpose of capturing the repository context? If there isn't, training by simply concatenating the beginning and end of different files seems somewhat peculiar.

@guoday
Copy link
Collaborator

guoday commented Nov 29, 2023

The term "independent subgraph" refers to a weakly connected subgraph. First, convert the directed graph into an undirected graph, and then divide the graph into multiple connected subgraphs. That is, in each subgraph, any two vertices should be connected by edges within the subgraph. In your example, it is a connected subgraph, with only one subgraph, which is itself. The following is the code to divide the graph into subgraphs.

# convert the directed graph into an undirected graph
def to_undirected(graph):
    undirected_graph = defaultdict(set)
    for node in graph:
        undirected_graph[node]
        for neighbor in graph[node]:
            undirected_graph[node].add(neighbor)
            undirected_graph[neighbor].add(node)
    return undirected_graph

# Use DFS to find all connected subgraphs.
def dfs(graph, node, visited, subgraph):
    visited[node] = True
    subgraph.add(node)
    for neighbor in graph[node]:
        if not visited[neighbor]:
            dfs(graph, neighbor, visited, subgraph)

# obtain all subgraphs
def get_subgraphs(graph):
    undirected_graph = to_undirected(graph)
    visited = {node: False for node in undirected_graph}
    subgraphs = []
    for node in undirected_graph:
        if not visited[node]:
            subgraph = set()
            dfs(undirected_graph, node, visited, subgraph)
            subgraphs.append(subgraph)
    return subgraphs

@guoday
Copy link
Collaborator

guoday commented Nov 29, 2023

Regarding repo-level concatenation, I have a related question.

In a batch, one sample may contain multi docs from different files, such as repo_a/file_a and repo_a/file_b. When concatenating these files into one sample for pre-training and calculating loss, will there still be an attention mask to prevent file_b from attending file_a?

If there is an attention mask, how can it serve the purpose of capturing the repository context? If there isn't, training by simply concatenating the beginning and end of different files seems somewhat peculiar.

If file_b depends on file_a, why is there a need for an attention mask to prevent file_b from attending file_a? Conversely, if file_b doesn't depend on file_a, we wouldn't concatenate these files into a single sample.

@guoday
Copy link
Collaborator

guoday commented Nov 29, 2023

Regarding repo-level concatenation, I have a related question.
In a batch, one sample may contain multi docs from different files, such as repo_a/file_a and repo_a/file_b. When concatenating these files into one sample for pre-training and calculating loss, will there still be an attention mask to prevent file_b from attending file_a?
If there is an attention mask, how can it serve the purpose of capturing the repository context? If there isn't, training by simply concatenating the beginning and end of different files seems somewhat peculiar.

If file_b depends on file_a, why is there a need for an attention mask to prevent file_b from attending file_a? Conversely, if file_b doesn't depend on file_a, we wouldn't concatenate these files into a single sample.

In specific scenarios, such as the one described in #43 (comment), the input sequence is BACDFGE. In this case, even though file E does not directly depend on file A, the file E is allowed to attend the file A. This approach enables the model to effectively utilize contextual information from various files within the repository, thereby enhancing its overall comprehension and performance.

@slamandar
Copy link

Thank you for your prompt and detailed response!

One last question.

In one sample, is there a need for special token between the concatenated files? So that the model can distinguish that there are multiple files, and avoid model generate code like "import package" after the main content, in some downstream scenarios.

@guoday
Copy link
Collaborator

guoday commented Nov 29, 2023

In fact, special token is required. However, we incorporate comments such as #utils.py and #model.py before each file to indicate to the model that the code completion is at the repository level.

@slamandar
Copy link

Completely understand. Thanks again for your quick response!

@Bytes-Explorer
Copy link
Author

@guoday I was also wondering what do you do to the other files, like build files or metadata files? Thanks

@vaisaxena
Copy link

@guoday Thanks for the details above. It was quite helpful. One follow up question.

Do you take care of cycles that may appear in the dependency graph of the files ? How do you handle that ? This is the case where A->B, B->C, C->A

@dongs0104
Copy link

Regarding repo-level concatenation, I have a related question.
In a batch, one sample may contain multi docs from different files, such as repo_a/file_a and repo_a/file_b. When concatenating these files into one sample for pre-training and calculating loss, will there still be an attention mask to prevent file_b from attending file_a?
If there is an attention mask, how can it serve the purpose of capturing the repository context? If there isn't, training by simply concatenating the beginning and end of different files seems somewhat peculiar.

If file_b depends on file_a, why is there a need for an attention mask to prevent file_b from attending file_a? Conversely, if file_b doesn't depend on file_a, we wouldn't concatenate these files into a single sample.

In specific scenarios, such as the one described in #43 (comment), the input sequence is BACDFGE. In this case, even though file E does not directly depend on file A, the file E is allowed to attend the file A. This approach enables the model to effectively utilize contextual information from various files within the repository, thereby enhancing its overall comprehension and performance.

@guoday
But Node A and B is not connected in-direct graph so i have question about it what if A and B contain simair contents? did you get get_subgraphs and then re order it by repo level again?

@guoday
Copy link
Collaborator

guoday commented Dec 18, 2023

Regarding repo-level concatenation, I have a related question.
In a batch, one sample may contain multi docs from different files, such as repo_a/file_a and repo_a/file_b. When concatenating these files into one sample for pre-training and calculating loss, will there still be an attention mask to prevent file_b from attending file_a?
If there is an attention mask, how can it serve the purpose of capturing the repository context? If there isn't, training by simply concatenating the beginning and end of different files seems somewhat peculiar.

If file_b depends on file_a, why is there a need for an attention mask to prevent file_b from attending file_a? Conversely, if file_b doesn't depend on file_a, we wouldn't concatenate these files into a single sample.

In specific scenarios, such as the one described in #43 (comment), the input sequence is BACDFGE. In this case, even though file E does not directly depend on file A, the file E is allowed to attend the file A. This approach enables the model to effectively utilize contextual information from various files within the repository, thereby enhancing its overall comprehension and performance.

@guoday But Node A and B is not connected in-direct graph so i have question about it what if A and B contain simair contents? did you get get_subgraphs and then re order it by repo level again?

Nodes A and B are connected in an undirected graph, indicating that they are the same input sequence. If A and B have similar contents, B can leverage the content of A as additional context to enhance the completion process (sssuming in the sequence, B follows A). We do not re-order these nodes.

@reignianor
Copy link

reignianor commented Jan 9, 2024

Truly remarkable work! I am curious about the advantages of repo concatenation in your training process. Do you first pre-train using file-level code (at 4K window), and then continue-train with repo-level code (at 16K window)? What if pre-training using repo-level code at 4K window first?

@zte-tcb
Copy link

zte-tcb commented Jan 15, 2024

@guoday Thanks for the details above. It was quite helpful. One follow up question.

Do you take care of cycles that may appear in the dependency graph of the files ? How do you handle that ? This is the case where A->B, B->C, C->A

I have the same doubts.

@juncaofish
Copy link

@guoday Thanks for the details above. It was quite helpful. One follow up question.
Do you take care of cycles that may appear in the dependency graph of the files ? How do you handle that ? This is the case where A->B, B->C, C->A

I have the same doubts.

Actually, the dependencies of file within the same repository could be represented as a DAG? and it's impossible for the case as your show for A->B, B->C, C->A, which would cause a circular reference problem

@guoday
Copy link
Collaborator

guoday commented Jan 20, 2024

@guoday Thanks for the details above. It was quite helpful. One follow up question.
Do you take care of cycles that may appear in the dependency graph of the files ? How do you handle that ? This is the case where A->B, B->C, C->A

I have the same doubts.

Actually, the dependencies of file within the same repository could be represented as a DAG? and it's impossible for the case as your show for A->B, B->C, C->A, which would cause a circular reference problem

The algorithm employs a modified topological sort. Unlike the standard approach that selects nodes with zero in-degrees, this algorithm selects nodes with minimal in-degrees, which allows it to handle cycles within the graph.

@slamandar
Copy link

In the process of reproducing repository-level data concatenation, I have a question.

Is the file-level data or the unparsed language data(excluding python/java/c/c++/c#) included in the long-context continue pre-train dataset?

@kail8
Copy link

kail8 commented Feb 29, 2024

Hi, I'm really impressed by your advanced work. I have am extra question: in the repo-level concatenation, if one file depends on some huge files or libraries (such as torch or transformers), the concatenated sample will inevitably exceed the window size / context length. How do you deal with this problem? @guoday

@guoday
Copy link
Collaborator

guoday commented Feb 29, 2024

In the process of reproducing repository-level data concatenation, I have a question.

Is the file-level data or the unparsed language data(excluding python/java/c/c++/c#) included in the long-context continue pre-train dataset?

For unparsed language data or repository-level code that surpasses 32KB, we split them into file-level data for use in the continue pre-training.

@guoday
Copy link
Collaborator

guoday commented Feb 29, 2024

Hi, I'm really impressed by your advanced work. I have am extra question: in the repo-level concatenation, if one file depends on some huge files or libraries (such as torch or transformers), the concatenated sample will inevitably exceed the window size / context length. How do you deal with this problem? @guoday

For unparsed language data or repository-level code that surpasses 32KB, we split them into file-level data for use in the continue pre-training.

@kail8
Copy link

kail8 commented Feb 29, 2024

Hi, I'm really impressed by your advanced work. I have am extra question: in the repo-level concatenation, if one file depends on some huge files or libraries (such as torch or transformers), the concatenated sample will inevitably exceed the window size / context length. How do you deal with this problem? @guoday

For unparsed language data or repository-level code that surpasses 32KB, we split them into file-level data for use in the continue pre-training.

Got it. Thanks for your quick response!

@Calvinnncy97
Copy link

May I know how are the dependencies parsed?

@virtualzx-nad
Copy link

hi @guoday ! when i use the model how do i structure my repo in my prompt to take advantage DeepSeek's understanding of repo structures? How should i separate different files in the same repo and how do i denote filenames? My repo also contains different languages so just adding # filename.py doesn't seem to be good enough.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests