Repo level concatenation of data #43

Bytes-Explorer · 2023-11-21T15:40:20Z

Can you share more details on the technique for repo level concatenation part?

guoday · 2023-11-24T15:27:14Z

We first parsing the dependencies of files, e.g. A->B, B->C, B->D. And then we rearrange the file positions based on their dependencies, e.g. A,B,C,D. For file paths, we add them on each code file as a comment. An example is shown in https://github.com/deepseek-ai/DeepSeek-Coder#4-repository-level-code-completion

Bytes-Explorer · 2023-11-27T05:14:04Z

Thank you for your response. Is this done for all languages in the data?

guoday · 2023-11-27T06:17:53Z

only for python, java, c#, c and c++

Bytes-Explorer · 2023-11-27T06:18:24Z

Thank you @guoday

Bytes-Explorer · 2023-11-27T14:15:26Z

@guoday Do you then do repo level dedup for all programming languages or just the above languages?

guoday · 2023-11-28T04:01:21Z

just the above languages. Other languages employ file level dedup.

Bytes-Explorer · 2023-11-28T08:12:10Z

@guoday Thank you for your prompt responses. I was curious if you did any ablation studies/evaluations to understand if repo level concatenation helped the model performance in a significant way.

guoday · 2023-11-28T12:05:34Z

Not yet. We will try to evaluate the model on repo-level benchmark. For function-level benchmark, the repo level concatenation doesn't help or hurt the model performance.

Bytes-Explorer · 2023-11-28T12:16:50Z

Do you have your own repo level benchmark or use a standard one?

guoday · 2023-11-28T12:45:54Z

We will use public datasets like RepoCoder and CrossCodeEval to evaluate.

Bytes-Explorer · 2023-11-28T13:08:15Z

Ok thanks, was aware of those. Once again, appreciate your prompt responses. I look forward to reading the technical report from your group. Thanks!

Casi11as · 2023-11-29T08:28:29Z

Hello, I would like to know the details of the concatenation of data. Assume that the structure of parsed dependencies is in the picture, what is the concatenation results? Is it ACF,ADF,ADG,BCF,BDF,BDG,BE? 7 pieces?

guoday · 2023-11-29T08:41:47Z

First, we select the file with the smallest incoming degree, and if there are multiple files with the smallest incoming degree, we randomly choose one. This process is repeated until a dependency order is obtained. For your example, there are many possibilities, one of which could be BACDFGE.

Casi11as · 2023-11-29T09:10:39Z

First, we select the file with the smallest incoming degree, and if there are multiple files with the smallest incoming degree, we randomly choose one. This process is repeated until a dependency order is obtained. For your example, there are many possibilities, one of which could be BACDFGE.

In other words, will all the files of the same language in a repo only concatenate one sample?

guoday · 2023-11-29T09:14:59Z

Theoretically, yes. However, to shorten the sample length, we will parse a repository in advance and then divide it into multiple independent subgraphs based on dependencies, with each independent subgraph regarding as a sample.

Casi11as · 2023-11-29T09:22:56Z

Thanks ! So what are the rules for dividing into subgraphs? Taking the picture I posted above as an example, what sub-pictures will it be divided into?

slamandar · 2023-11-29T09:33:45Z

Regarding repo-level concatenation, I have a related question.

In a batch, one sample may contain multi docs from different files, such as repo_a/file_a and repo_a/file_b. When concatenating these files into one sample for pre-training and calculating loss, will there still be an attention mask to prevent file_b from attending file_a?

If there is an attention mask, how can it serve the purpose of capturing the repository context? If there isn't, training by simply concatenating the beginning and end of different files seems somewhat peculiar.

guoday · 2023-11-29T09:34:02Z

The term "independent subgraph" refers to a weakly connected subgraph. First, convert the directed graph into an undirected graph, and then divide the graph into multiple connected subgraphs. That is, in each subgraph, any two vertices should be connected by edges within the subgraph. In your example, it is a connected subgraph, with only one subgraph, which is itself. The following is the code to divide the graph into subgraphs.

# convert the directed graph into an undirected graph
def to_undirected(graph):
    undirected_graph = defaultdict(set)
    for node in graph:
        undirected_graph[node]
        for neighbor in graph[node]:
            undirected_graph[node].add(neighbor)
            undirected_graph[neighbor].add(node)
    return undirected_graph

# Use DFS to find all connected subgraphs.
def dfs(graph, node, visited, subgraph):
    visited[node] = True
    subgraph.add(node)
    for neighbor in graph[node]:
        if not visited[neighbor]:
            dfs(graph, neighbor, visited, subgraph)

# obtain all subgraphs
def get_subgraphs(graph):
    undirected_graph = to_undirected(graph)
    visited = {node: False for node in undirected_graph}
    subgraphs = []
    for node in undirected_graph:
        if not visited[node]:
            subgraph = set()
            dfs(undirected_graph, node, visited, subgraph)
            subgraphs.append(subgraph)
    return subgraphs

guoday · 2023-11-29T09:41:24Z

Regarding repo-level concatenation, I have a related question.

In a batch, one sample may contain multi docs from different files, such as repo_a/file_a and repo_a/file_b. When concatenating these files into one sample for pre-training and calculating loss, will there still be an attention mask to prevent file_b from attending file_a?

If there is an attention mask, how can it serve the purpose of capturing the repository context? If there isn't, training by simply concatenating the beginning and end of different files seems somewhat peculiar.

If file_b depends on file_a, why is there a need for an attention mask to prevent file_b from attending file_a? Conversely, if file_b doesn't depend on file_a, we wouldn't concatenate these files into a single sample.

guoday · 2023-11-29T09:53:23Z

Regarding repo-level concatenation, I have a related question.
In a batch, one sample may contain multi docs from different files, such as repo_a/file_a and repo_a/file_b. When concatenating these files into one sample for pre-training and calculating loss, will there still be an attention mask to prevent file_b from attending file_a?
If there is an attention mask, how can it serve the purpose of capturing the repository context? If there isn't, training by simply concatenating the beginning and end of different files seems somewhat peculiar.

If file_b depends on file_a, why is there a need for an attention mask to prevent file_b from attending file_a? Conversely, if file_b doesn't depend on file_a, we wouldn't concatenate these files into a single sample.

In specific scenarios, such as the one described in #43 (comment), the input sequence is BACDFGE. In this case, even though file E does not directly depend on file A, the file E is allowed to attend the file A. This approach enables the model to effectively utilize contextual information from various files within the repository, thereby enhancing its overall comprehension and performance.

slamandar · 2023-11-29T10:33:34Z

Thank you for your prompt and detailed response!

One last question.

In one sample, is there a need for special token between the concatenated files? So that the model can distinguish that there are multiple files, and avoid model generate code like "import package" after the main content, in some downstream scenarios.

guoday · 2023-11-29T11:28:48Z

In fact, special token is required. However, we incorporate comments such as #utils.py and #model.py before each file to indicate to the model that the code completion is at the repository level.

slamandar · 2023-11-30T02:19:14Z

Completely understand. Thanks again for your quick response!

Bytes-Explorer · 2023-11-30T05:57:32Z

@guoday I was also wondering what do you do to the other files, like build files or metadata files? Thanks

vaisaxena · 2023-12-04T10:15:29Z

@guoday Thanks for the details above. It was quite helpful. One follow up question.

Do you take care of cycles that may appear in the dependency graph of the files ? How do you handle that ? This is the case where A->B, B->C, C->A

dongs0104 · 2023-12-18T02:16:32Z

Regarding repo-level concatenation, I have a related question.
In a batch, one sample may contain multi docs from different files, such as repo_a/file_a and repo_a/file_b. When concatenating these files into one sample for pre-training and calculating loss, will there still be an attention mask to prevent file_b from attending file_a?
If there is an attention mask, how can it serve the purpose of capturing the repository context? If there isn't, training by simply concatenating the beginning and end of different files seems somewhat peculiar.

If file_b depends on file_a, why is there a need for an attention mask to prevent file_b from attending file_a? Conversely, if file_b doesn't depend on file_a, we wouldn't concatenate these files into a single sample.

In specific scenarios, such as the one described in #43 (comment), the input sequence is BACDFGE. In this case, even though file E does not directly depend on file A, the file E is allowed to attend the file A. This approach enables the model to effectively utilize contextual information from various files within the repository, thereby enhancing its overall comprehension and performance.

@guoday
But Node A and B is not connected in-direct graph so i have question about it what if A and B contain simair contents? did you get get_subgraphs and then re order it by repo level again?

guoday · 2023-12-18T07:23:07Z

Regarding repo-level concatenation, I have a related question.
In a batch, one sample may contain multi docs from different files, such as repo_a/file_a and repo_a/file_b. When concatenating these files into one sample for pre-training and calculating loss, will there still be an attention mask to prevent file_b from attending file_a?
If there is an attention mask, how can it serve the purpose of capturing the repository context? If there isn't, training by simply concatenating the beginning and end of different files seems somewhat peculiar.

If file_b depends on file_a, why is there a need for an attention mask to prevent file_b from attending file_a? Conversely, if file_b doesn't depend on file_a, we wouldn't concatenate these files into a single sample.

In specific scenarios, such as the one described in #43 (comment), the input sequence is BACDFGE. In this case, even though file E does not directly depend on file A, the file E is allowed to attend the file A. This approach enables the model to effectively utilize contextual information from various files within the repository, thereby enhancing its overall comprehension and performance.

@guoday But Node A and B is not connected in-direct graph so i have question about it what if A and B contain simair contents? did you get get_subgraphs and then re order it by repo level again?

Nodes A and B are connected in an undirected graph, indicating that they are the same input sequence. If A and B have similar contents, B can leverage the content of A as additional context to enhance the completion process (sssuming in the sequence, B follows A). We do not re-order these nodes.

reignianor · 2024-01-09T08:08:37Z

Truly remarkable work! I am curious about the advantages of repo concatenation in your training process. Do you first pre-train using file-level code (at 4K window), and then continue-train with repo-level code (at 16K window)? What if pre-training using repo-level code at 4K window first?

zte-tcb · 2024-01-15T01:13:37Z

@guoday Thanks for the details above. It was quite helpful. One follow up question.

Do you take care of cycles that may appear in the dependency graph of the files ? How do you handle that ? This is the case where A->B, B->C, C->A

I have the same doubts.

juncaofish · 2024-01-20T03:42:46Z

@guoday Thanks for the details above. It was quite helpful. One follow up question.
Do you take care of cycles that may appear in the dependency graph of the files ? How do you handle that ? This is the case where A->B, B->C, C->A

I have the same doubts.

Actually, the dependencies of file within the same repository could be represented as a DAG? and it's impossible for the case as your show for A->B, B->C, C->A, which would cause a circular reference problem

guoday · 2024-01-20T03:45:39Z

@guoday Thanks for the details above. It was quite helpful. One follow up question.
Do you take care of cycles that may appear in the dependency graph of the files ? How do you handle that ? This is the case where A->B, B->C, C->A

I have the same doubts.

Actually, the dependencies of file within the same repository could be represented as a DAG? and it's impossible for the case as your show for A->B, B->C, C->A, which would cause a circular reference problem

The algorithm employs a modified topological sort. Unlike the standard approach that selects nodes with zero in-degrees, this algorithm selects nodes with minimal in-degrees, which allows it to handle cycles within the graph.

slamandar · 2024-02-01T04:08:34Z

In the process of reproducing repository-level data concatenation, I have a question.

Is the file-level data or the unparsed language data(excluding python/java/c/c++/c#) included in the long-context continue pre-train dataset?

kail8 · 2024-02-29T02:50:08Z

Hi, I'm really impressed by your advanced work. I have am extra question: in the repo-level concatenation, if one file depends on some huge files or libraries (such as torch or transformers), the concatenated sample will inevitably exceed the window size / context length. How do you deal with this problem? @guoday

guoday · 2024-02-29T02:54:15Z

In the process of reproducing repository-level data concatenation, I have a question.

Is the file-level data or the unparsed language data(excluding python/java/c/c++/c#) included in the long-context continue pre-train dataset?

For unparsed language data or repository-level code that surpasses 32KB, we split them into file-level data for use in the continue pre-training.

guoday · 2024-02-29T02:56:29Z

Hi, I'm really impressed by your advanced work. I have am extra question: in the repo-level concatenation, if one file depends on some huge files or libraries (such as torch or transformers), the concatenated sample will inevitably exceed the window size / context length. How do you deal with this problem? @guoday

For unparsed language data or repository-level code that surpasses 32KB, we split them into file-level data for use in the continue pre-training.

kail8 · 2024-02-29T03:03:40Z

Hi, I'm really impressed by your advanced work. I have am extra question: in the repo-level concatenation, if one file depends on some huge files or libraries (such as torch or transformers), the concatenated sample will inevitably exceed the window size / context length. How do you deal with this problem? @guoday

For unparsed language data or repository-level code that surpasses 32KB, we split them into file-level data for use in the continue pre-training.

Got it. Thanks for your quick response!

Calvinnncy97 · 2024-03-18T09:29:07Z

May I know how are the dependencies parsed?

virtualzx-nad · 2024-04-03T06:38:41Z

hi @guoday ! when i use the model how do i structure my repo in my prompt to take advantage DeepSeek's understanding of repo structures? How should i separate different files in the same repo and how do i denote filenames? My repo also contains different languages so just adding # filename.py doesn't seem to be good enough.

Repo level concatenation of data #43

Repo level concatenation of data #43

Comments

Bytes-Explorer commented Nov 21, 2023

guoday commented Nov 24, 2023

Bytes-Explorer commented Nov 27, 2023

guoday commented Nov 27, 2023

Bytes-Explorer commented Nov 27, 2023

Bytes-Explorer commented Nov 27, 2023

guoday commented Nov 28, 2023

Bytes-Explorer commented Nov 28, 2023

guoday commented Nov 28, 2023 • edited Loading

Bytes-Explorer commented Nov 28, 2023 • edited Loading

guoday commented Nov 28, 2023

Bytes-Explorer commented Nov 28, 2023

Casi11as commented Nov 29, 2023

guoday commented Nov 29, 2023

Casi11as commented Nov 29, 2023

guoday commented Nov 29, 2023

Casi11as commented Nov 29, 2023

slamandar commented Nov 29, 2023

guoday commented Nov 29, 2023

guoday commented Nov 29, 2023 • edited Loading

guoday commented Nov 29, 2023

slamandar commented Nov 29, 2023

guoday commented Nov 29, 2023 • edited Loading

slamandar commented Nov 30, 2023

Bytes-Explorer commented Nov 30, 2023

vaisaxena commented Dec 4, 2023

dongs0104 commented Dec 18, 2023

guoday commented Dec 18, 2023

reignianor commented Jan 9, 2024 • edited Loading

zte-tcb commented Jan 15, 2024

juncaofish commented Jan 20, 2024

guoday commented Jan 20, 2024

slamandar commented Feb 1, 2024

kail8 commented Feb 29, 2024 • edited Loading

guoday commented Feb 29, 2024 • edited Loading

guoday commented Feb 29, 2024

kail8 commented Feb 29, 2024

Calvinnncy97 commented Mar 18, 2024

virtualzx-nad commented Apr 3, 2024

guoday commented Nov 28, 2023 •

edited

Loading

Bytes-Explorer commented Nov 28, 2023 •

edited

Loading

guoday commented Nov 29, 2023 •

edited

Loading

guoday commented Nov 29, 2023 •

edited

Loading

reignianor commented Jan 9, 2024 •

edited

Loading

kail8 commented Feb 29, 2024 •

edited

Loading

guoday commented Feb 29, 2024 •

edited

Loading