Skip to content

Sync repo #1143

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Nov 4, 2017
Merged

Sync repo #1143

merged 1 commit into from
Nov 4, 2017

Conversation

joshsmith
Copy link
Contributor

@joshsmith joshsmith commented Oct 31, 2017

What's in this PR?

This is almost good to merge now.

Closes #1111
Closes #1133
Closes #1136
Closes #971

end
defp add_merged(%Ecto.Changeset{} = changeset) do
changeset |> put_change(:merged, false)
end
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently GitHub does not pass merged along when listing the records.

Can also remove:

additions changed_files comments commits deletions mergeable_state review_comments 

@joshsmith
Copy link
Contributor Author

I will mention that it is really, really hard to debug problems happening inside of these Multi fns.

@joshsmith
Copy link
Contributor Author

I fixed an issue where marshall_result was not providing us with the github_issue result for github_pull_requests, but there appears to be a result missing. For our test repo we should be seeing 31 issues created, but it looks like there are only 30, with perhaps a missed validation error.

I wanted to at least get this spike working before worrying too much about error handling. Hard to get it working without it, though.

@joshsmith
Copy link
Contributor Author

I managed to debug it. I'm getting a single task back that's got an {:error, :validating_tasks} that's failing due to the timestamp from GitHub being before the modified_at.

It turns out this is due to a bug: we're not setting the modified_from field to "code_corps" for anything that's being update from the Code Corps app.

Of course, this means my strategy here is probably flawed. We won't get back updated issues/tasks when a sync isn't happening for the first time. We could modify this by returning back the matched issues regardless of whether the sync happens. That would allow my strategy here to continue to work. But it's probably worth stepping back and re-evaluating this approach.

That said, the general outline of how to approach the problem is there as a start.

@joshsmith
Copy link
Contributor Author

I ran a mix ecto.reset and it works all the way through. So this is effectively spiked. But lots of issues obviously remain with how this works.

tasks = Enum.map(tasks, fn task -> {:ok, task} end)
sync_comments(repo, github_issues, tasks)
repo
end
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inside here we could sync each step, modify the GithubRepo sync state, and then return the repo when done.

@begedin
Copy link
Contributor

begedin commented Oct 31, 2017

@joshsmith I managed to spike out a less coupled, fully working GithubRepo sync approach. I've encountered several difficulties while doing this, one of which I'm still trying to find a solution for.

Problems

/repos/:owner/:name/comments retrieves repository comments, NOT issue comments

Really, in hindsight, it makes sense, but it completely dodged my attention. The comments retrieved by that endpoint are code comments, not comments from within issues.

I had to implement an API.Issue.comments endpoint and then also implemented a utility endpoint API.Repository.issue_comments which preloads all GithubIssue records for a provided GithubRepo, then calls API.Issue.comments/1 for each of them and aggregates the results.

This makes the request somewhat slow, since it fires off parallel :head requests equal to the issue count on the repo, but it's still faster than doing each of those individually.

To allow for this, I also had to add a function clause to the GithubComment.create_or_update_comment which only requires a payload and associates with an issue matching by Repo.get_by(&1, url: issue_url). This decouples the function and we can likely switch the webhook syncing code to using it as well. More importantly, it allows for a simple pipe during the repo sync process.

i believe we need a sync_state on the ProjectGithubRepo as well

Once a GithubRepo is synced, and it receiving webhooks, there is no need to sync it again. When a project is connected to it, the only thing that's left to do is to create necessary Task and Comment records on that Project.

I added a virtual sync_state field to the ProjectGithubRepo for now, just so I could spike things out, but we should a proper field. I'm thinking the states are

~w(unsynced syncing_tasks errored_syncinc_tasks syncinc_comments errored_syncinc_comments)

our current sync flow does not allow creating Task or Comment records separate from GithubIssue and GithubComment respectively.

This is the issue that I'm stuck with.

The major part of the reason we introduced GithubIssue and GithubComment as separate entities was to decouple this process and allow easier updating when a single issue needs to sync as a task across multiple projects, etc.

However, due to the way we match the user, there is no way for us to actually separate the two. The user is matched from the initial payload, but no user information is stored on the GithubIssue/GithubComment for later use.

I considered several potential solutions for this

Make fetching payloads a process separate from the main syncing flow

Basically, we would sync a project_github_repo, along the lines of

with {:ok, pr_payloads} <- repo |> fetch_pull_requests,
        {:ok, issue_payloads} <- repo |> fetch_issues,
        {:ok, comment_payloads} <- repo |> fetch_comments,
        {:ok, %GithubRepo{sync_state: "receiving_webhooks"} = repo} <- repo |> sync_repo(pr_payloads, issue_payloads, comment_payloads),
        {:ok, %ProjectGithubRepo{sync_state: "synced"} = project_github_repo} <- project_github_repo |> sync_project_github_repo(issue_payloads, comment_payloads) 
do
  {:ok, project_githubrepo}
end

Then, we would sync a project by syncing each of its project_github_repo children.

Really, though, this is a bandaid fix, since it does not resolve the actual problem of too tight coupling and just allows us to implement the sync process sooner.

Try to make it work like webhooks work, keep it coupled and perform the sync of Task/GithubIssue as well as Comment/GithubComment in unison

I don't often say this, but I'm strongly against this option.

The whole process is complicated and hard to understand as is, and this will only make it worse.

Add github_user_id field to GithubIssue and GithubComment

This would be a temporary fix, but it would possibly allow us to defer the full solution into a separate PR.

Basically, we store a github user id field with the records, allowing us to match later. However, this does not provide us with enough information for the case when the author of the comment/issue does not have a codecorps accoutn linked to github, so that case would be unhandled.

I don't favor this solution, but wanted to mention it for the sake of argument.

Introduce GithubUser

As part of this PR we

  • introduce a GithubUser record
  • as we sync comments and issues, a GithubUser record is found or created alongside each of them

The record basically holds the fields we've added to User

field :avatar_url, :string
field :email, :string
field :github_id, :integer
field :username, :string
field :type, :string, default: "User" # or "Bot"

Once we start syncing tasks and comments, we match a GithubUser with a User (find or create) and associate appropriately.

As part of subsequent PRs we can then

  • Switch the webhook syncing code to use GithubUser to. We could do it here as well, but we don't have to.
  • Potentially decouple our webhook syncing code, consolidate further, move out of multis, etc.
  • Potentially clean up our User model a bit
  • ???

@begedin
Copy link
Contributor

begedin commented Oct 31, 2017

We could speed up comment fetching with Repository.issue_comments by storing comment_count on the GithubIssue (there is a "comments": comment_count key in the github payload) and then only sync github issues with comment count greater than 0.

@begedin
Copy link
Contributor

begedin commented Oct 31, 2017

From our chat offline, what we need to do here

  • Add sync state field to project github repo as well, with changesets,
  • make Sync.mark/1 work correctly
  • Add GithubUser
  • Modify GithubIssue/GithubComment sync to match or create a GithubUser, to decouple from Comment/Task sync
  • Modify Comment/Task sync to match against GithubUser and find or create User
  • Finalize project github repo sync
  • Remove Repository.comments and Issue.comments if we end up not using them
  • Add fns for counting number of records being worked and add to syncing details
  • Write tests
  • Create issue to decouple webhook syncing using new sync process as guideline, making use of GithubUser
  • Create issue to extract User :github_ fields into GithubUser (maybe needs a migration)

@joshsmith
Copy link
Contributor Author

joshsmith commented Oct 31, 2017

For adding GithubUser we'll need to:

  • add references to User, GithubIssue, GithubPullRequest, GithubComment
  • add all these to models
  • add model tests
  • add all these to views and view tests

@joshsmith
Copy link
Contributor Author

This is going to need some significant cleanup of changesets and the way changes are propagated to the various GithubX models during sync.

@joshsmith
Copy link
Contributor Author

This is mostly working, but on a first pass through a project, it seems to not create tasks in the syncing_tasks step, moves onto syncing_comments, and then hits a nil Task that it can't match on:

** (WithClauseError) no with clause matching: nil
    (code_corps) lib/code_corps/github/sync/comment/comment/comment.ex:63: CodeCorps.GitHub.Sync.Comment.Comment.find_or_create_comment/2
    (elixir) lib/enum.ex:1255: Enum."-map/2-lists^map/1-0-"/2
    (code_corps) lib/code_corps/github/sync/comment/comment/comment.ex:58: CodeCorps.GitHub.Sync.Comment.Comment.sync_project_github_repo/1
    (code_corps) lib/code_corps/github/sync/sync.ex:163: CodeCorps.GitHub.Sync.sync_project_github_repo/1

@joshsmith
Copy link
Contributor Author

joshsmith commented Nov 1, 2017

I wonder if it is worth adding in a synced_at timestamp to the GithubRepo and ProjectGithubRepo.

@joshsmith
Copy link
Contributor Author

joshsmith commented Nov 1, 2017

My first goals tomorrow:

  • get the existing tests passing
  • write an acceptance test for the full sync
  • write unit tests for
    • views
    • models
  • refactor
  • add record counts to sync state

Copy link
Contributor

@begedin begedin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty much implemented the way I would continue off.

I got a suggested issue for adapter behavior, but that's about it.

github_issue
|> Map.from_struct
|> MapTransformer.transform_inverse(@issue_mapping)
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The naming makes sense, but considering we were switching already, I'm thinking maybe we want to write a behavior module and document what an adapter needs to look like and what the naming convention is.

I'm just imagining where, in a couple of weeks, I write a new one and end up using a new scheme because I wasn't 100% what the old one is.

@joshsmith
Copy link
Contributor Author

This is missing tests for the GitHub.Sync that hit the error states for repo and project_repo sync.

Copy link
Contributor Author

@joshsmith joshsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some notes on things we can still improve here.

|> unique_github_constraint()
end

def update_with_github_user_changeset(struct, %{} = params) do
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing spec and docs.

@@ -36,6 +37,12 @@ defmodule CodeCorps.GitHub.Adapters.Comment do
{:url, ["url"]}
]

def to_comment_attrs(%GithubComment{} = github_comment) do
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing spec and docs.

@@ -54,6 +54,12 @@ defmodule CodeCorps.GitHub.Adapters.Issue do
payload |> MapTransformer.transform(@task_mapping)
end

def to_issue_attrs(%GithubIssue{} = github_issue) do
github_issue
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing spec and docs.

@@ -41,6 +47,47 @@ defmodule CodeCorps.GitHub.Sync.Comment.Comment do
|> ResultAggregator.aggregate
end

def sync_project_github_repo(%ProjectGithubRepo{github_repo: %GithubRepo{} = _} = project_github_repo) do
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing spec and doc.

@@ -32,6 +34,36 @@ defmodule CodeCorps.GitHub.Sync.Issue.Task do
|> ResultAggregator.aggregate
end

def sync_project_github_repo(%ProjectGithubRepo{github_repo: %GithubRepo{} = _} = project_github_repo) do
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing spec and doc.

end
end

defp put_sync_opt(opts, key, map) do
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing spec and doc.

end

@spec sync_repo(GithubRepo.t) :: {:ok, GithubRepo.t}
def sync_repo(%GithubRepo{} = repo) do
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing doc.

end
end

def sync_project_github_repo(%ProjectGithubRepo{} = project_github_repo) do
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing spec and doc.

nil -> changeset
_ -> do_validate_time_after(changeset, field, previous_time, current_time)
end
end
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs updated unit tests.

@@ -28,6 +29,8 @@ defmodule CodeCorpsWeb.ProjectGithubRepoController do
{:ok, :authorized} <- current_user |> Policy.authorize(:create, %ProjectGithubRepo{}, params),
{:ok, %ProjectGithubRepo{} = project_github_repo} <- create_project_repo_changeset(params) |> Repo.insert do

TaskSupervisor.start_child(:background_processor, fn -> CodeCorps.GitHub.Sync.sync_project_github_repo(project_github_repo) end)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not tested anywhere right now.

Copy link
Contributor

@begedin begedin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of ideas I have, plus a potential change that needs to be reverted, but other than that, this looks great!


with {:ok, project_github_repo} <- project_github_repo |> mark_project_repo("syncing_github_repo"),
{:ok, %GithubRepo{sync_state: "receiving_webhooks"}} <- repo |> sync_repo(),
project_github_repo <- Repo.get(ProjectGithubRepo, project_github_repo.id) |> preload_project_github_repo(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd actually prefer something like

preloads <- [:project, github_repo: [:github_app_installation, [github_issues: [:github_comments, :github_user]]]],
project_github_repo <- project_github_repo |> Repo.preload(preloads)

I think it's actually easier to follow. Either that, or renaming preload_project_gihub_repo to load_associations.

Seems somewhat more explicit to me.

{:ok, %GithubRepo{sync_state: "receiving_webhooks"}} <- repo |> sync_repo(),
project_github_repo <- Repo.get(ProjectGithubRepo, project_github_repo.id) |> preload_project_github_repo(),
{:ok, project_github_repo} <- project_github_repo |> mark_project_repo("syncing_tasks"),
{:ok, _tasks} <- project_github_repo |> Sync.Issue.Task.sync_project_github_repo() |> sync_step(:sync_tasks),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to read Sync.Issue.Task.sync_project_github_repo makes me think we may have nested too deep.

project_github_repo |> Sync.TaskSyncer.sync

Looks a bit clearer, but I'm still not happy with it. However, the we could alias Sync.TaskSyncer and just call

project_github_repo |> TaskSyncer.sync

What do you think about flattening out our Sync namespace that way? Nesting Task and GithubIssue under Issue together seems like we did just for the sake of them being semantically related, but logically, we probably want the two processes as decoupled as possible.

sync/sync.ex
sync/task_syncer.ex
sync/github_issue_syncer.ex

etc. looks nicer to me.

For now, as part of this PR, I'm thinking we

alias CodeCorps.GitHub.Sync.Issue.Task, as: TaskSyncer

and do the same for other modules. What do you think?

defp marshall_result({:ok, %{github_pull_request: pull_request}}), do: {:ok, pull_request}
defp marshall_result({:ok, %{github_issue: _, tasks: _}} = result), do: result
defp marshall_result({:ok, %{github_issue: github_issue}}), do: {:ok, github_issue}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these still needed? I think we didn't touch transactions in this PR

@joshsmith joshsmith merged commit 886434b into develop Nov 4, 2017
@joshsmith joshsmith deleted the 1111-sync-comments branch November 4, 2017 23:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants