Skip to content

Implement ML Features: Word2Vec #491

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 20 commits into from
Apr 29, 2020
Merged

Implement ML Features: Word2Vec #491

merged 20 commits into from
Apr 29, 2020

Conversation

GoEddie
Copy link
Contributor

@GoEddie GoEddie commented Apr 15, 2020

We are excited to review your PR.

So we can do the best job, please check:

  • There's a descriptive title that will make sense to other developers some time from now.
  • There's associated issues. All PR's should have issue(s) associated - unless a trivial self-evident change such as fixing a typo. You can use the format Fixes #nnnn in your description to cause GitHub to automatically close the issue(s) when your PR is merged.
  • Your change description explains what the change does, why you chose your approach, and anything else that reviewers should know.
  • You have included any necessary tests in the same PR.

Hi,

This implements Word2Vec and Word2VecModel so that the Word2Vec example can be completed (see https://spark.apache.org/docs/2.2.0/mllib-feature-extraction.html#word2vec), this is for issue #381

Thanks,

Ed Elliott

@imback82
Copy link
Contributor

@GoEddie Each E2E test is taking at least one more minute with this change. Could you check and see if you can reduce it.

@GoEddie GoEddie changed the title Implement ML Features: Word2Vec [WIP] Implement ML Features: Word2Vec Apr 16, 2020
@GoEddie
Copy link
Contributor Author

GoEddie commented Apr 16, 2020

@GoEddie Each E2E test is taking at least one more minute with this change. Could you check and see if you can reduce it.

@imback82 I removed the extra time from TestWord2VecTest but Word2VecModel still adds about 10 seconds because calling model.Fit causes spark to dump out loads of log messages, if I turn off logging (or set it to Error) then it is much faster - is it ok to disable logging or is it needed for other tests? I could set it to None, call model.Fit then set it back to Warn (or info?) again.

I tested with pyspark and that does the same thing, if logging is on then it adds the same time so it isn't a dotnet spark thing.

@imback82
Copy link
Contributor

if I turn off logging (or set it to Error) then it is much faster - is it ok to disable logging or is it needed for other tests?

Cool, can we set it to error for other tests? Also, please ping me when you remove WIP. Thanks!

@GoEddie GoEddie changed the title [WIP] Implement ML Features: Word2Vec Implement ML Features: Word2Vec Apr 17, 2020
@GoEddie
Copy link
Contributor Author

GoEddie commented Apr 17, 2020

Hi @imback82 - can you review it again now the build is faster again?

@imback82 imback82 requested review from suhsteve and imback82 April 18, 2020 05:35
@imback82 imback82 added the enhancement New feature or request label Apr 18, 2020

Word2VecModel model = word2vec.Fit(documentDataFrame);

const int expectedSynonyms = 2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we expect 2? (sorry I am not familiar with this model).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

findSynonyms takes the word to check and the maximum amount of synonyms to return so 2 is checking that the result is limited to 2 rows.

{
_spark = fixture.Spark;
//Calling Word2Vec.Fit is really slow with logging on, makes the test really slow
_spark.SparkContext.SetLogLevel("OFF");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, even ERROR doesn't help?

Copy link
Contributor

@imback82 imback82 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except for one nit comment, thanks @GoEddie!

@GoEddie
Copy link
Contributor Author

GoEddie commented Apr 28, 2020

Thanks @imback82 I have fixed that indentation issue!

Copy link
Member

@suhsteve suhsteve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@imback82 imback82 merged commit 1e88f27 into dotnet:master Apr 29, 2020
@MikeRys MikeRys mentioned this pull request May 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants