-
Notifications
You must be signed in to change notification settings - Fork 326
Implement ML Features: Word2Vec #491
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@GoEddie Each E2E test is taking at least one more minute with this change. Could you check and see if you can reduce it. |
@imback82 I removed the extra time from TestWord2VecTest but Word2VecModel still adds about 10 seconds because calling model.Fit causes spark to dump out loads of log messages, if I turn off logging (or set it to Error) then it is much faster - is it ok to disable logging or is it needed for other tests? I could set it to None, call model.Fit then set it back to Warn (or info?) again. I tested with pyspark and that does the same thing, if logging is on then it adds the same time so it isn't a dotnet spark thing. |
Cool, can we set it to error for other tests? Also, please ping me when you remove WIP. Thanks! |
Hi @imback82 - can you review it again now the build is faster again? |
src/csharp/Microsoft.Spark.E2ETest/IpcTests/ML/Feature/Word2VecModelTests.cs
Outdated
Show resolved
Hide resolved
src/csharp/Microsoft.Spark.E2ETest/IpcTests/ML/Feature/Word2VecModelTests.cs
Outdated
Show resolved
Hide resolved
src/csharp/Microsoft.Spark.E2ETest/IpcTests/ML/Feature/Word2VecModelTests.cs
Outdated
Show resolved
Hide resolved
|
||
Word2VecModel model = word2vec.Fit(documentDataFrame); | ||
|
||
const int expectedSynonyms = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we expect 2? (sorry I am not familiar with this model).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
findSynonyms takes the word to check and the maximum amount of synonyms to return so 2 is checking that the result is limited to 2 rows.
src/csharp/Microsoft.Spark.E2ETest/IpcTests/ML/Feature/Word2VecTests.cs
Outdated
Show resolved
Hide resolved
{ | ||
_spark = fixture.Spark; | ||
//Calling Word2Vec.Fit is really slow with logging on, makes the test really slow | ||
_spark.SparkContext.SetLogLevel("OFF"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, even ERROR
doesn't help?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except for one nit comment, thanks @GoEddie!
Thanks @imback82 I have fixed that indentation issue! |
Co-Authored-By: Steve Suh <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
We are excited to review your PR.
So we can do the best job, please check:
Fixes #nnnn
in your description to cause GitHub to automatically close the issue(s) when your PR is merged.Hi,
This implements Word2Vec and Word2VecModel so that the Word2Vec example can be completed (see https://spark.apache.org/docs/2.2.0/mllib-feature-extraction.html#word2vec), this is for issue #381
Thanks,
Ed Elliott