Implement ML Features: Word2Vec #491

GoEddie · 2020-04-15T23:02:34Z

We are excited to review your PR.

So we can do the best job, please check:

There's a descriptive title that will make sense to other developers some time from now.
There's associated issues. All PR's should have issue(s) associated - unless a trivial self-evident change such as fixing a typo. You can use the format Fixes #nnnn in your description to cause GitHub to automatically close the issue(s) when your PR is merged.
Your change description explains what the change does, why you chose your approach, and anything else that reviewers should know.
You have included any necessary tests in the same PR.

Hi,

This implements Word2Vec and Word2VecModel so that the Word2Vec example can be completed (see https://spark.apache.org/docs/2.2.0/mllib-feature-extraction.html#word2vec), this is for issue #381

Thanks,

Ed Elliott

imback82 · 2020-04-16T04:29:38Z

@GoEddie Each E2E test is taking at least one more minute with this change. Could you check and see if you can reduce it.

GoEddie · 2020-04-16T22:12:26Z

@GoEddie Each E2E test is taking at least one more minute with this change. Could you check and see if you can reduce it.

@imback82 I removed the extra time from TestWord2VecTest but Word2VecModel still adds about 10 seconds because calling model.Fit causes spark to dump out loads of log messages, if I turn off logging (or set it to Error) then it is much faster - is it ok to disable logging or is it needed for other tests? I could set it to None, call model.Fit then set it back to Warn (or info?) again.

I tested with pyspark and that does the same thing, if logging is on then it adds the same time so it isn't a dotnet spark thing.

imback82 · 2020-04-17T02:40:34Z

if I turn off logging (or set it to Error) then it is much faster - is it ok to disable logging or is it needed for other tests?

Cool, can we set it to error for other tests? Also, please ping me when you remove WIP. Thanks!

GoEddie · 2020-04-17T21:07:46Z

Hi @imback82 - can you review it again now the build is faster again?

src/csharp/Microsoft.Spark/ML/Feature/Word2VecModel.cs

src/csharp/Microsoft.Spark.E2ETest/IpcTests/ML/Feature/Word2VecModelTests.cs

imback82 · 2020-04-18T20:45:06Z

src/csharp/Microsoft.Spark.E2ETest/IpcTests/ML/Feature/Word2VecModelTests.cs

+
+            Word2VecModel model = word2vec.Fit(documentDataFrame);
+
+            const int expectedSynonyms = 2;


Why do we expect 2? (sorry I am not familiar with this model).

findSynonyms takes the word to check and the maximum amount of synonyms to return so 2 is checking that the result is limited to 2 rows.

src/csharp/Microsoft.Spark.E2ETest/IpcTests/ML/Feature/Word2VecTests.cs

imback82 · 2020-04-18T20:48:31Z

src/csharp/Microsoft.Spark.E2ETest/IpcTests/ML/Feature/Word2VecModelTests.cs

+        {
+            _spark = fixture.Spark;
+            //Calling Word2Vec.Fit is really slow with logging on, makes the test really slow
+            _spark.SparkContext.SetLogLevel("OFF");


Oh, even ERROR doesn't help?

imback82

LGTM except for one nit comment, thanks @GoEddie!

src/csharp/Microsoft.Spark/ML/Feature/Word2Vec.cs

GoEddie · 2020-04-28T20:52:25Z

Thanks @imback82 I have fixed that indentation issue!

src/csharp/Microsoft.Spark/ML/Feature/Word2Vec.cs

src/csharp/Microsoft.Spark/ML/Feature/Word2VecModel.cs

Co-Authored-By: Steve Suh <[email protected]>

suhsteve

LGTM

GOEddieUK added 12 commits April 15, 2020 23:32

Word2Vec and Word2VecModel

c6e873a

Merge branch 'master' of github.com:dotnet/spark into ml/word2vec

0a8d821

whitespace

1c2d55a

tidying:

b2245dc

whitespace

daf6f2d

reverting csproj file change

17325df

reverting csproj file change

f28e914

reverting csproj file change

0a969e3

reverting csproj file change

fe6fb61

reverting csproj file change

14f5312

reverting csproj file change

3b135a9

reverting csproj file change

18c3789

GoEddie mentioned this pull request Apr 15, 2020

[FEATURE REQUEST]: Implement ML Features #381

Open

39 tasks

speeding up tests

29242f7

GoEddie changed the title ~~Implement ML Features: Word2Vec~~ [WIP] Implement ML Features: Word2Vec Apr 16, 2020

GOEddieUK added 2 commits April 16, 2020 22:22

disabling logging

f1bfb7b

removing logging off

5e36609

disabling logging for test

2b1ded3

GoEddie changed the title ~~[WIP] Implement ML Features: Word2Vec~~ Implement ML Features: Word2Vec Apr 17, 2020

imback82 requested review from suhsteve and imback82 April 18, 2020 05:35

imback82 assigned GoEddie Apr 18, 2020

imback82 added the enhancement New feature or request label Apr 18, 2020

imback82 reviewed Apr 18, 2020

View reviewed changes

tidying after review

c215d4b

imback82 reviewed Apr 26, 2020

View reviewed changes

src/csharp/Microsoft.Spark/ML/Feature/Word2Vec.cs Outdated Show resolved Hide resolved

incorrect indentation

6a2904b

suhsteve reviewed Apr 28, 2020

View reviewed changes

GoEddie and others added 2 commits April 28, 2020 22:45

Apply suggestions from code review

77ce67e

Co-Authored-By: Steve Suh <[email protected]>

feedback after review

5b4a99a

suhsteve approved these changes Apr 29, 2020

View reviewed changes

imback82 approved these changes Apr 29, 2020

View reviewed changes

imback82 merged commit 1e88f27 into dotnet:master Apr 29, 2020

MikeRys mentioned this pull request May 7, 2020

Prep 0.11.0 release #507

Merged


		Word2VecModel model = word2vec.Fit(documentDataFrame);

		const int expectedSynonyms = 2;

Implement ML Features: Word2Vec #491

Implement ML Features: Word2Vec #491

Uh oh!

Conversation

GoEddie commented Apr 15, 2020

Uh oh!

imback82 commented Apr 16, 2020

Uh oh!

GoEddie commented Apr 16, 2020

Uh oh!

imback82 commented Apr 17, 2020

Uh oh!

GoEddie commented Apr 17, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

imback82 Apr 18, 2020

Choose a reason for hiding this comment

Uh oh!

GoEddie Apr 20, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

imback82 Apr 18, 2020

Choose a reason for hiding this comment

Uh oh!

imback82 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

GoEddie commented Apr 28, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

suhsteve left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

imback82 left a comment •

edited

Loading