-
Notifications
You must be signed in to change notification settings - Fork 76
Documentation and tests for the first and firstOrNull functions
#1547
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
| df.drop(df.nrow).firstOrNull { isHappy } shouldBe null | ||
| } | ||
|
|
||
| @Test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now I haven't added a test on an empty dataframe since if we do emptyDf.groupBy { age }.first(), we do not get the group column. This is a known issue, Jolan fixed it, but it is not merged yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
related to #1531
| val firstHappy = reducedGrouped.values()[0] | ||
| val firstUnhappy = reducedGrouped.values()[1] | ||
|
|
||
| firstHappy shouldBe dataFrameOf( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
grouped.first().values()[0].name.lastName shouldBe "Cooper" also worked (and similar approach in some further tests), but I decided to refrain from using generated properties, as they were only generated for df without transformations, and attempts to access these properties in some further transformations cause issues. I think not using the compiler plugin further in the file is more correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*gradle plugin.
The compiler plugin is not enabled here. If it were, you would indeed be able to use extension properties with transformations :)
| } | ||
|
|
||
| @Test | ||
| fun `first on GroupBy with predicate`() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be an issue here:
df.groupBy { isHappy }.first{ age > 10 }
works fine: ReducedGroupBy contains columns isHappy, name (firstName, lastName), age, city, weight, but
df.groupBy { isHappy }.first{ age > 100 }
returns ReducedGroupBy without the name column.
For example, this test passes, but it seems to me that it should not:
grouped.first { age > 100 }.values()[0] shouldBe dataFrameOf("isHappy", "age", "city", "weight")(true, null, null, null)[0].
That's why for now I haven't added a test for a predicate that doesn't match any row.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same reason as #1531 I assume?
| )[0] | ||
| } | ||
|
|
||
| @Test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I faced an issue here:
ReducedPivot (i.e. pivot.first()) and ReducedPivotGroupBy are not rendered correctly in notebooks when there is null in any row resulting from first().
If I replace null in such a row with some value, it is displayed correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how do they look when they are rendered incorrectly?
| )[0] | ||
| } | ||
|
|
||
| @Test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a simpler dataframe here for readability. Otherwise it was more laborious for a reader to validate the result.
| ).shouldAllBeEqual() | ||
| } | ||
|
|
||
| @Test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it fine to put these tests in the same class with FirstColumnsSelectionDsl? Or is it better to put them in a different class?
If it is better to put them in a different class, is it worth inheriting from TestBase to reuse df?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's ok to put it here. It's common to have functions with same name in the same file, even if it's CS DSL and DF API
| /** | ||
| * Returns the first value in this [DataColumn]. | ||
| * | ||
| * @param T The type of the values in the [DataColumn]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line seems redundant to me because param T is always inferred from DataColumn
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'd omit (everywhere) a @param with type parameter, and add @return!
| /** | ||
| * Selects the first row from each group of the given [GroupBy] | ||
| * and returns a [ReducedGroupBy] containing these rows | ||
| * (one row per group, each row is the first row in its group). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or null if group is empty
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have faced an issue in this case that might be unexpected behavior.
I started with df.take(0).groupBy { age }.first() - it doesn't return null (returns ReducedGroupBy), which is probably fine because there are no groups at all.
But now I have tried
val grouped = df.groupBy { age }
grouped.updateGroups {
if (it == grouped.groups[0]) {
it.take(0)
} else it
}.first()to make the first group empty. And applying first in this case causes an exception:
The problem is found in one of the loaded libraries: check library renderers java.lang.IllegalStateException: Can not insert column `age` because column with this path already exists in DataFrame.
This problem does not occur if every group has at least one row, or if I remove the column age from every group. For example, this works:
grouped.updateGroups {
val new = it.remove { age }
if (it == grouped.groups[0]) {
new.take(0)
} else new
}.first()We get null for an empty group and the first row for others.
But is it expected behavior that we get such an error about conflicting columns? Or maybe I am just obtaining an empty group incorrectly.
The df in this case is:
val df = dataFrameOf(
"name" to columnOf("Alice", "Bob", "Charlie"),
"age" to columnOf(15, 20, 25),
)Or, to use a bit more natural example, we can make a fullJoin of df with val ages = dataFrameOf("age" to columnOf(30)), then group by age, and filter out the row with null value in the group for the key 30. And applying first in this case causes the same exception.
I am reporting this just in case it is a not known issue :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your first issue can also be reproduced in notebook by
df.groupBy { age }.updateGroups { it }.first()I'm not entirely sure why..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah! df.groupBy { age }.updateGroups { it }.first().concat() works fine again. So the issue is in the renderer for ReducedGroupBy in notebooks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be reproduced outside notebooks with df.groupBy { age }.updateGroups { it }.first().values()
| * employees.groupBy { jobTitle }.first() | ||
| * ``` | ||
| * | ||
| * @param T The type of the values in the [GroupBy]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these lines are redundant because they are always infered
| * ```kotlin | ||
| * // Select the first row for each city. | ||
| * // Returns a ReducedPivot with one column per city and the first row from the group in each column. | ||
| * df.pivot { city }.first() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see if you can come up with representative example. Like, in what situation you'd use this function? What df typically it will be and what ideas one can draw from the result? Will be good if example can convey this
| * the structure remains unchanged — only the contents of each group | ||
| * are replaced with the first row from that group. | ||
| * | ||
| * Equivalent to `reduce { firstOrNull() }`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reduce is internal function, people won't be able to use it like this
| * Reduces this [Pivot] by selecting the first row from each group. | ||
| * | ||
| * Returns a [ReducedPivot] where: | ||
| * - each column corresponds to a [pivot] group — if multiple pivot keys were used, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think text explanations of pivot make it more scary than it is. For first i suggest to not include common pivot logic and refer to pivot kdoc instead
Reference to website with HTML tables or ascii tables might do a better job conveying what's going on
AndreiKingsley
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job!
Regarding overloads for GroupBy and Pivot - I'm working on a general KDoc system for these operations, so I'll be reworking them anyway in the future, so you can leave them as they are.
| /** | ||
| * Returns the first value in this [DataColumn]. | ||
| * | ||
| * @param T The type of the values in the [DataColumn]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'd omit (everywhere) a @param with type parameter, and add @return!
| * Returns the first value in this [DataColumn]. | ||
| * | ||
| * @param T The type of the values in the [DataColumn]. | ||
| * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, add here and in all other places "See also" section with related operations. For example
See also [firstOrNull], [last], [take].
| * | ||
| * ### Example | ||
| * ```kotlin | ||
| * // Select from the column "age" the first value where the age is greater than 17 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"select" is confusing, as we also have the select operation. I'd say "get"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same applies below
| * ``` | ||
| * | ||
| * @param T The type of the values in the [DataColumn]. | ||
| * @param predicate A lambda expression used to select a value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*the first value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that it might look a bit ambiguous (especially because of "used"), but when I wrote this, my idea was that the logic of determining the first value is outside the scope of the predicate, and the predicate as a function just returns true if the input satisfies the condition. Do you think we still need to change it to "the first value"?
Maybe it's not that important, but I am clarifying because this part occurs in several places in this file and in last.kt as well :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be important information to know that it will stop calling the predicate after the first true is is found.
That's not conveyed when you say it's "used to select a value" :)
So yes, you explained in the broad sense what a "predicate" is, but I think it will be more valuable if you describe more specifically what it's used for in this specific case.
(similarly when you have a function that takes age: Int and you start explaining what an Int is ;P)
| * This predicate takes a value from the [DataColumn] as an input | ||
| * and returns `true` if the value satisfies the condition or `false` otherwise. | ||
| * | ||
| * @throws [NoSuchElementException] if the [DataColumn] contains no element matching the [predicate] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Helpful! We don't add these enough. Though I would add "@see firstOrNull" somewhere around here so people will know how to avoid this exception
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
holds for the other functions as well
| // region DataFrame | ||
|
|
||
| /** | ||
| * Returns the first row in this [DataFrame]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could link to DataRow from "row" :) may be helpful
| * ### Example | ||
| * ```kotlin | ||
| * // Select the first row for each city where the population is greater than 100 000. | ||
| * df.pivot { city }.first { population > 100000 } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
100_000 is better readable ;P (and compiles!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or 10e5 or something ;P
| * students.pivot { faculty }.groupBy { enrollmentYear }.first { age > 21 } | ||
| * ``` | ||
| * | ||
| * @param predicate A lambda expression used to select a value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh you can actually also link to RowFilter :) it has some useful kdoc too
| val firstHappy = reducedGrouped.values()[0] | ||
| val firstUnhappy = reducedGrouped.values()[1] | ||
|
|
||
| firstHappy shouldBe dataFrameOf( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*gradle plugin.
The compiler plugin is not enabled here. If it were, you would indeed be able to use extension properties with transformations :)
| } | ||
|
|
||
| @Test | ||
| fun `first on GroupBy with predicate`() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same reason as #1531 I assume?
| @Test | ||
| fun `first on GroupBy with predicate`() { | ||
| val grouped = df.groupBy { isHappy } | ||
| val reducedGrouped = grouped.first{ it["age"] as Int > 17 && it["city"] != "Moscow" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't forget to lint :) I recommend using the KtLint plugin. You can also call ktLintFormat from gradle
| )[0] | ||
| } | ||
|
|
||
| @Test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how do they look when they are rendered incorrectly?
Fixes #1279