Skip to content

Conversation

koperagen
Copy link
Collaborator

@koperagen koperagen commented Oct 9, 2025

Consider file structured as this:

header1
data1

header2
data2

I believe it's popular, one example is srt, but i personally had to deal with it a lot

It can be parsed into dataframe:

File("data").readText().lines().chunked(3).map { it.dropLast(1) }.toDataFrame()

Surprisingly here toDataFrame is not generic Iterable.toDataFrame, but completely another function.
Problem: In current shape it's not helpful.
I end up with something very close to what i want, but required change to code is somewhat non-trivial.
image

I'd either have to:

val data = File("data").readText().lines().chunked(3).map { it.dropLast(1) }
val df = buildList { 
  add(listOf("col1", "col2"))
  addAll(data)
}.toDataFrame()

Or switch to completely different route:

val df = File("data").readText().lines().chunked(3).map { it.dropLast(1) }.toDataFrame("data").split("data").cast<List<*>>().into("col1", "col2")

with compiler plugin

val df = File("data").readText().lines().chunked(3).map { it.dropLast(1) }.toDataFrame("data").split { data }.into("col1", "col2")

With this API change:

val df = File("data").readText().lines().chunked(3).map { it.dropLast(1) }.toDataFrame(header = listOf("col1", "col2"))

Plugin will understand resulting schema too

@koperagen koperagen added this to the 1.0.0-Beta4 milestone Oct 9, 2025
@koperagen koperagen self-assigned this Oct 9, 2025
@koperagen koperagen added enhancement New feature or request Compiler plugin Anything related to the DataFrame Compiler Plugin labels Oct 9, 2025
@zaleslaw
Copy link
Collaborator

I agree that this a popular use-case, I also faced with the same and handled with File API parsing

Great, if tiny example, saying, with subtitles will be also added

public fun <T> List<List<T>>.toDataFrame(containsColumns: Boolean = false): AnyFrame =
@Refine
@Interpretable("ValuesListsToDataFrame")
public fun <T> List<List<T>>.toDataFrame(header: List<String>? = null, containsColumns: Boolean = false): AnyFrame =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a breaking change, can we keep the old function and deprecate it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deprecated and moved from io to api package

Copy link
Collaborator

@Jolanrensen Jolanrensen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a useful addition :) it's common in excel/csv's too to take the first row as headers unless headers are supplied explicitly, so it makes sense here too I guess

@koperagen
Copy link
Collaborator Author

@zaleslaw

Great, if tiny example, saying, with subtitles will be also added

Updated documentation.
In the process, updated page with headers to be able to refer to individual operations
Before:
image

After:
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Compiler plugin Anything related to the DataFrame Compiler Plugin enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants