Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create DataFrames From Java Collections #88

Open
dgunning opened this issue Jan 27, 2018 · 1 comment
Open

Create DataFrames From Java Collections #88

dgunning opened this issue Jan 27, 2018 · 1 comment

Comments

@dgunning
Copy link
Contributor

dgunning commented Jan 27, 2018

Java Developers need a easy way to create Dataframes from in-memory Java Collections. This will Morpheus much more suitable for generic Java development.

I am proposing a class called ListSource or CollectionSource that would be able to create a DataFrame from a List of Lists. E.g. Lets say you read a table from a Word document

XWPFDocument document = new XWPFDocument(stream); XWPFTable table= document.getTables().get(0);

and you convert the table to a lists of Iterables (or lists)

 List<Iterable<XWPFTableCell>> tableData =
                    table.getRows().stream()
                    .map( XWPFTableRow::getTableCells).collect(Collectors.toList());

you could then create a dataframe as follows

  DataFrame<Integer,String> data = new ListSource<XWPFTableCell>()
           .read(options ->{
                options.setData( tableData );
                options.setConverter( XWPFTableCell::getText );
            });

Generally a lot of data in Java can be converted to Lists of Lists and this feature would make Morpheus much more applicable.

Note that the current Morpheus API allows the following

        final Array<String> columns = Array.ofIterable( rows.get(0).getTableCells().stream()
          .map( XWPFTableCell::getText ).collect(toList()));

        return DataFrame.ofObjects(
                Range.of(1, rows.size()).toArray(),
                columns,
                value -> rows.get( value.rowOrdinal()+1).getTableCells().get(value.colOrdinal()).getText());

but that was trickier to get right due to the long method chains and the +1 in the method calls

@dgunning
Copy link
Contributor Author

This approach is generally applicable for many unconventional datasources especially if we add a new TableAdapter utility class that takes a raw table and returns a List

//Get Canada's Investor alert's page and find the table with the alerts
  Document doc = Jsoup.connect("https://www.securities-administrators.ca/InvestorAlerts.aspx/").get();
        Element table = doc.getElementById("ctl00_bodyContent_InvestorAlertSearchControl1_InvestorAlertListControl1_GridView_List");

// Convert to List of ELements
        List<Iterable<Element>> tableData =
                new TableAdapter<Element, Element, Element>()
                        .adapt(table,
                                tableElement -> tableElement.getElementsByTag("tr"),
                                rowElement -> {
                                    Elements tdOrTh = rowElement.getElementsByTag("td");
                                    return !tdOrTh.isEmpty() ? tdOrTh : rowElement.getElementsByTag("th");
                                });

Create dataframe

        DataFrame<Integer,String> data = new ListSource<Element>().read(options ->{
            options.setData( tableData );
            options.setConverter( e ->  e.wholeText().trim() );
        });

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant