Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check and improve CSV Export after Library Pagination #2924

Open
RafaPolit opened this issue May 19, 2020 · 12 comments · May be fixed by #6595
Open

Check and improve CSV Export after Library Pagination #2924

RafaPolit opened this issue May 19, 2020 · 12 comments · May be fixed by #6595

Comments

@RafaPolit
Copy link
Member

No description provided.

@simonfossom
Copy link

simonfossom commented May 19, 2020

@fnocetti here's a new icon for CSV export and import.
Could you please replace the current one? Thanks!

icons.zip

@fnocetti
Copy link
Contributor

fnocetti commented May 21, 2020

After checking, we need no changes after the Library Pagination implementation.
Changes will be needed when the pagination allows to fetch documents above ES limit.
For now, the scope of this issue can be defined by:

  • change the icon for the one @simonfossom provided
  • implement proper integration testing to be prepared for when fetching above the limit happens

@RafaPolit Please let me know if you disagree and/or if this affects the priority of this issue.

@fnocetti
Copy link
Contributor

I agreed con Rafa to test a little bit further and re check the scope of this issue.

@RafaPolit
Copy link
Member Author

I love the spanglish version of "I agreed con Rafa"! :) Yes, please, report about what would be required to do batches of 10.000 for export of large collections.

@fnocetti
Copy link
Contributor

After checking, if you try to access a set of documents with a query like <offset:9990, limit: 30> the api will tell you that the results window is too large (>10000) and suggests you to use some of the ES API's meant for that purpose.
I'll create a pull request with the new icon and the new integration test when they're ready.
Maybe we can discuss in the tech call how to proceed with improving the search window?
On the while, I will continue to investigate how the search_after and scroll API's could be used within uwazi to expand that window.

@fnocetti
Copy link
Contributor

After some research I'm listing my findings here:

  • There are two alternatives to fetch over 10k records:
    • ES Scroll API: it is a stateful API that allows you to fetch contiguous pages of search results by keeping (inside of ES's own implementation) the search context in memory. The fact that it needs to keep the context alive makes it not ideal for real-time user facing navigation, but a very efficient strategy for batch non-real-time operations (like csv export). This was the strategy originally proposed by @txau and I agree that is the best option for the CSV export. We should be in a good position to add support for it in the search API.
    • ES Search After API: It's a stateless implementation of Scroll. It achieves the same results without needing to keep the context in memory but can be more complex to implement because it needs to use a tuple of order-defining values (based on the indexable properties) as a "from" parameter instead of simply an ordinal number. In the other hand, it makes it a better approach for real-time user navigation features.
  • After reading some experiences from others, we should't need to implement deep pagination for the UI because users do not usually scroll more than a few pages in the search results. They'd rather look at a few pages and continue refining their search criteria. I'd love to hear @simonfossom expert position on this topic. May be we could validate this with some UX research?
  • We need to do some refactoring on the CSV Export class and routes to be able to implement this strategy. Assuming the Scroll API (which seems the way to go, and it's more straightforward to implement):
    • The CSV Exporter class assumes a single page containing the whole set of results to be exported. We need to refactor the algorithm to use pagination.
    • While implementing pagination, the headers pre-computation strategy may stop to work correctly, so we will need to refactor it. We might need a more greedy strategy that can even be more efficient and lead to better results.
    • This could lead to a need to do a bit more manipulation of the CSV temporal file, but should not be a problem.
  • Also, after discussing with Mila, this might not be urgent.

@RafaPolit
Copy link
Member Author

To be revisited when there is more urgency.

For the time being, the workaround is to create filters that produce smaller batches (probably already implemented in every collection) and do partial exports.

@nickdickinson
Copy link

I think this could be a higher priority issue. Case: As a user, I've run into this issue trying to export from the UPR info database to search for keywords related to water and sanitation in records that were not classified by the UPR database as water and sanitation. If I could understand how to paginate the API, I might try that route but it is a bit difficult as a user to figure out how to use the API, get a list of the entities for filters, etc. And how to paginate. For now, there is no clear way to download many records >10000 without many many manual steps.

@fnocetti
Copy link
Contributor

Hello @nickdickinson, thank you for your input.
On the current version of Uwazi there's no way of downloading more than 10000 records. That's a hard limit on the results set size, so even paginating on the API won't help.

cc @RafaPolit @txau:
@nickdickinson has a nice user need here. Implementing support for exporting more than 10000 records will need Uwazi to make use of the ElasicSearch's Scroll API, probably asynchronously. Should we discuss this further?

@RafaPolit
Copy link
Member Author

RafaPolit commented Nov 24, 2021

I know this is not a "true" solution, but you could try to reach the "less than 10000" goal by adding extra filters in the right hand side panel. For example, once you have the filter you need that throws more than 10000 results, try to find another parameter that would create exclusive searches.

Options:

  • select a single state under review. That would probably limit the results considerably. Still, if there are scenarios for all countries, this would require several
  • select a single state under review (regional group), that may be enough and you get only 7 exports
  • limit them by sessions: each sessions has, at most, 4000 files, so you will probably never surpass the 10.000 mark using the sessions. Still, there could be a lot of exports

This is, as mentioned above, a temporary workaround until this is solved properly. Hope this helps.

@nickdickinson
Copy link

Thanks @fnocetti and @RafaPolit. I appreciate you considering the issue as I can imagine it coming up more often for researchers. The workaround is ok if it is not a repeated task. I think also for the user it will be great to be able to download a CSV, even if it has more than 10,000 records.

For now it is not preventing my main goal, which is to build a reporting database for Sanitation and Water for All, a coalition of countries and other stakeholders. We want to be able to report per country, what percent of water and sanitation recommendations have been "supported". This is easy to do manually with the filter for the Human Right to Water and Sanitation as it is only a few hundred recommendations. Presumably I could also use the API to refresh the database a few times per year.

I noticed that there are just as many recommendations mentioning water and/or sanitation not classified as the HR to water and sanitation but do seem associated so basically I wanted to research this. Perhaps I will try to use the search queries. Thanks again.

@fnocetti fnocetti added this to the Data layer revamp milestone Jun 27, 2023
@fnocetti fnocetti linked a pull request Mar 15, 2024 that will close this issue
7 tasks
@fnocetti
Copy link
Contributor

I have written a PoC for this and works nicely.
I'm using the Scroll API, as we agreed, which seems to be the best option for this scenarios. Regardless of the API we use, it will create a point-in-time/snapshot to preserve the consistency of the results during the export process. There's a limit to the number of points elastic can maintain at the same time, so there are some questions we need to answer:

  • How are we planning to prevent reaching that limit? It is relatively complicated right now, but it would be pretty simple to achive with the Queue Workers from V2?
  • Should we limit the "full" export functionality to logged in users and keep the limit for visitors? Or should this be available to everyone? This question is related to the previous one in the sense that it could potentially open room to denial-of-service attacks on the export functionality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants