Check and improve CSV Export after Library Pagination #2924

RafaPolit · 2020-05-19T14:17:08Z

No description provided.

simonfossom · 2020-05-19T19:09:17Z

@fnocetti here's a new icon for CSV export and import.
Could you please replace the current one? Thanks!

fnocetti · 2020-05-21T14:09:18Z

After checking, we need no changes after the Library Pagination implementation.
Changes will be needed when the pagination allows to fetch documents above ES limit.
For now, the scope of this issue can be defined by:

change the icon for the one @simonfossom provided
implement proper integration testing to be prepared for when fetching above the limit happens

@RafaPolit Please let me know if you disagree and/or if this affects the priority of this issue.

fnocetti · 2020-05-21T18:03:51Z

I agreed con Rafa to test a little bit further and re check the scope of this issue.

RafaPolit · 2020-05-21T18:41:48Z

I love the spanglish version of "I agreed con Rafa"! :) Yes, please, report about what would be required to do batches of 10.000 for export of large collections.

fnocetti · 2020-05-22T21:22:21Z

After checking, if you try to access a set of documents with a query like <offset:9990, limit: 30> the api will tell you that the results window is too large (>10000) and suggests you to use some of the ES API's meant for that purpose.
I'll create a pull request with the new icon and the new integration test when they're ready.
Maybe we can discuss in the tech call how to proceed with improving the search window?
On the while, I will continue to investigate how the search_after and scroll API's could be used within uwazi to expand that window.

fnocetti · 2020-05-25T15:52:27Z

After some research I'm listing my findings here:

There are two alternatives to fetch over 10k records:
- ES Scroll API: it is a stateful API that allows you to fetch contiguous pages of search results by keeping (inside of ES's own implementation) the search context in memory. The fact that it needs to keep the context alive makes it not ideal for real-time user facing navigation, but a very efficient strategy for batch non-real-time operations (like csv export). This was the strategy originally proposed by @txau and I agree that is the best option for the CSV export. We should be in a good position to add support for it in the search API.
- ES Search After API: It's a stateless implementation of Scroll. It achieves the same results without needing to keep the context in memory but can be more complex to implement because it needs to use a tuple of order-defining values (based on the indexable properties) as a "from" parameter instead of simply an ordinal number. In the other hand, it makes it a better approach for real-time user navigation features.
After reading some experiences from others, we should't need to implement deep pagination for the UI because users do not usually scroll more than a few pages in the search results. They'd rather look at a few pages and continue refining their search criteria. I'd love to hear @simonfossom expert position on this topic. May be we could validate this with some UX research?
We need to do some refactoring on the CSV Export class and routes to be able to implement this strategy. Assuming the Scroll API (which seems the way to go, and it's more straightforward to implement):
- The CSV Exporter class assumes a single page containing the whole set of results to be exported. We need to refactor the algorithm to use pagination.
- While implementing pagination, the headers pre-computation strategy may stop to work correctly, so we will need to refactor it. We might need a more greedy strategy that can even be more efficient and lead to better results.
- This could lead to a need to do a bit more manipulation of the CSV temporal file, but should not be a problem.
Also, after discussing with Mila, this might not be urgent.

RafaPolit · 2020-05-26T15:07:23Z

To be revisited when there is more urgency.

For the time being, the workaround is to create filters that produce smaller batches (probably already implemented in every collection) and do partial exports.

nickdickinson · 2021-11-24T16:00:47Z

I think this could be a higher priority issue. Case: As a user, I've run into this issue trying to export from the UPR info database to search for keywords related to water and sanitation in records that were not classified by the UPR database as water and sanitation. If I could understand how to paginate the API, I might try that route but it is a bit difficult as a user to figure out how to use the API, get a list of the entities for filters, etc. And how to paginate. For now, there is no clear way to download many records >10000 without many many manual steps.

fnocetti · 2021-11-24T16:24:24Z

Hello @nickdickinson, thank you for your input.
On the current version of Uwazi there's no way of downloading more than 10000 records. That's a hard limit on the results set size, so even paginating on the API won't help.

cc @RafaPolit @txau:
@nickdickinson has a nice user need here. Implementing support for exporting more than 10000 records will need Uwazi to make use of the ElasicSearch's Scroll API, probably asynchronously. Should we discuss this further?

RafaPolit · 2021-11-24T16:24:31Z

I know this is not a "true" solution, but you could try to reach the "less than 10000" goal by adding extra filters in the right hand side panel. For example, once you have the filter you need that throws more than 10000 results, try to find another parameter that would create exclusive searches.

Options:

select a single state under review. That would probably limit the results considerably. Still, if there are scenarios for all countries, this would require several
select a single state under review (regional group), that may be enough and you get only 7 exports
limit them by sessions: each sessions has, at most, 4000 files, so you will probably never surpass the 10.000 mark using the sessions. Still, there could be a lot of exports

This is, as mentioned above, a temporary workaround until this is solved properly. Hope this helps.

nickdickinson · 2021-11-24T16:52:00Z

Thanks @fnocetti and @RafaPolit. I appreciate you considering the issue as I can imagine it coming up more often for researchers. The workaround is ok if it is not a repeated task. I think also for the user it will be great to be able to download a CSV, even if it has more than 10,000 records.

For now it is not preventing my main goal, which is to build a reporting database for Sanitation and Water for All, a coalition of countries and other stakeholders. We want to be able to report per country, what percent of water and sanitation recommendations have been "supported". This is easy to do manually with the filter for the Human Right to Water and Sanitation as it is only a few hundred recommendations. Presumably I could also use the API to refresh the database a few times per year.

I noticed that there are just as many recommendations mentioning water and/or sanitation not classified as the HR to water and sanitation but do seem associated so basically I wanted to research this. Perhaps I will try to use the search queries. Thanks again.

fnocetti · 2024-03-15T15:03:50Z

I have written a PoC for this and works nicely.
I'm using the Scroll API, as we agreed, which seems to be the best option for this scenarios. Regardless of the API we use, it will create a point-in-time/snapshot to preserve the consistency of the results during the export process. There's a limit to the number of points elastic can maintain at the same time, so there are some questions we need to answer:

How are we planning to prevent reaching that limit? It is relatively complicated right now, but it would be pretty simple to achive with the Queue Workers from V2?
Should we limit the "full" export functionality to logged in users and keep the limit for visitors? Or should this be available to everyone? This question is related to the previous one in the sense that it could potentially open room to denial-of-service attacks on the export functionality.

RafaPolit added Sprint Tech Debt 🛠️ labels May 19, 2020

RafaPolit assigned fnocetti May 19, 2020

RafaPolit added Priority: Low and removed Sprint labels May 26, 2020

fnocetti mentioned this issue May 26, 2020

Change CSV Export icon #2935

Closed

fnocetti added the Backend 💾 label Jun 27, 2023

fnocetti added this to the Data layer revamp milestone Jun 27, 2023

txau added Priority: High and removed Priority: Low labels Jan 22, 2024

fnocetti linked a pull request Mar 15, 2024 that will close this issue

Scroll export #6595

Draft

7 tasks

fnocetti mentioned this issue May 10, 2024

Warning about 10,000 entities limitation (CSV Export and Map View) #6582

Open

aphilop unassigned fnocetti Jul 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check and improve CSV Export after Library Pagination #2924

Check and improve CSV Export after Library Pagination #2924

RafaPolit commented May 19, 2020

simonfossom commented May 19, 2020 •

edited

Loading

fnocetti commented May 21, 2020 •

edited

Loading

fnocetti commented May 21, 2020

RafaPolit commented May 21, 2020

fnocetti commented May 22, 2020

fnocetti commented May 25, 2020

RafaPolit commented May 26, 2020

nickdickinson commented Nov 24, 2021

fnocetti commented Nov 24, 2021

RafaPolit commented Nov 24, 2021 •

edited

Loading

nickdickinson commented Nov 24, 2021

fnocetti commented Mar 15, 2024

Check and improve CSV Export after Library Pagination #2924

Check and improve CSV Export after Library Pagination #2924

Comments

RafaPolit commented May 19, 2020

simonfossom commented May 19, 2020 • edited Loading

fnocetti commented May 21, 2020 • edited Loading

fnocetti commented May 21, 2020

RafaPolit commented May 21, 2020

fnocetti commented May 22, 2020

fnocetti commented May 25, 2020

RafaPolit commented May 26, 2020

nickdickinson commented Nov 24, 2021

fnocetti commented Nov 24, 2021

RafaPolit commented Nov 24, 2021 • edited Loading

nickdickinson commented Nov 24, 2021

fnocetti commented Mar 15, 2024

simonfossom commented May 19, 2020 •

edited

Loading

fnocetti commented May 21, 2020 •

edited

Loading

RafaPolit commented Nov 24, 2021 •

edited

Loading