-
Notifications
You must be signed in to change notification settings - Fork 7
Description
While #74 is focusing on automating google searches in the narrow domain of finding homepage urls, I think we can generalize this for something that is more broadly useful for our purposes, and allows us to make the most of automated google searches.
How This Works
Instead of having to create a specially-designed searcher for each subject we want to search, we maintain a queue of google searches which we can populate with all required queries and which are automatically run, 100 per day. As we identify more unique data needs, we can create all relevant variations of these queries and add them to the queue, whereupon they will be deployed and their results stored for later retrieval and organization.
This way, we can perform a wider array of data searches without having to expend considerable time modifying the repository and performing pull requests. All we would need to do is generate the queries and then place them in the queue.
This will work well for discrete searches that we intend to run on a one-time or few-time basis. For regular searches, an alternative design is required.
What the design involves
- Create a
search_queuetable in the database. This contains a list of all search queries which are to be performed, and would include columns to enable cross-referencing with other tables in the database if desired. This would also provide a way to quickly identify if a search has been performed, helping to preserve state. This would obviate theagency_url_search_cachetable. It would also allow us to review historical searches, whereas currently we don't have a clear record of what query produces what results. - Create a
search_resultstable. All results related to a given search would be uploaded here and linked via foreign key to the search queue table, with columns forurlandsnippet(i.e. short descriptive test supplied by google as to the contents of the web page`. This would be done as an alternative to uploading them for hugging face for a few reasons:
- A: It enables more atomic uploads. The nature of a huggingface database is such that we're incentivized to batch results together into csv files, as having a csv file for each result quickly becomes difficult to organize. However, batching results poses a risk in that we may perform a large amount of searches, then something goes wrong and all those searches will have to be restarted. If each search result is uploaded one at a time, we only risk losing one search if an error occurs
- B: These results can be easily converted into a huggingface dataset at a later time.
- Modify the Google Searcher such that, instead of specifically focusing on filling agency homepage urls, as it currently does, it pulls searches to perform from the search queue, runs them, and uploads the top 10 results to the
search_resultstable. - Instead of hard-coding the search queries, as is currently done with the agency homepage searcher, queries are generated through a one-time script which are then uploaded to the search queue. This allows us to more easily inspect pending queries and to create the queries en-masse before they are sent.