Skip to content

Generalize Google Searcher #76

@maxachis

Description

@maxachis

While #74 is focusing on automating google searches in the narrow domain of finding homepage urls, I think we can generalize this for something that is more broadly useful for our purposes, and allows us to make the most of automated google searches.

How This Works

Instead of having to create a specially-designed searcher for each subject we want to search, we maintain a queue of google searches which we can populate with all required queries and which are automatically run, 100 per day. As we identify more unique data needs, we can create all relevant variations of these queries and add them to the queue, whereupon they will be deployed and their results stored for later retrieval and organization.

This way, we can perform a wider array of data searches without having to expend considerable time modifying the repository and performing pull requests. All we would need to do is generate the queries and then place them in the queue.

This will work well for discrete searches that we intend to run on a one-time or few-time basis. For regular searches, an alternative design is required.

What the design involves

  1. Create a search_queue table in the database. This contains a list of all search queries which are to be performed, and would include columns to enable cross-referencing with other tables in the database if desired. This would also provide a way to quickly identify if a search has been performed, helping to preserve state. This would obviate the agency_url_search_cache table. It would also allow us to review historical searches, whereas currently we don't have a clear record of what query produces what results.
  2. Create a search_results table. All results related to a given search would be uploaded here and linked via foreign key to the search queue table, with columns for url and snippet (i.e. short descriptive test supplied by google as to the contents of the web page`. This would be done as an alternative to uploading them for hugging face for a few reasons:
  • A: It enables more atomic uploads. The nature of a huggingface database is such that we're incentivized to batch results together into csv files, as having a csv file for each result quickly becomes difficult to organize. However, batching results poses a risk in that we may perform a large amount of searches, then something goes wrong and all those searches will have to be restarted. If each search result is uploaded one at a time, we only risk losing one search if an error occurs
  • B: These results can be easily converted into a huggingface dataset at a later time.
  1. Modify the Google Searcher such that, instead of specifically focusing on filling agency homepage urls, as it currently does, it pulls searches to perform from the search queue, runs them, and uploads the top 10 results to the search_results table.
  2. Instead of hard-coding the search queries, as is currently done with the agency homepage searcher, queries are generated through a one-time script which are then uploaded to the search queue. This allows us to more easily inspect pending queries and to create the queries en-masse before they are sent.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions