Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sitemap Generator should generate a partial gleaner configuration #7

Open
webb-ben opened this issue May 21, 2024 · 0 comments
Open
Assignees
Labels
enhancement New feature or request

Comments

@webb-ben
Copy link
Member

webb-ben commented May 21, 2024

Is your feature request related to a problem? Please describe.
When sitemap generator makes its sitemap index file, it has everything needed to create the sources section of a given gleaner configuration file such as:

gleaner:
  runid: iow # this will be the bucket the output is placed in...
  summon: true # do we want to visit the web sites and pull down the files
  mill: false
context:
  cache: true
contextmaps:
- prefix: "https://schema.org/"
  file: "/home/vagrant/conf/jsonldcontext.jsonld"  # wget http://schema.org/docs/jsonldcontext.jsonld
- prefix: "http://schema.org/"
  file: "/home/vagrant/conf/jsonldcontext.jsonld"  # wget http://schema.org/docs/jsonldcontext.jsonld
summoner:
  after: ""      # "21 May 20 10:00 UTC"   
  mode: full  # full || diff:  If diff compare what we have currently in gleaner to sitemap, get only new, delete missing
  threads: 2 
  delay:  # milliseconds (1000 = 1 second) to delay between calls (will FORCE threads to 1) 
  headless: http://localhost:9222  # URL for headless see docs/headless
millers:
  graph: true
sources:
- active: 'true'
  domain: https://pids.geoconnex.dev
  headless: 'false'
  name: refgages0
  pid: https://gleaner.io/genid/geoconnex
  propername: refgages0
  sourcetype: sitemap
  url: https://pids.geoconnex.dev/sitemap/ref/gages/gages__0.xml 
- active: 'true'
  domain: https://pids.geoconnex.dev
  headless: 'false'
  name: refmainstems
  pid: https://gleaner.io/genid/geoconnex
  propername: refmainstems
  sourcetype: sitemap
  url: https://pids.geoconnex.dev/sitemap/ref/mainstems/mainstems__0.xml  
- active: 'true'
  domain: https://pids.geoconnex.dev
  headless: 'false'
  name: dams0 
  pid: https://gleaner.io/genid/geoconnex
  propername: dams0 
  sourcetype: sitemap
  url: https://pids.geoconnex.dev/sitemap/ref/dams/dams__0.xml 
- active: 'true'
  domain: https://pids.geoconnex.dev
  headless: 'false'
  name: cdss0
  pid: https://gleaner.io/genid/geoconnex
  propername: cdss0
  sourcetype: sitemap
  url: https://pids.geoconnex.dev/sitemap/cdss/co_gages__0.xml
- active: 'true'
  domain: https://pids.geoconnex.dev
  headless: 'false'
  name: nmwdist0 
  pid: https://gleaner.io/genid/geoconnex
  propername: nmwdist0 
  sourcetype: sitemap
  url: https://pids.geoconnex.dev/sitemap/nmwdi/st/nmwdi-st__0.xml

Describe the solution you'd like
Either as a separate step, or as function of the already existing sitemap generator workflow, it would be nice to be able to generate this section.

Describe alternatives you've considered
Ideally gleaner would be able to parse a sitemap index file to create the source entries. In lieu of something like this, being able to copy paste the configuration would be a step up from manual entry, especially as the list of sources we are crawling grows.

Additional context
Add any other context or screenshots about the feature request here.

@webb-ben webb-ben added the enhancement New feature or request label May 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants