Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ScopeType from Yaml not overrules ScopeType of Docker CMD #774

Open
gitreich opened this issue Feb 20, 2025 · 1 comment
Open

ScopeType from Yaml not overrules ScopeType of Docker CMD #774

gitreich opened this issue Feb 20, 2025 · 1 comment

Comments

@gitreich
Copy link
Contributor

Hi,
When I configure a seed in the yam like this (only 1 seed!):

seeds:

And Start the Docker with:

docker run -d --name ONB_Btrix_onb_weekly_20250220085219 -e NODE_OPTIONS='--max-old-space-size=32768' -p 41703:41703 -p 42831:42831 -v /home/netarchive/browsertrix/crawls/:/crawls/ webrecorder/browsertrix-crawler:1.5.3 crawl --screencastPort 41703 --healthCheckPort 42831 --scopeType domain --headless --profile /crawls/profiles/default_profile.tar.gz --delay 0 --behaviorTimeout 60 --pageLoadTimeout 60 --waitUntil networkidle0 --saveState always --logging stats,info --config /crawls/config/onb_weekly_20250220085219.yaml --depth 7 --workers 1 --limit 1000 --sizeLimit 26843545600 --text to-warc,final-to-warc --screenshot fullPage --warcInfo ONB_CRAWL_onb_weekly_20250220085219_Depth_7_20250220085221 --userAgentSuffix +ONB_Bot_Btrix_1.5.3, [email protected] --crawlId id_ONB_CRAWL_onb_weekly_20250220085219_Depth_7_20250220085221 --collection onb_weekly_20250220085219

My expectation was, that the scopeType of the Docker CMD will be overruled by the ScopeType of the Yaml Config, but in relality the ScopeType of the Docker CMD overruled the .yaml ScopeType: (See already at the second page):

Log File:

{"timestamp":"2025-02-20T08:52:23.499Z","logLevel":"info","context":"general","message":"Browsertrix-Crawler 1.5.3 (with warcio.js 2.4.3)","details":{}}
{"timestamp":"2025-02-20T08:52:23.501Z","logLevel":"info","context":"general","message":"Seeds","details":[{"url":"https://www.onb.ac.at/koop-litera","scopeType":"prefix","include":["/^https?:\/\/www\.onb\.ac\.at\//"],"exclude":[],"allowHash":false,"depth":-1,"sitemap":null,"auth":null,"_authEncoded":null,"maxExtraHops":0,"maxDepth":7}]}
{"timestamp":"2025-02-20T08:52:23.502Z","logLevel":"info","context":"general","message":"Link Selectors","details":[{"selector":"a[href]","extract":"href","isAttribute":false}]}
{"timestamp":"2025-02-20T08:52:23.502Z","logLevel":"info","context":"general","message":"Behavior Options","details":{"message":"{"autoplay":true,"autofetch":true,"autoscroll":true,"siteSpecific":true,"log":"__bx_log","startEarly":true,"clickSelector":"a"}"}}
{"timestamp":"2025-02-20T08:52:23.502Z","logLevel":"info","context":"general","message":"With Browser Profile","details":{"url":"/crawls/profiles/default_profile.tar.gz"}}
{"timestamp":"2025-02-20T08:52:23.654Z","logLevel":"info","context":"healthcheck","message":"Healthcheck server started on 42831","details":{}}
{"timestamp":"2025-02-20T08:52:25.144Z","logLevel":"info","context":"worker","message":"Creating 1 workers","details":{}}
{"timestamp":"2025-02-20T08:52:25.145Z","logLevel":"info","context":"worker","message":"Worker starting","details":{"workerid":0}}
{"timestamp":"2025-02-20T08:52:25.445Z","logLevel":"info","context":"worker","message":"Starting page","details":{"workerid":0,"page":"https://www.onb.ac.at/koop-litera"}}
{"timestamp":"2025-02-20T08:52:25.449Z","logLevel":"info","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":0,"total":1,"pending":1,"failed":0,"limit":{"max":1000,"hit":false},"pendingPages":["{"seedId":0,"started":"2025-02-20T08:52:25.160Z","extraHops":0,"url":"https:\/\/www.onb.ac.at\/koop-litera","added":"2025-02-20T08:52:23.727Z","depth":0}"]}}
{"timestamp":"2025-02-20T08:52:25.838Z","logLevel":"info","context":"general","message":"Awaiting page load","details":{"page":"https://www.onb.ac.at/koop-litera","workerid":0}}
{"timestamp":"2025-02-20T08:52:26.666Z","logLevel":"warn","context":"recorder","message":"Skipping URL from unknown frame","details":{"url":"https://www.onb.ac.at/koop-litera","frameId":"EF1D59B6DFFFEB93BE1ECE0AF51F64D7"}}
{"timestamp":"2025-02-20T08:52:30.554Z","logLevel":"warn","context":"recorder","message":"Skipping URL from unknown frame","details":{"url":"https://eu.libraryh3lp.com/chat/[email protected]?skin=14948&referer=https%3A%2F%2Fwww.onb.ac.at%2Fkoop-litera","frameId":"16D571831B96CD8EB7ED25AA00FD4D1D"}}
{"timestamp":"2025-02-20T08:52:35.060Z","logLevel":"warn","context":"general","message":"Invalid Page - URL must start with http:// or https://","details":{"url":"data:text/plain;charset=utf-8,%3C!DOCTYPE%20html%3E%3Chtml%3E%3Cbody%3E%3C%2Fbody%3E%3C%2Fhtml%3E","page":"https://www.onb.ac.at/koop-litera","workerid":0}}
{"timestamp":"2025-02-20T08:52:39.552Z","logLevel":"info","context":"behavior","message":"Running behaviors","details":{"frames":2,"frameUrls":["https://www.onb.ac.at/koop-litera","https://eu.libraryh3lp.com/chat/[email protected]?skin=14948&referer=https%3A%2F%2Fwww.onb.ac.at%2Fkoop-litera"],"page":"https://www.onb.ac.at/koop-litera","workerid":0}}
{"timestamp":"2025-02-20T08:52:39.552Z","logLevel":"info","context":"behavior","message":"Run Script Started","details":{"frameUrl":"https://www.onb.ac.at/koop-litera","page":"https://www.onb.ac.at/koop-litera","workerid":0}}
{"timestamp":"2025-02-20T08:52:39.553Z","logLevel":"info","context":"behavior","message":"Run Script Started","details":{"frameUrl":"https://eu.libraryh3lp.com/chat/[email protected]?skin=14948&referer=https%3A%2F%2Fwww.onb.ac.at%2Fkoop-litera","page":"https://www.onb.ac.at/koop-litera","workerid":0}}
{"timestamp":"2025-02-20T08:52:39.729Z","logLevel":"info","context":"behaviorScript","message":"Behavior log","details":{"state":{"segments":1},"msg":"Skipping autoscroll, page seems to not be responsive to scrolling events","page":"https://www.onb.ac.at/koop-litera","workerid":0}}
{"timestamp":"2025-02-20T08:52:39.730Z","logLevel":"info","context":"behaviorScript","message":"Behavior log","details":{"state":{"segments":1},"msg":"done!","page":"https://www.onb.ac.at/koop-litera","workerid":0}}
{"timestamp":"2025-02-20T08:52:39.802Z","logLevel":"info","context":"behaviorScript","message":"Behavior log","details":{"state":{"segments":1},"msg":"Scrolling down by 30 pixels every 0.075 seconds","page":"https://www.onb.ac.at/koop-litera","workerid":0}}
{"timestamp":"2025-02-20T08:52:40.228Z","logLevel":"info","context":"behavior","message":"Run Script Finished","details":{"frameUrl":"https://eu.libraryh3lp.com/chat/[email protected]?skin=14948&referer=https%3A%2F%2Fwww.onb.ac.at%2Fkoop-litera","page":"https://www.onb.ac.at/koop-litera","workerid":0}}
{"timestamp":"2025-02-20T08:53:25.690Z","logLevel":"info","context":"behaviorScript","message":"Behavior log","details":{"state":{"segments":1},"msg":"done!","page":"https://www.onb.ac.at/koop-litera","workerid":0}}
{"timestamp":"2025-02-20T08:53:25.692Z","logLevel":"info","context":"behavior","message":"Run Script Finished","details":{"frameUrl":"https://www.onb.ac.at/koop-litera","page":"https://www.onb.ac.at/koop-litera","workerid":0}}
{"timestamp":"2025-02-20T08:53:25.692Z","logLevel":"info","context":"behavior","message":"Behaviors finished","details":{"finished":2,"page":"https://www.onb.ac.at/koop-litera","workerid":0}}
{"timestamp":"2025-02-20T08:53:26.788Z","logLevel":"info","context":"pageStatus","message":"Page Finished","details":{"loadState":4,"page":"https://www.onb.ac.at/koop-litera","workerid":0}}
{"timestamp":"2025-02-20T08:53:26.833Z","logLevel":"info","context":"general","message":"Saving crawl state to: /crawls/collections/onb_weekly_20250220085219/crawls/crawl-20250220085326-id_ONB_CRAWL_onb_weekly_20250220085219_Depth_7_20250220085221.yaml","details":{}}
{"timestamp":"2025-02-20T08:53:26.912Z","logLevel":"info","context":"worker","message":"Starting page","details":{"workerid":0,"page":"https://www.onb.ac.at/kalender"}}
{"timestamp":"2025-02-20T08:53:26.914Z","logLevel":"info","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":1,"total":376,"pending":1,"failed":0,"limit":{"max":1000,"hit":false},"pendingPages":["{"seedId":0,"started":"2025-02-20T08:53:26.911Z","extraHops":0,"url":"https:\/\/www.onb.ac.at\/kalender","added":"2025-02-20T08:52:34.771Z","depth":1}"]}}
{"timestamp":"2025-02-20T08:53:27.485Z","logLevel":"info","context":"general","message":"Awaiting page load","details":{"page":"https://www.onb.ac.at/kalender","workerid":0}}
{"timestamp":"2025-02-20T08:53:28.817Z","logLevel":"warn","context":"recorder","message":"Skipping URL from unknown frame","details":{"url":"https://eu.libraryh3lp.com/chat/[email protected]?skin=14948&referer=https%3A%2F%2Fwww.onb.ac.at%2Fkalender","frameId":"7AEF6A9CE6FA205AB578E7BB014372E6"}}
{"timestamp":"2025-02-20T08:53:31.563Z","logLevel":"warn","context":"general","message":"Invalid Page - URL must start with http:// or https://","details":{"url":"data:text/plain;charset=utf-8,%3C!DOCTYPE%20html%3E%3Chtml%3E%3Cbody%3E%3C%2Fbody%3E%3C%2Fhtml%3E","page":"https://www.onb.ac.at/kalender","workerid":0}}
{"timestamp":"2025-02-20T08:53:34.959Z","logLevel":"info","context":"behavior","message":"Running behaviors","details":{"frames":2,"frameUrls":["https://www.onb.ac.at/kalender","https://eu.libraryh3lp.com/chat/[email protected]?skin=14948&referer=https%3A%2F%2Fwww.onb.ac.at%2Fkalender"],"page":"https://www.onb.ac.at/kalender","workerid":0}}
{"timestamp":"2025-02-20T08:53:34.959Z","logLevel":"info","context":"behavior","message":"Run Script Started","details":{"frameUrl":"https://www.onb.ac.at/kalender","page":"https://www.onb.ac.at/kalender","workerid":0}}
{"timestamp":"2025-02-20T08:53:34.960Z","logLevel":"info","context":"behavior","message":"Run Script Started","details":{"frameUrl":"https://eu.libraryh3lp.com/chat/[email protected]?skin=14948&referer=https%3A%2F%2Fwww.onb.ac.at%2Fkalender","page":"https://www.onb.ac.at/kalender","workerid":0}}
{"timestamp":"2025-02-20T08:53:35.076Z","logLevel":"info","context":"behaviorScript","message":"Behavior log","details":{"state":{"segments":1},"msg":"Skipping autoscroll, page seems to not be responsive to scrolling events","page":"https://www.onb.ac.at/kalender","workerid":0}}
{"timestamp":"2025-02-20T08:53:35.076Z","logLevel":"info","context":"behaviorScript","message":"Behavior log","details":{"state":{"segments":1},"msg":"done!","page":"https://www.onb.ac.at/kalender","workerid":0}}
{"timestamp":"2025-02-20T08:53:35.571Z","logLevel":"info","context":"behaviorScript","message":"Behavior log","details":{"state":{"segments":1},"msg":"Skipping autoscroll, page seems to not be responsive to scrolling events","page":"https://www.onb.ac.at/kalender","workerid":0}}
{"timestamp":"2025-02-20T08:53:35.572Z","logLevel":"info","context":"behaviorScript","message":"Behavior log","details":{"state":{"segments":1},"msg":"done!","page":"https://www.onb.ac.at/kalender","workerid":0}}
{"timestamp":"2025-02-20T08:53:35.575Z","logLevel":"info","context":"behavior","message":"Run Script Finished","details":{"frameUrl":"https://www.onb.ac.at/kalender","page":"https://www.onb.ac.at/kalender","workerid":0}}
{"timestamp":"2025-02-20T08:53:35.577Z","logLevel":"info","context":"behavior","message":"Run Script Finished","details":{"frameUrl":"https://eu.libraryh3lp.com/chat/[email protected]?skin=14948&referer=https%3A%2F%2Fwww.onb.ac.at%2Fkalender","page":"https://www.onb.ac.at/kalender","workerid":0}}
{"timestamp":"2025-02-20T08:53:35.577Z","logLevel":"info","context":"behavior","message":"Behaviors finished","details":{"finished":2,"page":"https://www.onb.ac.at/kalender","workerid":0}}
{"timestamp":"2025-02-20T08:53:36.637Z","logLevel":"info","context":"pageStatus","message":"Page Finished","details":{"loadState":4,"page":"https://www.onb.ac.at/kalender","workerid":0}}
{"timestamp":"2025-02-20T08:53:36.658Z","logLevel":"info","context":"worker","message":"Starting page","details":{"workerid":0,"page":"https://www.onb.ac.at/oeffnungszeiten"}}
{"timestamp":"2025-02-20T08:53:36.659Z","logLevel":"info","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":2,"total":390,"pending":1,"failed":0,"limit":{"max":1000,"hit":false},"pendingPages":["{"seedId":0,"started":"2025-02-20T08:53:36.657Z","extraHops":0,"url":"https:\/\/www.onb.ac.at\/oeffnungszeiten","added":"2025-02-20T08:52:34.771Z","depth":1}"]}}
{"timestamp":"2025-02-20T08:53:37.003Z","logLevel":"info","context":"general","message":"Awaiting page load","details":{"page":"https://www.onb.ac.at/oeffnungszeiten","workerid":0}}
{"timestamp":"2025-02-20T08:53:37.838Z","logLevel":"warn","context":"recorder","message":"Skipping URL from unknown frame","details":{"url":"https://eu.libraryh3lp.com/chat/[email protected]?skin=14948&referer=https%3A%2F%2Fwww.onb.ac.at%2Foeffnungszeiten","frameId":"B863D6A957B12CD3238E0521FFF3AA9D"}}
{"timestamp":"2025-02-20T08:53:40.183Z","logLevel":"warn","context":"general","message":"Invalid Page - URL must start with http:// or https://","details":{"url":"data:text/plain;charset=utf-8,%3C!DOCTYPE%20html%3E%3Chtml%3E%3Cbody%3E%3C%2Fbody%3E%3C%2Fhtml%3E","page":"https://www.onb.ac.at/oeffnungszeiten","workerid":0}}
{"timestamp":"2025-02-20T08:53:41.371Z","logLevel":"info","context":"behavior","message":"Running behaviors","details":{"frames":2,"frameUrls":["https://www.onb.ac.at/oeffnungszeiten","https://eu.libraryh3lp.com/chat/[email protected]?skin=14948&referer=https%3A%2F%2Fwww.onb.ac.at%2Foeffnungszeiten"],"page":"https://www.onb.ac.at/oeffnungszeiten","workerid":0}}

@ikreymer
Copy link
Member

The yaml file supports per-seed scope rules, which take precedence over the global scope rules. The per-seed rules can only be set in the YAML file, while global ones can be set in both.

If you do:

seeds:
   - url: https://www.onb.ac.at/koop-litera

scopeType: "prefix"
depth: 7

instead of:

seeds:
   - url: https://www.onb.ac.at/koop-litera
     scopeType: "prefix"
     depth: 7

Then the scopeType and depth should be overridable via the command line

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Triage
Development

No branches or pull requests

2 participants