Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Solving Cloudflare, block_images detection #170

Closed
sebhansen opened this issue Jan 29, 2025 · 16 comments
Closed

Solving Cloudflare, block_images detection #170

sebhansen opened this issue Jan 29, 2025 · 16 comments
Labels
detection issue Potential leak in Camoufox.

Comments

@sebhansen
Copy link

sebhansen commented Jan 29, 2025

Website detecting Camoufox:

The "website" I am trying to go to, is moneysupermarket.com. It's a link to a specific cars JSON data. In my normal code I'd simply take all text and scrape that so I can build it elsewhere, but it just doesnt get past Cloudflare.

Screenshots:

Image

To Reproduce:

from multiprocessing.connection import wait
from camoufox import Camoufox
from browserforge.fingerprints import Screen
import time

headers = {
    'x-frame-options': 'SAMEORIGIN'
}

with Camoufox(    
              os=('windows', 'macos'),
              screen=Screen(max_width=1920, max_height=1080),
              headless=False, 
              humanize=True, 
              block_images=True, 
              geoip=True,
              proxy={
                  'server': "serverIP",
                  'username': "usrname",
                  'password': "pass"
                  }
              ) as browser:
    browser = browser.new_context(extra_http_headers = headers)
    page = browser.new_page()
    # page.goto("https://www.google.com/")
    # page.wait_for_load_state('networkidle')
    page.goto("https://www.moneysupermarket.com/dialogue/insur/van/vehicle-lookup/at?id=8a42e125944fade7019452066c173984")
    page.wait_for_load_state('networkidle')
    time.sleep(60000)

Simply trying to scrape all text from the link provided. The code provided just shows we dont get past cloudflare.

Other questions:

Is it because of the webGL lib not being crazy good right now, and then will be fixed for the next release of the updated webGL lib?

  1. Are you using a proxy?

Yes, I am using datacenter proxies. This shouldnt be an issue however, since Zyte is able to get through every time, using the same proxies.

  1. Open the website in a private tab in your personal browser using the same IP. Does it work?

It works. It shows the cloudflare site for a second and then goes on to show me the correct page.

  1. Is Camoufox detected randomly or every time?

Every time.

  1. What OS are you using?

Win 11

Version:

Pip package: v0.4.9
Camoufox: v134.0.2-beta.20 (Up to date!)

@sebhansen sebhansen added the detection issue Potential leak in Camoufox. label Jan 29, 2025
@netdev1
Copy link

netdev1 commented Jan 29, 2025

your code example includes username / password for your proxy, if they're not just example values make sure to change them in your proxy provider settings or people can abuse that

@sebhansen
Copy link
Author

Completely forgot to remove that part, thank you.

@daijro
Copy link
Owner

daijro commented Jan 29, 2025

Hello,

Seems to be an issue with block_images=True. Cloudflare must be checking if images return a 200 response. I will add a leak warning and look for a potential workaround.

@daijro
Copy link
Owner

daijro commented Jan 29, 2025

@sebhansen I am not familiar with Zyte. Did you have images disabled on their browser as well?

daijro added a commit that referenced this issue Jan 29, 2025

Verified

This commit was signed with the committer’s verified signature.
@sebhansen
Copy link
Author

sebhansen commented Jan 29, 2025

I can't seem to find any specifics on whether or not they block images.
I have now tried with block_images=False, with the same result. The website also only contains text, so no images need to be rendered.
I wish I had more information for you, but I cant really think of anything else.

@sebhansen
Copy link
Author

I now see I am able to manually press the checkbox when testing with the browser showing. Before, it used to just throw me straight back to the same exact challenge page.

Now my question is, is Camoufox able to find that box and click it itself, or will I have to manually find the coordinates for the element, which doesnt seem dynamic at all?

@daijro
Copy link
Owner

daijro commented Jan 29, 2025

Now my question is, is Camoufox able to find that box and click it itself, or will I have to manually find the coordinates for the element, which doesnt seem dynamic at all?

Unfortunately support for searching within iframes is limited right now due to restrictions in how Camoufox handles content isolation (to prevent detection, cross-process iframe isolation is kept enabled).

However it should be possible to find the checkbox coordinates given the offset of the turnstile frame. or as a more full proof solution, finding the checkbox given a screenshot using OpenCV.

@sebhansen
Copy link
Author

@daijro Do you possibly have anything you have done with the coordinates before? I just cant seem to find anything that I can use with Camoufox. I have tried using boundingBox etc, but it isnt supported etc. I can show you what I have tried to so far.

with Camoufox(
    os=('windows', 'macos'),
    screen=Screen(max_width=1920, max_height=1080),
    headless=False,
    humanize=True,
    block_images=False,
    geoip=True,
    proxy={
        'server': "xxx",
        'username': "xxx",
        'password': "xxx"
    }
) as browser:
    browser = browser.new_context(extra_http_headers=headers)
    page = browser.new_page()

    # Navigate to the page
    page.goto("https://www.moneysupermarket.com/dialogue/insur/van/vehicle-lookup/at?id=8a42e125944fade7019452066c173984", wait_until='networkidle')
    print("Page loaded.")
    # Wait for the Cloudflare challenge iframe to appear
    try:
        print("Waiting for Cloudflare challenge iframe...")
        time.sleep(10)
        print("10 seconds passed, checking for iframe...")
        page.wait_for_selector('//*[@src*="challenges.cloudflare.com"]', timeout=30000)
        print("Cloudflare challenge iframe detected.")
    except Exception as e:
        print("Failed to detect Cloudflare challenge iframe:", e)
        exit()

    # Locate the iframe
    cloudflare_iframe = page.frame_locator('//*[@src*="challenges.cloudflare.com"]')

    # Wait for the checkbox inside the iframe
    try:
        checkbox = cloudflare_iframe.locator('input[type="checkbox"]')
        print("Checkbox located inside the iframe. Sleeping for 10 seconds...")
        time.sleep(10)
        print("Checkbox located inside the iframe.")
    except Exception as e:
        print("Failed to locate checkbox inside the iframe:", e)
        exit()

    # Get the bounding box of the checkbox
    try:
        bounding_box = checkbox.boundingBox()
        if bounding_box:
            # Calculate the center of the checkbox
            checkbox_x = bounding_box['x'] + bounding_box['width'] / 2
            checkbox_y = bounding_box['y'] + bounding_box['height'] / 2

            # Move the mouse to the checkbox coordinates
            page.mouse.move(checkbox_x, checkbox_y)

            # Click the checkbox
            page.mouse.click(checkbox_x, checkbox_y)

            print("Checkbox clicked successfully!")
        else:
            print("Failed to get bounding box of the checkbox.")
    except Exception as e:
        print("Failed to interact with the checkbox:", e)

    # Keep the browser open for debugging
    time.sleep(60000)

The issue with this is that it seems like it doesnt directly act like playwright.

@sebhansen
Copy link
Author

sebhansen commented Jan 29, 2025

Also, just tested on a new website I had issues with: https://www.nettiauto.com/vaihtoautot?posted_by=dealer

I get to the same page with the cloudflare checkbox, but this one sends me directly back to the same exact cloudflare challenge. Using the same config as for the other site. Here, Zyte is also able to go directly through (Zyte is an API used for scraping, getting json/html, or whatever you want, sent back). They maintain Scrapy if you know what that is.

@sadikhan918
Copy link

sadikhan918 commented Jan 30, 2025

@daijro Do you possibly have anything you have done with the coordinates before? I just cant seem to find anything that I can use with Camoufox. I have tried using boundingBox etc, but it isnt supported etc. I can show you what I have tried to so far.

with Camoufox(
    os=('windows', 'macos'),
    screen=Screen(max_width=1920, max_height=1080),
    headless=False,
    humanize=True,
    block_images=False,
    geoip=True,
    proxy={
        'server': "xxx",
        'username': "xxx",
        'password': "xxx"
    }
) as browser:
    browser = browser.new_context(extra_http_headers=headers)
    page = browser.new_page()

    # Navigate to the page
    page.goto("https://www.moneysupermarket.com/dialogue/insur/van/vehicle-lookup/at?id=8a42e125944fade7019452066c173984", wait_until='networkidle')
    print("Page loaded.")
    # Wait for the Cloudflare challenge iframe to appear
    try:
        print("Waiting for Cloudflare challenge iframe...")
        time.sleep(10)
        print("10 seconds passed, checking for iframe...")
        page.wait_for_selector('//*[@src*="challenges.cloudflare.com"]', timeout=30000)
        print("Cloudflare challenge iframe detected.")
    except Exception as e:
        print("Failed to detect Cloudflare challenge iframe:", e)
        exit()

    # Locate the iframe
    cloudflare_iframe = page.frame_locator('//*[@src*="challenges.cloudflare.com"]')

    # Wait for the checkbox inside the iframe
    try:
        checkbox = cloudflare_iframe.locator('input[type="checkbox"]')
        print("Checkbox located inside the iframe. Sleeping for 10 seconds...")
        time.sleep(10)
        print("Checkbox located inside the iframe.")
    except Exception as e:
        print("Failed to locate checkbox inside the iframe:", e)
        exit()

    # Get the bounding box of the checkbox
    try:
        bounding_box = checkbox.boundingBox()
        if bounding_box:
            # Calculate the center of the checkbox
            checkbox_x = bounding_box['x'] + bounding_box['width'] / 2
            checkbox_y = bounding_box['y'] + bounding_box['height'] / 2

            # Move the mouse to the checkbox coordinates
            page.mouse.move(checkbox_x, checkbox_y)

            # Click the checkbox
            page.mouse.click(checkbox_x, checkbox_y)

            print("Checkbox clicked successfully!")
        else:
            print("Failed to get bounding box of the checkbox.")
    except Exception as e:
        print("Failed to interact with the checkbox:", e)

    # Keep the browser open for debugging
    time.sleep(60000)

The issue with this is that it seems like it doesnt directly act like playwright.

From my own testing, turnstile captchas usually autosolve, but you can just click on the turnstile frame if it doesn’t.

cloudflare_iframe.click()

Or if the turnstile itself is not centered in the frame, you can click on a position relative to the top right of the frame:

cloudflare_iframe.click(position = {‘x’:20, ‘y’: 20})

block_images also breaks turnstile, and turnstiles will fail even if you click on them.

@sebhansen
Copy link
Author

sebhansen commented Jan 30, 2025

@sadikhan918 I'm not quite sure how you are able to give the locator a click function. It just doesnt want to click for me at all. If you have a full piece of code, I'd love to see how it works!

@daijro daijro changed the title Cloudflare blocking Camoufox, but not Zyte Solving Cloudflare, block_images detection Jan 30, 2025
@sadikhan918
Copy link

@sadikhan918 I'm not quite sure how you are able to give the locator a click function. It just doesnt want to click for me at all. If you have a full piece of code, I'd love to see how it works!

Sorry, I went back and checked my tests and noticed that I wasn't clicking on the iframe itself, but the parent container. That's why the positioning can be important, since the parent can span across the entire width of the page but the turnstile element is only a portion of it. This is some working code to solve it:

from camoufox.sync_api import Camoufox
import time


with Camoufox(headless=False) as browser:
    browserContext = browser.new_context()
    page = browserContext.new_page()
    
    page.goto('https://2captcha.com/demo/cloudflare-turnstile')
    cloudflare_frame = page.locator('#cf-turnstile')
    cloudflare_frame.click()
    
    # Check if cloudflare is solved
    time.sleep(100)
    browserContext.close()

@sebhansen
Copy link
Author

@sadikhan918 I'm not quite sure how you are able to give the locator a click function. It just doesnt want to click for me at all. If you have a full piece of code, I'd love to see how it works!

Sorry, I went back and checked my tests and noticed that I wasn't clicking on the iframe itself, but the parent container. That's why the positioning can be important, since the parent can span across the entire width of the page but the turnstile element is only a portion of it. This is some working code to solve it:

from camoufox.sync_api import Camoufox
import time


with Camoufox(headless=False) as browser:
    browserContext = browser.new_context()
    page = browserContext.new_page()
    
    page.goto('https://2captcha.com/demo/cloudflare-turnstile')
    cloudflare_frame = page.locator('#cf-turnstile')
    cloudflare_frame.click()
    
    # Check if cloudflare is solved
    time.sleep(100)
    browserContext.close()

I agree with the theory in this, but it just doesnt seem to be working with anything other than that exact site. Have you tried with https://sergiodemo.com/security/challenge/legacy-challenge? It's more like real life a scenario.

Sadly, everything useful is inside the #shadow-root, and I just cant seem to make it press anything or find anything that describes any locators I can use.

@sadikhan918
Copy link

@sadikhan918 I'm not quite sure how you are able to give the locator a click function. It just doesnt want to click for me at all. If you have a full piece of code, I'd love to see how it works!

Sorry, I went back and checked my tests and noticed that I wasn't clicking on the iframe itself, but the parent container. That's why the positioning can be important, since the parent can span across the entire width of the page but the turnstile element is only a portion of it. This is some working code to solve it:

from camoufox.sync_api import Camoufox
import time


with Camoufox(headless=False) as browser:
    browserContext = browser.new_context()
    page = browserContext.new_page()
    
    page.goto('https://2captcha.com/demo/cloudflare-turnstile')
    cloudflare_frame = page.locator('#cf-turnstile')
    cloudflare_frame.click()
    
    # Check if cloudflare is solved
    time.sleep(100)
    browserContext.close()

I agree with the theory in this, but it just doesnt seem to be working with anything other than that exact site. Have you tried with https://sergiodemo.com/security/challenge/legacy-challenge? It's more like real life a scenario.

Sadly, everything useful is inside the #shadow-root, and I just cant seem to make it press anything or find anything that describes any locators I can use.

I see what you mean. When I click on the turnstile by selecting the parent div in your example, it “clicks” but doesn’t solve the challenge. I’m not sure how to fix that, but I know that the challenge can be skipped altogether with a good proxy.

@sebhansen
Copy link
Author

Oh for sure, sadly those will be way too expensive to keep up haha

@sebhansen
Copy link
Author

Will close this as I dont think it'll be possible to solve the challenge without just clicking popular spots for the checkbox with coordinates. If anyone figures out a better way, feel free to let me know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
detection issue Potential leak in Camoufox.
Projects
None yet
Development

No branches or pull requests

4 participants