Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interacting on websites #46

Open
nrllh opened this issue May 12, 2022 · 15 comments
Open

Interacting on websites #46

nrllh opened this issue May 12, 2022 · 15 comments

Comments

@nrllh
Copy link

nrllh commented May 12, 2022

Hii guys,

Since you are hardly changing the pipeline, I'd like to hear your opinion:

In one of our studies (s. 4.2.4), we showed that interacting on web pages triggers much more HTTP traffic and helps to explore more about the visited webpages. We have simulated keys like page down, page up, page end. I can understand that in such large measurements these can cause longer crawling time. If we don't do this, we may miss a lot (e.g., because of lazy loading). We miss in many visits; images, CSS files, XMLHttpRequests, and JavaScript files, and I think also identified technologies by Wappalyzer.

That's why I'd suggest also making a minimal simulation (like only page end). This'll allow the crawler to scroll to the end of the page and will load more data than we see.

@tunetheweb
Copy link
Member

Counter argument is that we may get an unfair view on page weight if not everyone scrolls down (and particularly to the end).

@nrllh
Copy link
Author

nrllh commented May 12, 2022

It depends on what we want to measure:

  1. What visitors of webpages may get,
  2. What websites provide.

When I think of the Almanac, we tend to measure the 2nd option.

@tunetheweb
Copy link
Member

Indeed. Though I'm not entirely clear what you mean by each of those points as could be read many ways!

But the reality is visitors may get anything from a cold, viewport-only load (basically what HTTP Archive crawl does now), to a more fuller load if they scroll to the end (your proposal), to even more if they click around and interact with the page.

Not saying adding a PageEnd check is a bad idea btw (nor that not doing that and leaving it as is necessarily bad either). They're just... different. And it's arguable which is more representative.

@rviscomi
Copy link
Member

Thanks @nrllh this is an interesting suggestion and I'm glad we're exploring it. Good points raised about what we're trying to measure getting at the philosophy of HTTP Archive.

Our mission is to measure how websites are built. Having a more complete picture of what sites are using/doing would seem to be very valuable. We do need to be careful not to do that in such a way that throws off existing analyses. As in Barry's page weight example, we would want to distinguish between "initial page weight" and "full page weight".

I do think we should pursue this idea, but first we should carefully think through any unintended side effects and ways to mitigate them.

@nrllh
Copy link
Author

nrllh commented May 16, 2022

Because of (mainly) lazy-load, we don't measure the "real" weight of the web pages anymore. However, that is a limitation of most crawlers because they don't interact with the websites. This is problematic because most of real users natively interact with the websites (scrolls down, hovering over images..) and many important parts are being loaded only after such interactions.

So INMO, interacting with websites natively belongs to surfing and is important to explore more real data about webpages and what real visitors get when they visit such webpages. It shouldn't also be ignored because of other today's web dynamics.

It's hard also to find a common solution for mimicking user interactions, but we may find acceptable ways (e.g., simulating scroll keys, hitting page up-down buttons). We can make a distinguishment with timestamps and note the timestamp just before simulating the interactions and mark the new resources. Then we'll have "the current dataset" + "new resources after interacting".

PS: I also think that we should first discuss it carefully and then introduce it if you find it necessary. With my issue, I just wanted to draw your attention to it.

@nrllh
Copy link
Author

nrllh commented Mar 5, 2024

Hi @tunetheweb @rviscomi, I would ping this issue again.

Lazy-loading frames are now supported in all browsers: https://twitter.com/addyosmani/status/1723416138477179124

I'm still for interactions such as page-down, scrolling, or reaching the end of the page to ensure all website elements are fully loaded. Any thoughts?

@rviscomi
Copy link
Member

rviscomi commented Mar 7, 2024

I'm still interested. One unintended side effect to consider is scrolling so early that we disqualify the actual LCP element. So if we do it, we'd need to scroll as unobtrusively as possible.

@pmeenan any other thoughts on gracefully simulating a scroll interaction on every page?

@pmeenan
Copy link
Member

pmeenan commented Mar 7, 2024

Technically, the best way to do this would be with a custom metric that generates the interaction and waits for a promise that fires after some amount of time or raf (or whatever a good end for the given interaction would be). The trick is going to be triggering an interaction that the browser recognizes as a user-generated interaction (JS doesn't usually trigger, not sure if the devtools events do).

That said, it will add several seconds to every page tested and I'm not sure of the value. Is there value in JUST measuring a scroll? Is that a common interaction that triggers INP issues (since scroll is supposed to be off-thread)? Can the resulting data be trusted as being useful?

@tunetheweb
Copy link
Member

I'm really worried about what this does to the existing data. e.g. the median page load size will increase. And is that is arguably just as wrong (if not more so!) as a measure of page request size depending on how many user scroll.

Ideally we want to measure this as is a separate run, but that doubles are crawl size and I don't think we vcan justify that cost (in terms of compute, crawl time, storage time, and query size).

My preferred solution is for Lighthouse to do this (I would love to have this to more accurately measure CLS in LH too!) and then have an audit to show the amount of off screen content loaded on scroll.

@pmeenan
Copy link
Member

pmeenan commented Mar 7, 2024

FWIW, custom metrics are collected after the "run" and won't get included in any of the data. I'm still somewhat skeptical about the usefulness of it though.

@tunetheweb
Copy link
Member

An so the resource requests that it requests won’t show in the main HAR?

@pmeenan
Copy link
Member

pmeenan commented Mar 7, 2024

Correct. The netlog, devtools and trace collection are all stopped before custom metrics run.

@nrllh
Copy link
Author

nrllh commented Mar 8, 2024

My worry is we're not getting the full scope of site resources with the current HTTPArchive data. Real user behavior involves a lot of interaction with webpages - scrolling, clicking, etc. - Baymard's study even shows 70% of mobile users scroll both ways on a page and interacting with pages loads more data which we currently miss. Especially with lazy-loaded iframes, we will skip a good chunk of content - most of third-parties, ads, images, scripts etc.

So in my opinion, without webpage interactions, HTTPArchive data might not fully reflect real user experiences.

@tunetheweb
Copy link
Member

But this is always going to be limit of a crawler based approach rather than using field data.

There are many interactions that can influence what a website loads. For example a "Show cart" functionality might show a same-page, pop-up which loads more resources. Or clicking on a disclosure widget can load more content. Or a mega-menu.

And do all users scroll? Half-way? All the way down?

What about infinite scroll pages?

The only thing we can say with certainty is, based on the settings the crawler uses (e.g. viewport size), the page loads X resources.

@nrllh
Copy link
Author

nrllh commented Mar 8, 2024

I am also not friend of introducing different kind of interactions, but scrolling is one of the common interactions, there are also studies show that many people scroll up and down. If we want to have more realistic measurements we should scroll as well - just like most of the web users.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants