Skip to content

Self reference #27

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Self reference #27

wants to merge 2 commits into from

Conversation

arrrrny
Copy link

@arrrrny arrrrny commented May 12, 2025

This fixes #24 and enhance self referencing.

Idea is that once I extract a value via parser, I want to reuse that extract value multiple times.

In the current implementation this is not possible. I have to reselect it.

Idea of using is neat however it does not cover the whole cases. like in attribute.
slot does not cover the case for reusing the extracted element more than once case.

This PR allows to reuse once extracted element,text and extract attributes of the same element

@arrrrny arrrrny closed this May 12, 2025
@arrrrny arrrrny reopened this May 14, 2025
@sukhcha-in
Copy link
Owner

Thanks for the PR :)

Will test it tomorrow.

@sukhcha-in sukhcha-in added the enhancement New feature or request label May 14, 2025
@sukhcha-in
Copy link
Owner

Hi @arrrrny, I can see use of _self in ParserType.attribute.
Can you provide example usage of _self in ParserType.element and ParserType.text?

You can add example in this test file:

import 'package:dart_web_scraper/dart_web_scraper.dart';
import 'package:html/parser.dart';

Map<String, List<Config>> configMap = {
  'example.com': testConfig,
};

List<Config> testConfig = [
  Config(
    usePassedUserAgent: true,
    parsers: {
      "main": [
        Parser(
          id: 'products',
          parent: ['_root'],
          type: ParserType.element,
          selector: ['div[role="listitem"]'],
          multiple: true,
        ),
        Parser(
          id: 'name',
          parent: ['products'],
          type: ParserType.text,
          selector: ['div.product'],
        ),
        Parser(
          id: 'sponsored',
          parent: ['products'],
          type: ParserType.attribute,
          selector: ['_self::data-component-type', '_self::class'],
          cleaner: (data, debug) {
            return data.obj == "sponsored" ? true : false;
          },
        )
      ],
    },
    urlTargets: [
      UrlTarget(
        name: 'main',
        where: [
          "/",
        ],
      ),
    ],
  ),
];

void main() async {
  Uri url = Uri.parse("https://example.com");

  Config? config = getConfig(
    url,
    configs: configMap,
  );
  if (config == null) {
    print("Unsupported URL");
    return;
  }

  WebParser webParser = WebParser();

  Map<String, Object> parsed = await webParser.parse(
    scrapedData: Data(
      url,
      parse(scrapedData),
    ),
    config: config,
    debug: true,
  );

  print(parsed);
}

String scrapedData = """
<!DOCTYPE html>
<html lang="en">
<head>
	<meta charset="UTF-8">
	<title>Test</title>
</head>
<body>
    <div role="listitem" class="sponsored" data-component-type="sponsored">
        <div class="product">Product 1</div>
    </div>
    <div role="listitem" class="sponsored">
        <div class="product">Product 2</div>
    </div>
    <div role="listitem" data-component-type="sponsored">
        <div class="product">Product 3</div>
    </div>
    <div role="listitem">
        <div class="product">Product 4</div>
    </div>
</body>
</html>
""";

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Self Referencing
2 participants