Skip to content

A Dart package for flexibly parsing text into easy-to-handle format according to multiple regular expression patterns.

License

Notifications You must be signed in to change notification settings

kaboc/dart_text_parser

Repository files navigation

Pub Version Dart CI codecov

A Dart package for parsing text flexibly according to preset or custom regular expression patterns.

Usage

Using the preset matchers (URL / email address / phone number)

The package has the following preset matchers.

Below is an example of using three of the preset matchers except for UrlLikeMatcher.

import 'package:text_parser/text_parser.dart';

void main() {
  const text = 'abc https://example.com/sample.jpg. def\n'
      '[email protected] +1-012-3456-7890';

  final parser = TextParser(
    matchers: const [
      EmailMatcher(),
      UrlMatcher(),
      TelMatcher(),
    ],
  );
  final elements = parser.parseSync(text);
  elements.forEach(print);
}

Output:

TextElement(matcherType: TextMatcher, matcherIndex null, offset: 0, text: abc , groups: [])
TextElement(matcherType: UrlMatcher, matcherIndex 1, offset: 4, text: https://example.com/sample.jpg, groups: [])
TextElement(matcherType: TextMatcher, matcherIndex null, offset: 34, text: . def\n, groups: [])
TextElement(matcherType: EmailMatcher, matcherIndex 0, offset: 40, text: [email protected], groups: [])
TextElement(matcherType: TextMatcher, matcherIndex null, offset: 60, text:  , groups: [])
TextElement(matcherType: TelMatcher, matcherIndex 2, offset: 61, text: +1-012-3456-7890, groups: [])

The regular expression pattern of each of them is not very strict. If it does not meet your use case, overwrite the pattern by yourself to make it stricter.

parse() vs parseSync()

parseSync() literally executes parsing synchronously. If you want to prevent an execution from blocking the UI in Flutter or pauses other tasks in pure Dart, use parse() instead.

  • useIsolate: false
    • Parsing is scheduled as a microtask.
  • useIsolate: true (default)
    • Parsing is executed in an isolate.
    • On Flutter Web, this is treated the same as useIsolate: false since dart:isolate is not supported on the platform.

UrlMatcher vs UrlLikeMatcher

UrlMatcher does not match URLs not starting with "http" (e.g. example.com, //example.com, etc). If you want them to be matched too, use UrlLikeMatcher instead.

matcherType and matcherIndex

matcherType contained in a TextElement object is the type of the matcher that was used to extract the element. matcherIndex is the index of the matcher in the matcher list passed to the matchers argument of TextParser.

Extracting only matching text elements

By default, the result of parse() or parseSync() contains all elements including the ones that have TextMatcher as matcherType, which are elements of a string that did not match any match pattern. If you want to exclude them, pass onlyMatches: true when calling parse() or parseSync().

final elements = parser.parseSync(text, onlyMatches: true);
elements.forEach(print);

Output:

TextElement(matcherType: UrlMatcher, matcherIndex 1, offset: 4, text: https://example.com/sample.jpg, groups: [])
TextElement(matcherType: EmailMatcher, matcherIndex 0, offset: 40, text: [email protected], groups: [])
TextElement(matcherType: TelMatcher, matcherIndex 2, offset: 56, text: +1-012-3456-7890, groups: [])

Extracting text elements of a particular matcher type

final telElements = elements.whereMatcherType<TelMatcher>().toList();

Or use a classic way:

final telElements = elements.map((elm) => elm.matcherType == TelMatcher).toList();

Conflict between matchers

If multiple matchers match the string at the same position in text, the first one in those matchers takes precedence.

final parser = TextParser(matchers: const[UrlLikeMatcher(), EmailMatcher()]);
final elements = parser.parseSync('[email protected]');

In this example, UrlLikeMatcher matches foo.bar and EmailMatcher matches [email protected], but UrlLikeMatcher is used because it is written before EmailMatcher in the matchers list.

Overwriting the pattern of a preset matcher

If you want to parse a sequence of eleven numbers after "tel:" as a phone number:

TelMatcher(r'(?<=tel:)\d{11}')

Using a custom pattern

You can create a matcher with a custom pattern either with PatternMatcher or by extending TextMatcher.

PatternMatcher

const boldMatcher = PatternMatcher(r'\*\*(.+?)\*\*');
final parser = TextParser(matchers: [boldMatcher]);

Custom matcher class

It is also possible to create a matcher class by extending TextMatcher.

Below is an example of a matcher that parses the HTML <a> tags into a set of the href value and the link text.

class ATagMatcher extends TextMatcher {
  const ATagMatcher()
      : super(
          r'\<a\s(?:.+?\s)*?href="(.+?)".*?\>'
          r'\s*(.+?)\s*'
          r'\</a\>',
        );
}
const text = '''
<a class="foo" href="https://example.com/">
  Content inside tags
</a>
''';

final parser = TextParser(
  matchers: const [ATagMatcher()],
  dotAll: true,
);
final elements = parser.parseSync(text, onlyMatches: true);
print(elements.first.groups);

Output:

[https://example.com/, Content inside tags]

ExactMatcher

ExactMatcher escapes reserved characters of RegExp so that those are used as regular characters. The parser extracts the substrings that exactly match any of the strings in the list passed as the argument.

TextParser(
  matchers: [
    // 'e.g.' matches only 'e.g.', not 'edge' nor 'eggs'.
    ExactMatcher(['e.g.', 'i.e.']),
  ],
)

Groups

Each TextElement in a parse result has the property of groups. It is a list of strings that have matched the smaller pattern inside every set of parentheses ( ).

Below is an example of a pattern that matches a Markdown style link.

r'\[(.+?)\]\((.*?)\)'

This pattern has two sets of parentheses; (.+?) in \[(.+?)\] and (/*?) in \((.*?)\). When this matches [foo](bar), the first set of parentheses captures "foo" and the second set captures "bar", so groups results in ['foo', 'bar'].

Tip:

If you want certain parentheses to be not captured as a group, add ?: after the opening parenthesis, like (?:pattern) instead of (pattern).

Named groups

Named groups are captured too, but their names are lost in the resulting groups list. Below is an example where a single match pattern contains capturing of both unnamed and named groups.

final parser = TextParser(
  matchers: const [PatternMatcher(r'(?<year>\d{4})-(\d{2})-(?<day>\d{2})')],
);
final elements = parser.parseSync('2020-01-23');
print(elements.first);

Output:

TextElement(matcherType: PatternMatcher, matcherIndex 0, offset: 0, text: 2020-01-23, groups: [2020, 01, 23])

RegExp options

How a regular expression is treated can be configured in the TextParser constructor.

  • multiLine
  • caseSensitive
  • unicode
  • dotAll

These options are passed to the constructor of RegExp internally, so refer to its document for information.

About

A Dart package for flexibly parsing text into easy-to-handle format according to multiple regular expression patterns.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages