Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Match start and end of string #13

Open
martinflack opened this issue Mar 20, 2022 · 5 comments
Open

Match start and end of string #13

martinflack opened this issue Mar 20, 2022 · 5 comments

Comments

@martinflack
Copy link

Great project. Is it intentional that a caret ^ does not match the start of the string and a dollar $ does not match the end? Wikipedia seems to think those are included in POSIX regex.

@no-defun-allowed
Copy link
Collaborator

It's currently only intentional for as long as I'm still musing on the nicest way to represent it in a DFA. Gilbert told me it's possible to handle arbitrary lookahead and lookbehind with derivatives, but I still don't get his description. Or I could generalise derivatives to consume boundary information as well as characters.

@madnificent
Copy link

Is something like this on the virtual in-your-head roadmap? I want to benchmark OMRN for matching of terminals of an LL(1) language and therefore only care about the longest match from START.

@no-defun-allowed
Copy link
Collaborator

It is, but I'm not sure of just what to do. At the moment it's possible I could have a function that just scans once from the start, but the issue is about handling ^ and $ which is more general, and there are a couple of ways to do them.

  • The DFA and derivatives could be augmented to check conditions like start, end, start-of-word, end-of-word, etc.
  • More general lookahead/behind could be implemented, if very careful.

A way to match just at the start wouldn't be too hard still. Rather than wrapping every RE in grep as we do, just not wrapping suffices to get a machine that scans only from the start.

@madnificent
Copy link

I don't know my way through the code well enough to give this a stab. Could you point me to the right lines to edit or some branch so I could run an initial vague benchmark for my case?

@no-defun-allowed
Copy link
Collaborator

no-defun-allowed commented Jun 13, 2022

Here is a huge kludge to make OMRN generate code to only scan at the start of string. Note that it makes all regular expressions only match at the start of the string; I'd need to make a separate caches for lexing(?) and grepping functions which requires more changes. Also please pull from master as I needed to not hard-code one thing that should be a use of the (restart) local macro, for this to work.

(in-package :one-more-re-nightmare)

(defclass scan-from-start ()
  ())

;;; Copied verbatim from the START-CODE method for SCAN-EVERYTHING. Oh no 
(defmethod start-code ((strategy scan-from-start) states)
  (destructuring-bind (state) states
    (let ((expression (state-expression state)))
      (cond
        ((state-never-succeeds-p state)
         ;; Just return immediately if we're told to match nothing.
         `(start (return)))
        ((re-empty-p expression)
         ;; Succeed for every character?
         `(start
           (cond
             ((> position end)
              (return))
             (t
              (incf position)
              ,(setf-from-assignments (state-exit-effects state))
              (win ,@(win-locations (state-exit-map state)))))))
        (t
         `(start
           (setf position start)
           (go ,(find-state-name state :bounds-check))))))))

(defmethod initial-states ((strategy scan-from-start) expression)
  ;; Don't do parallel GREP scanning stuff.
  (list (alpha (add-tags expression) (empty-set))))

(defmethod macros-for-strategy append ((strategy scan-from-start))
  '((restart ()
     ;; Don't actually restart the DFA.
     '(return))))

(defun make-default-strategy (layout expression)
  (declare (ignore layout expression))
  (make-instance (dynamic-mixins:mix 'scan-from-start 'call-continuation)))

After which:

CL-USER> (one-more-re-nightmare:first-match "f" "food")
#(0 1)
CL-USER> (one-more-re-nightmare:first-match "f" "oodf")
NIL

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants