Revision as of 20:07, 25 September 2006 editHarmil (talk | contribs)8,207 edits →History: PGE← Previous edit | Latest revision as of 01:11, 24 December 2020 edit undoCedar101 (talk | contribs)Extended confirmed users16,563 edits →Implicit changes: <syntaxhighlight lang="peg"> | ||
(103 intermediate revisions by 61 users not shown) | |||
Line 1: | Line 1: | ||
''' |
'''Raku rules''' are the ], ] and general-purpose ] facility of the ] programming language, and are a core part of the language. Since ]'s pattern-matching constructs have exceeded the capabilities of ] regular expressions for some time, Raku documentation refers to them exclusively as ''regexes'', distancing the term from the formal definition. | ||
Raku provides a superset of Perl 5 features with respect to regexes, folding them into a larger framework called ''rules'', which provide the capabilities of a ], as well as acting as a ] with respect to their lexical scope.<ref>{{cite web | url=https://design.raku.org/S05.html | title=Synopsis 5: Regexes and Rules | author=Wall, Larry | date=June 24, 2002 }}</ref> Rules are introduced with the <code>rule</code> keyword, which has a usage quite similar to subroutine definitions. Anonymous rules can be introduced with the <code>regex</code> (or <code>rx</code>) keyword, or simply be used inline as regexes were in Perl 5 via the <code>m</code> (matching) or <code>s</code> (substitution) operators. | |||
==History== | ==History== | ||
In ''Apocalypse 5'', Larry Wall enumerated 20 problems with "current regex culture". Among these were that Perl's regexes were "too compact and 'cute'", had "too much reliance on too few metacharacters", "little support for named captures", "little support for grammars", and "poor integration with |
In ''Apocalypse 5'', a document outlining the preliminary design decisions for Raku pattern matching, Larry Wall enumerated 20 problems with the "current regex culture". Among these were that Perl's regexes were "too compact and 'cute'", had "too much reliance on too few metacharacters", "little support for named captures", "little support for grammars", and "poor integration with 'real' language".<ref>{{cite web | url=https://raku.org/archive/doc/design/apo/A05.html | title=Apocalypse 5: Pattern Matching | author=Wall, Larry | date=June 4, 2002 }}</ref> | ||
Between late 2004 and mid-2005, a compiler for |
Between late 2004 and mid-2005, a compiler for Raku style rules was developed for the ] called Parrot Grammar Engine (PGE), which was later renamed to the more generic ]. PGE is a combination of runtime and compiler for Raku style grammars that allows any parrot-based compiler to use these tools for parsing, and also to provide rules to their runtimes. | ||
Among other Raku features, support for named captures was added to Perl 5.10 in 2007.<ref> {{webarchive|url=https://web.archive.org/web/20080109115419/http://perlbuzz.com/2007/12/perl-510-now-available.html |date=2008-01-09 }}</ref> | |||
In May 2012, the reference implementation of Raku, ], shipped its Rakudo Star monthly snapshot with a working ] parser built entirely in Raku rules.<ref>{{cite web | url=https://rakudo.org/post/rakudo-star-2012-05-released | title= Rakudo Star 2012.05 released | author=moritz | date=May 5, 2012}}</ref> | |||
==Changes from Perl 5== | ==Changes from Perl 5== | ||
There are only six unchanged features from Perl 5's regexes: | There are only six unchanged features from Perl 5's regexes: | ||
* Literals: word characters |
* Literals: word characters (letters, numbers and ]) matched literally | ||
* Capturing: <code>(...)</code> | * Capturing: <code>(...)</code> | ||
* Alternatives: <code>|</code> | * Alternatives: <code>|</code> | ||
* Backslash escape: <code>\</code> | * Backslash escape: <code>\</code> | ||
* Repetition quantifiers: <code>*</code>, <code>+</code>, and <code>?</code> | * Repetition quantifiers: <code>*</code>, <code>+</code>, and <code>?</code>, but not <code>{m,n}</code> | ||
* Minimal matching suffix: <code>*?</code>, <code>+?</code>, <code>??</code> | * Minimal matching suffix: <code>*?</code>, <code>+?</code>, <code>??</code> | ||
A few of the most powerful additions include: | A few of the most powerful additions include: | ||
* The ability to reference rules using <code><rulename></code> to build up entire grammars. | * The ability to reference rules using <code><nowiki><rulename></nowiki></code> to build up entire grammars. | ||
* A handful of commit operators that allow the programmer to control ] during matching. | * A handful of commit operators that allow the programmer to control ] during matching. | ||
The following changes greatly improve the readability of regexes | |||
* Simplified non-capturing groups: <code></code> which are the same as Perl 5's: <code>(?:...)</code> | The following changes greatly improve the readability of regexes: | ||
* Simplified non-capturing groups: <code></code>, which are the same as Perl 5's: <code>(?:...)</code> | |||
* Simplified code assertions: <code><?{...}></code> | * Simplified code assertions: <code><?{...}></code> | ||
* Allows for whitespace to be included without being matched, allowing for multiline regexes. Use <code>\ </code> or <code>' '</code> to express whitespace. | |||
* Perl 5's <code>/x</code> is now the default. | |||
* Extended regex formatting (Perl 5's <code>/x</code>) is now the default. | |||
===Implicit changes=== | |||
Some of the features of Perl 5 regular expressions are more powerful in Raku because of their ability to encapsulate the expanded features of Raku rules. For example, in Perl 5, there were positive and negative lookahead operators <code>(?=...)</code> and <code>(?!...)</code>. In Raku these same features exist, but are called <code><nowiki><before ...></nowiki></code> and <code><nowiki><!before ...></nowiki></code>. | |||
However, because <code>before</code> can encapsulate arbitrary rules, it can be used to express lookahead as a ] for a grammar. For example, the following ] describes the classic ] language <math> \{ a^n b^n c^n : n \ge 1 \} </math>: | |||
<syntaxhighlight lang="peg"> | |||
S ← &(A !b) a+ B | |||
A ← a A? b | |||
B ← b B? c | |||
</syntaxhighlight> | |||
In Raku rules that would be: | |||
<syntaxhighlight lang="perl6"> | |||
rule S { <before <A> <!before b>> a+ <B> } | |||
rule A { a <A>? b } | |||
rule B { b <B>? c } | |||
</syntaxhighlight> | |||
Of course, given the ability to mix rules and regular code, that can be simplified even further: | |||
<syntaxhighlight lang="Perl6"> | |||
rule S { (a+) (b+) (c+) <{$0.elems == $1.elems == $2.elems}> } | |||
</syntaxhighlight> | |||
However, this makes use of ], which is a subtly different concept in Raku rules, but more substantially different in parsing theory, making this a semantic rather than syntactic predicate. The most important difference in practice is performance. There is no way for the rule engine to know what conditions the assertion may match, so no optimization of this process can be made. | |||
==Integration with Perl== | |||
In many languages, regular expressions are entered as strings, which are then passed to library routines that parse and compile them into an internal state. In Perl 5, regular expressions shared some of the ] with Perl's scanner. This simplified many aspects of regular expression usage, though it added a great deal of complexity to the scanner. In Raku, rules are part of the grammar of the language. No separate parser exists for rules, as it did in Perl 5. This means that code, embedded in rules, is parsed at the same time as the rule itself and its surrounding code. For example, it is possible to nest rules and code without re-invoking the parser: | |||
<syntaxhighlight lang="perl6"> | |||
rule ab { | |||
(a.) # match "a" followed by any character | |||
# Then check to see if that character was "b" | |||
# If so, print a message. | |||
{ $0 ~~ /b {say "found the b"}/ } | |||
} | |||
</syntaxhighlight> | |||
The above is a single block of Raku code that contains an outer rule definition, an inner block of assertion code, and inside of that a regex that contains one more level of assertion. | |||
== Implementation == | == Implementation == | ||
=== Keywords === | === Keywords === | ||
There are several keywords used in conjunction with |
There are several keywords used in conjunction with Raku rules: | ||
;regex: A named or anonymous regex |
;regex: A named or anonymous regex that ignores whitespace within the regex by default. | ||
; |
;token: A named or anonymous regex that implies the <code>:ratchet</code> modifier. | ||
; |
;rule: A named or anonymous regex that implies the <code>:ratchet</code> and <code>:sigspace</code> modifiers. | ||
;rx: An anonymous regex |
;rx: An anonymous regex that takes arbitrary delimiters such as <code>//</code> where regex only takes braces. | ||
;m: An operator form of anonymous regex |
;m: An operator form of anonymous regex that performs matches with arbitrary delimiters. | ||
; |
;mm: Shorthand for m with the <code>:sigspace</code> modifier. | ||
;s: An operator form of anonymous regex |
;s: An operator form of anonymous regex that performs substitution with arbitrary delimiters. | ||
;ss: Shorthand for s with the <code>:sigspace</code> modifier. | ;ss: Shorthand for s with the <code>:sigspace</code> modifier. | ||
;/.../: Simply placing a regex between slashes is shorthand for <code> |
;<code>/.../</code>: Simply placing a regex between slashes is shorthand for <code>rx/.../</code>. | ||
Here is an example of typical use: | |||
<syntaxhighlight lang="perl6"> | |||
token word { \w+ } | |||
rule phrase { <word> * \. } | |||
if $string ~~ / <phrase> \n / { | |||
... | |||
} | |||
</syntaxhighlight> | |||
=== Modifiers === | === Modifiers === | ||
Modifiers may be placed after any of the regex keywords, and before the |
Modifiers may be placed after any of the regex keywords, and before the delimiter. If a regex is named, the modifier comes after the name. Modifiers control the way regexes are parsed and how they behave. They are always introduced with a leading <code>:</code> character. | ||
Some of the more important modifiers include: | Some of the more important modifiers include: | ||
* <code>:i</code> or <code>:ignorecase</code> – Perform matching without respect to case. | * <code>:i</code> or <code>:ignorecase</code> – Perform matching without respect to case. | ||
* <code>:m</code> or <code>:ignoremark</code> – Perform matching without respect to combining characters. | |||
* <code>:g</code> or <code>:global</code> – Perform the match more than once on a given target string. | * <code>:g</code> or <code>:global</code> – Perform the match more than once on a given target string. | ||
* <code>:s</code> or <code>:sigspace</code> – Replace whitespace in the regex with a whitespace-matching rule, rather than simply ignoring it. | * <code>:s</code> or <code>:sigspace</code> – Replace whitespace in the regex with a whitespace-matching rule, rather than simply ignoring it. | ||
Line 51: | Line 105: | ||
* <code>:ratchet</code> – Never perform backtracking in the rule. | * <code>:ratchet</code> – Never perform backtracking in the rule. | ||
For example: | |||
=== Grammars === | |||
<syntaxhighlight lang="Perl6"> | |||
A grammar may be defined using the <code>grammar</code> operator. A grammar is essentially just a namespace for rules: | |||
regex addition { :ratchet :sigspace <term> \+ <expr> } | |||
</syntaxhighlight> | |||
grammar Str::SprintfFormat { | |||
regex format_token { \%: <index>? <precision>? <modifier>? <directive> } | |||
token index { \d+ \$ } | |||
token precision { <flags>? <vector>? <precision_count> } | |||
token flags { <>+ } | |||
token precision_count { >\d* | \* ]? ]? } | |||
token vector { \*? v } | |||
token modifier { ll | <> } | |||
token directive { <> } | |||
} | |||
=== Grammars === | |||
A grammar may be defined using the <code>grammar</code> operator. A grammar is essentially just a ] for rules: | |||
<syntaxhighlight lang="perl6"> | |||
grammar Str::SprintfFormat { | |||
regex format_token { \%: <index>? <precision>? <modifier>? <directive> } | |||
token index { \d+ \$ } | |||
token precision { <flags>? <vector>? <precision_count> } | |||
token flags { <>+ } | |||
token precision_count { >\d* | \* ]? ]? } | |||
token vector { \*? v } | |||
token modifier { ll | <> } | |||
token directive { <> } | |||
} | |||
</syntaxhighlight> | |||
This is the grammar used to define Perl's <code>]</code> string formatting notation. | This is the grammar used to define Perl's <code>]</code> string formatting notation. | ||
Outside of this namespace, you could use these rules like so: | Outside of this namespace, you could use these rules like so: | ||
<syntaxhighlight lang="perl6"> | |||
if / <Str::SprintfFormat::format_token> / { ... } | |||
</syntaxhighlight> | |||
A rule used in this way is actually identical to the invocation of a subroutine with the extra semantics and side-effects of pattern matching (e.g. rule invocations can be backtracked). | A rule used in this way is actually identical to the invocation of a subroutine with the extra semantics and side-effects of pattern matching (e.g., rule invocations can be backtracked). | ||
==Examples== | ==Examples== | ||
Here are some example rules in |
Here are some example rules in Raku: | ||
<syntaxhighlight lang="perl"> | |||
rx { a ( d | e ) f : g } | |||
rx { a (d | e) f : g } | |||
rx { (ab*) <{ $1.size % 2 == 0 }> } | |||
</syntaxhighlight> | |||
That last is identical to: | That last is identical to: | ||
<syntaxhighlight lang="perl"> | |||
rx { ( ab* ) } | |||
rx { (ab*) } | |||
</syntaxhighlight> | |||
==References== | ==References== | ||
<references/> | <references/> | ||
==External links== | |||
] | |||
* - The reference manual page for grammars. | |||
* - A tutorial for grammars in Raku | |||
* - The standards document covering Perl 6 regexes and rules. | |||
* - Gentle introduction to Perl 6 regexes. | |||
{{Perl}} | |||
] | |||
] |
Latest revision as of 01:11, 24 December 2020
Raku rules are the regular expression, string matching and general-purpose parsing facility of the Raku programming language, and are a core part of the language. Since Perl's pattern-matching constructs have exceeded the capabilities of formal regular expressions for some time, Raku documentation refers to them exclusively as regexes, distancing the term from the formal definition.
Raku provides a superset of Perl 5 features with respect to regexes, folding them into a larger framework called rules, which provide the capabilities of a parsing expression grammar, as well as acting as a closure with respect to their lexical scope. Rules are introduced with the rule
keyword, which has a usage quite similar to subroutine definitions. Anonymous rules can be introduced with the regex
(or rx
) keyword, or simply be used inline as regexes were in Perl 5 via the m
(matching) or s
(substitution) operators.
History
In Apocalypse 5, a document outlining the preliminary design decisions for Raku pattern matching, Larry Wall enumerated 20 problems with the "current regex culture". Among these were that Perl's regexes were "too compact and 'cute'", had "too much reliance on too few metacharacters", "little support for named captures", "little support for grammars", and "poor integration with 'real' language".
Between late 2004 and mid-2005, a compiler for Raku style rules was developed for the Parrot virtual machine called Parrot Grammar Engine (PGE), which was later renamed to the more generic Parser Grammar Engine. PGE is a combination of runtime and compiler for Raku style grammars that allows any parrot-based compiler to use these tools for parsing, and also to provide rules to their runtimes.
Among other Raku features, support for named captures was added to Perl 5.10 in 2007.
In May 2012, the reference implementation of Raku, Rakudo, shipped its Rakudo Star monthly snapshot with a working JSON parser built entirely in Raku rules.
Changes from Perl 5
There are only six unchanged features from Perl 5's regexes:
- Literals: word characters (letters, numbers and underscore) matched literally
- Capturing:
(...)
- Alternatives:
|
- Backslash escape:
\
- Repetition quantifiers:
*
,+
, and?
, but not{m,n}
- Minimal matching suffix:
*?
,+?
,??
A few of the most powerful additions include:
- The ability to reference rules using
<rulename>
to build up entire grammars. - A handful of commit operators that allow the programmer to control backtracking during matching.
The following changes greatly improve the readability of regexes:
- Simplified non-capturing groups:
, which are the same as Perl 5's:
(?:...)
- Simplified code assertions:
<?{...}>
- Allows for whitespace to be included without being matched, allowing for multiline regexes. Use
\
or' '
to express whitespace. - Extended regex formatting (Perl 5's
/x
) is now the default.
Implicit changes
Some of the features of Perl 5 regular expressions are more powerful in Raku because of their ability to encapsulate the expanded features of Raku rules. For example, in Perl 5, there were positive and negative lookahead operators (?=...)
and (?!...)
. In Raku these same features exist, but are called <before ...>
and <!before ...>
.
However, because before
can encapsulate arbitrary rules, it can be used to express lookahead as a syntactic predicate for a grammar. For example, the following parsing expression grammar describes the classic non-context-free language :
S ← &(A !b) a+ B A ← a A? b B ← b B? c
In Raku rules that would be:
rule S { <before <A> <!before b>> a+ <B> } rule A { a <A>? b } rule B { b <B>? c }
Of course, given the ability to mix rules and regular code, that can be simplified even further:
rule S { (a+) (b+) (c+) <{$0.elems == $1.elems == $2.elems}> }
However, this makes use of assertions, which is a subtly different concept in Raku rules, but more substantially different in parsing theory, making this a semantic rather than syntactic predicate. The most important difference in practice is performance. There is no way for the rule engine to know what conditions the assertion may match, so no optimization of this process can be made.
Integration with Perl
In many languages, regular expressions are entered as strings, which are then passed to library routines that parse and compile them into an internal state. In Perl 5, regular expressions shared some of the lexical analysis with Perl's scanner. This simplified many aspects of regular expression usage, though it added a great deal of complexity to the scanner. In Raku, rules are part of the grammar of the language. No separate parser exists for rules, as it did in Perl 5. This means that code, embedded in rules, is parsed at the same time as the rule itself and its surrounding code. For example, it is possible to nest rules and code without re-invoking the parser:
rule ab { (a.) # match "a" followed by any character # Then check to see if that character was "b" # If so, print a message. { $0 ~~ /b {say "found the b"}/ } }
The above is a single block of Raku code that contains an outer rule definition, an inner block of assertion code, and inside of that a regex that contains one more level of assertion.
Implementation
Keywords
There are several keywords used in conjunction with Raku rules:
- regex
- A named or anonymous regex that ignores whitespace within the regex by default.
- token
- A named or anonymous regex that implies the
:ratchet
modifier. - rule
- A named or anonymous regex that implies the
:ratchet
and:sigspace
modifiers. - rx
- An anonymous regex that takes arbitrary delimiters such as
//
where regex only takes braces. - m
- An operator form of anonymous regex that performs matches with arbitrary delimiters.
- mm
- Shorthand for m with the
:sigspace
modifier. - s
- An operator form of anonymous regex that performs substitution with arbitrary delimiters.
- ss
- Shorthand for s with the
:sigspace
modifier. /.../
- Simply placing a regex between slashes is shorthand for
rx/.../
.
Here is an example of typical use:
token word { \w+ } rule phrase { <word> * \. } if $string ~~ / <phrase> \n / { ... }
Modifiers
Modifiers may be placed after any of the regex keywords, and before the delimiter. If a regex is named, the modifier comes after the name. Modifiers control the way regexes are parsed and how they behave. They are always introduced with a leading :
character.
Some of the more important modifiers include:
:i
or:ignorecase
– Perform matching without respect to case.:m
or:ignoremark
– Perform matching without respect to combining characters.:g
or:global
– Perform the match more than once on a given target string.:s
or:sigspace
– Replace whitespace in the regex with a whitespace-matching rule, rather than simply ignoring it.:Perl5
– Treat the regex as a Perl 5 regular expression.:ratchet
– Never perform backtracking in the rule.
For example:
regex addition { :ratchet :sigspace <term> \+ <expr> }
Grammars
A grammar may be defined using the grammar
operator. A grammar is essentially just a namespace for rules:
grammar Str::SprintfFormat { regex format_token { \%: <index>? <precision>? <modifier>? <directive> } token index { \d+ \$ } token precision { <flags>? <vector>? <precision_count> } token flags { <>+ } token precision_count { >\d* | \* ]? ]? } token vector { \*? v } token modifier { ll | <> } token directive { <> } }
This is the grammar used to define Perl's sprintf
string formatting notation.
Outside of this namespace, you could use these rules like so:
if / <Str::SprintfFormat::format_token> / { ... }
A rule used in this way is actually identical to the invocation of a subroutine with the extra semantics and side-effects of pattern matching (e.g., rule invocations can be backtracked).
Examples
Here are some example rules in Raku:
rx { a (d | e) f : g } rx { (ab*) <{ $1.size % 2 == 0 }> }
That last is identical to:
rx { (ab*) }
References
- Wall, Larry (June 24, 2002). "Synopsis 5: Regexes and Rules".
- Wall, Larry (June 4, 2002). "Apocalypse 5: Pattern Matching".
- Perl 5.10 now available - Perl Buzz Archived 2008-01-09 at the Wayback Machine
- moritz (May 5, 2012). "Rakudo Star 2012.05 released".
External links
- Raku Grammars - The reference manual page for grammars.
- Grammar tutorial - A tutorial for grammars in Raku
- Synopsis 05 - The standards document covering Perl 6 regexes and rules.
- Perl 6 Regex Introduction - Gentle introduction to Perl 6 regexes.
Perl | |
---|---|
Things | |
Frameworks | |
Software | |
Related | |
Books | |
People | |