library(pcre)
provides access to Perl Compatible Regular Expressions.
The core facility for string matching in Prolog is provided by DCG (Definite Clause Grammars). Using DCGs is typically more verbose but gives reuse, modularity, readability and mixing with arbitrary Prolog code in return. Supporting regular expressions has some advantages: (1) in simple cases the terse specification of a regular expression is more comfortable, (2) many programmers are familar with them and (3) regular expressions are part of domain specific languages one may wish to implement in Prolog, e.g., SPARQL.
There are roughly three options for adding regular expressions to
Prolog. One is to simply interpret them in Prolog. Given Prolog's
unification and backtracking facilities this is remarkable simple and
performs quite reasonable. Still, the implementing all facilities of
modern regular expression engines requires significant effort.
Alternatively, we can compile them into DCGs. This brings terse
expressions to DCGs while staying in the same framework. The
disadvantage is that regular expressions become programs that are hard
to reclaim, making this approach less attractive for applications that
potentially execute many different regular expressions. The final option
is to wrap an existing regular expression engine. This provides access
to a robust implementation for which we only have to document the Prolog
binding. That is the option taken by library library(pcre)
.
This module provides an interface to the PCRE (Perl Compatible Regular Expression) library. This Prolog interface provides an almost comprehensive wrapper around PCRE.
Regular expressions are created from a pattern and options and represented as a SWI-Prolog blob. This implies they are subject to (atom) garbage collection. Compiled regular expressions can safely be used in multiple threads. Most predicates accept both an explicitly compiled regular expression, a pattern or a term Pattern/Flags. In the latter two cases a regular expression blob is created and stored in a cache. The cache can be cleared using re_flush/0.
?- re_match("^needle"/i, "Needle in a haystack"). true.
Options:
true
, match only at the first positionfalse
)anycrlf
, \
R only matches CR, LF or CRLF. If unicode
,
\
R matches all Unicode line endings. Subject string is the
end of a line (default false
)true
)true
)false
)any
, recognize any Unicode newline sequence, if anycrlf
,
recognize CR, LF, and CRLF as newline sequences, if cr
,
recognize CR, if lf
, recognize LF and finally if crlf
recognize CRLF as newline.Regex | is the output of re_compile/3,
a pattern or a term Pattern/Flags, where Pattern is an atom or string.
The defined flags and there related option for re_compile/3
are below.
|
capture_type(Type)
option passed to re_compile/3,
may be specified using flags if Regex is of the form
Pattern/Flags and may be specified at the level of individual captures
using a naming convention for the caption name. See re_compile/3
for details.
The example below exploits the typed groups to parse a date specification:
?- re_matchsub("(?<date> (?<year_I>(?:\\d\\d)?\\d\\d) - (?<month_I>\\d\\d) - (?<day_I>\\d\\d) )"/e, "2017-04-20", Sub, []). Sub = re_match{0:"2017-04-20", date:"2017-04-20", day:20, month:4, year:2017}.
call(Goal, Dict1, V0, V1), call(Goal, Dict2, V1, V2), ... call(Goal, Dictn, Vn, V).
This predicate is used to implement re_split/4 and re_replace/4. For example, we can count all matches of a Regex on String using this code:
re_match_count(Regex, String, Count) :- re_foldl(increment, Regex, String, 0, Count, []). increment(_Match, V0, V1) :- V1 is V0+1.
After which we can query
?- re_match_count("a", "aap", X). X = 2.
?- re_split("a+", "abaac", Split, []). Split = ["","a","b","aa","c"]. ?- re_split(":\\s*"/n, "Age: 33", Split, []). Split = ['Age', ': ', 33].
Pattern | is the pattern text, optionally
follows by /Flags. Similar to re_matchsub/4,
the final output type can be controlled by a flag a (atom), s
(string, default) or n (number if possible, atom
otherwise). |
\
N
or $Name. Both N and Name may be written as {N} and {Name} to avoid
ambiguities.
Pattern | is the pattern text, optionally
follows by /Flags. Flags may include g , replacing all
occurences of Pattern. In addition, similar to re_matchsub/4,
the final output type can be controlled by a flag a (atom)
or s (string, default). |
regex
(see blob/2). Defined Options
are defined below. Please consult the PCRE documentation for details.
anycrlf
, \
R only matches CR, LF or CRLF. If unicode
,
\
R matches all Unicode line endings.true
, do caseless matching.true
, $ not to match newline at endtrue
, . matches anything including NLtrue
, allow duplicate names for subpatternstrue
, ignore white space and # commentstrue
, PCRE extra features (not much use currently)true
, force matching to be before newlinejavascript
, JavaScript compatibilitytrue
, ^
and $ match newlines within dataany
, recognize any Unicode newline sequence, if anycrlf
(default), recognize CR, LF, and CRLF as newline sequences, if cr
,
recognize CR, if lf
, recognize LF and finally if crlf
recognize CRLF as newline.true
, use Unicode properties for \
d, \
w,
etc.true
, invert greediness of quantifiersIn addition to the options above that directly map to pcre flags the following options are processed:
true
, study the regular expression.Start-Length
. Note the
we use
Start-Length` rather than the more conventional
Start-End
to allow for immediate use with sub_atom/5
and
sub_string/5.
The capture_type
specifies the default for this pattern.
The interface supports a different type for each named group
using the syntax (?<name_T>...)
, where T
is one of S
(string),
A
(atom), I
(integer), F
(float), N
(number), T
(term) and R
(range). In the
current implementation I
, F
and N
are synonyms for T
. Future versions may act different if
the parsed value is not of the requested numeric type.
PCRE_CONFIG_*
constant after removing =PCRE_CONFIG_= and mapping the name to lower
case, e.g. utf8
, unicode_properties
, etc.
Value is either a Prolog boolean, integer or atom.
Finally, the functionality of pcre_version()
is
available using the configuration name version
.