This module provides an interface to the PCRE (Perl Compatible Regular Expression) library. This Prolog interface provides an almost comprehensive wrapper around PCRE.
Regular expressions are created from a pattern and options and represented as a SWI-Prolog blob. This implies they are subject to (atom) garbage collection. Compiled regular expressions can safely be used in multiple threads. Most predicates accept both an explicitly compiled regular expression, a pattern or a term Pattern/Flags. In the latter two cases a regular expression blob is created and stored in a cache. The cache can be cleared using re_flush/0.
?- re_match("^needle"/i, "Needle in a haystack"). true.
Options:
true
, match only at the first positionfalse
)anycrlf
, \
R only matches CR, LF or CRLF. If unicode
,
\
R matches all Unicode line endings. Subject string is the
end of a line (default false
)true
)true
)false
)any
, recognize any Unicode newline sequence, if anycrlf
,
recognize CR, LF, and CRLF as newline sequences, if cr
,
recognize CR, if lf
, recognize LF and finally if crlf
recognize CRLF as newline.Regex | is the output of re_compile/3,
a pattern or a term Pattern/Flags, where Pattern is an atom or string.
The defined flags and there related option for re_compile/3
are below.
|
capture_type(Type)
option passed to re_compile/3,
may be specified using flags if Regex is of the form
Pattern/Flags and may be specified at the level of individual captures
using a naming convention for the caption name. See re_compile/3
for details.
The example below exploits the typed groups to parse a date specification:
?- re_matchsub("(?<date> (?<year_I>(?:\\d\\d)?\\d\\d) - (?<month_I>\\d\\d) - (?<day_I>\\d\\d) )"/e, "2017-04-20", Sub, []). Sub = re_match{0:"2017-04-20", date:"2017-04-20", day:20, month:4, year:2017}.
call(Goal, Dict1, V0, V1), call(Goal, Dict2, V1, V2), ... call(Goal, Dictn, Vn, V).
This predicate is used to implement re_split/4 and re_replace/4. For example, we can count all matches of a Regex on String using this code:
re_match_count(Regex, String, Count) :- re_foldl(increment, Regex, String, 0, Count, []). increment(_Match, V0, V1) :- V1 is V0+1.
After which we can query
?- re_match_count("a", "aap", X). X = 2.
?- re_split("a+", "abaac", Split, []). Split = ["","a","b","aa","c"]. ?- re_split(":\\s*"/n, "Age: 33", Split, []). Split = ['Age', ': ', 33].
Pattern | is the pattern text, optionally
follows by /Flags. Similar to re_matchsub/4,
the final output type can be controlled by a flag a (atom), s
(string, default) or n (number if possible, atom
otherwise). |
\
N
or $Name. Both N and Name may be written as {N} and {Name} to avoid
ambiguities.
Pattern | is the pattern text, optionally
follows by /Flags. Flags may include g , replacing all
occurences of Pattern. In addition, similar to re_matchsub/4,
the final output type can be controlled by a flag a (atom)
or s (string, default). |
regex
(see blob/2). Defined Options
are defined below. Please consult the PCRE documentation for details.
anycrlf
, \
R only matches CR, LF or CRLF. If unicode
,
\
R matches all Unicode line endings.true
, do caseless matching.true
, $ not to match newline at endtrue
, . matches anything including NLtrue
, allow duplicate names for subpatternstrue
, ignore white space and # commentstrue
, PCRE extra features (not much use currently)true
, force matching to be before newlinejavascript
, JavaScript compatibilitytrue
, ^
and $ match newlines within dataany
, recognize any Unicode newline sequence, if anycrlf
(default), recognize CR, LF, and CRLF as newline sequences, if cr
,
recognize CR, if lf
, recognize LF and finally if crlf
recognize CRLF as newline.true
, use Unicode properties for \
d, \
w,
etc.true
, invert greediness of quantifiersIn addition to the options above that directly map to pcre flags the following options are processed:
true
, study the regular expression.Start-Length
. Note the
we use
Start-Length` rather than the more conventional
Start-End
to allow for immediate use with sub_atom/5
and
sub_string/5.
The capture_type
specifies the default for this pattern.
The interface supports a different type for each named group
using the syntax (?<name_T>...)
, where T
is one of S
(string),
A
(atom), I
(integer), F
(float), N
(number), T
(term) and R
(range). In the
current implementation I
, F
and N
are synonyms for T
. Future versions may act different if
the parsed value is not of the requested numeric type.
PCRE_CONFIG_*
constant after removing =PCRE_CONFIG_= and mapping the name to lower
case, e.g. utf8
, unicode_properties
, etc.
Value is either a Prolog boolean, integer or atom.
Finally, the functionality of pcre_version()
is
available using the configuration name version
.