products:pcre2:syntax
Delphi 12 Athens Updates Available!
To download, click your product: DIContainers, DIConverters, DICreole, DIFileFinder, DIGoogleReader, DIHtmlLabel, DIHtmlParser, DIMime, DIRegEx, DISQLite3, DITidy, DIUcl, DIUnicode, DIXml, YuBrotli, YuImage, YuNetSurf, YuOpenSSL, YuPcre2, YuPdf, YuStemmer, YuXmlSec, YuZip.
To download, click your product: DIContainers, DIConverters, DICreole, DIFileFinder, DIGoogleReader, DIHtmlLabel, DIHtmlParser, DIMime, DIRegEx, DISQLite3, DITidy, DIUcl, DIUnicode, DIXml, YuBrotli, YuImage, YuNetSurf, YuOpenSSL, YuPcre2, YuPdf, YuStemmer, YuXmlSec, YuZip.
no way to compare when less than two revisions
Differences
This shows you the differences between two versions of the page.
— | products:pcre2:syntax [2016/01/22 15:08] (current) – created - external edit 127.0.0.1 | ||
---|---|---|---|
Line 1: | Line 1: | ||
+ | ====== YuPcre2: RegEx Syntax ====== | ||
+ | {{page> | ||
+ | |||
+ | This is a quick-reference summary of the regular expression syntax supported by YuPcre2. The full syntax and semantics are described in the documentation which accompanies the download package. | ||
+ | |||
+ | ===== Quoting ===== | ||
+ | |||
+ | < | ||
+ | \x where x is non-alphanumeric is a literal x | ||
+ | \Q...\E | ||
+ | </ | ||
+ | |||
+ | ===== Escaped Characters ===== | ||
+ | |||
+ | This table applies to ASCII and Unicode environments. | ||
+ | |||
+ | < | ||
+ | \a | ||
+ | \cx " | ||
+ | \e | ||
+ | \f form feed (hex 0C) | ||
+ | \n | ||
+ | \r | ||
+ | \t tab (hex 09) | ||
+ | \0dd | ||
+ | \ddd | ||
+ | \o{ddd..} | ||
+ | \U " | ||
+ | \uhhhh | ||
+ | \xhh | ||
+ | \x{hhh..} | ||
+ | </ | ||
+ | |||
+ | Note that \0dd is always an octal code. The treatment of backslash followed by a non-zero digit is complicated; | ||
+ | |||
+ | When \x is not followed by {, from zero to two hexadecimal digits are read, but if PCRE2_ALT_BSUX is set, \x must be followed by two hexadecimal digits to be recognized as a hexadecimal escape; otherwise it matches a literal " | ||
+ | |||
+ | ===== Character Types ===== | ||
+ | |||
+ | < | ||
+ | . any character except newline; | ||
+ | in dotall mode, any character whatsoever | ||
+ | \C one code unit, even in UTF mode (best avoided) | ||
+ | \d a decimal digit | ||
+ | \D a character that is not a decimal digit | ||
+ | \h a horizontal white space character | ||
+ | \H a character that is not a horizontal white space character | ||
+ | \N a character that is not a newline | ||
+ | \p{xx} | ||
+ | \P{xx} | ||
+ | \R a newline sequence | ||
+ | \s a white space character | ||
+ | \S a character that is not a white space character | ||
+ | \v a vertical white space character | ||
+ | \V a character that is not a vertical white space character | ||
+ | \w a " | ||
+ | \W a " | ||
+ | \X a Unicode extended grapheme cluster | ||
+ | </ | ||
+ | |||
+ | The application can lock out the use of \C by setting the PCRE2_NEVER_BACKSLASH_C option. It is dangerous because it may leave the current matching point in the middle of a UTF-8 or UTF-16 character. | ||
+ | |||
+ | By default, \d, \s, and \w match only ASCII characters, even in UTF-8 mode or in the 16-bit and 32-bit libraries. However, if locale-specific matching is happening, \s and \w may also match characters with code points in the range 128-255. If the PCRE2_UCP option is set, the behaviour of these escape sequences is changed to use Unicode properties and they match many more characters. | ||
+ | |||
+ | ===== General Category Properties for \p and \P ===== | ||
+ | |||
+ | < | ||
+ | C Other | ||
+ | Cc | ||
+ | Cf | ||
+ | Cn | ||
+ | Co | ||
+ | Cs | ||
+ | |||
+ | L Letter | ||
+ | Ll Lower case letter | ||
+ | Lm | ||
+ | Lo Other letter | ||
+ | Lt Title case letter | ||
+ | Lu Upper case letter | ||
+ | L& | ||
+ | |||
+ | M Mark | ||
+ | Mc | ||
+ | Me | ||
+ | Mn | ||
+ | |||
+ | N Number | ||
+ | Nd | ||
+ | Nl | ||
+ | No Other number | ||
+ | |||
+ | P Punctuation | ||
+ | Pc | ||
+ | Pd Dash punctuation | ||
+ | Pe Close punctuation | ||
+ | Pf Final punctuation | ||
+ | Pi | ||
+ | Po Other punctuation | ||
+ | Ps Open punctuation | ||
+ | |||
+ | S Symbol | ||
+ | Sc | ||
+ | Sk | ||
+ | Sm | ||
+ | So Other symbol | ||
+ | |||
+ | Z Separator | ||
+ | Zl Line separator | ||
+ | Zp | ||
+ | Zs Space separator | ||
+ | </ | ||
+ | |||
+ | ===== PCRE2 Special Category Properties for \p and \P ===== | ||
+ | |||
+ | < | ||
+ | Xan Alphanumeric: | ||
+ | Xps POSIX space: property Z or tab, NL, VT, FF, CR | ||
+ | Xsp Perl space: property Z or tab, NL, VT, FF, CR | ||
+ | Xuc Univerally-named character: one that can be | ||
+ | | ||
+ | Xwd Perl word: property Xan or underscore | ||
+ | </ | ||
+ | |||
+ | Perl and POSIX space are now the same. Perl added VT to its space character set at release 5.18. | ||
+ | |||
+ | ===== Script Names for \p and \P ===== | ||
+ | |||
+ | Ahom, Anatolian_Hieroglyphs, | ||
+ | |||
+ | ===== Character Classes ===== | ||
+ | |||
+ | < | ||
+ | [...] | ||
+ | [^...] | ||
+ | [x-y] range (can be used for hex characters) | ||
+ | [[: | ||
+ | [[: | ||
+ | |||
+ | alnum | ||
+ | alpha | ||
+ | ascii 0-127 | ||
+ | blank space or tab | ||
+ | cntrl | ||
+ | digit | ||
+ | graph | ||
+ | lower lower case letter | ||
+ | print | ||
+ | punct | ||
+ | space white space | ||
+ | upper upper case letter | ||
+ | word same as \w | ||
+ | xdigit | ||
+ | </ | ||
+ | |||
+ | In PCRE2, POSIX character set names recognize only ASCII characters by default, but some of them use Unicode properties if PCRE2_UCP is set. You can use \Q...\E inside a character class. | ||
+ | |||
+ | ===== Quantifiers ===== | ||
+ | |||
+ | < | ||
+ | ? 0 or 1, greedy | ||
+ | ?+ 0 or 1, possessive | ||
+ | ?? 0 or 1, lazy | ||
+ | * 0 or more, greedy | ||
+ | *+ 0 or more, possessive | ||
+ | *? 0 or more, lazy | ||
+ | + 1 or more, greedy | ||
+ | ++ 1 or more, possessive | ||
+ | +? 1 or more, lazy | ||
+ | {n} | ||
+ | {n,m} at least n, no more than m, greedy | ||
+ | {n, | ||
+ | {n, | ||
+ | {n,} n or more, greedy | ||
+ | {n,}+ n or more, possessive | ||
+ | {n,}? n or more, lazy | ||
+ | </ | ||
+ | |||
+ | ===== Anchors and Simple Assertions ===== | ||
+ | |||
+ | < | ||
+ | \b word boundary | ||
+ | \B not a word boundary | ||
+ | ^ start of subject | ||
+ | also after an internal newline in multiline mode | ||
+ | (after any newline if PCRE2_ALT_CIRCUMFLEX is set) | ||
+ | \A start of subject | ||
+ | $ end of subject | ||
+ | also before newline at end of subject | ||
+ | also before internal newline in multiline mode | ||
+ | \Z end of subject | ||
+ | also before newline at end of subject | ||
+ | \z end of subject | ||
+ | \G first matching position in subject | ||
+ | </ | ||
+ | |||
+ | ===== Match Point Reset ===== | ||
+ | |||
+ | < | ||
+ | \K reset start of match | ||
+ | </ | ||
+ | |||
+ | \K is honoured in positive assertions, but ignored in negative ones. | ||
+ | |||
+ | ===== Alternation ===== | ||
+ | |||
+ | < | ||
+ | expr|expr|expr... | ||
+ | </ | ||
+ | |||
+ | ===== Capturing ===== | ||
+ | |||
+ | < | ||
+ | (...) | ||
+ | (?< | ||
+ | (?' | ||
+ | (? | ||
+ | (?: | ||
+ | (? | ||
+ | | ||
+ | </ | ||
+ | |||
+ | ===== Atomic Groups ===== | ||
+ | |||
+ | < | ||
+ | (?> | ||
+ | </ | ||
+ | |||
+ | ===== Comment ===== | ||
+ | |||
+ | < | ||
+ | (?# | ||
+ | </ | ||
+ | |||
+ | ===== Option Setting ===== | ||
+ | |||
+ | < | ||
+ | (?i) caseless | ||
+ | (?J) allow duplicate names | ||
+ | (?m) multiline | ||
+ | (?s) single line (dotall) | ||
+ | (?U) default ungreedy (lazy) | ||
+ | (?x) extended (ignore white space) | ||
+ | (? | ||
+ | </ | ||
+ | |||
+ | The following are recognized only at the very start of a pattern or after one of the newline or \R options with similar syntax. More than one of them may appear. | ||
+ | |||
+ | < | ||
+ | (*LIMIT_MATCH=d) set the match limit to d (decimal number) | ||
+ | (*LIMIT_RECURSION=d) set the recursion limit to d (decimal number) | ||
+ | (*NOTEMPTY) | ||
+ | (*NOTEMPTY_ATSTART) set PCRE2_NOTEMPTY_ATSTART when matching | ||
+ | (*NO_AUTO_POSSESS) no auto-possessification (PCRE2_NO_AUTO_POSSESS) | ||
+ | (*NO_DOTSTAR_ANCHOR) no .* anchoring (PCRE2_NO_DOTSTAR_ANCHOR) | ||
+ | (*NO_JIT) | ||
+ | (*NO_START_OPT) no start-match optimization (PCRE2_NO_START_OPTIMIZE) | ||
+ | (*UTF) | ||
+ | (*UCP) | ||
+ | </ | ||
+ | |||
+ | Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of the limits set by the caller of pcre2_match(), | ||
+ | |||
+ | ===== Newline Convention ===== | ||
+ | |||
+ | These are recognized only at the very start of the pattern or after option settings with a similar syntax. | ||
+ | |||
+ | < | ||
+ | (*CR) | ||
+ | (*LF) | ||
+ | (*CRLF) | ||
+ | (*ANYCRLF) | ||
+ | (*ANY) | ||
+ | </ | ||
+ | |||
+ | ===== What \R Matches ===== | ||
+ | |||
+ | These are recognized only at the very start of the pattern or after option setting with a similar syntax. | ||
+ | |||
+ | < | ||
+ | (*BSR_ANYCRLF) | ||
+ | (*BSR_UNICODE) | ||
+ | </ | ||
+ | |||
+ | ===== Lookahead and Lookbehind Assertions ===== | ||
+ | |||
+ | < | ||
+ | (? | ||
+ | (? | ||
+ | (?< | ||
+ | (?< | ||
+ | </ | ||
+ | |||
+ | Each top-level branch of a look behind must be of a fixed length. | ||
+ | |||
+ | ===== Backreferences ===== | ||
+ | |||
+ | < | ||
+ | \n reference by number (can be ambiguous) | ||
+ | \gn | ||
+ | \g{n} | ||
+ | \g{-n} | ||
+ | \k< | ||
+ | \k' | ||
+ | \g{name} | ||
+ | \k{name} | ||
+ | (? | ||
+ | </ | ||
+ | |||
+ | ===== Subroutine References (Possibly Recursive) ===== | ||
+ | |||
+ | < | ||
+ | (?R) recurse whole pattern | ||
+ | (?n) call subpattern by absolute number | ||
+ | (?+n) call subpattern by relative number | ||
+ | (?-n) call subpattern by relative number | ||
+ | (?& | ||
+ | (? | ||
+ | \g< | ||
+ | \g' | ||
+ | \g< | ||
+ | \g' | ||
+ | \g< | ||
+ | \g' | ||
+ | \g< | ||
+ | \g' | ||
+ | </ | ||
+ | |||
+ | ===== Conditional Patterns ===== | ||
+ | |||
+ | < | ||
+ | (? | ||
+ | (? | ||
+ | |||
+ | (?(n) | ||
+ | (? | ||
+ | (? | ||
+ | (? | ||
+ | (? | ||
+ | (? | ||
+ | (?(R) | ||
+ | (? | ||
+ | (? | ||
+ | (? | ||
+ | (? | ||
+ | (? | ||
+ | </ | ||
+ | |||
+ | ===== Backtracking Control ===== | ||
+ | |||
+ | The following act immediately they are reached: | ||
+ | |||
+ | < | ||
+ | (*ACCEPT) | ||
+ | (*FAIL) | ||
+ | (*MARK: | ||
+ | </ | ||
+ | |||
+ | The following act only when a subsequent match failure causes a backtrack to reach them. They all force a match failure, but they differ in what happens afterwards. Those that advance the start-of-match point do so only if the pattern is not anchored. | ||
+ | |||
+ | < | ||
+ | (*COMMIT) | ||
+ | (*PRUNE) | ||
+ | (*PRUNE: | ||
+ | (*SKIP) | ||
+ | (*SKIP: | ||
+ | (*MARK: | ||
+ | (*THEN) | ||
+ | (*THEN: | ||
+ | </ | ||
+ | |||
+ | ===== Callouts ===== | ||
+ | |||
+ | < | ||
+ | (?C) callout (assumed number 0) | ||
+ | (?Cn) | ||
+ | (? | ||
+ | </ | ||
+ | |||
+ | The allowed string delimiters are ` ' " ^ % # $ (which are the same for the start and the end), and the starting delimiter { matched with the ending delimiter }. To encode the ending delimiter within the string, double it. |
products/pcre2/syntax.txt · Last modified: 2016/01/22 15:08 by 127.0.0.1