Delphi Inspiration

Components and Applications

User Tools

Site Tools


products:pcre2:syntax

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

products:pcre2:syntax [2016/01/22 15:08] (current)
Line 1: Line 1:
 +====== YuPcre2: RegEx Syntax ======
  
 +{{page>​header}}
 +
 +This is a quick-reference summary of the regular expression syntax supported by YuPcre2. The full syntax and semantics are described in the documentation which accompanies the download package.
 +
 +===== Quoting =====
 +
 +<​code>​
 +  \x         where x is non-alphanumeric is a literal x
 +  \Q...\E ​   treat enclosed characters as literal
 +</​code>​
 +
 +===== Escaped Characters =====
 +
 +This table applies to ASCII and Unicode environments.
 +
 +<​code>​
 +  \a         ​alarm,​ that is, the BEL character (hex 07)
 +  \cx        "​control-x",​ where x is any ASCII printing character
 +  \e         ​escape (hex 1B)
 +  \f         form feed (hex 0C)
 +  \n         ​newline (hex 0A)
 +  \r         ​carriage return (hex 0D)
 +  \t         tab (hex 09)
 +  \0dd       ​character with octal code 0dd
 +  \ddd       ​character with octal code ddd, or backreference
 +  \o{ddd..} ​ character with octal code ddd..
 +  \U         "​U"​ if PCRE2_ALT_BSUX is set (otherwise is an error)
 +  \uhhhh ​    ​character with hex code hhhh (if PCRE2_ALT_BSUX is set)
 +  \xhh       ​character with hex code hh
 +  \x{hhh..} ​ character with hex code hhh..
 +</​code>​
 +
 +Note that \0dd is always an octal code. The treatment of backslash followed by a non-zero digit is complicated;​ for details see the section "​Non-printing characters"​ in the **pcre2pattern** documentation,​ where details of escape processing in EBCDIC environments are also given.
 +
 +When \x is not followed by {, from zero to two hexadecimal digits are read, but if PCRE2_ALT_BSUX is set, \x must be followed by two hexadecimal digits to be recognized as a hexadecimal escape; otherwise it matches a literal "​x"​. Likewise, if \u (in ALT_BSUX mode) is not followed by four hexadecimal digits, it matches a literal "​u"​.
 +
 +===== Character Types =====
 +
 +<​code>​
 +  .          any character except newline;
 +               in dotall mode, any character whatsoever
 +  \C         one code unit, even in UTF mode (best avoided)
 +  \d         a decimal digit
 +  \D         a character that is not a decimal digit
 +  \h         a horizontal white space character
 +  \H         a character that is not a horizontal white space character
 +  \N         a character that is not a newline
 +  \p{xx} ​    a character with the xx property
 +  \P{xx} ​    a character without the xx property
 +  \R         a newline sequence
 +  \s         a white space character
 +  \S         a character that is not a white space character
 +  \v         a vertical white space character
 +  \V         a character that is not a vertical white space character
 +  \w         a "​word"​ character
 +  \W         a "​non-word"​ character
 +  \X         a Unicode extended grapheme cluster
 +</​code>​
 +
 +The application can lock out the use of \C by setting the PCRE2_NEVER_BACKSLASH_C option. It is dangerous because it may leave the current matching point in the middle of a UTF-8 or UTF-16 character.
 +
 +By default, \d, \s, and \w match only ASCII characters, even in UTF-8 mode or in the 16-bit and 32-bit libraries. However, if locale-specific matching is happening, \s and \w may also match characters with code points in the range 128-255. If the PCRE2_UCP option is set, the behaviour of these escape sequences is changed to use Unicode properties and they match many more characters.
 +
 +===== General Category Properties for \p and \P =====
 +
 +<​code>​
 +  C          Other
 +  Cc         ​Control
 +  Cf         ​Format
 +  Cn         ​Unassigned
 +  Co         ​Private use
 +  Cs         ​Surrogate
 +
 +  L          Letter
 +  Ll         Lower case letter
 +  Lm         ​Modifier letter
 +  Lo         Other letter
 +  Lt         Title case letter
 +  Lu         Upper case letter
 +  L& ​        Ll, Lu, or Lt
 +
 +  M          Mark
 +  Mc         ​Spacing mark
 +  Me         ​Enclosing mark
 +  Mn         ​Non-spacing mark
 +
 +  N          Number
 +  Nd         ​Decimal number
 +  Nl         ​Letter number
 +  No         Other number
 +
 +  P          Punctuation
 +  Pc         ​Connector punctuation
 +  Pd         Dash punctuation
 +  Pe         Close punctuation
 +  Pf         Final punctuation
 +  Pi         ​Initial punctuation
 +  Po         Other punctuation
 +  Ps         Open punctuation
 +
 +  S          Symbol
 +  Sc         ​Currency symbol
 +  Sk         ​Modifier symbol
 +  Sm         ​Mathematical symbol
 +  So         Other symbol
 +
 +  Z          Separator
 +  Zl         Line separator
 +  Zp         ​Paragraph separator
 +  Zs         Space separator
 +</​code>​
 +
 +===== PCRE2 Special Category Properties for \p and \P =====
 +
 +<​code>​
 +  Xan        Alphanumeric:​ union of properties L and N
 +  Xps        POSIX space: property Z or tab, NL, VT, FF, CR
 +  Xsp        Perl space: property Z or tab, NL, VT, FF, CR
 +  Xuc        Univerally-named character: one that can be
 +               ​represented by a Universal Character Name
 +  Xwd        Perl word: property Xan or underscore
 +</​code>​
 +
 +Perl and POSIX space are now the same. Perl added VT to its space character set at release 5.18.
 +
 +===== Script Names for \p and \P =====
 +
 +Ahom, Anatolian_Hieroglyphs,​ Arabic, Armenian, Avestan, Balinese, Bamum, Bassa_Vah, Batak, Bengali, Bopomofo, Brahmi, Braille, Buginese, Buhid, Canadian_Aboriginal,​ Carian, Caucasian_Albanian,​ Chakma, Cham, Cherokee, Common, Coptic, Cuneiform, Cypriot, Cyrillic, Deseret, Devanagari, Duployan, Egyptian_Hieroglyphs,​ Elbasan, Ethiopic, Georgian, Glagolitic, Gothic, Grantha, Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hatran, Hebrew, Hiragana, Imperial_Aramaic,​ Inherited, Inscriptional_Pahlavi,​ Inscriptional_Parthian,​ Javanese, Kaithi, Kannada, Katakana, Kayah_Li, Kharoshthi, Khmer, Khojki, Khudawadi, Lao, Latin, Lepcha, Limbu, Linear_A, Linear_B, Lisu, Lycian, Lydian, Mahajani, Malayalam, Mandaic, Manichaean, Meetei_Mayek,​ Mende_Kikakui,​ Meroitic_Cursive,​ Meroitic_Hieroglyphs,​ Miao, Modi, Mongolian, Mro, Multani, Myanmar, Nabataean, New_Tai_Lue,​ Nko, Ogham, Ol_Chiki, Old_Hungarian,​ Old_Italic, Old_North_Arabian,​ Old_Permic, Old_Persian,​ Old_South_Arabian,​ Old_Turkic, Oriya, Osmanya, Pahawh_Hmong,​ Palmyrene, Pau_Cin_Hau,​ Phags_Pa, Phoenician, Psalter_Pahlavi,​ Rejang, Runic, Samaritan, Saurashtra, Sharada, Shavian, Siddham, SignWriting,​ Sinhala, Sora_Sompeng,​ Sundanese, Syloti_Nagri,​ Syriac, Tagalog, Tagbanwa, Tai_Le, Tai_Tham, Tai_Viet, Takri, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, Tirhuta, Ugaritic, Vai, Warang_Citi,​ Yi.
 +
 +===== Character Classes =====
 +
 +<​code>​
 +  [...]       ​positive character class
 +  [^...] ​     negative character class
 +  [x-y]       range (can be used for hex characters)
 +  [[:​xxx:​]] ​  ​positive POSIX named set
 +  [[:​^xxx:​]] ​ negative POSIX named set
 +
 +  alnum       ​alphanumeric
 +  alpha       ​alphabetic
 +  ascii       0-127
 +  blank       space or tab
 +  cntrl       ​control character
 +  digit       ​decimal digit
 +  graph       ​printing,​ excluding space
 +  lower       lower case letter
 +  print       ​printing,​ including space
 +  punct       ​printing,​ excluding alphanumeric
 +  space       white space
 +  upper       upper case letter
 +  word        same as \w
 +  xdigit ​     hexadecimal digit
 +</​code>​
 +
 +In PCRE2, POSIX character set names recognize only ASCII characters by default, but some of them use Unicode properties if PCRE2_UCP is set. You can use \Q...\E inside a character class.
 +
 +===== Quantifiers =====
 +
 +<​code>​
 +  ?           0 or 1, greedy
 +  ?+          0 or 1, possessive
 +  ??          0 or 1, lazy
 +  *           0 or more, greedy
 +  *+          0 or more, possessive
 +  *?          0 or more, lazy
 +  +           1 or more, greedy
 +  ++          1 or more, possessive
 +  +?          1 or more, lazy
 +  {n}         ​exactly n
 +  {n,m}       at least n, no more than m, greedy
 +  {n,​m}+ ​     at least n, no more than m, possessive
 +  {n,​m}? ​     at least n, no more than m, lazy
 +  {n,}        n or more, greedy
 +  {n,}+       n or more, possessive
 +  {n,}?       n or more, lazy
 +</​code>​
 +
 +===== Anchors and Simple Assertions =====
 +
 +<​code>​
 +  \b          word boundary
 +  \B          not a word boundary
 +  ^           start of subject
 +                also after an internal newline in multiline mode
 +                (after any newline if PCRE2_ALT_CIRCUMFLEX is set)
 +  \A          start of subject
 +  $           end of subject
 +                also before newline at end of subject
 +                also before internal newline in multiline mode
 +  \Z          end of subject
 +                also before newline at end of subject
 +  \z          end of subject
 +  \G          first matching position in subject
 +</​code>​
 +
 +===== Match Point Reset =====
 +
 +<​code>​
 +  \K          reset start of match
 +</​code>​
 +
 +\K is honoured in positive assertions, but ignored in negative ones.
 +
 +===== Alternation =====
 +
 +<​code>​
 +  expr|expr|expr...
 +</​code>​
 +
 +===== Capturing =====
 +
 +<​code>​
 +  (...)           ​capturing group
 +  (?<​name>​...) ​   named capturing group (Perl)
 +  (?'​name'​...) ​   named capturing group (Perl)
 +  (?​P<​name>​...) ​  named capturing group (Python)
 +  (?:​...) ​        ​non-capturing group
 +  (?​|...) ​        ​non-capturing group; reset group numbers for
 +                   ​capturing groups in each alternative
 +</​code>​
 +
 +===== Atomic Groups =====
 +
 +<​code>​
 +  (?>​...) ​        ​atomic,​ non-capturing group
 +</​code>​
 +
 +===== Comment =====
 +
 +<​code>​
 +  (?#​....) ​       comment (not nestable)
 +</​code>​
 +
 +===== Option Setting =====
 +
 +<​code>​
 +  (?i)            caseless
 +  (?J)            allow duplicate names
 +  (?m)            multiline
 +  (?s)            single line (dotall)
 +  (?U)            default ungreedy (lazy)
 +  (?x)            extended (ignore white space)
 +  (?​-...) ​        unset option(s)
 +</​code>​
 +
 +The following are recognized only at the very start of a pattern or after one of the newline or \R options with similar syntax. More than one of them may appear.
 +
 +<​code>​
 +  (*LIMIT_MATCH=d) set the match limit to d (decimal number)
 +  (*LIMIT_RECURSION=d) set the recursion limit to d (decimal number)
 +  (*NOTEMPTY) ​    set PCRE2_NOTEMPTY when matching
 +  (*NOTEMPTY_ATSTART) set PCRE2_NOTEMPTY_ATSTART when matching
 +  (*NO_AUTO_POSSESS) no auto-possessification (PCRE2_NO_AUTO_POSSESS)
 +  (*NO_DOTSTAR_ANCHOR) no .* anchoring (PCRE2_NO_DOTSTAR_ANCHOR)
 +  (*NO_JIT) ​      ​disable JIT optimization
 +  (*NO_START_OPT) no start-match optimization (PCRE2_NO_START_OPTIMIZE)
 +  (*UTF) ​         set appropriate UTF mode for the library in use
 +  (*UCP) ​         set PCRE2_UCP (use Unicode properties for \d etc)
 +</​code>​
 +
 +Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of the limits set by the caller of pcre2_match(),​ not increase them. The application can lock out the use of %%(*%%UTF) and %%(*%%UCP) by setting the PCRE2_NEVER_UTF or PCRE2_NEVER_UCP options, respectively,​ at compile time.
 +
 +===== Newline Convention =====
 +
 +These are recognized only at the very start of the pattern or after option settings with a similar syntax.
 +
 +<​code>​
 +  (*CR)           ​carriage return only
 +  (*LF)           ​linefeed only
 +  (*CRLF) ​        ​carriage return followed by linefeed
 +  (*ANYCRLF) ​     all three of the above
 +  (*ANY) ​         any Unicode newline sequence
 +</​code>​
 +
 +===== What \R Matches =====
 +
 +These are recognized only at the very start of the pattern or after option setting with a similar syntax.
 +
 +<​code>​
 +  (*BSR_ANYCRLF) ​ CR, LF, or CRLF
 +  (*BSR_UNICODE) ​ any Unicode newline sequence
 +</​code>​
 +
 +===== Lookahead and Lookbehind Assertions =====
 +
 +<​code>​
 +  (?​=...) ​        ​positive look ahead
 +  (?​!...) ​        ​negative look ahead
 +  (?<​=...) ​       positive look behind
 +  (?<​!...) ​       negative look behind
 +</​code>​
 +
 +Each top-level branch of a look behind must be of a fixed length.
 +
 +===== Backreferences =====
 +
 +<​code>​
 +  \n              reference by number (can be ambiguous)
 +  \gn             ​reference by number
 +  \g{n}           ​reference by number
 +  \g{-n} ​         relative reference by number
 +  \k<​name> ​       reference by name (Perl)
 +  \k'​name' ​       reference by name (Perl)
 +  \g{name} ​       reference by name (Perl)
 +  \k{name} ​       reference by name (.NET)
 +  (?​P=name) ​      ​reference by name (Python)
 +</​code>​
 +
 +===== Subroutine References (Possibly Recursive) =====
 +
 +<​code>​
 +  (?R)            recurse whole pattern
 +  (?n)            call subpattern by absolute number
 +  (?+n)           call subpattern by relative number
 +  (?-n)           call subpattern by relative number
 +  (?&​name) ​       call subpattern by name (Perl)
 +  (?​P>​name) ​      call subpattern by name (Python)
 +  \g<​name> ​       call subpattern by name (Oniguruma)
 +  \g'​name' ​       call subpattern by name (Oniguruma)
 +  \g<​n> ​          call subpattern by absolute number (Oniguruma)
 +  \g'​n' ​          call subpattern by absolute number (Oniguruma)
 +  \g<​+n> ​         call subpattern by relative number (PCRE2 extension)
 +  \g'​+n' ​         call subpattern by relative number (PCRE2 extension)
 +  \g<​-n> ​         call subpattern by relative number (PCRE2 extension)
 +  \g'​-n' ​         call subpattern by relative number (PCRE2 extension)
 +</​code>​
 +
 +===== Conditional Patterns =====
 +
 +<​code>​
 +  (?​(condition)yes-pattern)
 +  (?​(condition)yes-pattern|no-pattern)
 +
 +  (?(n)               ​absolute reference condition
 +  (?​(+n) ​             relative reference condition
 +  (?​(-n) ​             relative reference condition
 +  (?​(<​name>​) ​         named reference condition (Perl)
 +  (?​('​name'​) ​         named reference condition (Perl)
 +  (?​(name) ​           named reference condition (PCRE2)
 +  (?(R)               ​overall recursion condition
 +  (?​(Rn) ​             specific group recursion condition
 +  (?​(R&​name) ​         specific recursion condition
 +  (?​(DEFINE) ​         define subpattern for reference
 +  (?​(VERSION[>​]=n.m) ​ test PCRE2 version
 +  (?​(assert) ​         assertion condition
 +</​code>​
 +
 +===== Backtracking Control =====
 +
 +The following act immediately they are reached:
 +
 +<​code>​
 +  (*ACCEPT) ​      force successful match
 +  (*FAIL) ​        force backtrack; synonym (*F)
 +  (*MARK:​NAME) ​   set name to be passed back; synonym (*:NAME)
 +</​code>​
 +
 +The following act only when a subsequent match failure causes a backtrack to reach them. They all force a match failure, but they differ in what happens afterwards. Those that advance the start-of-match point do so only if the pattern is not anchored.
 +
 +<​code>​
 +  (*COMMIT) ​      ​overall failure, no advance of starting point
 +  (*PRUNE) ​       advance to next starting character
 +  (*PRUNE:​NAME) ​  ​equivalent to (*MARK:​NAME)(*PRUNE)
 +  (*SKIP) ​        ​advance to current matching position
 +  (*SKIP:​NAME) ​   advance to position corresponding to an earlier
 +                  (*MARK:​NAME);​ if not found, the (*SKIP) is ignored
 +  (*THEN) ​        local failure, backtrack to next alternation
 +  (*THEN:​NAME) ​   equivalent to (*MARK:​NAME)(*THEN)
 +</​code>​
 +
 +===== Callouts =====
 +
 +<​code>​
 +  (?C)            callout (assumed number 0)
 +  (?Cn)           ​callout with numerical data n
 +  (?​C"​text"​) ​     callout with string data
 +</​code>​
 +
 +The allowed string delimiters are ` ' " ^ % # $ (which are the same for the start and the end), and the starting delimiter { matched with the ending delimiter }. To encode the ending delimiter within the string, double it.
products/pcre2/syntax.txt · Last modified: 2016/01/22 15:08 (external edit)