Delphi Inspiration

Components and Applications

User Tools

Site Tools


products:pcre2:history

YuPcre2: Version History

YuPcre2 is a modern regular expression library for Delphi with Perl syntax. Directly supports UnicodeString, AnsiString, or UCS4String, as well as UTF-8, and UTF-16.

YuPcre2 1.13.0 – 12 May 2020

Overview:

  1. Capturing groups that contain recursive backreferences to themselves are no longer automatically atomic.
  2. New option for pcre2_substitute:
    • PCRE2_SUBSTITUTE_LITERAL: The replacement string is literal.
    • PCRE2_SUBSTITUTE_MATCHED: Use pre-existing match data for 1st match.
    • PCRE2_SUBSTITUTE_REPLACEMENT_ONLY: Return only replacement string(s).
  3. If PCRE2_UCP is set without PCRE2_UTF, Unicode character properties are used for upper/lower case computations on characters whose code points are greater than 127.
  4. The character tables (for low-valued characters) can now more easily be saved and restored in binary.
  5. Updated to Unicode 13.0.0.

Details:

  • Fix a JIT bug which allowed to read the fields of the compiled pattern before its existence is checked.
  • Back in the PCRE1 days, capturing groups that contained recursive back references to themselves were atomic because after the end a repeated group, the captured substrings had their values from the final repetition, not from an earlier repetition that might be the destination of a backtrack. This feature was documented, and was carried over into PCRE2. However, it has now been realized that the PCRE2 made this atomicizing unnecessary, and it is confusing when users are unaware of it, making some patterns appear not to be working as expected. Capture values of recursive back references in repeated groups are now correctly backtracked, so this unnecessary restriction has been removed.
  • Added (?* and (?<* as synonms for (*napla: and (*naplb: to match another regex engine. The Perl regex folks are aware of this usage and have made a note about it.
  • When an assertion is repeated, PCRE2 used to limit the maximum repetition to 1, believing that repeating an assertion is pointless. However, if a positive assertion contains capturing groups, repetition can be useful. In any case, an assertion could always be wrapped in a repeated group. The only restriction that is now imposed is that an unlimited maximum is changed to one more than the minimum.
  • Fix *THEN verbs in lookahead assertions in JIT.
  • The JIT stack is now freed when the low-level stack allocation fails.
  • (?(DEFINE)…) groups were not being handled correctly when checking for the fixed length of a lookbehind assertion. Such a group within a lookbehind should be skipped, as it does not contribute to the length of the group. Instead, the (DEFINE) group was being processed, and if at the end of the lookbehind, that end was not correctly recognized. Errors such as “lookbehind assertion is not fixed length” and also “internal error: bad code value in parsed_skip()” could result.
  • Put a limit of 1000 on recursive calls when studying a pattern to search nested groups for starting code units, in order to avoid stack overflow issues. If the limit is reached, it just gives up trying for this optimization.
  • Restore the control verb chain list when exiting from a recurse function in JIT.
  • Fix a crash which occurs when the character type of an invalid UTF character is decoded in JIT.
  • When PCRE2_UCP is set without PCRE2_UTF, Unicode character properties are used for upper/lower case computations on characters whose code points are greater than 127.
  • The function for checking UTF-16 validity was returning an incorrect offset for the start of the error when a high surrogate was not followed by a valid low surrogate. This caused incorrect behaviour, for example when PCRE2_MATCH_INVALID_UTF was set and a match started immediately following the invalid high surrogate, such as aa matching \x{d800}aa.
  • If a DEFINE group immediately preceded a lookbehind assertion, the pattern could be mis-compiled and therefore not match correctly. This is the example that found this: (?(DEFINE)(?<foo>bar))(?<![-a-z0-9])word which failed to match “word” because the “move back” value was set to zero.
  • PCRE2_CONFIG_TABLES_LENGTH is added to pcre2_config so that an application that wants to save tables in binary knows how long they are.

YuPcre2 1.12.0 – 24 Dec 2019

  • Add a check for the maximum number of capturing subpatterns, which is 65535.
  • Improve the invalid utf32 support of the JIT compiler. Now it correctly detects invalid characters in the 0xd800-0xdfff range.
  • Fix minor typo bug in JIT compile when \X is used in a non-UTF string.
  • Add support for matching in invalid UTF strings to the pcre2_match interpreter, and integrate with the existing JIT support via the new PCRE2_MATCH_INVALID_UTF compile-time option.
  • Adjust the limit for “must have” code unit searching, in particular, increase it substantially for non-anchored patterns.
  • Allow (*ACCEPT) to be quantified, because an ungreedy quantifier with a zero minimum is potentially useful.
  • Some changes to the way the minimum subject length is handled:
    • When PCRE2_NO_START_OPTIMIZE is set, no minimum length is computed.
    • An incorrect minimum length could be calculated for a pattern that contained (*ACCEPT) inside a qualified group whose minimum repetition was zero, for example A(?:(*ACCEPT))?B, which incorrectly computed a minimum of 2. The minimum length scan no longer happens for a pattern that contains (*ACCEPT).
    • When no minimum length is set by the normal scan, but a first and/or last code unit is recorded, set the minimum to 1 or 2 as appropriate.
    • When a pattern contains multiple groups with the same number, a back reference cannot know which one to scan for a minimum length. This used to cause the minimum length finder to give up with no result. Now it treats such references as not adding to the minimum length (which it should have done all along).
    • Furthermore, the above action now happens only if the back reference is to a group that exists more than once in a pattern instead of any back reference in a pattern with duplicate numbers.
  • A (*MARK) value inside a successful condition was not being returned by the interpretive matcher (it was returned by JIT). This bug has been mended.
  • The quantifier {1} was always being ignored, but this is incorrect when it is made possessive and applied to an item in parentheses, because a parenthesized item may contain multiple branches or other backtracking points, for example (a|ab){1}+c or (a+){1}+a.
  • DFA matching (using pcre2_dfa_match) was not recognising a partial match if the end of the subject was encountered in a lookahead (conditional or otherwise), an atomic group, or a recursion.
  • Check for integer overflow when computing lookbehind lengths.
  • Implement non-atomic positive lookaround assertions.
  • If a lookbehind contained a lookahead that contained another lookbehind within it, the nested lookbehind was not correctly processed. For example, if (?<=(?=(?<=a)))b was matched to “ab” it gave no match instead of matching “b”.
  • Implemented pcre2_get_match_data_size.
  • Two alterations to partial matching:
    • The definition of a partial match is slightly changed: if a pattern contains any lookbehinds, an empty partial match may be given, because this is another situation where adding characters to the current subject can lead to a full match. Example: c*+(?<=[bc]) with subject “ab”.
  • Similarly, if a pattern could match an empty string, an empty partial match may be given. Example: (?![ab]).* with subject “ab”. This case applies only to PCRE2_PARTIAL_HARD.
    • An empty string partial hard match can be returned for \z and \Z as it is documented that they shouldn't match.
  • A branch that started with (*ACCEPT) was not being recognized as one that could match an empty string.
  • Corrected pcre2_set_character_tables tables data type: was const C_unsigned_char_num_ptr instead of const C_uint8_t_ptr, as generated by pcre2_maketables.
  • Upgraded to Unicode 12.1.0.
  • If the length of one branch of a group exceeded 65535 (the maximum value that is remembered as a minimum length), the whole group's length was incorrectly recorded as 65535, leading to incorrect “no match” when start-up optimizations were in force.
  • The “rightmost consulted character” value was not always correct; in particular, if a pattern ended with a negative lookahead, characters that were inspected in that lookahead were not included.
  • Add the pcre2_maketables_free function.
  • The start-up optimization that looks for a unique initial matching code unit in the interpretive engines uses memchr() in 8-bit mode. When the search is caseless, it was doing so inefficiently, which ended up slowing down the match drastically when the subject was very long. The revised code (a) remembers if one case is not found, so it never repeats the search for that case after a bumpalong and (b) when one case has been found, it searches only up to that position for an earlier occurrence of the other case. This fix applies to both interpretive pcre2_match and to pcre2_dfa_match.
  • While scanning to find the minimum length of a group, if any branch has minimum length zero, there is no need to scan any subsequent branches (a small compile-time performance improvement).
  • Add underflow check in JIT which may occur when the value of subject string pointer is close to 0.
  • Arrange for classes such as [Aa] which contain just the two cases of the same character, to be treated as a single caseless character. This causes the first and required code unit optimizations to kick in where relevant.
  • Improve the bitmap of starting bytes for positive classes that include wide characters, but no property types, in UTF-8 mode. Previously, on encountering such a class, the bits for all bytes greater than $c4 were set, thus specifying any character with codepoint >= $100. Now the only bits that are set are for the relevant bytes that start the wide characters. This can give a noticeable performance improvement.
  • If the bitmap of starting code units contains only 1 or 2 bits, replace it with a single starting code unit (1 bit) or a caseless single starting code unit if the two relevant characters are case-partners. This is particularly relevant to the 8-bit library, though it applies to all. It can give a performance boost for patterns such as [Ww]ord and (word|WORD). However, this optimization doesn't happen if there is a “required” code unit of the same value (because the search for a “required” code unit starts at the match start for non-unique first code unit patterns, but after a unique first code unit, and patterns such as a*a need the former action).
  • If a non-ASCII character was the first in a starting assertion in a caseless match, the “first code unit” optimization did not get the casing right, and the assertion failed to match a character in the other case if it did not start with the same code unit.
  • Detect empty matches in JIT.
  • Fix a JIT bug which allowed to read the fields of the compiled pattern before its existence is checked.
  • Capturing groups that contained recursive back references to themselves are no longer atomic.

YuPcre2 1.11.0 – 8 Oct 2019

  • Fix subject buffer overread in JIT when UTF is disabled and \X or \R has a greater than 1 fixed quantifier.
  • Added support for callouts from pcre2_substitute.
  • Fix an xclass matching issue in JIT.
  • Implement PCRE2_EXTRA_ESCAPED_CR_IS_LF.
  • Implement the Perl 5.28 experimental alphabetic names for atomic groups and lookaround assertions, for example, (*pla:…) and (*atomic:…). These are characterized by a lower case letter following (*.
  • Implement the new Perl “script run” features (*script_run:…) and (*atomic_script_run:…) aka (*sr:…) and (*asr:…).
  • Implement PCRE2_COPY_MATCHED_SUBJECT for pcre2_match (including JIT via pcre2_match) and pcre2_dfa_match, but *not* the pcre2_jit_match fast path. Also, when a match fails, set the subject field in the match data to nil for tidiness - none of the substring extractors should reference this after match failure.
  • If a pattern started with a subroutine call that had a quantifier with a minimum of zero, an incorrect “match must start with this character” could be recorded. Example: (?&xxx)*ABC(?<xxx>XYZ) would (incorrectly) expect 'A' to be the first character of a match.
  • The heap limit checking code in pcre2_dfa_match could suffer from overflow if the heap limit was set very large. This could cause incorrect “heap limit exceeded” errors.
  • If a pattern started with (*MARK), (*COMMIT), (*PRUNE), (*SKIP)#, or (*THEN) followed by ^ it was not recognized as anchored.
  • With PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL set, escape sequences such as \s which are valid in character classes, but not as the end of ranges, were being treated as literals. An example is [_-\s] (but not [\s-_] because that gave an error at the start of a range). Now an “invalid range” error is given independently of PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL.
  • PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL was affecting known escape sequences such as \eX when they appeared invalidly in a character class. Now the option applies only to unrecognized or malformed escape sequences.
  • The pcre2_dfa_match function was incorrectly handling conditional version tests such as (?(VERSION>=0)…) when the version test was true. Incorrect processing or a crash could result.
  • When PCRE2_UTF is set, allow non-ASCII letters and decimal digits in group names, as Perl does.
  • Implemented PCRE2_EXTRA_ALT_BSUX to support ECMAScript 6's \u{hhh} construct.
  • Compile \p{Any} to be the same as . in PCRE2_DOTALL mode, so that it benefits from auto-anchoring if \p{Any}* starts a pattern.
  • Disable SSE2 JIT optimizations in x86 CPUs when SSE2 is not available.
  • Improve DIUtils.pas Unicode processing to support Unicode Code Points from $000000 to $10FFFF. Adjust remaining source code accordingly.
  • Update DIUtils.pas Unicode functions to Unicode 12.1.0.
  • Remove DI.inc include file. Directly link in DICompilers.inc instead.

YuPcre2 1.10.0 – 7 Mar 2019

  • Fix: TDIRegEx2_8.Replace and TDIRegEx2_16.Replace did not return the start of the string if StartOffset > 0.
  • Adjust TDIRegEx2SearchStream_Enc to DIConverters 1.18.0: Converter functions now use the native unsigned integer type for the length of a string and support stings longer than 2 GB. This change only affects projects using DIConverters 1.18.0.

YuPcre2 1.9.2 – 8 Jan 2019

  • Matching the pattern (*UTF)\C[^\v]+\x80 against an 8-bit string containing multi-code-unit characters caused bad behaviour and possibly a crash.
  • When returning an error from pcre2_pattern_convert, ensure the error offset is set zero for early errors.
  • Refactored pcre2_dfa_match so that the internal recursive calls no longer use the stack for local workspace and local ovectors. Instead, an initial block of stack is reserved, but if this is insufficient, heap memory is used. The heap limit parameter now applies to pcre2_dfa_match.
  • In pcre2_substitute, with global matching, a pattern that matched an empty string, but never at the starting match offset, was not handled in a Perl-compatible way. The pattern (<?=\G.) is an example of such a pattern. Because \G is in a lookbehind assertion, there has to be a “bumpalong” before there can be a match. The automatic “advance by one character after an empty string match” rule is therefore inappropriate. A more complicated algorithm has now been implemented.
  • When checking to see if a lookbehind is of fixed length, lookaheads were correctly ignored, but qualifiers on lookaheads were not being ignored, leading to an incorrect “lookbehind assertion is not fixed length” error.
  • Updated to Unicode version 11.0.0. As well as the usual addition of new scripts and characters, this involved re-jigging the grapheme break property algorithm because Unicode has changed the way emojis are handled.
  • Fixed an obscure bug that struck when there were two atomic groups not separated by something with a backtracking point. There could be an incorrect backtrack into the first of the atomic groups. A complicated example is (?>a(*:1))(?>b)(*SKIP:1)x|.* matched against “abc”, where the *SKIP shouldn't find a MARK (because is in an atomic group), but it did.
  • (*ACCEPT:ARG), (*FAIL:ARG), and (*COMMIT:ARG) are now supported.
  • A (*MARK) name was not being passed back for positive assertions that were terminated by (*ACCEPT).
  • Add support for \N{U+dddd}, but only in Unicode mode.
  • Add support for (?^) for unsetting all imnsx options.
  • The PCRE2_EXTENDED (/x) option only ever discarded space characters whose code point was less than 256. Now, when Unicode support is compiled, PCRE2_EXTENDED also discards U+0085, U+200E, U+200F, U+2028, and U+2029, which are additional characters defined by Unicode as “Pattern White Space”. This makes PCRE2 compatible with Perl.
  • In certain circumstances, option settings within patterns were not being correctly processed. For example, the pattern ((?i)A)(?m)B incorrectly matched “ab”. (The (?m) setting lost the fact that (?i) should be reset at the end of its group during the parse process, but without another setting such as (?m) the compile phase got it right.)
  • When serializing a pattern, set the memctl, executable_jit, and tables fields (that is, all the fields that contain pointers) to zeros so that the result of serializing is always the same. These fields are re-set when the pattern is deserialized.
  • In a pattern such as [^\x{100}-\x{ffff}]*[\x80-\xff] which has a repeated negative class with no characters less than 0x100 followed by a positive class with only characters less than 0x100, the first class was incorrectly being auto-possessified, causing incorrect match failures.
  • If the only branch in a conditional subpattern was anchored, the whole subpattern was treated as anchored, when it should not have been, since the assumed empty second branch cannot be anchored. Demonstrated by test patterns such as (?(1)^())b or (?(?=^))b.
  • A repeated conditional subpattern that could match an empty string was always assumed to be unanchored. Now it it checked just like any other repeated conditional subpattern, and can be found to be anchored if the minimum quantifier is one or more.

YuPcre2 1.9.1 – 1 Jan 2019

  • Fix TDIRegEx2_16.MatchNext which might not not have properly advanced the start offset if the previous match was an empty string.
  • In YuPcre2_RegEx2.pas, replace a few character constants with ordinal constants to work around duplicate case label errors with at least one Delphi 10.3 Rio installation.

YuPcre2 1.9.0 – 24 Dec 2018

  • Support Delphi 10.3 Rio Win32 and Win64.

YuPcre2 1.8.0 – 2 Mar 2018

  • Add new pcre2_config options: PCRE2_CONFIG_NEVER_BACKSLASH_C and PCRE2_CONFIG_COMPILED_WIDTHS.
  • Defined public names for all the pcre2_compile error numbers.
  • When an assertion contained (*ACCEPT) it caused all open capturing groups to be closed (as for a non-assertion ACCEPT), which was wrong and could lead to misbehaviour for subsequent references to groups that started outside the assertion. ACCEPT in an assertion now closes only those groups that were started within that assertion.
  • Although pcre2_jit_match checks whether the pattern is compiled in a given mode, it was also expected that at least one mode is available. This is fixed and pcre2_jit_match returns with PCRE2_ERROR_JIT_BADOPTION when the pattern is not optimized by JIT at all.
  • If a backreference with a minimum repeat count of zero was first in a pattern, apart from assertions, an incorrect first matching character could be recorded. For example, for the pattern (?=(a))\1?b, “b” was incorrectly set as the first character of a match.
  • Characters in a leading positive assertion are considered for recording a first character of a match when the rest of the pattern does not provide one. However, a character in a non-assertive group within a leading assertion such as in the pattern (?=(a))\1?b caused this process to fail. This was an infelicity rather than an outright bug, because it did not affect the result of a match, just its speed. (In fact, in this case, the starting 'a' was subsequently picked up in the study.)
  • Allocate a single callout block on the stack at the start of pcre2_match and set its never-changing fields once only. Do the same for pcre2_dfa_match.
  • Save the extra compile options (set in the compile context) with the compiled pattern (they were not previously saved), add PCRE2_INFO_EXTRAOPTIONS to retrieve them.
  • Added PCRE2_CALLOUT_STARTMATCH and PCRE2_CALLOUT_BACKTRACK bits to a new field callout_flags in callout blocks. The bits are set by pcre2_match, but not by JIT or pcre2_dfa_match. These bits are provided to help with tracking how a backtracking match is proceeding.
  • When PCRE2_FIRSTLINE without PCRE2_NO_START_OPTIMIZE was used in non-JIT matching (both pcre2_match and pcre2_dfa_match) and the matched string started with the first code unit of a newline sequence, matching failed because it was not tried at the newline.
  • Code for giving up a non-partial match after failing to find a starting code unit anywhere in the subject was missing when searching for one of a number of code units (the bitmap case) in both pcre2_match and pcre2_dfa_match. This was a missing optimization rather than a bug.
  • The JIT compiler has been updated.
  • Avoid pointer overflow for unset captures in pcre2_substring_list_get. This could not actually cause a crash because it was always used in a memcpy() call with zero length.
  • Auto-possessification at the end of a capturing group was dependent on what follows the group (e.g. (a+)b would auto-possessify the a+) but this caused incorrect behaviour when the group was called recursively from elsewhere in the pattern where something different might follow. Iterators at the ends of capturing groups are no longer considered for auto-possessification if the pattern contains any recursions.

YuPcre2 1.7.0 – 16 Aug 2017

  • Implement PCRE2_ENDANCHORED, coEndAnchored, and moEndAnchored.
  • Add an explicit limit on the amount of heap used by pcre2_match, set by pcre2_set_heap_limit, TDIPerlRegEx2_8.HeapLimit, TDIDfaRegEx2_16.HeapLimit, and the pattern start (*LIMIT_HEAP=xxx).
  • Extend auto-anchoring etc. to ignore groups with a zero qualifier and single-branch conditions with a false condition (e.g. DEFINE) at the start of a branch. For example, (?(DEFINE)…)^A and (…){0}^B are now flagged as anchored.
  • Implement PCRE2_EXTENDED_MORE and coExtendedMore, and related /xx and (?xx) features.
  • Implement (?n: for PCRE2_NO_AUTO_CAPTURE and coNoAutoCapture, because Perl now has this.
  • Implement extra compile options in the compile context:
    • PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES and coAllowSurrogateEscapes;
    • PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL and coBadEscapeIsLiteral;
    • PCRE2_EXTRA_MATCH_LINE and coMatchLine;
    • PCRE2_EXTRA_MATCH_WORD and coMatchWord.
  • Implement newline type PCRE2_NEWLINE_NUL.
  • A lookbehind assertion that had a zero-length branch caused undefined behaviour when processed by pcre2_dfa_match.
  • The match limit value now also applies to pcre2_dfa_match as there are patterns that can use up a lot of resources without necessarily recursing very deeply.
  • Implement PCRE2_LITERAL and coLiteral.
  • Increased the limit for searching for a “must be present” code unit in subjects from 1000 to 2000 for 8-bit searches, since they are much faster.
  • Arrange for anchored patterns to record and use “first code unit” data, because this can give a fast “no match” without searching for a “required code unit”. Previously only non-anchored patterns did this.
  • Upgraded the Unicode tables from Unicode 8.0.0 to Unicode 10.0.0.
  • Update extended grapheme breaking rules to the latest set that are in Unicode Standard Annex #29.
  • Added experimental foreign pattern conversion facilities (pcre2_pattern_convert and friends).
  • If a hyphen that follows a character class is the last character in the class, Perl does not give a warning. PCRE2 now also treats this as a literal.
  • PCRE2 was not throwing an error for [\d-X] (and similar escapes), as is documented.

YuPcre2 1.6.0 – 3 Apr 2017

New features:

  • Support Delphi 10.2 Tokyo Win32 and Win64.
  • The main interpreter, pcre2_match, has been refactored into a new version that does not use recursive function calls (and therefore the stack) for remembering backtracking positions. The new implementation allows backtracking into recursive group calls in patterns, making it more compatible with Perl, and also fixes some other hard-to-do issues.
    • Now that pcre2_match no longer uses recursive function calls (see above), the “match limit recursion” value seems misnamed. It still exists, and limits the depth of tree that is searched. To avoid future confusion, it has been renamed as “depth limit” in all relevant places (TDIRegEx2Base.MatchLimitDepth, PCRE2_INFO_DEPTHLIMIT, PCRE2_CONFIG_DEPTHLIMIT, PCRE2_ERROR_DEPTHLIMIT, pcre2_set_depth_limit, etc.) but the old names are still available for backwards compatibility.
    • PCRE2_CONFIG_STACKRECURSE is no longer used and deprecated.
  • Added the PCRE2_INFO_FRAMESIZE item to pcre2_pattern_info and the InfoFrameSize property to TDIRegEx2_8 as well as TDIRegEx2_16.InfoFrameSize.
  • The depth (formerly recursion) limit now applies to DFA matching.

Bug fixes:

  • In the 32-bit library in non-UTF mode, an attempt to find a Unicode property for a character with a code point greater than 0x10ffff (the Unicode maximum) caused a crash.
  • If a lookbehind assertion that contained a back reference to a group appearing later in the pattern was compiled with the PCRE2_ANCHORED option, undefined actions (often a segmentation fault) could occur, depending on what other options were set. An example assertion is (?<!\1(abc)) where the reference \1 precedes the group (abc).
  • Fix memory leak in pcre2_serialize_decode when the input is invalid.
  • Fix potential nil dereference in pcre2_callout_enumerate if called with a nil pattern pointer.
  • The alternative matching function, pcre2_dfa_match misbehaved if it encountered a character class with a possessive repeat, for example [a-f]{3}+.

YuPcre2 1.5.0 – 17 Feb 2017

New features:

  • Implemented pcre2_code_copy_with_tables.
  • \g{+<number>} (e.g. \g{+2}) is now supported. It is a “forward back reference” and can be useful in repetitions (compare \g{-<number>}). Perl does not recognize this syntax.

Optimizations:

  • When a pattern is too complicated, PCRE2 gives up trying to find a minimum matching length and just records zero. Typically this happens when there are too many nested or recursive back references. If the limit was reached in certain recursive cases it failed to be triggered and an internal error could be the result.
  • The pcre2_dfa_match function now takes note of the recursion limit for the internal recursive calls that are used for lookrounds and recursions within the pattern.
  • Detecting patterns that are too large inside the length-measuring loop saves processing ridiculously long patterns to their end.
  • When autopossessifying, skip empty branches without recursion, to reduce stack usage. Example pattern: X?(R||){3335}.
  • A pattern with very many explicit back references to a group that is a long way from the start of the pattern could take a long time to compile because searching for the referenced group in order to find the minimum length was being done repeatedly. Now up to 128 group minimum lengths are cached and the attempt to find a minimum length is abandoned if there is a back reference to a group whose number is greater than 128. (In that case, the pattern is so complicated that this optimization probably isn't worth it.)

Bug fixes:

  • In any wide-character mode (8-bit UTF or any 16-bit or 32-bit mode), without PCRE2_UCP set, a negative character type such as \D in a positive class should cause all characters greater than 255 to match, whatever else is in the class. There was a bug that caused this not to happen if a Unicode property item was added to such a class, for example [\D\P{Nd}] or [\W\pL].
  • There has been a major re-factoring of pcre2_compile. Most syntax checking is now done in the pre-pass that identifies capturing groups. While doing this, some minor bugs and Perl incompatibilities were fixed, including:
    1. \Q\E in the middle of a quantifier such as A+\Q\E+ is now ignored instead of giving an invalid quantifier error.
    2. {0} can now be used after a group in a lookbehind assertion; previously this caused an “assertion is not fixed length” error.
    3. Perl always treats (?(DEFINE) as a “define” group, even if a group with the name “DEFINE” exists. PCRE2 now does likewise.
    4. A recursion condition test such as (?(R2)…) must now refer to an existing subpattern.
    5. A conditional recursion test such as (?(R)…) misbehaved if there was a group whose name began with “R”.
    6. A hyphen appearing immediately after a POSIX character class (for example [[:ascii:]-z]) now generates an error. Perl does accept this as a literal, but gives a warning, so it seems best to fail it in PCRE.
    7. An empty \Q\E sequence may appear after a callout that precedes an assertion condition (it is, of course, ignored).

      One effect of the refactoring is that some error numbers and messages have changed, and the pattern offset given for compiling errors is not always the right-most character that has been read. In particular, for a variable-length lookbehind assertion it now points to the start of the assertion. Another change is that when a callout appears before a group, the “length of next pattern item” that is passed now just gives the length of the opening parenthesis item, not the length of the whole group. A length of zero is now given only for a callout at the end of the pattern. Automatic callouts are no longer inserted before and after explicit callouts in the pattern. * Back references are now permitted in lookbehind assertions when there are no duplicated group numbers (that is, (?| has not been used), and, if the reference is by name, there is only one group of that name. The referenced group must, of course be of fixed length.
  • Automatic callouts are no longer generated before and after callouts in the pattern.
  • A number of bugs have been mended relating to match start-up optimizations when the first thing in a pattern is a positive lookahead. These all applied only when PCRE2_NO_START_OPTIMIZE was *not* set:
    1. A pattern such as (?=.*X)X$ was incorrectly optimized as if it needed both an initial 'X' and a following 'X'.
    2. Some patterns starting with an assertion that started with .* were incorrectly optimized as having to match at the start of the subject or after a newline. There are cases where this is not true, for example, (?=.*[A-Z])(?=.{8,16})(?!.*[\s]) matches after the start in lines that start with spaces. Starting .* in an assertion is no longer taken as an indication of matching at the start (or after a newline).
  • A pattern with PCRE2_DOTALL (/s) set but not PCRE2_NO_DOTSTAR_ANCHOR, and which started with .* inside a positive lookahead was incorrectly being compiled as implicitly anchored.
  • Fix out-of-bounds read for partial matching of . against an empty string when the newline type is CRLF.
  • The appearance of \p, \P, or \X in a substitution string when PCRE2_SUBSTITUTE_EXTENDED was set caused a segmentation fault (nil dereference).
  • If the starting offset was specified as greater than the subject length in a call to pcre2_substitute an out-of-bounds memory reference could occur.
  • Incorrect data was compiled for a pattern with PCRE2_UCP set without PCRE2_UTF if a class required all wide characters to match (for example, [\s[:^ascii:]]).
  • The limit in the auto-possessification code that was intended to catch overly-complicated patterns and not spend too much time auto-possessifying was being reset too often, resulting in very long compile times for some patterns. Now such patterns are no longer completely auto-possessified.
  • Ignore PCRE2_CASELESS when processing \h, \H, \v, and \V in classes as it just wastes time. In the UTF case it can also produce redundant entries in XCLASS lists caused by characters with multiple other cases and pairs of characters in the same “not-x” sublists.

YuPcre2 1.4.0 – 31 Jul 2016

New Features:

  • Implemented pcre2_code_copy to make a copy of a compiled pattern.
  • Implemented the PCRE2_NO_JIT option for pcre2_match and moNoJit option for TDIRegEx2Base.MatchOptions.
  • Calls to pcre2_get_error_message with error numbers that are never returned by PCRE2 functions were returning empty strings. Now the error code PCRE2_ERROR_BADDATA is returned.
  • Allow \C in lookbehinds and DFA matching in UTF-32 mode.

Bug fixes:

  • Detect unmatched closing parentheses and give the error in the pre-scan instead of later. Previously the pre-scan carried on and could give a misleading incorrect error message. For example, (?J)(?'a'))(?'a') gave a message about invalid duplicate group names.
  • A pattern that included (*ACCEPT) in the middle of a sufficiently deeply nested set of parentheses of sufficient size caused an overflow of the compiling workspace (which was diagnosed, but of course is not desirable).
  • Detect missing closing parentheses during the pre-pass for group identification.
  • Fix a racing condition in JIT.
  • Fix register overwrite in JIT when SSE2 acceleration is enabled.

YuPcre2 1.3.0 – 7 May 2016

  • Support Delphi 10.1 Berlin Win32 and Win64.

YuPcre2 1.2.0 – 4 Mar 2016

New features:

  • New option to limit the length of a pattern: TDIRegEx2Base.MaxPatternLength and pcre2_set_max_pattern_length.
  • New option to limit the offset of unanchored matches: TDIRegEx2Base.OffsetLimit and pcre2_set_offset_limit.
  • New pcre2_substitute options PCRE2_SUBSTITUTE_EXTENDED, PCRE2_SUBSTITUTE_UNSET_EMPTY, PCRE2_SUBSTITUTE_UNKNOWN_UNSET, and PCRE2_SUBSTITUTE_OVERFLOW_LENGTH.

Bug fixes:

  • In a character class such as [\W\p{Any}] where both a negative-type escape (“not a word character”) and a property escape were present, the property escape was being ignored.
  • Fixed integer overflow for patterns whose minimum matching length is very, very large.
  • The special sequences [[:<:]] and [[:>:]] gave rise to incorrect compiling errors or other strange effects if compiled in UCP mode.
  • Adding group information caching improves the speed of compiling when checking whether a group has a fixed length and/or could match an empty string, especially when recursion or subroutine calls are involved.
  • If [:^ascii:] or [:^xdigit:] are present in a non-negated class, all characters with code points greater than 255 are in the class. When a Unicode property was also in the class (if PCRE2_UCP is set, escapes such as \w are turned into Unicode properties), wide characters were not correctly handled, and could fail to match. Negated classes such as [^[:^ascii:]\d] were also not working correctly in UCP mode.
  • If PCRE2_AUTO_CALLOUT was set on a pattern that had a (?# comment between an item and its qualifier (for example, A(?#comment)?B) pcre2_compile misbehaved.
  • Similarly, if an isolated \E was present between an item and its qualifier when PCRE2_AUTO_CALLOUT was set, pcre2_compile misbehaved.
  • The error for an invalid UTF pattern string always gave the code unit offset as zero instead of where the invalidity was found.
  • An empty \Q\E sequence between an item and its qualifier caused pcre2_compile to misbehave when auto callouts were enabled.
  • If both PCRE2_ALT_VERBNAMES and PCRE2_EXTENDED were set, and a (*MARK) or other verb “name” ended with whitespace immediately before the closing parenthesis, pcre2_compile misbehaved. Example: (*:abc ), but only when both those options were set.
  • In a number of places pcre2_compile was not handling nil characters correctly.
  • If a pattern that was compiled with PCRE2_EXTENDED started with white space or a #-type comment that was followed by (?-x), which turns off PCRE2_EXTENDED, and there was no subsequent (?x) to turn it on again, pcre2_compile assumed that (?-x) applied to the whole pattern and consequently mis-compiled it. The fix for this bug means that a setting of any of the (?imsxU) options at the start of a pattern is no longer transferred to the options that are returned by PCRE2_INFO_ALLOPTIONS. In fact, this was an anachronism that should have changed when the effects of those options were all moved to compile time.
  • An escaped closing parenthesis in the “name” part of a (*verb) when PCRE2_ALT_VERBNAMES was set caused pcre2_compile to malfunction.

YuPcre2 1.1.0 – 15 Sep 2015

  • Support Delphi 10 Seattle Win32 and Win64.
  • Match limit check added to recursion.
  • Arrange for the UTF check in pcre2_match and pcre2_dfa_match to look only at the part of the subject that is relevant when the starting offset is non-zero.
  • Improve first character match in JIT with SSE2 on x86.
  • Fixed two assertion fails in JIT.
  • Fixed a corner case of range optimization in JIT.
  • Add the ${*MARK} facility to pcre2_substitute.
  • Implemented PCRE2_ALT_VERBNAMES and coAltVerbnames.
  • Fixed two issues in JIT.

YuPcre2 1.0.1 – 8 Aug 2015

  • Pathological patterns containing many nested occurrences of [: caused pcre2_compile to run for a very long time.
  • A missing closing parenthesis for a callout with a string argument was not being diagnosed, possibly leading to a buffer overflow.
  • A conditional group with only one branch has an implicit empty alternative branch and must therefore be treated as potentially matching an empty string.
  • If (?R was followed by - or + incorrect behaviour happened instead of a diagnostic.
  • Conditional groups whose condition was an assertion preceded by an explicit callout with a string argument might be incorrectly processed, especially if the string contained \Q.
  • Fix buffer overflow while checking a UTF-8 string if the final multi-byte UTF-8 character was truncated.
  • Finding the minimum matching length of complex patterns with back references and/or recursions can take a long time. There is now a cut-off that gives up trying to find a minimum length when things get too complex.
  • An optimization has been added that speeds up finding the minimum matching length for patterns containing repeated capturing groups or recursions.
  • If a pattern contained a back reference to a group whose number was duplicated as a result of appearing in a (?|…) group, the computation of the minimum matching length gave a wrong result, which could cause incorrect “no match” errors. For such patterns, a minimum matching length cannot at present be computed.
  • Added a check for integer overflow in conditions (?(<digits>) and (?(R<digits>).
  • Fixed an issue when \p{Any} inside an xclass did not read the current character.
  • The JIT compiler did not restore the control verb head in case of *THEN control verbs.
  • The way recursive references such as (?3) are compiled has been re-written because the old way was the cause of many issues. Now, conversion of the group number into a pattern offset does not happen until the pattern has been completely compiled. This does mean that detection of all infinitely looping recursions is postponed till match time. In the past, some easy ones were detected at compile time.
  • A test for a back reference to a non-existent group was missing for items such as \987. This caused incorrect code to be compiled.
  • Error messages for syntax errors following \g and \k were giving inaccurate offsets in the pattern.
  • Improve the performance of starting single character repetitions in JIT.
  • (*LIMIT_MATCH=) now gives an error instead of setting the value to 0.
  • Error messages for syntax errors in *LIMIT_MATCH and *LIMIT_RECURSION now give the right offset instead of zero.
  • The JIT compiler should not check repeats after a {0,1} repeat byte code.
  • The JIT compiler should restore the control chain for empty possessive repeats.

YuPcre2 1.0.0 – 22 Jul 2015

  • Initial release.
products/pcre2/history.txt · Last modified: 2020/05/12 18:27 (external edit)