TDIHtmlTablesPlugin.CurrentTable property.TDIHtmlTable.TableNum property.DIUri_3986 when cleaning up leading /../ and /.. dot segments on an empty base.DIUtils.pas Unicode functions to Unicode 14.0.0.
Delphi compilers with support for the inline directive (starting with Delphi 2005) failed to compile DIHtmlParser *.bpl packages for the Demo and Commercial editions. They generated a “[dcc32 Fatal Error] DIUtils: F2051 Unit DIContainers was compiled with a different version of DIUtils.StrSameIW”. Regular *.exe applications compiled without problems. The DIHtmlParser Source Code also compiled to both *.bpl packages and *.exe applications with no problems.
Extend character support to the full range of Unicode Code Points from $000000 to $10FFFF.
Up to now, DIHtmlParser stored code points as WideChars. This limited Unicode support to the Basic Multilingual Plane (BMP) from $0000 to $FFFF. Code points from the Supplementary Planes were converted to the $FFFD replacement character. This went well with a great number of languages. But less common scripts did not work, just like the increasingly popular emojis from the Symbols and Pictographs Unicode blocks.
DIHtmlParser 8.0.0 overcomes these limitations and now covers the complete Unicode range. Changes are almost entirely internal and maintain backwards compatibility as much as possible. Existing applications should compile with no or minor changes only. WideChar routines are marked as deprecated and hint at their new complementary UCP routines.
TDIHtmlParser.Data is still a WideChar buffer. However, its contents is now fully UTF-16 encoded. This means that it may contain code points > $FFFF which take up two WideChars (surrogate pairs). As a result, indexed access to the buffer is no longer guaranteed. TDIHtmlParser.Data related methods, like TDIHtmlParser.DataAsStrTrimW are adjusted accordingly.
UnicodeString utility routines are rewritten to handle full UTF-16, including surrogate pairs. Most of them are in DIUtils.pas. YuUtf.pas also contains new utility routines for UTF-16 testing, encoding, and decoding. If possible, string handling routines now take NativeInt type parameters for the buffer length.
Other noteworthy changes:
TDIHtmlParser.UCP complements TDIHtmlParser.Char.TDIHtmlParser.CustomTagStartChar has new a UCS4Char complement CustomTagStartUcp. The same holds for TDIHtmlWriterPlugin.CustomTagStartChar and CustomTagStartUcp.TDICustomTag.GetStartCode has a new UCS4Char overload. So do GetEmptyElementCode and GetEndCode.TDIHtmlParser.StartCol, EndCol, StartLine, EndLine, StartPos, and EndPos from unsigned Cardinal to signed NativeInt.DI_No_Classes and DI_No_Unicode_Component (source code only). TDIHtmlParser and TDIHtmlParserPlugin now always descends from TComponent and the Classes unit is always used. Source code only.DIUtils.pas Unicode processing to support Unicode Code Points from $000000 to $10FFFF. Adjust remaining source code accordingly.DIUtils.pas Unicode functions to Unicode 12.1.0.DIUtils.pas. There is no error message, so it is not possible to work around the problem. Support for these compilers is therefore removed. At least Delphi 6 is now required to compile DIHtmlParser.DI.inc include file. Directly link in DICompilers.inc instead.TDIUnicodeWriter memory leak if TDIUnicodeWriteMethods.Init allocates its own memory.TDIUnicodeWriter.Clear calls TDIUnicodeWriteMethods.Flush to reset encoder state.DIUtils.pas Unicode functions to Unicode 12.Read_iso_2022_jp_ms read methods and Write_iso_2022_jp_ms write methods. This is recognized by TDIHtmlCharSetPlugin.TDIHtmlWriterPlugin.PredefinedEntities:peLtAttribValue to encode “<” as < in attribute values. Required for XML conformance.peGtAttribValue to encode “>” as > in attribute values.peQuotNum to encode quotation mark as numeric " instead of ".peAposNum was not applied to attribute values.TDIHtmlWriterPlugin properties to force the character used to quote attribute values:QuoteHtmlTagsCharQuoteCustomTagsCharQuoteSsiTagsCharDIUri as deprecated.TDIHtmlChangeLinksPlugin uses unit DIUri_3986 instead of the deprecated unit DIUri.TDIHtmlCharSetPlugin: Fix that a second <meta http-equiv> tag which is not a content type does not reset the decoding to the default decoding.TAG_SECTION, TAG_SECTION_ID and ATTRIB_PLACEHOLDER and ATTRIB_PLACEHOLDER_ID. The new HTML5 tags and attributes are automatically registered calling RegisterHtmlTags and RegisterHtmlAttribs.RegisterHtmlDecodingEntities, DIHtmlParser now recognizes all 2231 references listed in the current HTML5 draft.';' is no longer required. For example, & is recognized as '&' just as &, &, and &.';'. Change: If a terminating semicolon ';' is present, RegisterDecodingEntity now demands that it must be present in the entity name.TDIHtmlCharSetPlugin recognizes the new HTML5 <meta charset=“name”> character encoding declaration.DIUri_3986.TDIUri.AssignPath and DIUri_3986.TDIUri.AssignHost methods, plus DIUri_3986.UritoFileName with DIUri_3986.TDIUri URI input and UnicodeString filename output.TDIHtmlParser.SourceStream, the size of the internal source buffer was not correctly calculated. Depending on the decoding, this slowed down reading or even stoped it before the end of the stream was reached.DIUri.UriToFileName removes 'localhost' from authority, if present. Despite this change, DIUri is now deprecated. use DIUri_3986 instead.ColorFromHtml: Improve parsing of #color values, in particular different lengths. Parse non conforming #color values as legacy color values.EmptyAttribValues parameter (default = false) toTDIHtmlTag.GetCode, TDIHtmlTag.GetStartCode, TDIHtmlTag.GetEmptyElementCode,TDICustomTag.GetCode, TDICustomTag.GetStartCode, TDICustomTag.GetEmptyElementCode,TDISsiTag.GetCode, TDISsiTag.GetStartCode, TDISsiTag.GetEmptyElementCode.TDIHtmlParser.FillSourceBuffer (source code edition only).</script> and </style> so they accept attribute content like the other end-tags. This does not strictly conform to the HTML specifications but is sometimes found in real-world HTML.EndLine, EndCol, and EndPos functions determine the end of the current HTML piece.TDIVector or descendents like TDITag and TDIHtmlTag.</SCRIPT> end tag was missing.<script> and </script> elements.<![CDATA[ beginning of ptCDataSection case-sensitively, as per specification.<![CDATA[ … ]]> sections separately inside JavaScript comments. This fixes a problem with pages that use a commented CDATA section inside a script element but do not properly close this comment before the closing </script> end tag. Such end tags are now recognized by DIHtmlParser.TDICustomHtmlWriterPlugin intermediate interface for greater flexibilty in customizing TDIHtmlWriterPlugin.TDIHtmlParser.DataAsStrTrim8 convenience method.DIUtils.pas.TDIHtmlParser parsing options:TDIHtmlParser.EnableComments.TDIHtmlParser.EnableEntities.TDIHtmlParser.EnableExclamationMarkups.TDIHtmlParser: When parsing JavaScript, a forward slash “/” inside a regular expression character class was not recognized as such and could lead to an infinite loop.TDIHtmlCharSetPlugin: Correct decoding function for “GBK” encoding which did not read the 1 to 127 character range.DIUtils.pas which caused an error when compiled on a Windows OS set to a non-European (Asian, Cyrillic, etc.) codepage.TDIHtmlTag, TDICustomTag, TDISsiTag: .ConCatValue must not escape a '&' character in an attribute value immediately followed by a '{' character (HTML 4.0.1 Section B.7.1).TDIHtmlParser.TrimAttribValues behaved exactly opposit as intended.TDIHtmlLinksPlugin2.TDIHtmlCollectLinksPlugin.TDIHtmlChangeLinksPlugin.TDIHtmlParser.FindHtmlTag, TDIHtmlParser.FindSsiTag, and TDIHtmlParser.ParseNextHtmlTag.TDIHtmlParser.EnableHtmlTags property which controls if HTML tags are properly recognized as such or are simply treated as text. Ignoring HTML tags can be useful for HTML scripting.TDIHtmlParser.TrimAttribValues property which controls if whitespace are automatically trimmed when parsing the attribute values of tags.TDIHtmlParser.DefaultContentScriptType property to determine the content script type from outside the HTML document.Write_UTF_7 / Read_UTF_7)Write_UTF_7_ODC / reads as Read_UTF_7)'<k$R>') as HTML Tags instead of Text. There is also a new piece type ptExclamationMarkup covering inserts starting with an exclamation mark like '<!A>'. It is returned for the character patterns '<! … >' which are not Comments, CData Sections, Document Templates, or SSI.'<?XML Char* ?>'. By specification, XmlPI must terminate with '?>', but the '?' is sometimes missing. Specification conformant parsing would then cause DIHtmlParser unintentionally to interpret lengthy stretches as XmlPI. This is now fixed by recognizing both variants as ending an XmlPI.' ') in some cases.TDIHtmlParser.StopParseAll procedure to a TDIHtmlParser.StopParse property. This must be set to True to stop the current parsing process. It applies to both TDIHtmlParser.ParseAll as well as to TDIHtmlParser.ParseNextPiece, where it cancels an ongoing parsing process which did not yet return to the caller.TDIAbstractHtmlAttribsPlugin as ancestor class of TDIHtmlLinksPlugin, which now responds to a much wider range of link combinations, including multiple links contained within a single tag. Applications can also add custom Tag / Attribute combinations to report by calling TDIAbstractHtmlAttribsPlugin.AddAttrib. The TDIHtmlLinksPluginEvent callback definition has changed slightly and requires an interface change to existing applications.TDIHtmlWriterPlugin.PredefinedEntities option which allows to specify some known predefined entities which will alway be encoded by default when writing HTML text, regardless of other entity registrations.TDITag.ForceAttribValue to TDITag.ForceAttrib.TDITag and descendent classes benefit from changes to DIContainers ancestors. This includes speed optimizations as well as some interface simplifications.