TDIHtmlTablesPlugin.CurrentTable
property.TDIHtmlTable.TableNum
property.DIUri_3986
when cleaning up leading /../
and /..
dot segments on an empty base.DIUtils.pas
Unicode functions to Unicode 14.0.0.
Delphi compilers with support for the inline
directive (starting with Delphi 2005) failed to compile DIHtmlParser *.bpl packages for the Demo and Commercial editions. They generated a “[dcc32 Fatal Error] DIUtils: F2051 Unit DIContainers was compiled with a different version of DIUtils.StrSameIW”. Regular *.exe applications compiled without problems. The DIHtmlParser Source Code also compiled to both *.bpl packages and *.exe applications with no problems.
Extend character support to the full range of Unicode Code Points from $000000 to $10FFFF.
Up to now, DIHtmlParser stored code points as WideChars. This limited Unicode support to the Basic Multilingual Plane (BMP) from $0000 to $FFFF. Code points from the Supplementary Planes were converted to the $FFFD replacement character. This went well with a great number of languages. But less common scripts did not work, just like the increasingly popular emojis from the Symbols and Pictographs Unicode blocks.
DIHtmlParser 8.0.0 overcomes these limitations and now covers the complete Unicode range. Changes are almost entirely internal and maintain backwards compatibility as much as possible. Existing applications should compile with no or minor changes only. WideChar routines are marked as deprecated and hint at their new complementary UCP routines.
TDIHtmlParser.Data
is still a WideChar buffer. However, its contents is now fully UTF-16 encoded. This means that it may contain code points > $FFFF which take up two WideChars (surrogate pairs). As a result, indexed access to the buffer is no longer guaranteed. TDIHtmlParser.Data
related methods, like TDIHtmlParser.DataAsStrTrimW
are adjusted accordingly.
UnicodeString utility routines are rewritten to handle full UTF-16, including surrogate pairs. Most of them are in DIUtils.pas
. YuUtf.pas
also contains new utility routines for UTF-16 testing, encoding, and decoding. If possible, string handling routines now take NativeInt type parameters for the buffer length.
Other noteworthy changes:
TDIHtmlParser.UCP
complements TDIHtmlParser.Char
.TDIHtmlParser.CustomTagStartChar
has new a UCS4Char complement CustomTagStartUcp
. The same holds for TDIHtmlWriterPlugin.CustomTagStartChar
and CustomTagStartUcp
.TDICustomTag.GetStartCode
has a new UCS4Char overload. So do GetEmptyElementCode
and GetEndCode
.TDIHtmlParser.StartCol
, EndCol
, StartLine
, EndLine
, StartPos
, and EndPos
from unsigned Cardinal to signed NativeInt.DI_No_Classes
and DI_No_Unicode_Component
(source code only). TDIHtmlParser
and TDIHtmlParserPlugin
now always descends from TComponent
and the Classes
unit is always used. Source code only.DIUtils.pas
Unicode processing to support Unicode Code Points from $000000 to $10FFFF. Adjust remaining source code accordingly.DIUtils.pas
Unicode functions to Unicode 12.1.0.DIUtils.pas
. There is no error message, so it is not possible to work around the problem. Support for these compilers is therefore removed. At least Delphi 6 is now required to compile DIHtmlParser.DI.inc
include file. Directly link in DICompilers.inc
instead.TDIUnicodeWriter
memory leak if TDIUnicodeWriteMethods.Init
allocates its own memory.TDIUnicodeWriter.Clear
calls TDIUnicodeWriteMethods.Flush
to reset encoder state.DIUtils.pas
Unicode functions to Unicode 12.Read_iso_2022_jp_ms
read methods and Write_iso_2022_jp_ms
write methods. This is recognized by TDIHtmlCharSetPlugin
.TDIHtmlWriterPlugin.PredefinedEntities
:peLtAttribValue
to encode “<
” as <
in attribute values. Required for XML conformance.peGtAttribValue
to encode “>
” as >
in attribute values.peQuotNum
to encode quotation mark as numeric "
instead of "
.peAposNum
was not applied to attribute values.TDIHtmlWriterPlugin
properties to force the character used to quote attribute values:QuoteHtmlTagsChar
QuoteCustomTagsChar
QuoteSsiTagsChar
DIUri
as deprecated.TDIHtmlChangeLinksPlugin
uses unit DIUri_3986
instead of the deprecated unit DIUri
.TDIHtmlCharSetPlugin
: Fix that a second <meta http-equiv> tag which is not a content type does not reset the decoding to the default decoding.TAG_SECTION
, TAG_SECTION_ID
and ATTRIB_PLACEHOLDER
and ATTRIB_PLACEHOLDER_ID
. The new HTML5 tags and attributes are automatically registered calling RegisterHtmlTags
and RegisterHtmlAttribs
.RegisterHtmlDecodingEntities
, DIHtmlParser now recognizes all 2231 references listed in the current HTML5 draft.';
' is no longer required. For example, &
is recognized as '&
' just as &
, &
, and &
.';
'. Change: If a terminating semicolon ';
' is present, RegisterDecodingEntity
now demands that it must be present in the entity name.TDIHtmlCharSetPlugin
recognizes the new HTML5 <meta charset=“name”>
character encoding declaration.DIUri_3986.TDIUri.AssignPath
and DIUri_3986.TDIUri.AssignHost
methods, plus DIUri_3986.UritoFileName
with DIUri_3986.TDIUri
URI input and UnicodeString filename output.TDIHtmlParser.SourceStream
, the size of the internal source buffer was not correctly calculated. Depending on the decoding, this slowed down reading or even stoped it before the end of the stream was reached.DIUri.UriToFileName
removes 'localhost' from authority, if present. Despite this change, DIUri
is now deprecated. use DIUri_3986
instead.ColorFromHtml
: Improve parsing of #color values, in particular different lengths. Parse non conforming #color values as legacy color values.EmptyAttribValues
parameter (default = false) toTDIHtmlTag.GetCode
, TDIHtmlTag.GetStartCode
, TDIHtmlTag.GetEmptyElementCode
,TDICustomTag.GetCode
, TDICustomTag.GetStartCode
, TDICustomTag.GetEmptyElementCode
,TDISsiTag.GetCode
, TDISsiTag.GetStartCode
, TDISsiTag.GetEmptyElementCode
.TDIHtmlParser.FillSourceBuffer
(source code edition only).</script>
and </style>
so they accept attribute content like the other end-tags. This does not strictly conform to the HTML specifications but is sometimes found in real-world HTML.EndLine
, EndCol
, and EndPos
functions determine the end of the current HTML piece.TDIVector
or descendents like TDITag
and TDIHtmlTag
.</SCRIPT>
end tag was missing.<script>
and </script>
elements.<![CDATA[
beginning of ptCDataSection
case-sensitively, as per specification.<![CDATA[
… ]]>
sections separately inside JavaScript comments. This fixes a problem with pages that use a commented CDATA section inside a script element but do not properly close this comment before the closing </script>
end tag. Such end tags are now recognized by DIHtmlParser.TDICustomHtmlWriterPlugin
intermediate interface for greater flexibilty in customizing TDIHtmlWriterPlugin
.TDIHtmlParser.DataAsStrTrim8
convenience method.DIUtils.pas
.TDIHtmlParser
parsing options:TDIHtmlParser.EnableComments
.TDIHtmlParser.EnableEntities
.TDIHtmlParser.EnableExclamationMarkups
.TDIHtmlParser
: When parsing JavaScript, a forward slash “/” inside a regular expression character class was not recognized as such and could lead to an infinite loop.TDIHtmlCharSetPlugin
: Correct decoding function for “GBK” encoding which did not read the 1 to 127 character range.DIUtils.pas
which caused an error when compiled on a Windows OS set to a non-European (Asian, Cyrillic, etc.) codepage.TDIHtmlTag
, TDICustomTag
, TDISsiTag
: .ConCatValue
must not escape a '&' character in an attribute value immediately followed by a '{' character (HTML 4.0.1 Section B.7.1).TDIHtmlParser.TrimAttribValues
behaved exactly opposit as intended.TDIHtmlLinksPlugin2
.TDIHtmlCollectLinksPlugin
.TDIHtmlChangeLinksPlugin
.TDIHtmlParser.FindHtmlTag
, TDIHtmlParser.FindSsiTag
, and TDIHtmlParser.ParseNextHtmlTag
.TDIHtmlParser.EnableHtmlTags
property which controls if HTML tags are properly recognized as such or are simply treated as text. Ignoring HTML tags can be useful for HTML scripting.TDIHtmlParser.TrimAttribValues
property which controls if whitespace are automatically trimmed when parsing the attribute values of tags.TDIHtmlParser.DefaultContentScriptType
property to determine the content script type from outside the HTML document.Write_UTF_7
/ Read_UTF_7
)Write_UTF_7_ODC
/ reads as Read_UTF_7
)'<k$R>
') as HTML Tags instead of Text. There is also a new piece type ptExclamationMarkup
covering inserts starting with an exclamation mark like '<!A>
'. It is returned for the character patterns '<! … >
' which are not Comments, CData Sections, Document Templates, or SSI.'<?XML Char* ?>
'. By specification, XmlPI must terminate with '?>
', but the '?
' is sometimes missing. Specification conformant parsing would then cause DIHtmlParser unintentionally to interpret lengthy stretches as XmlPI. This is now fixed by recognizing both variants as ending an XmlPI.' 
') in some cases.TDIHtmlParser.StopParseAll
procedure to a TDIHtmlParser.StopParse
property. This must be set to True
to stop the current parsing process. It applies to both TDIHtmlParser.ParseAll
as well as to TDIHtmlParser.ParseNextPiece
, where it cancels an ongoing parsing process which did not yet return to the caller.TDIAbstractHtmlAttribsPlugin
as ancestor class of TDIHtmlLinksPlugin
, which now responds to a much wider range of link combinations, including multiple links contained within a single tag. Applications can also add custom Tag / Attribute combinations to report by calling TDIAbstractHtmlAttribsPlugin.AddAttrib
. The TDIHtmlLinksPluginEvent
callback definition has changed slightly and requires an interface change to existing applications.TDIHtmlWriterPlugin.PredefinedEntities
option which allows to specify some known predefined entities which will alway be encoded by default when writing HTML text, regardless of other entity registrations.TDITag.ForceAttribValue
to TDITag.ForceAttrib
.TDITag
and descendent classes benefit from changes to DIContainers
ancestors. This includes speed optimizations as well as some interface simplifications.