HTML::Defang(3) User Contributed Perl Documentation HTML::Defang(3)NAMEHTML::Defang - Cleans HTML as well as CSS of scripting and other
executable contents, and neutralises XSS attacks.
SYNOPSIS
my $InputHtml = "<html><body></body></html>";
my $Defang = HTML::Defang->new(
context => $Self,
fix_mismatched_tags => 1,
tags_to_callback => [ br embed img ],
tags_callback => \&DefangTagsCallback,
url_callback => \&DefangUrlCallback,
css_callback => \&DefangCssCallback,
attribs_to_callback => [ qw(border src) ],
attribs_callback => \&DefangAttribsCallback
);
my $SanitizedHtml = $Defang->defang($InputHtml);
# Callback for custom handling specific HTML tags
sub DefangTagsCallback {
my ($Self, $Defang, $OpenAngle, $lcTag, $IsEndTag, $AttributeHash, $CloseAngle, $HtmlR, $OutR) = @_;
# Explicitly defang this tag, eventhough safe
return DEFANG_ALWAYS if $lcTag eq 'br';
# Explicitly whitelist this tag, eventhough unsafe
return DEFANG_NONE if $lcTag eq 'embed';
# I am not sure what to do with this tag, so process as HTML::Defang normally would
return DEFANG_DEFAULT if $lcTag eq 'img';
}
# Callback for custom handling URLs in HTML attributes as well as style tag/attribute declarations
sub DefangUrlCallback {
my ($Self, $Defang, $lcTag, $lcAttrKey, $AttrValR, $AttributeHash, $HtmlR) = @_;
# Explicitly allow this URL in tag attributes or stylesheets
return DEFANG_NONE if $$AttrValR =~ /safesite.com/i;
# Explicitly defang this URL in tag attributes or stylesheets
return DEFANG_ALWAYS if $$AttrValR =~ /evilsite.com/i;
}
# Callback for custom handling style tags/attributes
sub DefangCssCallback {
my ($Self, $Defang, $Selectors, $SelectorRules, $Tag, $IsAttr) = @_;
my $i = 0;
foreach (@$Selectors) {
my $SelectorRule = $$SelectorRules[$i];
foreach my $KeyValueRules (@$SelectorRule) {
foreach my $KeyValueRule (@$KeyValueRules) {
my ($Key, $Value) = @$KeyValueRule;
# Comment out any '!important' directive
$$KeyValueRule[2] = DEFANG_ALWAYS if $Value =~ '!important';
# Comment out any 'position=fixed;' declaration
$$KeyValueRule[2] = DEFANG_ALWAYS if $Key =~ 'position' && $Value =~ 'fixed';
}
}
$i++;
}
}
# Callback for custom handling HTML tag attributes
sub DefangAttribsCallback {
my ($Self, $Defang, $lcTag, $lcAttrKey, $AttrValR, $HtmlR) = @_;
# Change all 'border' attribute values to zero.
$$AttrValR = '0' if $lcAttrKey eq 'border';
# Defang all 'src' attributes
return DEFANG_ALWAYS if $lcAttrKey eq 'src';
return DEFANG_NONE;
}
DESCRIPTION
This module accepts an input HTML and/or CSS string and removes any
executable code including scripting, embedded objects, applets, etc.,
and neutralises any XSS attacks. A whitelist based approach is used
which means only HTML known to be safe is allowed through.
HTML::Defang uses a custom html tag parser. The parser has been
designed and tested to work with nasty real world html and to try and
emulate as close as possible what browsers actually do with strange
looking constructs. The test suite has been built based on examples
from a range of sources such as http://ha.ckers.org/xss.html and
http://imfo.ru/csstest/css_hacks/import.php to ensure that as many as
possible XSS attack scenarios have been dealt with.
HTML::Defang can make callbacks to client code when it encounters the
following:
· When a specified tag is parsed
· When a specified attribute is parsed
· When a URL is parsed as part of an HTML attribute, or CSS property
value.
· When style data is parsed, as part of an HTML style attribute, or
as part of an HTML <style> tag.
The callbacks include details about the current tag/attribute that is
being parsed, and also gives a scalar reference to the input HTML.
Querying pos() on the input HTML should indicate where the module is
with parsing. This gives the client code flexibility in working with
HTML::Defang.
HTML::Defang can defang whole tags, any attribute in a tag, any URL
that appear as an attribute or style property, or any CSS declaration
in a declaration block in a style rule. This helps to precisely block
the most specific unwanted elements in the contents(for example, block
just an offending attribute instead of the whole tag), while retaining
any safe HTML/CSS.
CONSTRUCTOR
HTML::Defang->new(%Options)
Constructs a new HTML::Defang object. The following options are
supported:
Options
tags_to_callback
Array reference of tags for which a call back should be
made. If a tag in this array is parsed, the subroutine
tags_callback() is invoked.
attribs_to_callback
Array reference of tag attributes for which a call back
should be made. If an attribute in this array is parsed,
the subroutine attribs_callback() is invoked.
tags_callback
Subroutine reference to be invoked when a tag listed in
@$tags_to_callback is parsed.
attribs_callback
Subroutine reference to be invoked when an attribute listed
in @$attribs_to_callback is parsed.
url_callback
Subroutine reference to be invoked when a URL is detected
in an HTML tag attribute or a CSS property.
css_callback
Subroutine reference to be invoked when CSS data is found
either as the contents of a 'style' attribute in an HTML
tag, or as the contents of a <style> HTML tag.
fix_mismatched_tags
This property, if set, fixes mismatched tags in the HTML
input. By default, tags present in the default
%mismatched_tags_to_fix hash are fixed. This set of tags
can be overridden by passing in an array reference
$mismatched_tags_to_fix to the constructor. Any opened tags
in the set are automatically closed if no corresponding
closing tag is found. If an unbalanced closing tag is
found, that is commented out.
mismatched_tags_to_fix
Array reference of tags for which the code would check for
matching opening and closing tags. See the property
$fix_mismatched_tags.
context
You can pass an arbitrary scalar as a 'context' value
that's then passed as the first parameter to all callback
functions. Most commonly this is something like '$Self'
allow_double_defang
If this is true, then tag names and attribute names which
already begin with the defang string ("defang_" by default)
will have an additional copy of the defang string prepended
if they are flagged to be defanged by the return value of a
callback, or if the tag or attribute name is unknown.
The default is to assume that tag names and attribute names
beginning with the defang string are already made safe, and
need no further modification, even if they are flagged to
be defanged by the return value of a callback. Any tag or
attribute modifications made directly by a callback are
still performed.
Debug
If set, prints debugging output.
CALLBACK METHODS
COMMON PARAMETERS
A number of the callbacks share the same parameters. These common
parameters are documented here. Certain variables may have specific
meanings in certain callbacks, so be sure to check the
documentation for that method first before referring this section.
$context
You can pass an arbitrary scalar as a 'context' value that's
then passed as the first parameter to all callback functions.
Most commonly this is something like '$Self'
$Defang
Current HTML::Defang instance
$OpenAngle
Opening angle(<) sign of the current tag.
$lcTag
Lower case version of the HTML tag that is currently being
parsed.
$IsEndTag
Has the value '/' if the current tag is a closing tag.
$AttributeHash
A reference to a hash containing the attributes of the current
tag and their values. Each value is a scalar reference to the
value, rather than just a scalar value. You can add attributes
(remember to make it a scalar ref, eg $AttributeHash{"newattr"}
= \"newval"), delete attributes, or modify attribute values in
this hash, and any changes you make will be incorporated into
the output HTML stream.
The attribute values will have any entity references decoded
before being passed to you, and any unsafe values we be re-
encoded back into the HTML stream.
So for instance, the tag:
<div title="<"Hi there <">
Will have the attribute hash:
{ title => \q[<"Hi there <] }
And will be turned back into the HTML on output:
<div title="<"Hi there <">
$CloseAngle
Anything after the end of last attribute including the closing
HTML angle(>)
$HtmlR
A scalar reference to the input HTML. The input HTML is parsed
using m/\G$SomeRegex/c constructs, so to continue from where
HTML:Defang left, clients can use m/\G$SomeRegex/c for further
processing on the input. This will resume parsing from where
HTML::Defang left. One can also use the pos() function to
determine where HTML::Defang left off. This combined with the
add_to_output() method should give reasonable flexibility for
the client to process the input.
$OutR
A scalar reference to the processed output HTML so far.
tags_callback($context, $Defang, $OpenAngle, $lcTag, $IsEndTag,
$AttributeHash, $CloseAngle, $HtmlR, $OutR)
If $Defang->{tags_callback} exists, and HTML::Defang has parsed a
tag preset in $Defang->{tags_to_callback}, the above callback is
made to the client code. The return value of this method determines
whether the tag is defanged or not. More details below.
Return values
DEFANG_NONE
The current tag will not be defanged.
DEFANG_ALWAYS
The current tag will be defanged.
DEFANG_DEFAULT
The current tag will be processed normally by HTML:Defang
as if there was no callback method specified.
attribs_callback($context, $Defang, $lcTag, $lcAttrKey, $AttrVal,
$HtmlR, $OutR)
If $Defang->{attribs_callback} exists, and HTML::Defang has parsed
an attribute present in $Defang->{attribs_to_callback}, the above
callback is made to the client code. The return value of this
method determines whether the attribute is defanged or not. More
details below.
Method parameters
$lcAttrKey
Lower case version of the HTML attribute that is currently
being parsed.
$AttrVal
Reference to the HTML attribute value that is currently
being parsed.
See $AttributeHash for details of decoding.
Return values
DEFANG_NONE
The current attribute will not be defanged.
DEFANG_ALWAYS
The current attribute will be defanged.
DEFANG_DEFAULT
The current attribute will be processed normally by
HTML:Defang as if there was no callback method specified.
url_callback($context, $Defang, $lcTag, $lcAttrKey, $AttrVal,
$AttributeHash, $HtmlR, $OutR)
If $Defang->{url_callback} exists, and HTML::Defang has parsed a
URL, the above callback is made to the client code. The return
value of this method determines whether the attribute containing
the URL is defanged or not. URL callbacks can be made from <style>
tags as well style attributes, in which case the particular style
declaration will be commented out. More details below.
Method parameters
$lcAttrKey
Lower case version of the HTML attribute that is currently
being parsed. However if this callback is made as a result
of parsing a URL in a style attribute, $lcAttrKey will be
set to the string style, or will be set to undef if this
callback is made as a result of parsing a URL inside a
style tag.
$AttrVal
Reference to the URL value that is currently being parsed.
$AttributeHash
A reference to a hash containing the attributes of the
current tag and their values. Each value is a scalar
reference to the value, rather than just a scalar value.
You can add attributes (remember to make it a scalar ref,
eg $AttributeHash{"newattr"} = \"newval"), delete
attributes, or modify attribute values in this hash, and
any changes you make will be incorporated into the output
HTML stream. Will be set to undef if the callback is made
due to URL in a <style> tag or attribute.
Return values
DEFANG_NONE
The current URL will not be defanged.
DEFANG_ALWAYS
The current URL will be defanged.
DEFANG_DEFAULT
The current URL will be processed normally by HTML:Defang
as if there was no callback method specified.
css_callback($context, $Defang, $Selectors, $SelectorRules, $lcTag,
$IsAttr, $OutR)
If $Defang->{css_callback} exists, and HTML::Defang has parsed a
<style> tag or style attribtue, the above callback is made to the
client code. The return value of this method determines whether a
particular declaration in the style rules is defanged or not. More
details below.
Method parameters
$Selectors
Reference to an array containing the selectors in a style
tag or attribute.
$SelectorRules
Reference to an array containing the style declaration
blocks of all selectors in a style tag or attribute.
Consider the below CSS:
a { b:c; d:e}
j { k:l; m:n}
The declaration blocks will get parsed into the following
data structure:
[
[
[ "b", "c", DEFANG_DEFAULT ],
[ "d", "e", DEFANG_DEFAULT ]
],
[
[ "k", "l", DEFANG_DEFAULT ],
[ "m", "n", DEFANG_DEFAULT ]
]
]
So, generally each property:value pair in a declaration is
parsed into an array of the form
["property", "value", X]
where X can be DEFANG_NONE, DEFANG_ALWAYS or
DEFANG_DEFAULT, and DEFANG_DEFAULT the default value. A
client can manipulate this value to instruct HTML::Defang
to defang this property:value pair.
DEFANG_NONE - Do not defang
DEFANG_ALWAYS - Defang the style:property value
DEFANG_DEFAULT - Process this as if there is no callback
specified
$IsAttr
True if the currently processed item is a style attribute.
False if the currently processed item is a style tag.
METHODS
PUBLIC METHODS
defang($InputHtml)
Cleans up $InputHtml of any executable code including
scripting, embedded objects, applets, etc., and defang any XSS
attacks.
Method parameters
$InputHtml
The input HTML string that needs to be sanitized.
Returns the cleaned HTML. If fix_mismatched_tags is set, any
tags that appear in @$mismatched_tags_to_fix that are
unbalanced are automatically commented or closed.
add_to_output($String)
Appends $String to the output after the current parsed tag
ends. Can be used by client code in callback methods to add
HTML text to the processed output. If the HTML text needs to be
defanged, client code can safely call HTML::Defang->defang()
recursively from within the callback.
Method parameters
$String
The string that is added after the current parsed tag
ends.
INTERNAL METHODS
Generally these methods never need to be called by users of the
class, because they'll be called internally as the appropriate tags
are encountered, but they may be useful for some users in some
cases.
defang_script($OutR, $HtmlR, $TagOps, $OpenAngle, $IsEndTag, $Tag,
$TagTrail, $Attributes, $CloseAngle)
This method is invoked when a <script> tag is parsed. Defangs
the <script> opening tag, and any closing tag. Any scripting
content is also commented out, so browsers don't display them.
Returns 1 to indicate that the <script> tag must be defanged.
Method parameters
$OutR
A reference to the processed output HTML before the tag
that is currently being parsed.
$HtmlR
A scalar reference to the input HTML.
$TagOps
Indicates what operation should be done on a tag. Can
be undefined, integer or code reference. Undefined
indicates an unknown tag to HTML::Defang, 1 indicates a
known safe tag, 0 indicates a known unsafe tag, and a
code reference indicates a subroutine that should be
called to parse the current tag. For example, <style>
and <script> tags are parsed by dedicated subroutines.
$OpenAngle
Opening angle(<) sign of the current tag.
$IsEndTag
Has the value '/' if the current tag is a closing tag.
$Tag
The HTML tag that is currently being parsed.
$TagTrail
Any space after the tag, but before attributes.
$Attributes
A reference to an array of the attributes and their
values, including any surrouding spaces. Each element
of the array is added by 'push' calls like below.
push @$Attributes, [ $AttributeName, $SpaceBeforeEquals, $EqualsAndSubsequentSpace, $QuoteChar, $AttributeValue, $QuoteChar, $SpaceAfterAtributeValue ];
$CloseAngle
Anything after the end of last attribute including the
closing HTML angle(>)
defang_style($OutR, $HtmlR, $TagOps, $OpenAngle, $IsEndTag, $Tag,
$TagTrail, $Attributes, $CloseAngle, $IsAttr)
Builds a list of selectors and declarations from HTML style
tags as well as style attributes in HTML tags and calls
defang_stylerule() to do the actual defanging.
Returns 0 to indicate that style tags must not be defanged.
Method parameters
$IsAttr
Whether we are currently parsing a style attribute or
style tag. $IsAttr will be true if we are currently
parsing a style attribute.
For a description of other parameters, see documentation of
defang_script() method
cleanup_style($StyleString)
Helper function to clean up CSS data. This function directly
operates on the input string without taking a copy.
Method parameters
$StyleString
The input style string that is cleaned.
defang_stylerule($SelectorsIn, $StyleRules, $lcTag, $IsAttr,
$HtmlR, $OutR)
Defangs style data.
Method parameters
$SelectorsIn
An array reference to the selectors in the style
tag/attribute contents.
$StyleRules
An array reference to the declaration blocks in the
style tag/attribute contents.
$lcTag
Lower case version of the HTML tag that is currently
being parsed.
$IsAttr
Whether we are currently parsing a style attribute or
style tag. $IsAttr will be true if we are currently
parsing a style attribute.
$HtmlR
A scalar reference to the input HTML.
$OutR
A scalar reference to the processed output so far.
defang_attributes($OutR, $HtmlR, $TagOps, $OpenAngle, $IsEndTag,
$Tag, $TagTrail, $Attributes, $CloseAngle)
Defangs attributes, defangs tags, does tag, attrib, css and url
callbacks.
Method parameters
For a description of the method parameters, see
documentation of defang_script() method
cleanup_attribute($AttributeString)
Helper function to cleanup attributes
Method parameters
$AttributeString
The value of the attribute.
SEE ALSO
<http://mailtools.anomy.net/>, <http://htmlcleaner.sourceforge.net/>,
HTML::StripScripts, HTML::Detoxifier, HTML::Sanitizer, HTML::Scrubber
AUTHOR
Kurian Jose Aerthail <cpan@kurianja.fastmail.fm>. Thanks to Rob Mueller
<cpan@robm.fastmail.fm> for initial code, guidance and support and bug
fixes.
COPYRIGHT AND LICENSE
Copyright (C) 2003-2010 by Opera Software Australia Pty Ltd
This library is free software; you can redistribute it and/or modify it
under the same terms as Perl itself.
perl v5.14.1 2011-01-03 HTML::Defang(3)