UTF8Utils::checkForIllegalCodepoints() - Code Metrics - Inspection of "Merge pull request #163 from tgalopin/charset-supp..." - Masterminds/html5-php - Measure and Improve Code Quality continuously with Scrutinizer

Completed

Push — master ( ca7c31...92fff5 )

by Asmir

created 2019-02-22 09:19 UTC

UTF8Utils::checkForIllegalCodepoints() A

↳ Parent: UTF8Utils

Complexity

Conditions	3
Paths	4

Size

Total Lines

Duplication

Lines	0
Ratio	0 %

Code Coverage

Tests	11
CRAP Score	3

Importance

Changes

Metric	Value
dl	0
loc	42
ccs	11
cts	11
cp	1
rs	9.248
c	0
b	0
f	0
cc	3
nc	4
nop	1
crap	3

<?php

namespace Masterminds\HTML5\Parser;

/*
Portions based on code from html5lib files with the following copyright:

Copyright 2009 Geoffrey Sneddon <http://gsnedders.com/>

Permission is hereby granted, free of charge, to any person obtaining a
copy of this software and associated documentation files (the
    "Software"), to deal in the Software without restriction, including
without limitation the rights to use, copy, modify, merge, publish,
distribute, sublicense, and/or sell copies of the Software, and to
permit persons to whom the Software is furnished to do so, subject to
the following conditions:

The above copyright notice and this permission notice shall be included
in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
*/

use Masterminds\HTML5\Exception;

class UTF8Utils
{
    /**
     * The Unicode replacement character.
     */
    const FFFD = "\xEF\xBF\xBD";

    /**
     * Count the number of characters in a string.
     * UTF-8 aware. This will try (in order) iconv, MB, libxml, and finally a custom counter.
     *
     * @param string $string
     *
     * @return int
     */
    public static function countChars($string)
    {
        // Get the length for the string we need.
        if (function_exists('mb_strlen')) {
            return mb_strlen($string, 'utf-8');
        }

        if (function_exists('iconv_strlen')) {
            return iconv_strlen($string, 'utf-8');
        }

        if (function_exists('utf8_decode')) {
            // MPB: Will this work? Won't certain decodes lead to two chars
            // extrapolated out of 2-byte chars?
            return strlen(utf8_decode($string));
        }

        $count = count_chars($string);

        // 0x80 = 0x7F - 0 + 1 (one added to get inclusive range)
        // 0x33 = 0xF4 - 0x2C + 1 (one added to get inclusive range)
        return array_sum(array_slice($count, 0, 0x80)) + array_sum(array_slice($count, 0xC2, 0x33));
    }

    /**
     * Convert data from the given encoding to UTF-8.
     *
     * This has not yet been tested with charactersets other than UTF-8.
     * It should work with ISO-8859-1/-13 and standard Latin Win charsets.
     *
     * @param string $data     The data to convert
     * @param string $encoding A valid encoding. Examples: http://www.php.net/manual/en/mbstring.supported-encodings.php
     *
     * @return string
     */
    public static function convertToUTF8($data, $encoding = 'UTF-8')
    {
        /*
         * From the HTML5 spec: Given an encoding, the bytes in the input stream must be converted
         * to Unicode characters for the tokeniser, as described by the rules for that encoding,
         * except that the leading U+FEFF BYTE ORDER MARK character, if any, must not be stripped
         * by the encoding layer (it is stripped by the rule below). Bytes or sequences of bytes
         * in the original byte stream that could not be converted to Unicode characters must be
         * converted to U+FFFD REPLACEMENT CHARACTER code points.
         */

        // mb_convert_encoding is chosen over iconv because of a bug. The best
        // details for the bug are on http://us1.php.net/manual/en/function.iconv.php#108643
        // which contains links to the actual but reports as well as work around
        // details.
        if (function_exists('mb_convert_encoding')) {
            // mb library has the following behaviors:
            // - UTF-16 surrogates result in false.
            // - Overlongs and outside Plane 16 result in empty strings.

            // Before we run mb_convert_encoding we need to tell it what to do with
            // characters it does not know. This could be different than the parent
            // application executing this library so we store the value, change it
            // to our needs, and then change it back when we are done. This feels
            // a little excessive and it would be great if there was a better way.
            $save = mb_substitute_character();
            mb_substitute_character('none');
            $data = mb_convert_encoding($data, 'UTF-8', $encoding);
            mb_substitute_character($save);
        }
        // @todo Get iconv running in at least some environments if that is possible.
        elseif (function_exists('iconv') && 'auto' !== $encoding) {
            // fprintf(STDOUT, "iconv found\n");
            // iconv has the following behaviors:
            // - Overlong representations are ignored.
            // - Beyond Plane 16 is replaced with a lower char.
            // - Incomplete sequences generate a warning.
            $data = @iconv($encoding, 'UTF-8//IGNORE', $data);
        } else {
            throw new Exception('Not implemented, please install mbstring or iconv');
        }

        /*
         * One leading U+FEFF BYTE ORDER MARK character must be ignored if any are present.
         */
        if ("\xEF\xBB\xBF" === substr($data, 0, 3)) {
            $data = substr($data, 3);
        }

        return $data;
    }

    /**
     * Checks for Unicode code points that are not valid in a document.
     *
     * @param string $data A string to analyze
     *
     * @return array An array of (string) error messages produced by the scanning
     */
    public static function checkForIllegalCodepoints($data)
    {
        // Vestigal error handling.
        $errors = array();

        /*
         * All U+0000 null characters in the input must be replaced by U+FFFD REPLACEMENT CHARACTERs.
         * Any occurrences of such characters is a parse error.
         */
        for ($i = 0, $count = substr_count($data, "\0"); $i < $count; ++$i) {
            $errors[] = 'null-character';
        }

        /*
         * Any occurrences of any characters in the ranges U+0001 to U+0008, U+000B, U+000E to U+001F, U+007F
         * to U+009F, U+D800 to U+DFFF , U+FDD0 to U+FDEF, and characters U+FFFE, U+FFFF, U+1FFFE, U+1FFFF,
         * U+2FFFE, U+2FFFF, U+3FFFE, U+3FFFF, U+4FFFE, U+4FFFF, U+5FFFE, U+5FFFF, U+6FFFE, U+6FFFF, U+7FFFE,
         * U+7FFFF, U+8FFFE, U+8FFFF, U+9FFFE, U+9FFFF, U+AFFFE, U+AFFFF, U+BFFFE, U+BFFFF, U+CFFFE, U+CFFFF,
         * U+DFFFE, U+DFFFF, U+EFFFE, U+EFFFF, U+FFFFE, U+FFFFF, U+10FFFE, and U+10FFFF are parse errors.
         * (These are all control characters or permanently undefined Unicode characters.)
         */
        // Check PCRE is loaded.
        $count = preg_match_all(
            '/(?:
        [\x01-\x08\x0B\x0E-\x1F\x7F] # U+0001 to U+0008, U+000B,  U+000E to U+001F and U+007F
      |
        \xC2[\x80-\x9F] # U+0080 to U+009F
      |
        \xED(?:\xA0[\x80-\xFF]|[\xA1-\xBE][\x00-\xFF]|\xBF[\x00-\xBF]) # U+D800 to U+DFFFF
      |
        \xEF\xB7[\x90-\xAF] # U+FDD0 to U+FDEF
      |
        \xEF\xBF[\xBE\xBF] # U+FFFE and U+FFFF
      |
        [\xF0-\xF4][\x8F-\xBF]\xBF[\xBE\xBF] # U+nFFFE and U+nFFFF (1 <= n <= 10_{16})
      )/x', $data, $matches);
        for ($i = 0; $i < $count; ++$i) {
            $errors[] = 'invalid-codepoint';
        }

        return $errors;
    }
}


1		<?php
2
3		namespace Masterminds\HTML5\Parser;
4
5		/*
6		Portions based on code from html5lib files with the following copyright:
7
8		Copyright 2009 Geoffrey Sneddon <http://gsnedders.com/>
9
10		Permission is hereby granted, free of charge, to any person obtaining a
11		copy of this software and associated documentation files (the
12		"Software"), to deal in the Software without restriction, including
13		without limitation the rights to use, copy, modify, merge, publish,
14		distribute, sublicense, and/or sell copies of the Software, and to
15		permit persons to whom the Software is furnished to do so, subject to
16		the following conditions:
17
18		The above copyright notice and this permission notice shall be included
19		in all copies or substantial portions of the Software.
20
21		THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
22		OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
23		MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
24		IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
25		CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
26		TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
27		SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
28		*/
29
30		use Masterminds\HTML5\Exception;
31
32		class UTF8Utils
33		{
34		/**
35		* The Unicode replacement character.
36		*/
37		const FFFD = "\xEF\xBF\xBD";
38
39		/**
40		* Count the number of characters in a string.
41		* UTF-8 aware. This will try (in order) iconv, MB, libxml, and finally a custom counter.
42		*
43		* @param string $string
44		*
45		* @return int
46		*/
47	16	public static function countChars($string)
48		{
49		// Get the length for the string we need.
50	16	if (function_exists('mb_strlen')) {
51	16	return mb_strlen($string, 'utf-8');
52		}
53
54		if (function_exists('iconv_strlen')) {
55		return iconv_strlen($string, 'utf-8');
56		}
57
58		if (function_exists('utf8_decode')) {
59		// MPB: Will this work? Won't certain decodes lead to two chars
60		// extrapolated out of 2-byte chars?
61		return strlen(utf8_decode($string));
62		}
63
64		$count = count_chars($string);
65
66		// 0x80 = 0x7F - 0 + 1 (one added to get inclusive range)
67		// 0x33 = 0xF4 - 0x2C + 1 (one added to get inclusive range)
68		return array_sum(array_slice($count, 0, 0x80)) + array_sum(array_slice($count, 0xC2, 0x33));
69		}
70
71		/**
72		* Convert data from the given encoding to UTF-8.
73		*
74		* This has not yet been tested with charactersets other than UTF-8.
75		* It should work with ISO-8859-1/-13 and standard Latin Win charsets.
76		*
77		* @param string $data The data to convert
78		* @param string $encoding A valid encoding. Examples: http://www.php.net/manual/en/mbstring.supported-encodings.php
79		*
80		* @return string
81		*/
82	145	public static function convertToUTF8($data, $encoding = 'UTF-8')
83		{
84		/*
85		* From the HTML5 spec: Given an encoding, the bytes in the input stream must be converted
86		* to Unicode characters for the tokeniser, as described by the rules for that encoding,
87		* except that the leading U+FEFF BYTE ORDER MARK character, if any, must not be stripped
88		* by the encoding layer (it is stripped by the rule below). Bytes or sequences of bytes
89		* in the original byte stream that could not be converted to Unicode characters must be
90		* converted to U+FFFD REPLACEMENT CHARACTER code points.
91		*/
92
93		// mb_convert_encoding is chosen over iconv because of a bug. The best
94		// details for the bug are on http://us1.php.net/manual/en/function.iconv.php#108643
95		// which contains links to the actual but reports as well as work around
96		// details.
97	145	if (function_exists('mb_convert_encoding')) {
98		// mb library has the following behaviors:
99		// - UTF-16 surrogates result in false.
100		// - Overlongs and outside Plane 16 result in empty strings.
101
102		// Before we run mb_convert_encoding we need to tell it what to do with
103		// characters it does not know. This could be different than the parent
104		// application executing this library so we store the value, change it
105		// to our needs, and then change it back when we are done. This feels
106		// a little excessive and it would be great if there was a better way.
107	145	$save = mb_substitute_character();
108	145	mb_substitute_character('none');
109	145	$data = mb_convert_encoding($data, 'UTF-8', $encoding);
110	145	mb_substitute_character($save);
111	145	}
112		// @todo Get iconv running in at least some environments if that is possible.
113		elseif (function_exists('iconv') && 'auto' !== $encoding) {
114		// fprintf(STDOUT, "iconv found\n");
115		// iconv has the following behaviors:
116		// - Overlong representations are ignored.
117		// - Beyond Plane 16 is replaced with a lower char.
118		// - Incomplete sequences generate a warning.
119		$data = @iconv($encoding, 'UTF-8//IGNORE', $data);
120		} else {
121		throw new Exception('Not implemented, please install mbstring or iconv');
122		}
123
124		/*
125		* One leading U+FEFF BYTE ORDER MARK character must be ignored if any are present.
126		*/
127	145	if ("\xEF\xBB\xBF" === substr($data, 0, 3)) {
128		$data = substr($data, 3);
129		}
130
131	145	return $data;
132		}
133
134		/**
135		* Checks for Unicode code points that are not valid in a document.
136		*
137		* @param string $data A string to analyze
138		*
139		* @return array An array of (string) error messages produced by the scanning
140		*/
141	145	public static function checkForIllegalCodepoints($data)
142		{
143		// Vestigal error handling.
144	145	$errors = array();
145
146		/*
147		* All U+0000 null characters in the input must be replaced by U+FFFD REPLACEMENT CHARACTERs.
148		* Any occurrences of such characters is a parse error.
149		*/
150	145	for ($i = 0, $count = substr_count($data, "\0"); $i < $count; ++$i) {
151	2	$errors[] = 'null-character';
152	2	}
153
154		/*
155		* Any occurrences of any characters in the ranges U+0001 to U+0008, U+000B, U+000E to U+001F, U+007F
156		* to U+009F, U+D800 to U+DFFF , U+FDD0 to U+FDEF, and characters U+FFFE, U+FFFF, U+1FFFE, U+1FFFF,
157		* U+2FFFE, U+2FFFF, U+3FFFE, U+3FFFF, U+4FFFE, U+4FFFF, U+5FFFE, U+5FFFF, U+6FFFE, U+6FFFF, U+7FFFE,
158		* U+7FFFF, U+8FFFE, U+8FFFF, U+9FFFE, U+9FFFF, U+AFFFE, U+AFFFF, U+BFFFE, U+BFFFF, U+CFFFE, U+CFFFF,
159		* U+DFFFE, U+DFFFF, U+EFFFE, U+EFFFF, U+FFFFE, U+FFFFF, U+10FFFE, and U+10FFFF are parse errors.
160		* (These are all control characters or permanently undefined Unicode characters.)
161		*/
162		// Check PCRE is loaded.
163	145	$count = preg_match_all(
164		'/(?:
165		[\x01-\x08\x0B\x0E-\x1F\x7F] # U+0001 to U+0008, U+000B, U+000E to U+001F and U+007F
166		\|
167		\xC2[\x80-\x9F] # U+0080 to U+009F
168		\|
169		\xED(?:\xA0[\x80-\xFF]\|[\xA1-\xBE][\x00-\xFF]\|\xBF[\x00-\xBF]) # U+D800 to U+DFFFF
170		\|
171		\xEF\xB7[\x90-\xAF] # U+FDD0 to U+FDEF
172		\|
173		\xEF\xBF[\xBE\xBF] # U+FFFE and U+FFFF
174		\|
175		[\xF0-\xF4][\x8F-\xBF]\xBF[\xBE\xBF] # U+nFFFE and U+nFFFF (1 <= n <= 10_{16})
176	145	)/x', $data, $matches);
177	145	for ($i = 0; $i < $count; ++$i) {
178	1	$errors[] = 'invalid-codepoint';
179	1	}
180
181	145	return $errors;
182		}
183		}
184

Masterminds / html5-php

Push — master ( ca7c31...92fff5 )

UTF8Utils::checkForIllegalCodepoints() A

Complexity

Size

Duplication

Code Coverage

Importance

Duplication Side-by-Side

Filter issues like