Splitter::createGenerator() - Code Metrics - Inspection of "[WIP] Sequencer to replace regular expressions" - TYPO3/Fluid - Measure and Improve Code Quality continuously with Scrutinizer

Completed

Pull Request — master (#457)

by Claus

created 2019-06-21 08:22 UTC

Splitter::createGenerator() B

↳ Parent: Splitter

Complexity

Conditions	9
Paths	9

Size

Total Lines

Duplication

Lines	0
Ratio	0 %

Importance

Changes

Metric	Value
cc	9
nc	9
nop	0
dl	0
loc	31
rs	8.0555
c	0
b	0
f	0

<?php
declare(strict_types=1);

namespace TYPO3Fluid\Fluid\Core\Parser;

/**
 * Splitter
 *
 * Byte-based calculations to perform splitting on Fluid template sources.
 * Uses (64bit) bit masking to detect characters that may split a template,
 * by grouping "interesting" bytes which have ordinal values within a value
 * range of maximum 64 and comparing the bit mask of this and the byte being
 * analysed.
 *
 * Contains the methods needed to iterate and match bytes based on (mutating)
 * bit-masks, and a couple of shorthand "peek" type methods to determine if
 * the current yield should be a certain type or another.
 *
 * The logic is essentially the equivalent of:
 *
 * - Using arrays of possible byte values
 * - Iterating characters and checking against the must-match bytes
 * - Using "substr" to extract relevant bits of template code
 *
 * The difference is that the method in this class is excessively faster than
 * any array-based counterpart and consumes orders of magnitude less memory.
 * It also means the opcode optimised version of the loop and comparisons use
 * ideal CPU instructions at the bit-level instead, making them both smaller
 * and even more efficient when compiled.
 *
 * Works by:
 *
 * - Iterating a byte value array while maintaining an internal pointer
 * - Yielding byte and position (which contains captured text since last yield)
 * - When yielding, reload the bit masks used in the next iteration
 */
class Splitter
{
    public const MAX_NAMESPACE_LENGTH = 10;

    public const BYTE_NULL = 0; // Zero-byte for terminating documents
    public const BYTE_INLINE = 123; // The "{" character indicating an inline expression started
    public const BYTE_INLINE_END = 125; // The "}" character indicating an inline expression ended
    public const BYTE_PIPE = 124; // The "|" character indicating an inline expression pass operation
    public const BYTE_MINUS = 45; // The "-" character (for legacy pass operations)
    public const BYTE_TAG = 60; // The "<" character indicating a tag has started
    public const BYTE_TAG_END = 62; // The ">" character indicating a tag has ended
    public const BYTE_TAG_CLOSE = 47; // The "/" character indicating a tag is a closing tag
    public const BYTE_QUOTE_DOUBLE = 34; // The " (standard double-quote) character
    public const BYTE_QUOTE_SINGLE = 39; // The ' (standard single-quote) character
    public const BYTE_WHITESPACE_SPACE = 32; // A standard space character
    public const BYTE_WHITESPACE_TAB = 9; // A standard carriage-return character
    public const BYTE_WHITESPACE_RETURN = 13; // A standard tab character
    public const BYTE_WHITESPACE_EOL = 10; // A standard (UNIX) line-break character
    public const BYTE_SEPARATOR_EQUALS = 61; // The "=" character
    public const BYTE_SEPARATOR_COLON = 58; // The ":" character
    public const BYTE_SEPARATOR_COMMA = 44; // The "," character
    public const BYTE_SEPARATOR_PIPE = 124; // The "|" character
    public const BYTE_PARENTHESIS_START = 40; // The "(" character
    public const BYTE_PARENTHESIS_END = 41; // The ")" character
    public const BYTE_ARRAY_START = 91; // The "[" character
    public const BYTE_ARRAY_END = 93; // The "]" character
    public const BYTE_SLASH = 47; // The "/" character
    public const BYTE_BACKSLASH = 92; // The "\" character
    public const BYTE_BACKTICK = 96; // The "`" character
    public const MAP_SHIFT = 64;
    public const MASK_LINEBREAKS = 0 | (1 << self::BYTE_WHITESPACE_EOL) | (1 << self::BYTE_WHITESPACE_RETURN);
    public const MASK_WHITESPACE = 0 | self::MASK_LINEBREAKS | (1 << self::BYTE_WHITESPACE_SPACE) | (1 << self::BYTE_WHITESPACE_TAB);

    /** @var Source */
    public $source;

    /** @var Context */
    public $context;

    /** @var Contexts */
    public $contexts;

    /** @var \NoRewindIterator */
    public $sequence;

    public $index = 0;
    private $primaryMask = 0;
    private $secondaryMask = 0;

    public function __construct(Source $source, Contexts $contexts)
    {
        $this->source = $source;
        $this->contexts = $contexts;
        $this->switch($contexts->root);
        $this->sequence = $this->parse();
class Id
{
    public $id;

    public function __construct($id)
    {
        $this->id = $id;
    }

}

class Account
{
    /** @var  Id $id */
    public $id;
}

$account_id = false;

if (starsAreRight()) {
    $account_id = new Id(42);
}

$account = new Account();
if ($account instanceof Id)
{
    $account->id = $account_id;
}
    }

    /**
     * Creates a dump, starting from the first line break before $position,
     * to the next line break from $position, counting the lines and characters
     * and inserting a marker pointing to the exact offending character.
     *
     * Is not very efficient - but adds bug tracing information. Should only
     * be called when exceptions are raised during sequencing.
     *
     * @param Position $position
     * @return string
     */
    public function extractSourceDumpOfLineAtPosition(Position $position): string
    {
        $lines = $this->countCharactersMatchingMask(Splitter::MASK_LINEBREAKS, 1, $position->index) + 1;
        $offset = $this->findBytePositionBeforeOffset(Splitter::MASK_LINEBREAKS, $position->index);
        $line = substr(
            $this->source->source,
            $offset,
            $this->findBytePositionAfterOffset(Splitter::MASK_LINEBREAKS, $position->index)
        );
        $character = $position->index - $offset - 1;
        $string = 'Line ' . $lines . ' character ' . $character . PHP_EOL;
        $string .= PHP_EOL;
        $string .= str_repeat(' ', max($character, 0)) . 'v' . PHP_EOL;
        $string .= trim($line) . PHP_EOL;
        $string .= str_repeat(' ', max($character, 0)) . '^' . PHP_EOL;
        return $string;
    }

    public function createErrorAtPosition(string $message, int $code): SequencingException
    {
        $position = new Position($this->context, $this->index);
        $ascii = (string) $this->source->bytes[$this->index];
        $message .=  ' ASCII: ' . $ascii . ': ' . $this->extractSourceDumpOfLineAtPosition($position);
        $error = new SequencingException($message, $code);
        return $error;
    }

    public function createUnsupportedArgumentError(string $argument, array $definitions): SequencingException
    {
        return $this->createErrorAtPosition(
            sprintf(
                'Unsupported argument "%s". Supported: ' . implode(', ', array_keys($definitions)),
                $argument
            ),
            1558298976
        );
    }

    /**
     * Split a string by searching for recognized characters using at least one,
     * optionally two bit masks consisting of OR'ed bit values of each detectable
     * character (byte). The secondary bit mask is costless as it is OR'ed into
     * the primary bit mask.
     *
     * @return \NoRewindIterator|string[]|null[]
     */
    public function parse(): \NoRewindIterator
    {
        return new \NoRewindIterator($this->createGenerator());
    }

    /**
     * Split a string by searching for recognized characters using at least one,
     * optionally two bit masks consisting of OR'ed bit values of each detectable
     * character (byte). The secondary bit mask is costless as it is OR'ed into
     * the primary bit mask.
     *
     * @return \NoRewindIterator|string[]|null[]
     */
    public function createGenerator(): \Generator
    {
        $bytes = &$this->source->bytes;
        $source = &$this->source->source;

        if (empty($bytes)) {
            yield Splitter::BYTE_NULL => null;
            return;
        }

        $captured = null;

        foreach ($bytes as $this->index => $byte) {
            // Decide which byte we encountered by explicitly checking if the encountered byte was in the minimum
            // range (not-mapped match). Next check is if the matched byte is within 64-128 range in which case
            // it is a mapped match. Anything else (>128) will be non-ASCII that is always captured.
            if ($byte < 64 && ($this->primaryMask & (1 << $byte))) {
                yield $byte => $captured;
                $captured = null;
            } elseif ($byte > 64 && $byte < 128 && ($this->secondaryMask & (1 << ($byte - static::MAP_SHIFT)))) {
                yield $byte => $captured;
                $captured = null;
            } else {
                // Append captured bytes from source, must happen after the conditions above so we avoid appending tokens.
                $captured .= $source{$this->index - 1};
            }
        }
        if ($captured !== null) {
            yield Splitter::BYTE_NULL => $captured;
        }
    }

    public function switch(Context $context): Context
    {
        $previous = $this->context;
        $this->context = $context;
        $this->primaryMask = $context->primaryMask;
        $this->secondaryMask = $context->secondaryMask;
        return $previous ?? $context;
    }

    public function countCharactersMatchingMask(int $primaryMask, int $offset, int $length): int

    {
        $bytes = &$this->source->bytes;
        $counted = 0;
        for ($index = $offset; $index < $this->source->length; $index++) {
            if (($primaryMask & (1 << $bytes[$index])) && $bytes[$index] < 64) {
                $counted++;
            }
        }
        return $counted;
    }

    public function findBytePositionBeforeOffset(int $primaryMask, int $offset): int
    {
        $bytes = &$this->source->bytes;
        for ($index = min($offset, $this->source->length); $index > 0; $index--) {
            if (($primaryMask & (1 << $bytes[$index])) && $bytes[$index] < 64) {
                return $index;
            }
        }
        return 0;
    }

    public function findBytePositionAfterOffset(int $primaryMask, int $offset): int
    {
        $bytes = &$this->source->bytes;
        for ($index = $offset; $index < $this->source->length; $index++) {
            if (($primaryMask & (1 << $bytes[$index])) && $bytes[$index] < 64) {
                return $index;
            }
        }
        return max($this->source->length, $offset);
    }
}


1			<?php
2			declare(strict_types=1);
3
4			namespace TYPO3Fluid\Fluid\Core\Parser;
5
6			/**
7			* Splitter
8			*
9			* Byte-based calculations to perform splitting on Fluid template sources.
10			* Uses (64bit) bit masking to detect characters that may split a template,
11			* by grouping "interesting" bytes which have ordinal values within a value
12			* range of maximum 64 and comparing the bit mask of this and the byte being
13			* analysed.
14			*
15			* Contains the methods needed to iterate and match bytes based on (mutating)
16			* bit-masks, and a couple of shorthand "peek" type methods to determine if
17			* the current yield should be a certain type or another.
18			*
19			* The logic is essentially the equivalent of:
20			*
21			* - Using arrays of possible byte values
22			* - Iterating characters and checking against the must-match bytes
23			* - Using "substr" to extract relevant bits of template code
24			*
25			* The difference is that the method in this class is excessively faster than
26			* any array-based counterpart and consumes orders of magnitude less memory.
27			* It also means the opcode optimised version of the loop and comparisons use
28			* ideal CPU instructions at the bit-level instead, making them both smaller
29			* and even more efficient when compiled.
30			*
31			* Works by:
32			*
33			* - Iterating a byte value array while maintaining an internal pointer
34			* - Yielding byte and position (which contains captured text since last yield)
35			* - When yielding, reload the bit masks used in the next iteration
36			*/
37			class Splitter
38			{
39			public const MAX_NAMESPACE_LENGTH = 10;
40
41			public const BYTE_NULL = 0; // Zero-byte for terminating documents
42			public const BYTE_INLINE = 123; // The "{" character indicating an inline expression started
43			public const BYTE_INLINE_END = 125; // The "}" character indicating an inline expression ended
44			public const BYTE_PIPE = 124; // The "\|" character indicating an inline expression pass operation
45			public const BYTE_MINUS = 45; // The "-" character (for legacy pass operations)
46			public const BYTE_TAG = 60; // The "<" character indicating a tag has started
47			public const BYTE_TAG_END = 62; // The ">" character indicating a tag has ended
48			public const BYTE_TAG_CLOSE = 47; // The "/" character indicating a tag is a closing tag
49			public const BYTE_QUOTE_DOUBLE = 34; // The " (standard double-quote) character
50			public const BYTE_QUOTE_SINGLE = 39; // The ' (standard single-quote) character
51			public const BYTE_WHITESPACE_SPACE = 32; // A standard space character
52			public const BYTE_WHITESPACE_TAB = 9; // A standard carriage-return character
53			public const BYTE_WHITESPACE_RETURN = 13; // A standard tab character
54			public const BYTE_WHITESPACE_EOL = 10; // A standard (UNIX) line-break character
55			public const BYTE_SEPARATOR_EQUALS = 61; // The "=" character
56			public const BYTE_SEPARATOR_COLON = 58; // The ":" character
57			public const BYTE_SEPARATOR_COMMA = 44; // The "," character
58			public const BYTE_SEPARATOR_PIPE = 124; // The "\|" character
59			public const BYTE_PARENTHESIS_START = 40; // The "(" character
60			public const BYTE_PARENTHESIS_END = 41; // The ")" character
61			public const BYTE_ARRAY_START = 91; // The "[" character
62			public const BYTE_ARRAY_END = 93; // The "]" character
63			public const BYTE_SLASH = 47; // The "/" character
64			public const BYTE_BACKSLASH = 92; // The "\" character
65			public const BYTE_BACKTICK = 96; // The "`" character
66			public const MAP_SHIFT = 64;
67			public const MASK_LINEBREAKS = 0 \| (1 << self::BYTE_WHITESPACE_EOL) \| (1 << self::BYTE_WHITESPACE_RETURN);
68			public const MASK_WHITESPACE = 0 \| self::MASK_LINEBREAKS \| (1 << self::BYTE_WHITESPACE_SPACE) \| (1 << self::BYTE_WHITESPACE_TAB);
69
70			/** @var Source */
71			public $source;
72
73			/** @var Context */
74			public $context;
75
76			/** @var Contexts */
77			public $contexts;
78
79			/** @var \NoRewindIterator */
80			public $sequence;
81
82			public $index = 0;
83			private $primaryMask = 0;
84			private $secondaryMask = 0;
85
86			public function __construct(Source $source, Contexts $contexts)
87			{
88			$this->source = $source;
89			$this->contexts = $contexts;
90			$this->switch($contexts->root);
91			$this->sequence = $this->parse();
			0 ignored issues – show Documentation Bug introduced 2019-06-03 22:50 UTC by Report Bug Copy Issue Report It seems like `$this->parse()` can also be of type `array<integer,string>` or `array<integer,null>`. However, the property `$sequence` is declared as type `object<NoRewindIterator>`. Maybe add an additional type check? Our type inference engine has found a suspicous assignment of a value to a property. This check raises an issue when a value that can be of a mixed type is assigned to a property that is type hinted more strictly. For example, imagine you have a variable `$accountId` that can either hold an Id object or false (if there is no account id yet). Your code now assigns that value to the `id` property of an instance of the `Account` class. This class holds a proper account, so the id value must no longer be false. Either this assignment is in error or a type check should be added for that assignment. class Id { public $id; public function __construct($id) { $this->id = $id; } } class Account { /** @var Id $id */ public $id; } $account_id = false; if (starsAreRight()) { $account_id = new Id(42); } $account = new Account(); if ($account instanceof Id) { $account->id = $account_id; } Loading history...
92			}
93
94			/**
95			* Creates a dump, starting from the first line break before $position,
96			* to the next line break from $position, counting the lines and characters
97			* and inserting a marker pointing to the exact offending character.
98			*
99			* Is not very efficient - but adds bug tracing information. Should only
100			* be called when exceptions are raised during sequencing.
101			*
102			* @param Position $position
103			* @return string
104			*/
105			public function extractSourceDumpOfLineAtPosition(Position $position): string
106			{
107			$lines = $this->countCharactersMatchingMask(Splitter::MASK_LINEBREAKS, 1, $position->index) + 1;
108			$offset = $this->findBytePositionBeforeOffset(Splitter::MASK_LINEBREAKS, $position->index);
109			$line = substr(
110			$this->source->source,
111			$offset,
112			$this->findBytePositionAfterOffset(Splitter::MASK_LINEBREAKS, $position->index)
113			);
114			$character = $position->index - $offset - 1;
115			$string = 'Line ' . $lines . ' character ' . $character . PHP_EOL;
116			$string .= PHP_EOL;
117			$string .= str_repeat(' ', max($character, 0)) . 'v' . PHP_EOL;
118			$string .= trim($line) . PHP_EOL;
119			$string .= str_repeat(' ', max($character, 0)) . '^' . PHP_EOL;
120			return $string;
121			}
122
123			public function createErrorAtPosition(string $message, int $code): SequencingException
124			{
125			$position = new Position($this->context, $this->index);
126			$ascii = (string) $this->source->bytes[$this->index];
127			$message .= ' ASCII: ' . $ascii . ': ' . $this->extractSourceDumpOfLineAtPosition($position);
128			$error = new SequencingException($message, $code);
129			return $error;
130			}
131
132			public function createUnsupportedArgumentError(string $argument, array $definitions): SequencingException
133			{
134			return $this->createErrorAtPosition(
135			sprintf(
136			'Unsupported argument "%s". Supported: ' . implode(', ', array_keys($definitions)),
137			$argument
138			),
139			1558298976
140			);
141			}
142
143			/**
144			* Split a string by searching for recognized characters using at least one,
145			* optionally two bit masks consisting of OR'ed bit values of each detectable
146			* character (byte). The secondary bit mask is costless as it is OR'ed into
147			* the primary bit mask.
148			*
149			* @return \NoRewindIterator\|string[]\|null[]
150			*/
151			public function parse(): \NoRewindIterator
152			{
153			return new \NoRewindIterator($this->createGenerator());
154			}
155
156			/**
157			* Split a string by searching for recognized characters using at least one,
158			* optionally two bit masks consisting of OR'ed bit values of each detectable
159			* character (byte). The secondary bit mask is costless as it is OR'ed into
160			* the primary bit mask.
161			*
162			* @return \NoRewindIterator\|string[]\|null[]
163			*/
164			public function createGenerator(): \Generator
165			{
166			$bytes = &$this->source->bytes;
167			$source = &$this->source->source;
168
169			if (empty($bytes)) {
170			yield Splitter::BYTE_NULL => null;
171			return;
172			}
173
174			$captured = null;
175
176			foreach ($bytes as $this->index => $byte) {
177			// Decide which byte we encountered by explicitly checking if the encountered byte was in the minimum
178			// range (not-mapped match). Next check is if the matched byte is within 64-128 range in which case
179			// it is a mapped match. Anything else (>128) will be non-ASCII that is always captured.
180			if ($byte < 64 && ($this->primaryMask & (1 << $byte))) {
181			yield $byte => $captured;
182			$captured = null;
183			} elseif ($byte > 64 && $byte < 128 && ($this->secondaryMask & (1 << ($byte - static::MAP_SHIFT)))) {
184			yield $byte => $captured;
185			$captured = null;
186			} else {
187			// Append captured bytes from source, must happen after the conditions above so we avoid appending tokens.
188			$captured .= $source{$this->index - 1};
189			}
190			}
191			if ($captured !== null) {
192			yield Splitter::BYTE_NULL => $captured;
193			}
194			}
195
196			public function switch(Context $context): Context
197			{
198			$previous = $this->context;
199			$this->context = $context;
200			$this->primaryMask = $context->primaryMask;
201			$this->secondaryMask = $context->secondaryMask;
202			return $previous ?? $context;
203			}
204
205			public function countCharactersMatchingMask(int $primaryMask, int $offset, int $length): int
			0 ignored issues – show Unused Code introduced 2019-05-31 19:29 UTC by Report Bug Copy Issue Report The parameter `$length` is not used and could be removed. This check looks from parameters that have been defined for a function or method, but which are not used in the method body. Loading history...
206			{
207			$bytes = &$this->source->bytes;
208			$counted = 0;
209			for ($index = $offset; $index < $this->source->length; $index++) {
210			if (($primaryMask & (1 << $bytes[$index])) && $bytes[$index] < 64) {
211			$counted++;
212			}
213			}
214			return $counted;
215			}
216
217			public function findBytePositionBeforeOffset(int $primaryMask, int $offset): int
218			{
219			$bytes = &$this->source->bytes;
220			for ($index = min($offset, $this->source->length); $index > 0; $index--) {
221			if (($primaryMask & (1 << $bytes[$index])) && $bytes[$index] < 64) {
222			return $index;
223			}
224			}
225			return 0;
226			}
227
228			public function findBytePositionAfterOffset(int $primaryMask, int $offset): int
229			{
230			$bytes = &$this->source->bytes;
231			for ($index = $offset; $index < $this->source->length; $index++) {
232			if (($primaryMask & (1 << $bytes[$index])) && $bytes[$index] < 64) {
233			return $index;
234			}
235			}
236			return max($this->source->length, $offset);
237			}
238			}
239

TYPO3 / Fluid

Pull Request — master (#457)

Splitter::createGenerator() B

Complexity

Size

Duplication

Importance

Duplication Side-by-Side

Filter issues like