Issues in generateCollationData.php (master) - Issues in master - wikimedia/mediawiki - Measure and Improve Code Quality continuously with Scrutinizer

Issues (4122)

Security Analysis not enabled

Cross-Site Scripting

Cross-Site Scripting enables an attacker to inject code into the response of a web-request that is viewed by other users. It can for example be used to bypass access controls, or even to take over other users' accounts.

File Exposure

File Exposure allows an attacker to gain access to local files that he should not be able to access. These files can for example include database credentials, or other configuration files.

File Manipulation

File Manipulation enables an attacker to write custom data to files. This potentially leads to injection of arbitrary code on the server.

Object Injection

Object Injection enables an attacker to inject an object into PHP code, and can lead to arbitrary code execution, file exposure, or file manipulation attacks.

Code Injection

Code Injection enables an attacker to execute arbitrary code on the server.

Response Splitting

Response Splitting can be used to send arbitrary responses.

File Inclusion

File Inclusion enables an attacker to inject custom files into PHP's file loading mechanism, either explicitly passed to include, or for example via PHP's auto-loading mechanism.

Command Injection

Command Injection enables an attacker to inject a shell command that is execute with the privileges of the web-server. This can be used to expose sensitive data, or gain access of your server.

SQL Injection

SQL Injection enables an attacker to execute arbitrary SQL code on your database server gaining access to user data, or manipulating user data.

XPath Injection

XPath Injection enables an attacker to modify the parts of XML document that are read. If that XML document is for example used for authentication, this can lead to further vulnerabilities similar to SQL Injection.

LDAP Injection

LDAP Injection enables an attacker to inject LDAP statements potentially granting permission to run unauthorized queries, or modify content inside the LDAP tree.

Header Injection

Other Vulnerability

This category comprises other attack vectors such as manipulating the PHP runtime, loading custom extensions, freezing the runtime, or similar.

Regex Injection

Regex Injection enables an attacker to execute arbitrary code in your PHP process.

XML Injection

XML Injection enables an attacker to read files on your local filesystem including configuration files, or can be abused to freeze your web-server process.

Variable Injection

Variable Injection enables an attacker to overwrite program variables with custom data, and can lead to further vulnerabilities.

Unfortunately, the security analysis is currently not available for your project. If you are a non-commercial open-source project, please contact support to gain access.

maintenance/language/generateCollationData.php (7 issues)

Labels

Severity

Upgrade to new PHP Analysis Engine

These results are based on our legacy PHP analysis, consider migrating to our new PHP analysis engine instead. Learn more

<?php

/**
 * Maintenance script to generate first letter data files for Collation.php.
 *
 * This program is free software; you can redistribute it and/or modify
 * it under the terms of the GNU General Public License as published by
 * the Free Software Foundation; either version 2 of the License, or
 * (at your option) any later version.
 *
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
 * GNU General Public License for more details.
 *
 * You should have received a copy of the GNU General Public License along
 * with this program; if not, write to the Free Software Foundation, Inc.,
 * 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
 * http://www.gnu.org/copyleft/gpl.html
 *
 * @file
 * @ingroup MaintenanceLanguage
 */

require_once __DIR__ . '/../Maintenance.php';

/**
 * Generate first letter data files for Collation.php
 *
 * @ingroup MaintenanceLanguage
 */
class GenerateCollationData extends Maintenance {
	/** The directory with source data files in it */
	public $dataDir;

	/** The primary weights, indexed by codepoint */
	public $weights;

	/**
	 * A hashtable keyed by codepoint, where presence indicates that a character
	 * has a decomposition mapping. This makes it non-preferred for group header
	 * selection.
	 */
	public $mappedChars;

	public $debugOutFile;

	/**
	 * Important tertiary weights from UTS #10 section 7.2
	 */
	const NORMAL_UPPERCASE = 0x08;
	const NORMAL_HIRAGANA = 0x0E;

	public function __construct() {
		parent::__construct();
		$this->addOption( 'data-dir', 'A directory on the local filesystem ' .
			'containing allkeys.txt and ucd.all.grouped.xml from unicode.org',
			false, true );
		$this->addOption( 'debug-output', 'Filename for sending debug output to',
			false, true );
	}

	public function execute() {
		$this->dataDir = $this->getOption( 'data-dir', '.' );

		$allkeysPresent = file_exists( "{$this->dataDir}/allkeys.txt" );
		$ucdallPresent = file_exists( "{$this->dataDir}/ucd.all.grouped.xml" );

		// As of January 2013, these links work for all versions of Unicode
		// between 5.1 and 6.2, inclusive.
		$allkeysURL = "http://www.unicode.org/Public/UCA/<Unicode version>/allkeys.txt";
		$ucdallURL = "http://www.unicode.org/Public/<Unicode version>/ucdxml/ucd.all.grouped.zip";

		if ( !$allkeysPresent || !$ucdallPresent ) {
			$icuVersion = IcuCollation::getICUVersion();
			$unicodeVersion = IcuCollation::getUnicodeVersionForICU();

			$error = "";

			if ( !$allkeysPresent ) {
				$error .= "Unable to find allkeys.txt. "
					. "Download it and specify its location with --data-dir=<DIR>. "
					. "\n\n";
			}
			if ( !$ucdallPresent ) {
				$error .= "Unable to find ucd.all.grouped.xml. "
					. "Download it, unzip, and specify its location with --data-dir=<DIR>. "
					. "\n\n";
			}

			$versionKnown = false;
			if ( !$icuVersion ) {
				// Unknown version - either very old intl,
				// or PHP < 5.3.7 which does not expose this information
				$error .= "As MediaWiki could not determine the version of ICU library used by your PHP's "
					. "intl extension it can't suggest which file version to download. "
					. "This can be caused by running a very old version of intl or PHP < 5.3.7. "
					. "If you are sure everything is all right, find out the ICU version "
					. "by running phpinfo(), check what is the Unicode version it is using "
					. "at http://site.icu-project.org/download, then try finding appropriate data file(s) at:";
			} elseif ( version_compare( $icuVersion, "4.0", "<" ) ) {
				// Extra old version
				$error .= "You are using outdated version of ICU ($icuVersion), intended for "
					. ( $unicodeVersion ? "Unicode $unicodeVersion" : "an unknown version of Unicode" )
					. "; this file might not be avalaible for it, and it's not supported by MediaWiki. "
					. " You are on your own; consider upgrading PHP's intl extension or try "
					. "one of the files available at:";
			} elseif ( version_compare( $icuVersion, "51.0", ">=" ) ) {
				// Extra recent version
				$error .= "You are using ICU $icuVersion, released after this script was last updated. "
					. "Check what is the Unicode version it is using at http://site.icu-project.org/download . "
					. "It can't be guaranteed everything will work, but appropriate file(s) should "
					. "be available at:";
			} else {
				// ICU 4.0 to 50.x
				$versionKnown = true;
				$error .= "You are using ICU $icuVersion, intended for "
					. ( $unicodeVersion ? "Unicode $unicodeVersion" : "an unknown version of Unicode" )
					. ". Appropriate file(s) should be available at:";
			}
			$error .= "\n";

			if ( $versionKnown && $unicodeVersion ) {
''   == false // true
''   == null  // true
'ab' == false // false
'ab' == null  // false

// It is often better to use strict comparison
'' === false // false
'' === null  // false
				$allkeysURL = str_replace( "<Unicode version>", "$unicodeVersion.0", $allkeysURL );
				$ucdallURL = str_replace( "<Unicode version>", "$unicodeVersion.0", $ucdallURL );
			}

			if ( !$allkeysPresent ) {
				$error .= "* $allkeysURL\n";
			}
			if ( !$ucdallPresent ) {
				$error .= "* $ucdallURL\n";
			}

			$this->error( $error );
			exit( 1 );

		}

		$debugOutFileName = $this->getOption( 'debug-output' );
		if ( $debugOutFileName ) {
			$this->debugOutFile = fopen( $debugOutFileName, 'w' );
			if ( !$this->debugOutFile ) {
				$this->error( "Unable to open debug output file for writing" );
				exit( 1 );

			}
		}
		$this->loadUcd();
		$this->generateFirstChars();
	}

	function loadUcd() {
		$uxr = new UcdXmlReader( "{$this->dataDir}/ucd.all.grouped.xml" );
		$uxr->readChars( [ $this, 'charCallback' ] );
	}

	function charCallback( $data ) {
		// Skip non-printable characters,
		// but do not skip a normal space (U+0020) since
		// people like to use that as a fake no header symbol.
		$category = substr( $data['gc'], 0, 1 );
		if ( strpos( 'LNPS', $category ) === false
			&& $data['cp'] !== '0020'
		) {
			return;
		}
		$cp = hexdec( $data['cp'] );

		// Skip the CJK ideograph blocks, as an optimisation measure.
		// UCA doesn't sort them properly anyway, without tailoring.
		if ( IcuCollation::isCjk( $cp ) ) {
			return;
		}

		// Skip the composed Hangul syllables, we will use the bare Jamo
		// as first letters
		if ( $data['block'] == 'Hangul Syllables' ) {
			return;
		}

		// Calculate implicit weight per UTS #10 v6.0.0, sec 7.1.3
		if ( $data['UIdeo'] === 'Y' ) {
			if ( $data['block'] == 'CJK Unified Ideographs'
				|| $data['block'] == 'CJK Compatibility Ideographs'
			) {
				$base = 0xFB40;
			} else {
				$base = 0xFB80;
			}
		} else {
			$base = 0xFBC0;
		}
		$a = $base + ( $cp >> 15 );
		$b = ( $cp & 0x7fff ) | 0x8000;

		$this->weights[$cp] = sprintf( ".%04X.%04X", $a, $b );

		if ( $data['dm'] !== '#' ) {
			$this->mappedChars[$cp] = true;
		}

		if ( $cp % 4096 == 0 ) {
			print "{$data['cp']}\n";
		}
	}

	function generateFirstChars() {
		$file = fopen( "{$this->dataDir}/allkeys.txt", 'r' );
		if ( !$file ) {
			$this->error( "Unable to open allkeys.txt" );
			exit( 1 );

		}
		global $IP;
		$outFile = fopen( "$IP/serialized/first-letters-root.ser", 'w' );
		if ( !$outFile ) {
			$this->error( "Unable to open output file first-letters-root.ser" );
			exit( 1 );

		}

		$goodTertiaryChars = [];

		// For each character with an entry in allkeys.txt, overwrite the implicit
		// entry in $this->weights that came from the UCD.
		// Also gather a list of tertiary weights, for use in selecting the group header
		while ( false !== ( $line = fgets( $file ) ) ) {
			// We're only interested in single-character weights, pick them out with a regex
			$line = trim( $line );
			if ( !preg_match( '/^([0-9A-F]+)\s*;\s*([^#]*)/', $line, $m ) ) {
				continue;
			}

			$cp = hexdec( $m[1] );
			$allWeights = trim( $m[2] );
			$primary = '';
			$tertiary = '';

			if ( !isset( $this->weights[$cp] ) ) {
				// Non-printable, ignore
				continue;
			}
			foreach ( StringUtils::explode( '[', $allWeights ) as $weightStr ) {
				preg_match_all( '/[*.]([0-9A-F]+)/', $weightStr, $m );
				if ( !empty( $m[1] ) ) {
					if ( $m[1][0] !== '0000' ) {
						$primary .= '.' . $m[1][0];
					}
					if ( $m[1][2] !== '0000' ) {
						$tertiary .= '.' . $m[1][2];
					}
				}
			}
			$this->weights[$cp] = $primary;
			if ( $tertiary === '.0008'
				|| $tertiary === '.000E'
			) {
				$goodTertiaryChars[$cp] = true;
			}
		}
		fclose( $file );

		// Identify groups of characters with the same primary weight
		$this->groups = [];
class MyClass { }

$x = new MyClass();
$x->foo = true;
		asort( $this->weights, SORT_STRING );
		$prevWeight = reset( $this->weights );
		$group = [];
		foreach ( $this->weights as $cp => $weight ) {
			if ( $weight !== $prevWeight ) {
				$this->groups[$prevWeight] = $group;
				$prevWeight = $weight;
				if ( isset( $this->groups[$weight] ) ) {
					$group = $this->groups[$weight];
				} else {
					$group = [];
				}
			}
			$group[] = $cp;
		}
		if ( $group ) {
			$this->groups[$prevWeight] = $group;
		}

		// If one character has a given primary weight sequence, and a second
		// character has a longer primary weight sequence with an initial
		// portion equal to the first character, then remove the second
		// character. This avoids having characters like U+A732 (double A)
		// polluting the basic latin sort area.

		foreach ( $this->groups as $weight => $group ) {
			if ( preg_match( '/(\.[0-9A-F]*)\./', $weight, $m ) ) {
				if ( isset( $this->groups[$m[1]] ) ) {
					unset( $this->groups[$weight] );
				}
			}
		}

		ksort( $this->groups, SORT_STRING );

		// Identify the header character in each group
		$headerChars = [];
		$prevChar = "\000";
		$tertiaryCollator = new Collator( 'root' );
		$primaryCollator = new Collator( 'root' );
		$primaryCollator->setStrength( Collator::PRIMARY );
		$numOutOfOrder = 0;
		foreach ( $this->groups as $weight => $group ) {
			$uncomposedChars = [];
			$goodChars = [];
			foreach ( $group as $cp ) {
				if ( isset( $goodTertiaryChars[$cp] ) ) {
					$goodChars[] = $cp;
				}
				if ( !isset( $this->mappedChars[$cp] ) ) {
					$uncomposedChars[] = $cp;
				}
			}
			$x = array_intersect( $goodChars, $uncomposedChars );
			if ( !$x ) {
				$x = $uncomposedChars;
				if ( !$x ) {
					$x = $group;
				}
			}

			// Use ICU to pick the lowest sorting character in the selection
			$tertiaryCollator->sort( $x );
			$cp = $x[0];

			$char = UtfNormal\Utils::codepointToUtf8( $cp );
			$headerChars[] = $char;
			if ( $primaryCollator->compare( $char, $prevChar ) <= 0 ) {
				$numOutOfOrder++;
				/*
				printf( "Out of order: U+%05X > U+%05X\n",
					utf8ToCodepoint( $prevChar ),
					utf8ToCodepoint( $char ) );
				 */
			}
			$prevChar = $char;

			if ( $this->debugOutFile ) {
				fwrite( $this->debugOutFile, sprintf( "%05X %s %s (%s)\n", $cp, $weight, $char,
					implode( ' ', array_map( 'UtfNormal\Utils::codepointToUtf8', $group ) ) ) );
			}
		}

		print "Out of order: $numOutOfOrder / " . count( $headerChars ) . "\n";

		fwrite( $outFile, serialize( $headerChars ) );
	}
}

class UcdXmlReader {
	public $fileName;
	public $callback;
	public $groupAttrs;
	public $xml;
	public $blocks = [];
	public $currentBlock;

	function __construct( $fileName ) {
		$this->fileName = $fileName;
	}

	public function readChars( $callback ) {
		$this->getBlocks();
		$this->currentBlock = reset( $this->blocks );
		$xml = $this->open();
		$this->callback = $callback;

		while ( $xml->name !== 'repertoire' && $xml->next() );

		while ( $xml->read() ) {
			if ( $xml->nodeType == XMLReader::ELEMENT ) {
				if ( $xml->name === 'group' ) {
					$this->groupAttrs = $this->readAttributes();
				} elseif ( $xml->name === 'char' ) {
					$this->handleChar();
				}
			} elseif ( $xml->nodeType === XMLReader::END_ELEMENT ) {
				if ( $xml->name === 'group' ) {
					$this->groupAttrs = [];
				}
			}
		}
		$xml->close();
	}

	protected function open() {
		$this->xml = new XMLReader;
		$this->xml->open( $this->fileName );
		if ( !$this->xml ) {
			throw new MWException( __METHOD__ . ": unable to open {$this->fileName}" );
		}
		while ( $this->xml->name !== 'ucd' && $this->xml->read() );
		$this->xml->read();

		return $this->xml;
	}

	/**
	 * Read the attributes of the current element node and return them
	 * as an array
	 * @return array
	 */
	protected function readAttributes() {
		$attrs = [];
		while ( $this->xml->moveToNextAttribute() ) {
			$attrs[$this->xml->name] = $this->xml->value;
		}

		return $attrs;
	}

	protected function handleChar() {
		$attrs = $this->readAttributes() + $this->groupAttrs;
		if ( isset( $attrs['cp'] ) ) {
			$first = $last = hexdec( $attrs['cp'] );
		} else {
			$first = hexdec( $attrs['first-cp'] );
			$last = hexdec( $attrs['last-cp'] );
			unset( $attrs['first-cp'] );
			unset( $attrs['last-cp'] );
		}

		for ( $cp = $first; $cp <= $last; $cp++ ) {
			$hexCp = sprintf( "%04X", $cp );
			foreach ( [ 'na', 'na1' ] as $nameProp ) {
				if ( isset( $attrs[$nameProp] ) ) {
					$attrs[$nameProp] = str_replace( '#', $hexCp, $attrs[$nameProp] );
				}
			}

			while ( $this->currentBlock ) {
				if ( $cp < $this->currentBlock[0] ) {
					break;
				} elseif ( $cp <= $this->currentBlock[1] ) {
					$attrs['block'] = key( $this->blocks );
					break;
				} else {
					$this->currentBlock = next( $this->blocks );
				}
			}

			$attrs['cp'] = $hexCp;
			call_user_func( $this->callback, $attrs );
		}
	}

	public function getBlocks() {
		if ( $this->blocks ) {
			return $this->blocks;
		}

		$xml = $this->open();
		while ( $xml->name !== 'blocks' && $xml->read() );

		while ( $xml->read() ) {
			if ( $xml->nodeType == XMLReader::ELEMENT ) {
				if ( $xml->name === 'block' ) {
					$attrs = $this->readAttributes();
					$first = hexdec( $attrs['first-cp'] );
					$last = hexdec( $attrs['last-cp'] );
					$this->blocks[$attrs['name']] = [ $first, $last ];
				}
			}
		}
		$xml->close();

		return $this->blocks;
	}
}

$maintClass = 'GenerateCollationData';
require_once RUN_MAINTENANCE_IF_MAIN;


1			<?php
			0 ignored issues – show Coding Style Compatibility introduced 2016-02-23 18:14 UTC by Report Bug Copy Issue Report Show Similar Issues like this For compatibility and reusability of your code, PSR1 recommends that a file should introduce either new symbols (like classes, functions, etc.) or have side-effects (like outputting something, or including other files), but not both at the same time. The first symbol is defined on line 31 and the first side effect is on line 24. The PSR-1: Basic Coding Standard recommends that a file should either introduce new symbols, that is classes, functions, constants or similar, or have side effects. Side effects are anything that executes logic, like for example printing output, changing ini settings or writing to a file. The idea behind this recommendation is that merely auto-loading a class should not change the state of an application. It also promotes a cleaner style of programming and makes your code less prone to errors, because the logic is not spread out all over the place. To learn more about the PSR-1, please see the PHP-FIG site on the PSR-1. Loading history...
2			/**
3			* Maintenance script to generate first letter data files for Collation.php.
4			*
5			* This program is free software; you can redistribute it and/or modify
6			* it under the terms of the GNU General Public License as published by
7			* the Free Software Foundation; either version 2 of the License, or
8			* (at your option) any later version.
9			*
10			* This program is distributed in the hope that it will be useful,
11			* but WITHOUT ANY WARRANTY; without even the implied warranty of
12			* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
13			* GNU General Public License for more details.
14			*
15			* You should have received a copy of the GNU General Public License along
16			* with this program; if not, write to the Free Software Foundation, Inc.,
17			* 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
18			* http://www.gnu.org/copyleft/gpl.html
19			*
20			* @file
21			* @ingroup MaintenanceLanguage
22			*/
23
24			require_once __DIR__ . '/../Maintenance.php';
25
26			/**
27			* Generate first letter data files for Collation.php
28			*
29			* @ingroup MaintenanceLanguage
30			*/
31			class GenerateCollationData extends Maintenance {
32			/** The directory with source data files in it */
33			public $dataDir;
34
35			/** The primary weights, indexed by codepoint */
36			public $weights;
37
38			/**
39			* A hashtable keyed by codepoint, where presence indicates that a character
40			* has a decomposition mapping. This makes it non-preferred for group header
41			* selection.
42			*/
43			public $mappedChars;
44
45			public $debugOutFile;
46
47			/**
48			* Important tertiary weights from UTS #10 section 7.2
49			*/
50			const NORMAL_UPPERCASE = 0x08;
51			const NORMAL_HIRAGANA = 0x0E;
52
53			public function __construct() {
54			parent::__construct();
55			$this->addOption( 'data-dir', 'A directory on the local filesystem ' .
56			'containing allkeys.txt and ucd.all.grouped.xml from unicode.org',
57			false, true );
58			$this->addOption( 'debug-output', 'Filename for sending debug output to',
59			false, true );
60			}
61
62			public function execute() {
63			$this->dataDir = $this->getOption( 'data-dir', '.' );
64
65			$allkeysPresent = file_exists( "{$this->dataDir}/allkeys.txt" );
66			$ucdallPresent = file_exists( "{$this->dataDir}/ucd.all.grouped.xml" );
67
68			// As of January 2013, these links work for all versions of Unicode
69			// between 5.1 and 6.2, inclusive.
70			$allkeysURL = "http://www.unicode.org/Public/UCA/<Unicode version>/allkeys.txt";
71			$ucdallURL = "http://www.unicode.org/Public/<Unicode version>/ucdxml/ucd.all.grouped.zip";
72
73			if ( !$allkeysPresent \|\| !$ucdallPresent ) {
74			$icuVersion = IcuCollation::getICUVersion();
75			$unicodeVersion = IcuCollation::getUnicodeVersionForICU();
76
77			$error = "";
78
79			if ( !$allkeysPresent ) {
80			$error .= "Unable to find allkeys.txt. "
81			. "Download it and specify its location with --data-dir=<DIR>. "
82			. "\n\n";
83			}
84			if ( !$ucdallPresent ) {
85			$error .= "Unable to find ucd.all.grouped.xml. "
86			. "Download it, unzip, and specify its location with --data-dir=<DIR>. "
87			. "\n\n";
88			}
89
90			$versionKnown = false;
91			if ( !$icuVersion ) {
92			// Unknown version - either very old intl,
93			// or PHP < 5.3.7 which does not expose this information
94			$error .= "As MediaWiki could not determine the version of ICU library used by your PHP's "
95			. "intl extension it can't suggest which file version to download. "
96			. "This can be caused by running a very old version of intl or PHP < 5.3.7. "
97			. "If you are sure everything is all right, find out the ICU version "
98			. "by running phpinfo(), check what is the Unicode version it is using "
99			. "at http://site.icu-project.org/download, then try finding appropriate data file(s) at:";
100			} elseif ( version_compare( $icuVersion, "4.0", "<" ) ) {
101			// Extra old version
102			$error .= "You are using outdated version of ICU ($icuVersion), intended for "
103			. ( $unicodeVersion ? "Unicode $unicodeVersion" : "an unknown version of Unicode" )
104			. "; this file might not be avalaible for it, and it's not supported by MediaWiki. "
105			. " You are on your own; consider upgrading PHP's intl extension or try "
106			. "one of the files available at:";
107			} elseif ( version_compare( $icuVersion, "51.0", ">=" ) ) {
108			// Extra recent version
109			$error .= "You are using ICU $icuVersion, released after this script was last updated. "
110			. "Check what is the Unicode version it is using at http://site.icu-project.org/download . "
111			. "It can't be guaranteed everything will work, but appropriate file(s) should "
112			. "be available at:";
113			} else {
114			// ICU 4.0 to 50.x
115			$versionKnown = true;
116			$error .= "You are using ICU $icuVersion, intended for "
117			. ( $unicodeVersion ? "Unicode $unicodeVersion" : "an unknown version of Unicode" )
118			. ". Appropriate file(s) should be available at:";
119			}
120			$error .= "\n";
121
122			if ( $versionKnown && $unicodeVersion ) {
			0 ignored issues – show Bug Best Practice introduced 2016-01-16 18:00 UTC by Report Bug Copy Issue Report Show Similar Issues like this The expression `$unicodeVersion` of type `false\|string` is loosely compared to `true`; this is ambiguous if the string can be empty. You might want to explicitly use `!== false` instead. In PHP, under loose comparison (like `==`, or `!=`, or `switch` conditions), values of different types might be equal. For `string` values, the empty string `''` is a special case, in particular the following results might be unexpected: '' == false // true '' == null // true 'ab' == false // false 'ab' == null // false // It is often better to use strict comparison '' === false // false '' === null // false Loading history...
123			$allkeysURL = str_replace( "<Unicode version>", "$unicodeVersion.0", $allkeysURL );
124			$ucdallURL = str_replace( "<Unicode version>", "$unicodeVersion.0", $ucdallURL );
125			}
126
127			if ( !$allkeysPresent ) {
128			$error .= "* $allkeysURL\n";
129			}
130			if ( !$ucdallPresent ) {
131			$error .= "* $ucdallURL\n";
132			}
133
134			$this->error( $error );
135			exit( 1 );
			0 ignored issues – show Coding Style Compatibility introduced 2016-01-16 18:00 UTC by Report Bug Copy Issue Report Show Similar Issues like this The method `execute()` contains an exit expression. An exit expression should only be used in rare cases. For example, if you write a short command line script. In most cases however, using an `exit` expression makes the code untestable and often causes incompatibilities with other libraries. Thus, unless you are absolutely sure it is required here, we recommend to refactor your code to avoid its usage. Loading history...
136			}
137
138			$debugOutFileName = $this->getOption( 'debug-output' );
139			if ( $debugOutFileName ) {
140			$this->debugOutFile = fopen( $debugOutFileName, 'w' );
141			if ( !$this->debugOutFile ) {
142			$this->error( "Unable to open debug output file for writing" );
143			exit( 1 );
			0 ignored issues – show Coding Style Compatibility introduced 2016-01-16 18:00 UTC by Report Bug Copy Issue Report Show Similar Issues like this The method `execute()` contains an exit expression. An exit expression should only be used in rare cases. For example, if you write a short command line script. In most cases however, using an `exit` expression makes the code untestable and often causes incompatibilities with other libraries. Thus, unless you are absolutely sure it is required here, we recommend to refactor your code to avoid its usage. Loading history...
144			}
145			}
146			$this->loadUcd();
147			$this->generateFirstChars();
148			}
149
150			function loadUcd() {
151			$uxr = new UcdXmlReader( "{$this->dataDir}/ucd.all.grouped.xml" );
152			$uxr->readChars( [ $this, 'charCallback' ] );
153			}
154
155			function charCallback( $data ) {
156			// Skip non-printable characters,
157			// but do not skip a normal space (U+0020) since
158			// people like to use that as a fake no header symbol.
159			$category = substr( $data['gc'], 0, 1 );
160			if ( strpos( 'LNPS', $category ) === false
161			&& $data['cp'] !== '0020'
162			) {
163			return;
164			}
165			$cp = hexdec( $data['cp'] );
166
167			// Skip the CJK ideograph blocks, as an optimisation measure.
168			// UCA doesn't sort them properly anyway, without tailoring.
169			if ( IcuCollation::isCjk( $cp ) ) {
170			return;
171			}
172
173			// Skip the composed Hangul syllables, we will use the bare Jamo
174			// as first letters
175			if ( $data['block'] == 'Hangul Syllables' ) {
176			return;
177			}
178
179			// Calculate implicit weight per UTS #10 v6.0.0, sec 7.1.3
180			if ( $data['UIdeo'] === 'Y' ) {
181			if ( $data['block'] == 'CJK Unified Ideographs'
182			\|\| $data['block'] == 'CJK Compatibility Ideographs'
183			) {
184			$base = 0xFB40;
185			} else {
186			$base = 0xFB80;
187			}
188			} else {
189			$base = 0xFBC0;
190			}
191			$a = $base + ( $cp >> 15 );
192			$b = ( $cp & 0x7fff ) \| 0x8000;
193
194			$this->weights[$cp] = sprintf( ".%04X.%04X", $a, $b );
195
196			if ( $data['dm'] !== '#' ) {
197			$this->mappedChars[$cp] = true;
198			}
199
200			if ( $cp % 4096 == 0 ) {
201			print "{$data['cp']}\n";
202			}
203			}
204
205			function generateFirstChars() {
206			$file = fopen( "{$this->dataDir}/allkeys.txt", 'r' );
207			if ( !$file ) {
208			$this->error( "Unable to open allkeys.txt" );
209			exit( 1 );
			0 ignored issues – show Coding Style Compatibility introduced 2016-01-16 18:00 UTC by Report Bug Copy Issue Report Show Similar Issues like this The method `generateFirstChars()` contains an exit expression. An exit expression should only be used in rare cases. For example, if you write a short command line script. In most cases however, using an `exit` expression makes the code untestable and often causes incompatibilities with other libraries. Thus, unless you are absolutely sure it is required here, we recommend to refactor your code to avoid its usage. Loading history...
210			}
211			global $IP;
212			$outFile = fopen( "$IP/serialized/first-letters-root.ser", 'w' );
213			if ( !$outFile ) {
214			$this->error( "Unable to open output file first-letters-root.ser" );
215			exit( 1 );
			0 ignored issues – show Coding Style Compatibility introduced 2016-01-16 18:00 UTC by Report Bug Copy Issue Report Show Similar Issues like this The method `generateFirstChars()` contains an exit expression. An exit expression should only be used in rare cases. For example, if you write a short command line script. In most cases however, using an `exit` expression makes the code untestable and often causes incompatibilities with other libraries. Thus, unless you are absolutely sure it is required here, we recommend to refactor your code to avoid its usage. Loading history...
216			}
217
218			$goodTertiaryChars = [];
219
220			// For each character with an entry in allkeys.txt, overwrite the implicit
221			// entry in $this->weights that came from the UCD.
222			// Also gather a list of tertiary weights, for use in selecting the group header
223			while ( false !== ( $line = fgets( $file ) ) ) {
224			// We're only interested in single-character weights, pick them out with a regex
225			$line = trim( $line );
226			if ( !preg_match( '/^([0-9A-F]+)\s;\s([^#]*)/', $line, $m ) ) {
227			continue;
228			}
229
230			$cp = hexdec( $m[1] );
231			$allWeights = trim( $m[2] );
232			$primary = '';
233			$tertiary = '';
234
235			if ( !isset( $this->weights[$cp] ) ) {
236			// Non-printable, ignore
237			continue;
238			}
239			foreach ( StringUtils::explode( '[', $allWeights ) as $weightStr ) {
240			preg_match_all( '/[*.]([0-9A-F]+)/', $weightStr, $m );
241			if ( !empty( $m[1] ) ) {
242			if ( $m[1][0] !== '0000' ) {
243			$primary .= '.' . $m[1][0];
244			}
245			if ( $m[1][2] !== '0000' ) {
246			$tertiary .= '.' . $m[1][2];
247			}
248			}
249			}
250			$this->weights[$cp] = $primary;
251			if ( $tertiary === '.0008'
252			\|\| $tertiary === '.000E'
253			) {
254			$goodTertiaryChars[$cp] = true;
255			}
256			}
257			fclose( $file );
258
259			// Identify groups of characters with the same primary weight
260			$this->groups = [];
			0 ignored issues – show Bug introduced 2016-01-16 18:00 UTC by Report Bug Copy Issue Report Show Similar Issues like this The property `groups` does not exist. Did you maybe forget to declare it? In PHP it is possible to write to properties without declaring them. For example, the following is perfectly valid PHP code: class MyClass { } $x = new MyClass(); $x->foo = true; Generally, it is a good practice to explictly declare properties to avoid accidental typos and provide IDE auto-completion: class MyClass { public $foo; } $x = new MyClass(); $x->foo = true; Loading history...
261			asort( $this->weights, SORT_STRING );
262			$prevWeight = reset( $this->weights );
263			$group = [];
264			foreach ( $this->weights as $cp => $weight ) {
265			if ( $weight !== $prevWeight ) {
266			$this->groups[$prevWeight] = $group;
267			$prevWeight = $weight;
268			if ( isset( $this->groups[$weight] ) ) {
269			$group = $this->groups[$weight];
270			} else {
271			$group = [];
272			}
273			}
274			$group[] = $cp;
275			}
276			if ( $group ) {
277			$this->groups[$prevWeight] = $group;
278			}
279
280			// If one character has a given primary weight sequence, and a second
281			// character has a longer primary weight sequence with an initial
282			// portion equal to the first character, then remove the second
283			// character. This avoids having characters like U+A732 (double A)
284			// polluting the basic latin sort area.
285
286			foreach ( $this->groups as $weight => $group ) {
287			if ( preg_match( '/(\.[0-9A-F]*)\./', $weight, $m ) ) {
288			if ( isset( $this->groups[$m[1]] ) ) {
289			unset( $this->groups[$weight] );
290			}
291			}
292			}
293
294			ksort( $this->groups, SORT_STRING );
295
296			// Identify the header character in each group
297			$headerChars = [];
298			$prevChar = "\000";
299			$tertiaryCollator = new Collator( 'root' );
300			$primaryCollator = new Collator( 'root' );
301			$primaryCollator->setStrength( Collator::PRIMARY );
302			$numOutOfOrder = 0;
303			foreach ( $this->groups as $weight => $group ) {
304			$uncomposedChars = [];
305			$goodChars = [];
306			foreach ( $group as $cp ) {
307			if ( isset( $goodTertiaryChars[$cp] ) ) {
308			$goodChars[] = $cp;
309			}
310			if ( !isset( $this->mappedChars[$cp] ) ) {
311			$uncomposedChars[] = $cp;
312			}
313			}
314			$x = array_intersect( $goodChars, $uncomposedChars );
315			if ( !$x ) {
316			$x = $uncomposedChars;
317			if ( !$x ) {
318			$x = $group;
319			}
320			}
321
322			// Use ICU to pick the lowest sorting character in the selection
323			$tertiaryCollator->sort( $x );
324			$cp = $x[0];
325
326			$char = UtfNormal\Utils::codepointToUtf8( $cp );
327			$headerChars[] = $char;
328			if ( $primaryCollator->compare( $char, $prevChar ) <= 0 ) {
329			$numOutOfOrder++;
330			/*
331			printf( "Out of order: U+%05X > U+%05X\n",
332			utf8ToCodepoint( $prevChar ),
333			utf8ToCodepoint( $char ) );
334			*/
335			}
336			$prevChar = $char;
337
338			if ( $this->debugOutFile ) {
339			fwrite( $this->debugOutFile, sprintf( "%05X %s %s (%s)\n", $cp, $weight, $char,
340			implode( ' ', array_map( 'UtfNormal\Utils::codepointToUtf8', $group ) ) ) );
341			}
342			}
343
344			print "Out of order: $numOutOfOrder / " . count( $headerChars ) . "\n";
345
346			fwrite( $outFile, serialize( $headerChars ) );
347			}
348			}
349
350			class UcdXmlReader {
351			public $fileName;
352			public $callback;
353			public $groupAttrs;
354			public $xml;
355			public $blocks = [];
356			public $currentBlock;
357
358			function __construct( $fileName ) {
359			$this->fileName = $fileName;
360			}
361
362			public function readChars( $callback ) {
363			$this->getBlocks();
364			$this->currentBlock = reset( $this->blocks );
365			$xml = $this->open();
366			$this->callback = $callback;
367
368			while ( $xml->name !== 'repertoire' && $xml->next() );
369
370			while ( $xml->read() ) {
371			if ( $xml->nodeType == XMLReader::ELEMENT ) {
372			if ( $xml->name === 'group' ) {
373			$this->groupAttrs = $this->readAttributes();
374			} elseif ( $xml->name === 'char' ) {
375			$this->handleChar();
376			}
377			} elseif ( $xml->nodeType === XMLReader::END_ELEMENT ) {
378			if ( $xml->name === 'group' ) {
379			$this->groupAttrs = [];
380			}
381			}
382			}
383			$xml->close();
384			}
385
386			protected function open() {
387			$this->xml = new XMLReader;
388			$this->xml->open( $this->fileName );
389			if ( !$this->xml ) {
390			throw new MWException( __METHOD__ . ": unable to open {$this->fileName}" );
391			}
392			while ( $this->xml->name !== 'ucd' && $this->xml->read() );
393			$this->xml->read();
394
395			return $this->xml;
396			}
397
398			/**
399			* Read the attributes of the current element node and return them
400			* as an array
401			* @return array
402			*/
403			protected function readAttributes() {
404			$attrs = [];
405			while ( $this->xml->moveToNextAttribute() ) {
406			$attrs[$this->xml->name] = $this->xml->value;
407			}
408
409			return $attrs;
410			}
411
412			protected function handleChar() {
413			$attrs = $this->readAttributes() + $this->groupAttrs;
414			if ( isset( $attrs['cp'] ) ) {
415			$first = $last = hexdec( $attrs['cp'] );
416			} else {
417			$first = hexdec( $attrs['first-cp'] );
418			$last = hexdec( $attrs['last-cp'] );
419			unset( $attrs['first-cp'] );
420			unset( $attrs['last-cp'] );
421			}
422
423			for ( $cp = $first; $cp <= $last; $cp++ ) {
424			$hexCp = sprintf( "%04X", $cp );
425			foreach ( [ 'na', 'na1' ] as $nameProp ) {
426			if ( isset( $attrs[$nameProp] ) ) {
427			$attrs[$nameProp] = str_replace( '#', $hexCp, $attrs[$nameProp] );
428			}
429			}
430
431			while ( $this->currentBlock ) {
432			if ( $cp < $this->currentBlock[0] ) {
433			break;
434			} elseif ( $cp <= $this->currentBlock[1] ) {
435			$attrs['block'] = key( $this->blocks );
436			break;
437			} else {
438			$this->currentBlock = next( $this->blocks );
439			}
440			}
441
442			$attrs['cp'] = $hexCp;
443			call_user_func( $this->callback, $attrs );
444			}
445			}
446
447			public function getBlocks() {
448			if ( $this->blocks ) {
449			return $this->blocks;
450			}
451
452			$xml = $this->open();
453			while ( $xml->name !== 'blocks' && $xml->read() );
454
455			while ( $xml->read() ) {
456			if ( $xml->nodeType == XMLReader::ELEMENT ) {
457			if ( $xml->name === 'block' ) {
458			$attrs = $this->readAttributes();
459			$first = hexdec( $attrs['first-cp'] );
460			$last = hexdec( $attrs['last-cp'] );
461			$this->blocks[$attrs['name']] = [ $first, $last ];
462			}
463			}
464			}
465			$xml->close();
466
467			return $this->blocks;
468			}
469			}
470
471			$maintClass = 'GenerateCollationData';
472			require_once RUN_MAINTENANCE_IF_MAIN;
473

wikimedia / mediawiki

Issues (4122)

Security Analysis not enabled

maintenance/language/generateCollationData.php (7 issues)

Labels

Severity

Introduced By

Upgrade to new PHP Analysis Engine

Duplication Side-by-Side

Filter issues like