Completed
Push — update/react-dash-wordpress-i1... ( 3ef406...9a5312 )
by Jeremy
45:47 queued 38:34
created

Jetpack_Media_Meta_Extractor   F

Complexity

Total Complexity 79

Size/Duplication

Total Lines 496
Duplicated Lines 5.24 %

Coupling/Cohesion

Components 1
Dependencies 1

Importance

Changes 0
Metric Value
dl 26
loc 496
rs 2.08
c 0
b 0
f 0
wmc 79
lcom 1
cbo 1

7 Methods

Rating   Name   Duplication   Size   Complexity  
B extract() 0 38 6
F extract_from_content() 18 261 54
B get_image_fields() 8 41 5
A extract_images_from_content() 0 4 1
A build_image_struct() 0 16 4
B get_images_from_html() 0 32 8
A get_stripped_content() 0 7 1

How to fix   Duplicated Code    Complexity   

Duplicated Code

Duplicate code is one of the most pungent code smells. A rule that is often used is to re-structure code once it is duplicated in three or more places.

Common duplication problems, and corresponding solutions are:

Complex Class

 Tip:   Before tackling complexity, make sure that you eliminate any duplication first. This often can reduce the size of classes significantly.

Complex classes like Jetpack_Media_Meta_Extractor often do a lot of different things. To break such a class down, we need to identify a cohesive component within that class. A common approach to find such a component is to look for fields/methods that share the same prefixes, or suffixes. You can also have a look at the cohesion graph to spot any un-connected, or weakly-connected components.

Once you have determined the fields that belong together, you can apply the Extract Class refactoring. If the component makes sense as a sub-class, Extract Subclass is also a candidate, and is often faster.

While breaking up the class, it is a good idea to analyze how other classes use Jetpack_Media_Meta_Extractor, and based on these observations, apply Extract Interface, too.

1
<?php // phpcs:ignore WordPress.Files.FileName.InvalidClassFileName
2
/**
3
 * Class with methods to extract metadata from a post/page about videos, images, links, mentions embedded
4
 * in or attached to the post/page.
5
 *
6
 * @package Jetpack
7
 */
8
9
/**
10
 * Class with methods to extract metadata from a post/page about videos, images, links, mentions embedded
11
 * in or attached to the post/page.
12
 *
13
 * @todo Additionally, have some filters on number of items in each field
14
 */
15
class Jetpack_Media_Meta_Extractor {
16
17
	// Some consts for what to extract.
18
	const ALL        = 255;
19
	const LINKS      = 1;
20
	const MENTIONS   = 2;
21
	const IMAGES     = 4;
22
	const SHORTCODES = 8; // Only the keeper shortcodes below.
23
	const EMBEDS     = 16;
24
	const HASHTAGS   = 32;
25
26
	/**
27
	 * Shortcodes to keep.
28
	 *
29
	 * For these, we try to extract some data from the shortcode, rather than just recording its presence (which we do for all)
30
	 * There should be a function get_{shortcode}_id( $atts ) or static method SomethingShortcode::get_{shortcode}_id( $atts ) for these.
31
	 *
32
	 * @var string[]
33
	 */
34
	private static $keeper_shortcodes = array(
35
		'youtube',
36
		'vimeo',
37
		'hulu',
38
		'ted',
39
		'wpvideo',
40
		'videopress',
41
	);
42
43
	/**
44
	 * Gets the specified media and meta info from the given post.
45
	 * NOTE: If you have the post's HTML content already and don't need image data, use extract_from_content() instead.
46
	 *
47
	 * @param int $blog_id The ID of the blog.
48
	 * @param int $post_id The ID of the post.
49
	 * @param int $what_to_extract A mask of things to extract, e.g. Jetpack_Media_Meta_Extractor::IMAGES | Jetpack_Media_Meta_Extractor::MENTIONS.
50
	 *
51
	 * @return array|WP_Error a structure containing metadata about the embedded things, or empty array if nothing found, or WP_Error on error.
52
	 */
53
	public static function extract( $blog_id, $post_id, $what_to_extract = self::ALL ) {
54
55
		// multisite?
56
		if ( function_exists( 'switch_to_blog' ) ) {
57
			switch_to_blog( $blog_id );
58
		}
59
60
		$post = get_post( $post_id );
61
		if ( ! $post instanceof WP_Post ) {
0 ignored issues
show
Bug introduced by
The class WP_Post does not exist. Did you forget a USE statement, or did you not list all dependencies?

This error could be the result of:

1. Missing dependencies

PHP Analyzer uses your composer.json file (if available) to determine the dependencies of your project and to determine all the available classes and functions. It expects the composer.json to be in the root folder of your repository.

Are you sure this class is defined by one of your dependencies, or did you maybe not list a dependency in either the require or require-dev section?

2. Missing use statement

PHP does not complain about undefined classes in ìnstanceof checks. For example, the following PHP code will work perfectly fine:

if ($x instanceof DoesNotExist) {
    // Do something.
}

If you have not tested against this specific condition, such errors might go unnoticed.

Loading history...
62
			return array();
63
		}
64
		$content  = $post->post_title . "\n\n" . $post->post_content;
65
		$char_cnt = strlen( $content );
66
67
		// prevent running extraction on really huge amounts of content.
68
		if ( $char_cnt > 100000 ) { // about 20k English words.
69
			$content = substr( $content, 0, 100000 );
70
		}
71
72
		$extracted = array();
73
74
		// Get images first, we need the full post for that.
75
		if ( self::IMAGES & $what_to_extract ) {
76
			$extracted = self::get_image_fields( $post );
77
78
			// Turn off images so we can safely call extract_from_content() below.
79
			$what_to_extract = $what_to_extract - self::IMAGES;
80
		}
81
82
		if ( function_exists( 'switch_to_blog' ) ) {
83
			restore_current_blog();
84
		}
85
86
		// All of the other things besides images can be extracted from just the content.
87
		$extracted = self::extract_from_content( $content, $what_to_extract, $extracted );
88
89
		return $extracted;
90
	}
91
92
	/**
93
	 * Gets the specified meta info from the given post content.
94
	 * NOTE: If you want IMAGES, call extract( $blog_id, $post_id, ...) which will give you more/better image extraction
95
	 * This method will give you an error if you ask for IMAGES.
96
	 *
97
	 * @param string $content The HTML post_content of a post.
98
	 * @param int    $what_to_extract A mask of things to extract, e.g. Jetpack_Media_Meta_Extractor::IMAGES | Jetpack_Media_Meta_Extractor::MENTIONS.
99
	 * @param array  $already_extracted Previously extracted things, e.g. images from extract(), which can be used for x-referencing here.
100
	 *
101
	 * @return array a structure containing metadata about the embedded things, or empty array if nothing found, or WP_Error on error.
102
	 */
103
	public static function extract_from_content( $content, $what_to_extract = self::ALL, $already_extracted = array() ) {
104
		$stripped_content = self::get_stripped_content( $content );
105
106
		// Maybe start with some previously extracted things (e.g. images from extract().
107
		$extracted = $already_extracted;
108
109
		// Embedded media objects will have already been converted to shortcodes by pre_kses hooks on save.
110
111
		if ( self::IMAGES & $what_to_extract ) {
112
			$images    = self::extract_images_from_content( $stripped_content, array() );
113
			$extracted = array_merge( $extracted, $images );
114
		}
115
116
		// ----------------------------------- MENTIONS ------------------------------
117
118 View Code Duplication
		if ( self::MENTIONS & $what_to_extract ) {
119
			if ( preg_match_all( '/(^|\s)@(\w+)/u', $stripped_content, $matches ) ) {
120
				$mentions             = array_values( array_unique( $matches[2] ) ); // array_unique() retains the keys!
121
				$mentions             = array_map( 'strtolower', $mentions );
122
				$extracted['mention'] = array( 'name' => $mentions );
123
				if ( ! isset( $extracted['has'] ) ) {
124
					$extracted['has'] = array();
125
				}
126
				$extracted['has']['mention'] = count( $mentions );
127
			}
128
		}
129
130
		// ----------------------------------- HASHTAGS ------------------------------
131
		/**
132
		 * Some hosts may not compile with --enable-unicode-properties and kick a warning:
133
		 * Warning: preg_match_all() [function.preg-match-all]: Compilation failed: support for \P, \p, and \X has not been compiled
134
		 * Therefore, we only run this code block on wpcom, not in Jetpack.
135
		 */
136
		if ( ( defined( 'IS_WPCOM' ) && IS_WPCOM ) && ( self::HASHTAGS & $what_to_extract ) ) {
137
			// This regex does not exactly match Twitter's
138
			// if there are problems/complaints we should implement this:
139
			// https://github.com/twitter/twitter-text/blob/master/java/src/com/twitter/Regex.java .
140 View Code Duplication
			if ( preg_match_all( '/(?:^|\s)#(\w*\p{L}+\w*)/u', $stripped_content, $matches ) ) {
141
				$hashtags             = array_values( array_unique( $matches[1] ) ); // array_unique() retains the keys!
142
				$hashtags             = array_map( 'strtolower', $hashtags );
143
				$extracted['hashtag'] = array( 'name' => $hashtags );
144
				if ( ! isset( $extracted['has'] ) ) {
145
					$extracted['has'] = array();
146
				}
147
				$extracted['has']['hashtag'] = count( $hashtags );
148
			}
149
		}
150
151
		// ----------------------------------- SHORTCODES ------------------------------
152
153
		// Always look for shortcodes.
154
		// If we don't want them, we'll just remove them, so we don't grab them as links below.
155
		$shortcode_pattern = '/' . get_shortcode_regex() . '/s';
156
		if ( preg_match_all( $shortcode_pattern, $content, $matches ) ) {
157
158
			$shortcode_total_count = 0;
159
			$shortcode_type_counts = array();
160
			$shortcode_types       = array();
161
			$shortcode_details     = array();
162
163
			if ( self::SHORTCODES & $what_to_extract ) {
164
165
				foreach ( $matches[2] as $key => $shortcode ) {
166
					// Elasticsearch (and probably other things) doesn't deal well with some chars as key names.
167
					$shortcode_name = preg_replace( '/[.,*"\'\/\\\\#+ ]/', '_', $shortcode );
168
169
					$attr = shortcode_parse_atts( $matches[3][ $key ] );
170
171
					$shortcode_total_count++;
172
					if ( ! isset( $shortcode_type_counts[ $shortcode_name ] ) ) {
173
						$shortcode_type_counts[ $shortcode_name ] = 0;
174
					}
175
					$shortcode_type_counts[ $shortcode_name ]++;
176
177
					// Store (uniquely) presence of all shortcode regardless of whether it's a keeper (for those, get ID below)
178
					// @todo Store number of occurrences?
179
					if ( ! in_array( $shortcode_name, $shortcode_types, true ) ) {
180
						$shortcode_types[] = $shortcode_name;
181
					}
182
183
					// For keeper shortcodes, also store the id/url of the object (e.g. youtube video, TED talk, etc.).
184
					if ( in_array( $shortcode, self::$keeper_shortcodes, true ) ) {
185
						// Clear shortcode ID data left from the last shortcode.
186
						$id = null;
187
						// We'll try to get the salient ID from the function jetpack_shortcode_get_xyz_id().
188
						// If the shortcode is a class, we'll call XyzShortcode::get_xyz_id().
189
						$shortcode_get_id_func   = "jetpack_shortcode_get_{$shortcode}_id";
190
						$shortcode_class_name    = ucfirst( $shortcode ) . 'Shortcode';
191
						$shortcode_get_id_method = "get_{$shortcode}_id";
192
						if ( function_exists( $shortcode_get_id_func ) ) {
193
							$id = call_user_func( $shortcode_get_id_func, $attr );
194
						} elseif ( method_exists( $shortcode_class_name, $shortcode_get_id_method ) ) {
195
							$id = call_user_func( array( $shortcode_class_name, $shortcode_get_id_method ), $attr );
196
						}
197
						if ( ! empty( $id )
198
							&& ( ! isset( $shortcode_details[ $shortcode_name ] ) || ! in_array( $id, $shortcode_details[ $shortcode_name ], true ) ) ) {
199
							$shortcode_details[ $shortcode_name ][] = $id;
200
						}
201
					}
202
				}
203
204
				if ( $shortcode_total_count > 0 ) {
205
					// Add the shortcode info to the $extracted array.
206
					if ( ! isset( $extracted['has'] ) ) {
207
						$extracted['has'] = array();
208
					}
209
					$extracted['has']['shortcode'] = $shortcode_total_count;
210
					$extracted['shortcode']        = array();
211
					foreach ( $shortcode_type_counts as $type => $count ) {
212
						$extracted['shortcode'][ $type ] = array( 'count' => $count );
213
					}
214
					if ( ! empty( $shortcode_types ) ) {
215
						$extracted['shortcode_types'] = $shortcode_types;
216
					}
217
					foreach ( $shortcode_details as $type => $id ) {
218
						$extracted['shortcode'][ $type ]['id'] = $id;
219
					}
220
				}
221
			}
222
223
			// Remove the shortcodes form our copy of $content, so we don't count links in them as links below.
224
			$content = preg_replace( $shortcode_pattern, ' ', $content );
225
		}
226
227
		// ----------------------------------- LINKS ------------------------------
228
229
		if ( self::LINKS & $what_to_extract ) {
230
231
			// To hold the extracted stuff we find.
232
			$links = array();
233
234
			// @todo Get the text inside the links?
235
236
			// Grab any links, whether in <a href="..." or not, but subtract those from shortcodes and images.
237
			// (we treat embed links as just another link).
238
			if ( preg_match_all( '#(?:^|\s|"|\')(https?://([^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/))))#', $content, $matches ) ) {
239
240
				foreach ( $matches[1] as $link_raw ) {
241
					$url = wp_parse_url( $link_raw );
242
243
					// Data URI links.
244
					if ( isset( $url['scheme'] ) && 'data' === $url['scheme'] ) {
245
						continue;
246
					}
247
248
					// Remove large (and likely invalid) links.
249
					if ( 4096 < strlen( $link_raw ) ) {
250
						continue;
251
					}
252
253
					// Build a simple form of the URL so we can compare it to ones we found in IMAGES or SHORTCODES and exclude those.
254
					$simple_url = $url['scheme'] . '://' . $url['host'] . ( ! empty( $url['path'] ) ? $url['path'] : '' );
255
					if ( isset( $extracted['image']['url'] ) ) {
256
						if ( in_array( $simple_url, (array) $extracted['image']['url'], true ) ) {
257
							continue;
258
						}
259
					}
260
261
					list( $proto, $link_all_but_proto ) = explode( '://', $link_raw ); // phpcs:ignore VariableAnalysis.CodeAnalysis.VariableAnalysis.UnusedVariable
262
263
					// Build a reversed hostname.
264
					$host_parts    = array_reverse( explode( '.', $url['host'] ) );
265
					$host_reversed = '';
266
					foreach ( $host_parts as $part ) {
267
						$host_reversed .= ( ! empty( $host_reversed ) ? '.' : '' ) . $part;
268
					}
269
270
					$link_analyzed = '';
271
					if ( ! empty( $url['path'] ) ) {
272
						// The whole path (no query args or fragments).
273
						$path           = substr( $url['path'], 1 ); // strip the leading '/'.
274
						$link_analyzed .= ( ! empty( $link_analyzed ) ? ' ' : '' ) . $path;
275
276
						// The path split by /.
277
						$path_split = explode( '/', $path );
278
						if ( count( $path_split ) > 1 ) {
279
							$link_analyzed .= ' ' . implode( ' ', $path_split );
280
						}
281
282
						// The fragment.
283
						if ( ! empty( $url['fragment'] ) ) {
284
							$link_analyzed .= ( ! empty( $link_analyzed ) ? ' ' : '' ) . $url['fragment'];
285
						}
286
					}
287
288
					// @todo Check unique before adding
289
					$links[] = array(
290
						'url'           => $link_all_but_proto,
291
						'host_reversed' => $host_reversed,
292
						'host'          => $url['host'],
293
					);
294
				}
295
			}
296
297
			$link_count = count( $links );
298
			if ( $link_count ) {
299
				$extracted['link'] = $links;
300
				if ( ! isset( $extracted['has'] ) ) {
301
					$extracted['has'] = array();
302
				}
303
				$extracted['has']['link'] = $link_count;
304
			}
305
		}
306
307
		// ----------------------------------- EMBEDS ------------------------------
308
309
		// Embeds are just individual links on their own line.
310
		if ( self::EMBEDS & $what_to_extract ) {
311
312
			if ( ! function_exists( '_wp_oembed_get_object' ) ) {
313
				include ABSPATH . WPINC . '/class-oembed.php';
314
			}
315
316
			// get an oembed object.
317
			$oembed = _wp_oembed_get_object();
318
319
			// Grab any links on their own lines that may be embeds.
320
			if ( preg_match_all( '|^\s*(https?://[^\s"]+)\s*$|im', $content, $matches ) ) {
321
322
				// To hold the extracted stuff we find.
323
				$embeds = array();
324
325
				foreach ( $matches[1] as $link_raw ) {
326
					$url = wp_parse_url( $link_raw );
0 ignored issues
show
Unused Code introduced by
$url is not used, you could remove the assignment.

This check looks for variable assignements that are either overwritten by other assignments or where the variable is not used subsequently.

$myVar = 'Value';
$higher = false;

if (rand(1, 6) > 3) {
    $higher = true;
} else {
    $higher = false;
}

Both the $myVar assignment in line 1 and the $higher assignment in line 2 are dead. The first because $myVar is never used and the second because $higher is always overwritten for every possible time line.

Loading history...
327
328
					list( $proto, $link_all_but_proto ) = explode( '://', $link_raw ); // phpcs:ignore VariableAnalysis.CodeAnalysis.VariableAnalysis.UnusedVariable
0 ignored issues
show
Unused Code introduced by
The assignment to $proto is unused. Consider omitting it like so list($first,,$third).

This checks looks for assignemnts to variables using the list(...) function, where not all assigned variables are subsequently used.

Consider the following code example.

<?php

function returnThreeValues() {
    return array('a', 'b', 'c');
}

list($a, $b, $c) = returnThreeValues();

print $a . " - " . $c;

Only the variables $a and $c are used. There was no need to assign $b.

Instead, the list call could have been.

list($a,, $c) = returnThreeValues();
Loading history...
329
330
					// Check whether this "link" is really an embed.
331
					foreach ( $oembed->providers as $matchmask => $data ) {
332
						list( $providerurl, $regex ) = $data; // phpcs:ignore VariableAnalysis.CodeAnalysis.VariableAnalysis.UnusedVariable
0 ignored issues
show
Unused Code introduced by
The assignment to $providerurl is unused. Consider omitting it like so list($first,,$third).

This checks looks for assignemnts to variables using the list(...) function, where not all assigned variables are subsequently used.

Consider the following code example.

<?php

function returnThreeValues() {
    return array('a', 'b', 'c');
}

list($a, $b, $c) = returnThreeValues();

print $a . " - " . $c;

Only the variables $a and $c are used. There was no need to assign $b.

Instead, the list call could have been.

list($a,, $c) = returnThreeValues();
Loading history...
333
334
						// Turn the asterisk-type provider URLs into regex.
335
						if ( ! $regex ) {
336
							$matchmask = '#' . str_replace( '___wildcard___', '(.+)', preg_quote( str_replace( '*', '___wildcard___', $matchmask ), '#' ) ) . '#i';
337
							$matchmask = preg_replace( '|^#http\\\://|', '#https?\://', $matchmask );
338
						}
339
340
						if ( preg_match( $matchmask, $link_raw ) ) {
341
							$embeds[] = $link_all_but_proto; // @todo Check unique before adding
342
343
							// @todo Try to get ID's for the ones we care about (shortcode_keepers)
344
							break;
345
						}
346
					}
347
				}
348
349
				if ( ! empty( $embeds ) ) {
350
					if ( ! isset( $extracted['has'] ) ) {
351
						$extracted['has'] = array();
352
					}
353
					$extracted['has']['embed'] = count( $embeds );
354
					$extracted['embed']        = array( 'url' => array() );
355
					foreach ( $embeds as $e ) {
356
						$extracted['embed']['url'][] = $e;
357
					}
358
				}
359
			}
360
		}
361
362
		return $extracted;
363
	}
364
365
	/**
366
	 * Get image fields for matching images.
367
	 *
368
	 * @uses Jetpack_PostImages
369
	 *
370
	 * @param WP_Post $post A post object.
371
	 * @param array   $args Optional args, see defaults list for details.
372
	 *
373
	 * @return array Returns an array of all images meeting the specified criteria in $args.
374
	 */
375
	private static function get_image_fields( $post, $args = array() ) {
376
377
		if ( ! $post instanceof WP_Post ) {
0 ignored issues
show
Bug introduced by
The class WP_Post does not exist. Did you forget a USE statement, or did you not list all dependencies?

This error could be the result of:

1. Missing dependencies

PHP Analyzer uses your composer.json file (if available) to determine the dependencies of your project and to determine all the available classes and functions. It expects the composer.json to be in the root folder of your repository.

Are you sure this class is defined by one of your dependencies, or did you maybe not list a dependency in either the require or require-dev section?

2. Missing use statement

PHP does not complain about undefined classes in ìnstanceof checks. For example, the following PHP code will work perfectly fine:

if ($x instanceof DoesNotExist) {
    // Do something.
}

If you have not tested against this specific condition, such errors might go unnoticed.

Loading history...
378
			return array();
379
		}
380
381
		$defaults = array(
382
			'width'  => 200, // Required minimum width (if possible to determine).
383
			'height' => 200, // Required minimum height (if possible to determine).
384
		);
385
386
		$args = wp_parse_args( $args, $defaults );
0 ignored issues
show
Documentation introduced by
$defaults is of type array<string,integer,{"w...r","height":"integer"}>, but the function expects a string.

It seems like the type of the argument is not accepted by the function/method which you are calling.

In some cases, in particular if PHP’s automatic type-juggling kicks in this might be fine. In other cases, however this might be a bug.

We suggest to add an explicit type cast like in the following example:

function acceptsInteger($int) { }

$x = '123'; // string "123"

// Instead of
acceptsInteger($x);

// we recommend to use
acceptsInteger((integer) $x);
Loading history...
387
388
		$image_list                = array();
389
		$image_booleans            = array();
390
		$image_booleans['gallery'] = 0;
391
392
		$from_featured_image = Jetpack_PostImages::from_thumbnail( $post->ID, $args['width'], $args['height'] );
393 View Code Duplication
		if ( ! empty( $from_featured_image ) ) {
394
			$srcs       = wp_list_pluck( $from_featured_image, 'src' );
395
			$image_list = array_merge( $image_list, $srcs );
396
		}
397
398
		$from_slideshow = Jetpack_PostImages::from_slideshow( $post->ID, $args['width'], $args['height'] );
399 View Code Duplication
		if ( ! empty( $from_slideshow ) ) {
400
			$srcs       = wp_list_pluck( $from_slideshow, 'src' );
401
			$image_list = array_merge( $image_list, $srcs );
402
		}
403
404
		$from_gallery = Jetpack_PostImages::from_gallery( $post->ID );
405
		if ( ! empty( $from_gallery ) ) {
406
			$srcs       = wp_list_pluck( $from_gallery, 'src' );
407
			$image_list = array_merge( $image_list, $srcs );
408
			$image_booleans['gallery']++; // @todo This count isn't correct, will only every count 1
409
		}
410
411
		// @todo Can we check width/height of these efficiently?  Could maybe use query args at least, before we strip them out
412
		$image_list = self::get_images_from_html( $post->post_content, $image_list );
413
414
		return self::build_image_struct( $image_list, $image_booleans );
415
	}
416
417
	/**
418
	 * Helper function to get images from HTML and return it with the set sturcture.
419
	 *
420
	 * @param string $content HTML content.
421
	 * @param array  $image_list Array of already found images.
422
	 *
423
	 * @return array|array[] Array of images.
424
	 */
425
	public static function extract_images_from_content( $content, $image_list ) {
426
		$image_list = self::get_images_from_html( $content, $image_list );
427
		return self::build_image_struct( $image_list, array() );
428
	}
429
430
	/**
431
	 * Produces a set structure for extracted media items.
432
	 *
433
	 * @param array $image_list Array of images.
434
	 * @param array $image_booleans Image booleans.
435
	 *
436
	 * @return array|array[]
437
	 */
438
	public static function build_image_struct( $image_list, $image_booleans ) {
439
		if ( ! empty( $image_list ) ) {
440
			$retval     = array( 'image' => array() );
441
			$image_list = array_unique( $image_list );
442
			foreach ( $image_list as $img ) {
443
				$retval['image'][] = array( 'url' => $img );
444
			}
445
			$image_booleans['image'] = count( $retval['image'] );
446
			if ( ! empty( $image_booleans ) ) {
447
				$retval['has'] = $image_booleans;
448
			}
449
			return $retval;
450
		} else {
451
			return array();
452
		}
453
	}
454
455
	/**
456
	 * Extracts images from html.
457
	 *
458
	 * @param string $html Some markup, possibly containing image tags.
459
	 * @param array  $images_already_extracted (just an array of image URLs without query strings, no special structure), used for de-duplication.
460
	 *
461
	 * @return array Image URLs extracted from the HTML, stripped of query params and de-duped
462
	 */
463
	public static function get_images_from_html( $html, $images_already_extracted ) {
464
		$image_list = $images_already_extracted;
465
		$from_html  = Jetpack_PostImages::from_html( $html );
466
		if ( ! empty( $from_html ) ) {
467
			$srcs = wp_list_pluck( $from_html, 'src' );
468
			foreach ( $srcs as $image_url ) {
469
				$length = strpos( $image_url, '?' );
470
				$src    = wp_parse_url( $image_url );
471
472
				if ( $src && isset( $src['scheme'], $src['host'], $src['path'] ) ) {
473
					// Rebuild the URL without the query string.
474
					$queryless = $src['scheme'] . '://' . $src['host'] . $src['path'];
475
				} elseif ( $length ) {
476
					// If wp_parse_url() didn't work, strip off the query string the old fashioned way.
477
					$queryless = substr( $image_url, 0, $length );
478
				} else {
479
					// Failing that, there was no spoon! Err ... query string!
480
					$queryless = $image_url;
481
				}
482
483
				// Discard URLs that are longer then 4KB, these are likely data URIs or malformed HTML.
484
				if ( 4096 < strlen( $queryless ) ) {
485
					continue;
486
				}
487
488
				if ( ! in_array( $queryless, $image_list, true ) ) {
489
					$image_list[] = $queryless;
490
				}
491
			}
492
		}
493
		return $image_list;
494
	}
495
496
	/**
497
	 * Strips concents of all tags, shortcodes, and decodes HTML entities.
498
	 *
499
	 * @param string $content Original content.
500
	 *
501
	 * @return string Cleaned content.
502
	 */
503
	private static function get_stripped_content( $content ) {
504
		$clean_content = wp_strip_all_tags( $content );
505
		$clean_content = html_entity_decode( $clean_content );
506
		// completely strip shortcodes and any content they enclose.
507
		$clean_content = strip_shortcodes( $clean_content );
508
		return $clean_content;
509
	}
510
}
511