Completed
Pull Request — master (#82)
by Luke
02:53
created

Taster::lickType()   C

Complexity

Conditions 11
Paths 21

Size

Total Lines 40
Code Lines 29

Duplication

Lines 0
Ratio 0 %

Code Coverage

Tests 24
CRAP Score 11.353

Importance

Changes 1
Bugs 0 Features 0
Metric Value
c 1
b 0
f 0
dl 0
loc 40
ccs 24
cts 28
cp 0.8571
rs 5.2653
cc 11
eloc 29
nc 21
nop 1
crap 11.353

How to fix   Complexity   

Long Method

Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.

For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.

Commonly applied refactorings include:

1
<?php
2
/**
3
 * CSVelte: Slender, elegant CSV for PHP
4
 * Inspired by Python's CSV module and Frictionless Data and the W3C's CSV
5
 * standardization efforts, CSVelte was written in an effort to take all the
6
 * suck out of working with CSV.
7
 *
8
 * @version   v0.1
9
 *
10
 * @copyright Copyright (c) 2016 Luke Visinoni <[email protected]>
11
 * @author    Luke Visinoni <[email protected]>
12
 * @license   https://github.com/deni-zen/csvelte/blob/master/LICENSE The MIT License (MIT)
13
 */
14
namespace CSVelte;
15
16
use Carbon\Carbon;
17
use CSVelte\Contract\Readable;
18
use CSVelte\Exception\TasteDelimiterException;
19
use CSVelte\Exception\TasteQuoteAndDelimException;
20
21
/**
22
 * CSVelte\Taster
23
 * Given CSV data, Taster will "taste" the data and provide its buest guess at
24
 * its "flavor". In other words, this class inspects CSV data and attempts to
25
 * auto-detect various CSV attributes such as line endings, quote characters, etc..
26
 *
27
 * @copyright (c) 2016, Luke Visinoni <[email protected]>
28
 * @author    Luke Visinoni <[email protected]>
29
 *
30
 * @todo      There are a ton of improvements that could be made to this class.
31
 *            I'll do a refactor on this fella once I get at least one test
32
 *            passing for each of its public methods.
33
 * @todo      Should I have a lickEscapeChar method? The python version doesn't
34
 *            have one. But then why does it even bother including one in its
35
 *            flavor class?
36
 * @todo      Examine each of the public methods in this class and determine
37
 *            whether it makes sense to ask for the data as a param rather than
38
 *            just pulling it from source. I don't think it makes sense... it
39
 *            was just easier to write the methods that way during testing.
40
 */
41
class Taster
42
{
43
    /**
44
     * End-of-line constants.
45
     */
46
    const EOL_UNIX = 'lf';
47
    const EOL_TRS80 = 'cr';
48
    const EOL_WINDOWS = 'crlf';
49
50
    /**
51
     * ASCII character codes for "invisibles".
52
     */
53
    const HORIZONTAL_TAB = 9;
54
    const LINE_FEED = 10;
55
    const CARRIAGE_RETURN = 13;
56
    const SPACE = 32;
57
58
    /**
59
     * Data types -- Used within the lickQuotingStyle method.
60
     */
61
    const DATA_NONNUMERIC = 'nonnumeric';
62
    const DATA_SPECIAL = 'special';
63
    const DATA_UNKNOWN = 'unknown';
64
65
    /**
66
     * Placeholder strings -- hold the place of newlines and delimiters contained
67
     * within quoted text so that the explode method doesn't split incorrectly.
68
     */
69
    const PLACEHOLDER_NEWLINE = '[__NEWLINE__]';
70
    const PLACEHOLDER_DELIM = '[__DELIM__]';
71
72
    /**
73
     * Recommended data sample size.
74
     */
75
    const SAMPLE_SIZE = 2500;
76
77
    /**
78
     * Column data types -- used within the lickHeader method to determine
79
     * whether the first row contains different types of data than the rest of
80
     * the rows (and thus, is likely a header row).
81
     */
82
    // +-987
83
    const TYPE_NUMBER = 'number';
84
    // +-12.387
85
    const TYPE_DOUBLE = 'double';
86
    // I am a string. I can contain all kinds of stuff.
87
    const TYPE_STRING = 'string';
88
    // 10-Jul-15, 9/1/2007, April 1st, 2006, etc.
89
    const TYPE_DATE = 'date';
90
    // 10:00pm, 5pm, 13:08, etc.
91
    const TYPE_TIME = 'time';
92
    // $98.96, ¥12389, £6.08, €87.00
93
    const TYPE_CURRENCY = 'currency';
94
    // 12ab44m1n2_asdf
95
    const TYPE_ALNUM = 'alnum';
96
    // abababab
97
    const TYPE_ALPHA = 'alpha';
98
99
    /**
100
     * @var CSVelte\Contract\Readable The source of data to examine
101
     */
102
    protected $input;
103
104
    /**
105
     * Sample of CSV data to use for tasting (determining CSV flavor).
106
     *
107
     * @var string
108
     */
109
    protected $sample;
110
111
    /**
112
     * Class constructor--accepts a CSV input source.
113
     *
114
     * @param CSVelte\Contract\Readable The source of CSV data
115
     *
116
     * @return void
0 ignored issues
show
Comprehensibility Best Practice introduced by
Adding a @return annotation to constructors is generally not recommended as a constructor does not have a meaningful return value.

Adding a @return annotation to a constructor is not recommended, since a constructor does not have a meaningful return value.

Please refer to the PHP core documentation on constructors.

Loading history...
117
     *
118
     * @todo It may be a good idea to skip the first line or two for the sample
119
     *     so that the header line(s) don't throw things off (with the exception
120
     *     of lickHeader() obviously)
121
     */
122 37
    public function __construct(Readable $input)
123
    {
124 37
        $this->input = $input;
0 ignored issues
show
Documentation Bug introduced by
It seems like $input of type object<CSVelte\Contract\Readable> is incompatible with the declared type object<CSVelte\CSVelte\Contract\Readable> of property $input.

Our type inference engine has found an assignment to a property that is incompatible with the declared type of that property.

Either this assignment is in error or the assigned type should be added to the documentation/type hint for that property..

Loading history...
125 37
        $this->sample = $input->read(self::SAMPLE_SIZE);
126 37
    }
127
128
    /**
129
     * I'm not sure what this is for...
130
     *
131
     * @param Readable $input The input source
132
     *
133
     * @return CSVelte\Taster
134
     *
135
     * @todo Get rid of this unless there is a good reason for having it...?
136
     * @ignore
137
     */
138
    public static function create(Readable $input)
139
    {
140
        return new self($input);
141
    }
142
143
    /**
144
     * Examine the input source and determine what "Flavor" of CSV it contains.
145
     * The CSV format, while having an RFC (https://tools.ietf.org/html/rfc4180),
146
     * doesn't necessarily always conform to it. And it doesn't provide meta such as the delimiting character, quote character, or what types of data are quoted.
147
     * such as the delimiting character, quote character, or what types of data are quoted.
148
     * are quoted.
149
     *
150
     * @return CSVelte\Flavor The metadata that the CSV format doesn't provide
151
     *
152
     * @todo Implement a lickQuote method for when lickQuoteAndDelim method fails
153
     * @todo Should there bea lickEscapeChar method? the python module that inspired
154
     *     this library doesn't include one...
155
     * @todo This should cache the results and only regenerate if $this->sample
156
     *     changes (or $this->input)
157
     */
158 16
    public function lick()
159
    {
160 16
        $lineTerminator = $this->lickLineEndings();
161
        try {
162 16
            list($quoteChar, $delimiter) = $this->lickQuoteAndDelim();
163 16
        } catch (TasteQuoteAndDelimException $e) {
164 2
            $quoteChar = '"';
165 2
            $delimiter = $this->lickDelimiter($lineTerminator);
166
        }
167
        /*
168
         * @todo Should this be null? Because doubleQuote = true means this = null
169
         */
170 16
        $escapeChar = '\\';
171 16
        $quoteStyle = $this->lickQuotingStyle($this->sample, $quoteChar, $delimiter, $lineTerminator);
172 16
        $header = $this->lickHeader($this->sample, $quoteChar, $delimiter, $lineTerminator);
173
174 16
        return new Flavor(compact('quoteChar', 'escapeChar', 'delimiter', 'lineTerminator', 'quoteStyle', 'header'));
175
    }
176
177
    /**
178
     * Replaces all quoted columns with a blank string. I was using this method
179
     * to prevent explode() from incorrectly splitting at delimiters and newlines
180
     * within quotes when parsing a file. But this was before I wrote the
181
     * replaceQuotedSpecialChars method which (at least to me) makes more sense.
182
     *
183
     * @param string The string to replace quoted strings within
184
     *
185
     * @return string The input string with quoted strings removed
186
     *
187
     * @todo Replace code that uses this method with the replaceQuotedSpecialChars
188
     *     method instead. I think it's cleaner.
189
     */
190 16
    protected function removeQuotedStrings($data)
191
    {
192 16
        return preg_replace($pattern = '/(["\'])(?:(?=(\\\\?))\2.)*?\1/sm', $replace = '', $data);
193
    }
194
195
    /**
196
     * Examine the input source to determine which character(s) are being used
197
     * as the end-of-line character.
198
     *
199
     * @return char The end-of-line char for the input data
200
     * @credit pulled from stackoverflow thread *tips hat to username "Harm"*
201
     *
202
     * @todo This should throw an exception if it cannot determine the line ending
203
     * @todo I probably will make this method protected when I'm done with testing...
204
     * @todo If there is any way for this method to fail (for instance if a file )
205
     *       is totally empty or contains no line breaks), then it needs to throw
206
     *       a relevant TasterException
207
     * @todo Use replaceQuotedSpecialChars rather than removeQuotedStrings()
208
     */
209 16
    protected function lickLineEndings()
210
    {
211 16
        $str = $this->removeQuotedStrings($this->sample);
212
        $eols = [
213 16
            self::EOL_WINDOWS => "\r\n",  // 0x0D - 0x0A - Windows, DOS OS/2
214 16
            self::EOL_UNIX    => "\n",    // 0x0A -      - Unix, OSX
215 16
            self::EOL_TRS80   => "\r",    // 0x0D -      - Apple ][, TRS80
216 16
        ];
217
218 16
        $curCount = 0;
219 16
        $curEol = '';
220 16
        foreach ($eols as $k => $eol) {
221 16
            if (($count = substr_count($str, $eol)) > $curCount) {
222 16
                $curCount = $count;
223 16
                $curEol = $eol;
224 16
            }
225 16
        }
226
227 16
        return $curEol;
228
    }
229
230
    /**
231
     * The best way to determine quote and delimiter characters is when columns
232
     * are quoted, often you can seek out a pattern of delim, quote, stuff, quote, delim
233
     * but this only works if you have quoted columns. If you don't you have to
234
     * determine these characters some other way... (see lickDelimiter).
235
     *
236
     * @return array A two-row array containing quotechar, delimchar
237
     *
238
     * @todo make protected
239
     * @todo This should throw an exception if it cannot determine the delimiter
240
     *     this way.
241
     * @todo This should check for any line endings not just \n
242
     */
243 16
    protected function lickQuoteAndDelim()
244
    {
245 16
        $patterns = [];
246
        // delim can be anything but line breaks, quotes, alphanumeric, underscore, backslash, or any type of spaces
247 16
        $antidelims = implode(["\r", "\n", "\w", preg_quote('"', '/'), preg_quote("'", '/')/*, preg_quote('\\', '/')*/, preg_quote(chr(self::SPACE), '/')]);
248 16
        $delim = '(?P<delim>[^'.$antidelims.'])';
249 16
        $quote = '(?P<quoteChar>"|\'|`)'; // @todo I think MS Excel uses some strange encoding for fancy open/close quotes
250 16
        $patterns[] = '/'.$delim.' ?'.$quote.'.*?\2\1/ms'; // ,"something", - anything but whitespace or quotes followed by a possible space followed by a quote followed by anything followed by same quote, followed by same anything but whitespace
251 16
        $patterns[] = '/(?:^|\n)'.$quote.'.*?\1'.$delim.' ?/ms'; // 'something', - beginning of line or line break, followed by quote followed by anything followed by quote followed by anything but whitespace or quotes
252 16
        $patterns[] = '/'.$delim.' ?'.$quote.'.*?\2(?:^|\n)/ms'; // ,'something' - anything but whitespace or quote followed by possible space followed by quote followed by anything followed by quote, followed by end of line
253 16
        $patterns[] = '/(?:^|\n)'.$quote.'.*?\2(?:$|\n)/ms'; // 'something' - beginning of line followed by quote followed by anything followed by quote followed by same quote followed by end of line
254 16
        foreach ($patterns as $pattern) {
255
            // @todo I had to add the error suppression char here because it was
256
            //     causing undefined offset errors with certain data sets. strange...
257 16
            if (@preg_match_all($pattern, $this->sample, $matches) && $matches) {
258 16
                break;
259
            }
260 16
        }
261 16
        if ($matches) {
262 16
            $quotes = array_count_values($matches['quoteChar']);
0 ignored issues
show
Bug introduced by
The variable $matches does not seem to be defined for all execution paths leading up to this point.

If you define a variable conditionally, it can happen that it is not defined for all execution paths.

Let’s take a look at an example:

function myFunction($a) {
    switch ($a) {
        case 'foo':
            $x = 1;
            break;

        case 'bar':
            $x = 2;
            break;
    }

    // $x is potentially undefined here.
    echo $x;
}

In the above example, the variable $x is defined if you pass “foo” or “bar” as argument for $a. However, since the switch statement has no default case statement, if you pass any other value, the variable $x would be undefined.

Available Fixes

  1. Check for existence of the variable explicitly:

    function myFunction($a) {
        switch ($a) {
            case 'foo':
                $x = 1;
                break;
    
            case 'bar':
                $x = 2;
                break;
        }
    
        if (isset($x)) { // Make sure it's always set.
            echo $x;
        }
    }
    
  2. Define a default value for the variable:

    function myFunction($a) {
        $x = ''; // Set a default which gets overridden for certain paths.
        switch ($a) {
            case 'foo':
                $x = 1;
                break;
    
            case 'bar':
                $x = 2;
                break;
        }
    
        echo $x;
    }
    
  3. Add a value for the missing path:

    function myFunction($a) {
        switch ($a) {
            case 'foo':
                $x = 1;
                break;
    
            case 'bar':
                $x = 2;
                break;
    
            // We add support for the missing case.
            default:
                $x = '';
                break;
        }
    
        echo $x;
    }
    
Loading history...
263 16
            arsort($quotes);
264 16
            $quotes = array_flip($quotes);
265 16
            if ($theQuote = array_shift($quotes)) {
266 16
                $delims = array_count_values($matches['delim']);
267 16
                arsort($delims);
268 16
                $delims = array_flip($delims);
269 16
                $theDelim = array_shift($delims);
270
271 16
                return [$theQuote, $theDelim];
272
            }
273 2
        }
274 2
        throw new TasteQuoteAndDelimException('quoteChar and delimiter cannot be determined');
275
    }
276
277
    /**
278
     * Take a list of likely delimiter characters and find the one that occurs
279
     * the most consistent amount of times within the provided data.
280
     *
281
     * @param string The character(s) used for newlines
282
     *
283
     * @return string One of four Flavor::QUOTING_* constants
284
     *
285
     * @see CSVelte\Flavor for possible quote style constants
286
     *
287
     * @todo Refactor this method--It needs more thorough testing against a wider
288
     *     variety of CSV data to be sure it works reliably. And I'm sure there
289
     *     are many performance and logic improvements that could be made. This
290
     *     is essentially a first draft.
291
     * @todo Use replaceQuotedSpecialChars rather than removeQuotedStrings
292
     */
293 2
    protected function lickDelimiter($eol = "\n")
294
    {
295 2
        $delimiters = [',', "\t", '|', ':', ';', '/', '\\'];
296 2
        $lines = explode($eol, $this->removeQuotedStrings($this->sample));
297 2
        $modes = [];
0 ignored issues
show
Unused Code introduced by
$modes is not used, you could remove the assignment.

This check looks for variable assignements that are either overwritten by other assignments or where the variable is not used subsequently.

$myVar = 'Value';
$higher = false;

if (rand(1, 6) > 3) {
    $higher = true;
} else {
    $higher = false;
}

Both the $myVar assignment in line 1 and the $higher assignment in line 2 are dead. The first because $myVar is never used and the second because $higher is always overwritten for every possible time line.

Loading history...
298 2
        $start = 0;
299 2
        $charFrequency = [];
300 2
        while ($start < count($lines)) {
301 2
            foreach ($lines as $key => $line) {
302 2
                if (!trim($line)) {
303
                    continue;
304
                }
305 2
                foreach ($delimiters as $char) {
306 2
                    $freq = substr_count($line, $char);
307 2
                    $charFrequency[$char][$key] = $freq;
308 2
                }
309 2
            }
310 2
            $start++;
311 2
        }
312 2
        $averages = Utils::array_average($charFrequency);
313 2
        $modes = Utils::array_mode($charFrequency);
314 2
        $consistencies = [];
315 2
        foreach ($averages as $achar => $avg) {
316 2
            foreach ($modes as $mchar => $mode) {
317 2
                if ($achar == $mchar) {
318 2
                    if ($mode) {
319 2
                        $consistencies[$achar] = $avg / $mode;
320 2
                    } else {
321 2
                        $consistencies[$achar] = 0;
322
                    }
323 2
                    break;
324
                }
325 2
            }
326 2
        }
327 2
        if (empty($consistencies)) {
328
            throw new TasteDelimiterException('Cannot determine delimiter character');
329
        }
330 2
        arsort($consistencies);
331
332 2
        return key($consistencies);
333
    }
334
335
    /**
336
     * Determine the "style" of data quoting. The CSV format, while having an RFC
337
     * (https://tools.ietf.org/html/rfc4180), doesn't necessarily always conform
338
     * to it. And it doesn't provide metadata such as the delimiting character,
339
     * quote character, or what types of data are quoted. So this method makes a
340
     * logical guess by finding which columns have been quoted (if any) and
341
     * examining their data type. Most often, CSV files will only use quotes
342
     * around columns that contain special characters such as the dilimiter,
343
     * the quoting character, newlines, etc. (we refer to this style as )
344
     * QUOTE_MINIMAL), but some quote all columns that contain nonnumeric data
345
     * (QUOTE_NONNUMERIC). Then there are CSV files that quote all columns
346
     * (QUOTE_ALL) and those that quote none (QUOTE_NONE).
347
     *
348
     * @param string The data to examime for "quoting style"
349
     * @param char The type of quote character being used (single or double)
350
     * @param char The character used as the column delimiter
351
     * @param char The character used for newlines
352
     *
353
     * @return string One of four "QUOTING_" constants defined above--see this
354
     *                method's description for more info.
355
     *
356
     * @todo Refactor this method--It needs more thorough testing against a wider
357
     *     variety of CSV data to be sure it works reliably. And I'm sure there
358
     *     are many performance and logic improvements that could be made. This
359
     *     is essentially a first draft.
360
     */
361 16
    protected function lickQuotingStyle($data, $quote, $delim, $eol)
0 ignored issues
show
Unused Code introduced by
The parameter $quote is not used and could be removed.

This check looks from parameters that have been defined for a function or method, but which are not used in the method body.

Loading history...
362
    {
363 16
        $data = $this->replaceQuotedSpecialChars($data, $delim);
364
365
        $quoting_styles = [
366 16
            Flavor::QUOTE_ALL        => 0,
367 16
            Flavor::QUOTE_NONE       => 0,
368 16
            Flavor::QUOTE_MINIMAL    => 0,
369 16
            Flavor::QUOTE_NONNUMERIC => 0,
370 16
        ];
371
372 16
        $lines = explode($eol, $data);
373
        $freq = [
374 16
            'quoted'   => [],
375 16
            'unquoted' => [],
376 16
        ];
377
378 16
        foreach ($lines as $key => $line) {
379
            // now we can sub back in the correct newlines
380 16
            $line = str_replace(self::PLACEHOLDER_NEWLINE, $eol, $line);
381 16
            $cols = explode($delim, $line);
382 16
            foreach ($cols as $colkey => $col) {
383
                // now we can sub back in the correct delim characters
384 16
                $col = str_replace(self::PLACEHOLDER_DELIM, $delim, $col);
385 16
                if ($isQuoted = $this->isQuoted($col)) {
0 ignored issues
show
Unused Code introduced by
$isQuoted is not used, you could remove the assignment.

This check looks for variable assignements that are either overwritten by other assignments or where the variable is not used subsequently.

$myVar = 'Value';
$higher = false;

if (rand(1, 6) > 3) {
    $higher = true;
} else {
    $higher = false;
}

Both the $myVar assignment in line 1 and the $higher assignment in line 2 are dead. The first because $myVar is never used and the second because $higher is always overwritten for every possible time line.

Loading history...
386 16
                    $col = $this->unQuote($col);
387 16
                    $type = $this->lickDataType($col);
388
                    // we can remove this guy all together since at lease one column is quoted
389 16
                    unset($quoting_styles[Flavor::QUOTE_NONE]);
390 16
                    $freq['quoted'][] = $type;
391 16
                } else {
392 16
                    $type = $this->lickDataType($col);
393
                    // we can remove this guy all together since at lease one column is unquoted
394 16
                    unset($quoting_styles[Flavor::QUOTE_ALL]);
395 16
                    $freq['unquoted'][] = $type;
396
                }
397 16
            }
398 16
        }
399 16
        $types = array_unique($freq['quoted']);
400
        // if quoting_styles still has QUOTE_ALL or QUOTE_NONE, then that's the one to return
401 16
        if (array_key_exists(Flavor::QUOTE_ALL, $quoting_styles)) {
402 3
            return Flavor::QUOTE_ALL;
403
        }
404 16
        if (array_key_exists(Flavor::QUOTE_NONE, $quoting_styles)) {
405 2
            return Flavor::QUOTE_NONE;
406
        }
407 16
        if (count($types) == 1) {
408 16
            if (current($types) == self::DATA_SPECIAL) {
409 16
                return Flavor::QUOTE_MINIMAL;
410
            } elseif (current($types) == self::DATA_NONNUMERIC) {
411
                return Flavor::QUOTE_NONNUMERIC;
412
            }
413
        } else {
414 2
            if (array_key_exists(self::DATA_NONNUMERIC, array_flip($types))) {
415
                // allow for a SMALL amount of error here
416 2
                $counts = [self::DATA_SPECIAL => 0, self::DATA_NONNUMERIC => 0];
417
                array_walk($freq['quoted'], function ($val, $key) use (&$counts) {
0 ignored issues
show
Unused Code introduced by
The parameter $key is not used and could be removed.

This check looks from parameters that have been defined for a function or method, but which are not used in the method body.

Loading history...
418 2
                    $counts[$val]++;
419 2
                });
420 2
                arsort($counts);
421 2
                $most = current($counts);
422 2
                $least = end($counts);
423 2
                $err_margin = $least / $most;
424 2
                if ($err_margin < 1) {
425 2
                    return Flavor::QUOTE_NONNUMERIC;
426
                }
427
            }
428
        }
429
430
        return Flavor::QUOTE_MINIMAL;
431
    }
432
433
    /**
434
     * Remove quotes around a piece of text (if there are any).
435
     *
436
     * @param string The data to "unquote"
437
     *
438
     * @return string The data passed in, only with quotes stripped (off the edges)
439
     */
440 22
    protected function unQuote($data)
441
    {
442 22
        return preg_replace('/^(["\'])(.*)\1$/', '\2', $data);
443
    }
444
445
    /**
446
     * Determine whether a particular string of data has quotes around it.
447
     *
448
     * @param string The data to check
449
     *
450
     * @return bool Whether the data is quoted or not
451
     */
452 16
    protected function isQuoted($data)
453
    {
454 16
        return preg_match('/^([\'"])[^\1]*\1$/', $data);
455
    }
456
457
    /**
458
     * Determine what type of data is contained within a variable
459
     * Possible types:
460
     *     - nonnumeric - only numbers
461
     *     - special - contains characters that could potentially need to be quoted (possible delimiter characters)
462
     *     - unknown - everything else
463
     * This method is really only used within the "lickQuotingStyle" method to
464
     * help determine whether a particular column has been quoted due to it being
465
     * nonnumeric or because it has some special character in it such as a delimiter
466
     * or newline or quote.
467
     *
468
     * @param string The data to determine the type of
469
     *
470
     * @return string The type of data (one of the "DATA_" constants above)
471
     *
472
     * @todo I could probably eliminate this method and use an anonymous function
473
     *     instead. It isn't used anywhere else and its name could be misleading.
474
     *     Especially since I also have a lickType method that is used within the
475
     *     lickHeader method.
476
     */
477 16
    protected function lickDataType($data)
478
    {
479
        // @todo make this check for only the quote and delim that are actually being used
480
        // that will make the guess more accurate
481 16
        if (preg_match('/[\'",\t\|:;-]/', $data)) {
482 16
            return self::DATA_SPECIAL;
483 16
        } elseif (preg_match('/[^0-9]/', $data)) {
484 16
            return self::DATA_NONNUMERIC;
485
        }
486
487 16
        return self::DATA_UNKNOWN;
488
    }
489
490
    /**
491
     * Replace all instances of newlines and whatever character you specify (as
492
     * the delimiter) that are contained within quoted text. The replacements are
493
     * simply a special placeholder string. This is done so that I can use the
494
     * very unsmart "explode" function and not have to worry about it exploding
495
     * on delimiters or newlines within quotes. Once I have exploded, I typically
496
     * sub back in the real characters before doing anything else. Although
497
     * currently there is no dedicated method for doing so I just use str_replace.
498
     *
499
     * @param string The string to do the replacements on
500
     * @param char The delimiter character to replace
501
     *
502
     * @return string The data with replacements performed
503
     *
504
     * @todo I could probably pass in (maybe optionally) the newline character I
505
     *     want to replace as well. I'll do that if I need to.
506
     */
507
    protected function replaceQuotedSpecialChars($data, $delim)
508
    {
509 22
        return preg_replace_callback('/([\'"])(.*)\1/imsU', function ($matches) use ($delim) {
510 18
            $ret = preg_replace("/([\r\n])/", self::PLACEHOLDER_NEWLINE, $matches[0]);
511 18
            $ret = str_replace($delim, self::PLACEHOLDER_DELIM, $ret);
512
513 18
            return $ret;
514 22
        }, $data);
515
    }
516
517
    /**
518
     * Determine the "type" of a particular string of data. Used for the lickHeader
519
     * method to assign a type to each column to try to determine whether the
520
     * first for is different than a consistent column type.
521
     *
522
     * @todo As I'm writing this method I'm beginning ot realize how expensive
523
     * the lickHeader method is going to end up being since it has to apply all
524
     * these regexes (potentially) to every column. I may end up writing a much
525
     * simpler type-checking method than this if it proves to be too expensive
526
     * to be practical.
527
     *
528
     * @param string The string of data to check the type of
529
     *
530
     * @return string One of the TYPE_ string constants above
531
     *
532
     * @uses Carbon/Carbon date/time ilbrary/class
533
     */
534 22
    protected function lickType($data)
535
    {
536 22
        if (preg_match('/^[+-]?[\d\.]+$/', $data)) {
537 18
            return self::TYPE_NUMBER;
538 22
        } elseif (preg_match('/^[+-]?[\d]+\.[\d]+$/', $data)) {
539
            return self::TYPE_DOUBLE;
540 22
        } elseif (preg_match('/^[+-]?[¥£€$]\d+(\.\d+)$/', $data)) {
541
            return self::TYPE_CURRENCY;
542 22
        } elseif (preg_match('/^[a-zA-Z]+$/', $data)) {
543 19
            return self::TYPE_ALPHA;
544
        } else {
545
            try {
546 22
                $year = '([01][0-9])?[0-9]{2}';
547 22
                $month = '([01]?[0-9]|Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)';
548 22
                $day = '[0-3]?[0-9]';
549 22
                $sep = '[\/\.\-]?';
550 22
                $time = '([0-2]?[0-9](:[0-5][0-9]){1,2}(am|pm)?|[01]?[0-9](am|pm))';
551 22
                $date = '('.$month.$sep.$day.$sep.$year.'|'.$day.$sep.$month.$sep.$year.'|'.$year.$sep.$month.$sep.$day.')';
552 22
                $dt = Carbon::parse($data);
553 22
                if ($dt->today()) {
554
                    // then this is most likely a time string...
555 22
                    if (preg_match("/^{$time}$/i", $data)) {
556
                        return self::TYPE_TIME;
557
                    }
558 22
                }
559 22
                if (preg_match("/^{$date}$/i", $data)) {
560 18
                    return self::TYPE_DATE;
561 13
                } elseif (preg_match("/^{$date} {$time}$/i")) {
562
                    return self::TYPE_DATETIME;
563
                }
564 22
            } catch (\Exception $e) {
565
                // now go on checking remaining types
566 22
                if (preg_match('/^\w+$/', $data)) {
567 6
                    return self::TYPE_ALNUM;
568
                }
569
            }
570
        }
571
572 22
        return self::TYPE_STRING;
573
    }
574
575
    /**
576
     * Examines the contents of the CSV data to make a determination of whether
577
     * or not it contains a header row. To make this determination, it creates
578
     * an array of each column's (in each row)'s data type and length and then
579
     * compares them. If all of the rows except the header look similar, it will
580
     * return true. This is only a guess though. There is no programmatic way to
581
     * determine 100% whether a CSV file has a header. The format does not
582
     * provide metadata such as that.
583
     *
584
     * @param string The CSV data to examine (only 20 rows will be examined so )
585
     *     there is no need to provide any more data than that)
586
     * @param char The CSV data's quoting char (either double or single quote)
587
     * @param char The CSV data's delimiting char (can be a variety of chars but)
588
     *     typically is either a comma or a tab, sometimes a pipe)
589
     * @param char The CSV data's end-of-line char(s) (\n \r or \r\n)
590
     *
591
     * @return bool True if the data (most likely) contains a header row
592
     *
593
     * @todo This method needs a total refactor. It's not necessary to loop twice
594
     *     You could get away with one loop and that would allow for me to do
595
     *     something like only examining enough rows to get to a particular
596
     *     "hasHeader" score (+-100 for instance) & then just return true|false
597
     * @todo Also, break out of the first loop after a certain (perhaps even a
598
     *     configurable) amount of lines (you only need to examine so much data )
599
     *     to reliably make a determination and this is an expensive method)
600
     * @todo Because the header isn't actually part of the "flavor",
601
     *     I could remove the need for quote, delim, and eol by "licking" the
602
     *     data sample provided in the first argument. Also, I could actually
603
     *     create a Reader object to read the data here.
604
     */
605 22
    public function lickHeader($data, $quote, $delim, $eol)
0 ignored issues
show
Unused Code introduced by
The parameter $quote is not used and could be removed.

This check looks from parameters that have been defined for a function or method, but which are not used in the method body.

Loading history...
606
    {
607 22
        $data = $this->replaceQuotedSpecialChars($data, $delim);
608 22
        $lines = explode($eol, $data);
609 22
        $types = [];
610 22
        foreach ($lines as $line_no => $line) {
611
            // now we can sub back in the correct newlines
612 22
            $line = str_replace(self::PLACEHOLDER_NEWLINE, $eol, $line);
613 22
            $cols = explode($delim, $line);
614 22
            $col_count = count($cols);
0 ignored issues
show
Unused Code introduced by
$col_count is not used, you could remove the assignment.

This check looks for variable assignements that are either overwritten by other assignments or where the variable is not used subsequently.

$myVar = 'Value';
$higher = false;

if (rand(1, 6) > 3) {
    $higher = true;
} else {
    $higher = false;
}

Both the $myVar assignment in line 1 and the $higher assignment in line 2 are dead. The first because $myVar is never used and the second because $higher is always overwritten for every possible time line.

Loading history...
615 22
            foreach ($cols as $col_no => $col) {
616
                // now we can sub back in the correct delim characters
617 22
                $col = str_replace(self::PLACEHOLDER_DELIM, $delim, $col);
618 22
                $types[$line_no][$col_no] = [
619 22
                    'type'   => $this->lickType($this->unQuote($col)),
620 22
                    'length' => strlen($col),
621
                ];
622 22
            }
623 22
        }
624 22
        $hasHeader = 0;
625 22
        $potential_header = array_shift($types);
626 22
        foreach ($types as $line_no => $cols) {
627 19
            foreach ($cols as $col_no => $col_info) {
628 19
                extract($col_info);
629 19
                if (!array_key_exists($col_no, $potential_header)) {
630 4
                    continue;
631
                }
632 19
                extract($potential_header[$col_no], EXTR_PREFIX_ALL, 'header');
633 19
                if ($header_type == self::TYPE_STRING) {
634
                    // use length
635 18
                    if ($length != $header_length) {
636 18
                        $hasHeader++;
637 18
                    } else {
638 9
                        $hasHeader--;
639
                    }
640 18
                } else {
641 19
                    if ($type != $header_type) {
642 19
                        $hasHeader++;
643 19
                    } else {
644 16
                        $hasHeader--;
645
                    }
646
                }
647 19
            }
648 22
        }
649
650 22
        return $hasHeader > 0;
651
    }
652
}
653