Completed
Branch releases/v0.2 (6437e0)
by Luke
02:23
created

Taster::removeQuotedStrings()   A

Complexity

Conditions 1
Paths 1

Size

Total Lines 4
Code Lines 2

Duplication

Lines 0
Ratio 0 %

Code Coverage

Tests 2
CRAP Score 1

Importance

Changes 1
Bugs 0 Features 0
Metric Value
c 1
b 0
f 0
dl 0
loc 4
ccs 2
cts 2
cp 1
rs 10
cc 1
eloc 2
nc 1
nop 1
crap 1
1
<?php
2
/**
3
 * CSVelte: Slender, elegant CSV for PHP
4
 * Inspired by Python's CSV module and Frictionless Data and the W3C's CSV
5
 * standardization efforts, CSVelte was written in an effort to take all the
6
 * suck out of working with CSV.
7
 *
8
 * @version   v0.2
9
 * @copyright Copyright (c) 2016 Luke Visinoni <[email protected]>
10
 * @author    Luke Visinoni <[email protected]>
11
 * @license   https://github.com/deni-zen/csvelte/blob/master/LICENSE The MIT License (MIT)
12
 */
13
namespace CSVelte;
14
15
use Carbon\Carbon;
16
use CSVelte\Contract\Readable;
17
use CSVelte\Exception\TasterException;
18
19
/**
20
 * CSVelte\Taster
21
 * Given CSV data, Taster will "taste" the data and provide its buest guess at
22
 * its "flavor". In other words, this class inspects CSV data and attempts to
23
 * auto-detect various CSV attributes such as line endings, quote characters, etc..
24
 *
25
 * @package   CSVelte
26
 * @copyright (c) 2016, Luke Visinoni <[email protected]>
27
 * @author    Luke Visinoni <[email protected]>
28
 * @todo      There are a ton of improvements that could be made to this class.
29
 *            I'll do a refactor on this fella once I get at least one test
30
 *            passing for each of its public methods.
31
 * @todo      Should I have a lickEscapeChar method? The python version doesn't
32
 *            have one. But then why does it even bother including one in its
33
 *            flavor class?
34
 * @todo      Examine each of the public methods in this class and determine
35
 *            whether it makes sense to ask for the data as a param rather than
36
 *            just pulling it from source. I don't think it makes sense... it
37
 *            was just easier to write the methods that way during testing.
38
 * @todo      There are at least portions of this class that could use the
39
 *            Reader class rather than working directly with data.
40
 */
41
class Taster
42
{
43
    /**
44
     * End-of-line constants
45
     */
46
    const EOL_UNIX    = 'lf';
47
    const EOL_TRS80   = 'cr';
48
    const EOL_WINDOWS = 'crlf';
49
50
    /**
51
     * ASCII character codes for "invisibles"
52
     */
53
    const HORIZONTAL_TAB = 9;
54
    const LINE_FEED = 10;
55
    const CARRIAGE_RETURN = 13;
56
    const SPACE = 32;
57
58
    /**
59
     * Data types -- Used within the lickQuotingStyle method
60
     */
61
    const DATA_NONNUMERIC = 'nonnumeric';
62
    const DATA_SPECIAL = 'special';
63
    const DATA_UNKNOWN = 'unknown';
64
65
    /**
66
     * Placeholder strings -- hold the place of newlines and delimiters contained
67
     * within quoted text so that the explode method doesn't split incorrectly
68
     */
69
    const PLACEHOLDER_NEWLINE = '[__NEWLINE__]';
70
    const PLACEHOLDER_DELIM = '[__DELIM__]';
71
72
    /**
73
     * Recommended data sample size
74
     */
75
    const SAMPLE_SIZE = 2500;
76
77
    /**
78
     * Column data types -- used within the lickHeader method to determine
79
     * whether the first row contains different types of data than the rest of
80
     * the rows (and thus, is likely a header row)
81
     */
82
    // +-987
83
    const TYPE_NUMBER = 'number';
84
    // +-12.387
85
    const TYPE_DOUBLE = 'double';
86
    // I am a string. I can contain all kinds of stuff.
87
    const TYPE_STRING = 'string';
88
    // 10-Jul-15, 9/1/2007, April 1st, 2006, etc.
89
    const TYPE_DATE = 'date';
90
    // 10:00pm, 5pm, 13:08, etc.
91
    const TYPE_TIME = 'time';
92
    // $98.96, ¥12389, £6.08, €87.00
93
    const TYPE_CURRENCY = 'currency';
94
    // 12ab44m1n2_asdf
95
    const TYPE_ALNUM = 'alnum';
96
    // abababab
97
    const TYPE_ALPHA = 'alpha';
98
99
    /**
100
     * @var CSVelte\Contract\Readable The source of data to examine
101
     * @access protected
102
     */
103
    protected $input;
104
105
    /**
106
     * Sample of CSV data to use for tasting (determining CSV flavor)
107
     * @var string
108
     */
109
    protected $sample;
110
111
    /**
112
     * Class constructor--accepts a CSV input source
113
     *
114
     * @param \CSVelte\Contract\Readable The source of CSV data
115
     * @todo It may be a good idea to skip the first line or two for the sample
116
     *     so that the header line(s) don't throw things off (with the exception
117
     *     of lickHeader() obviously)
118
     */
119 18
    public function __construct(Readable $input)
120
    {
121 18
        $this->input = $input;
0 ignored issues
show
Documentation Bug introduced by
It seems like $input of type object<CSVelte\Contract\Readable> is incompatible with the declared type object<CSVelte\CSVelte\Contract\Readable> of property $input.

Our type inference engine has found an assignment to a property that is incompatible with the declared type of that property.

Either this assignment is in error or the assigned type should be added to the documentation/type hint for that property..

Loading history...
122 18
        $this->sample = $input->read(self::SAMPLE_SIZE);
0 ignored issues
show
Documentation Bug introduced by
It seems like $input->read(self::SAMPLE_SIZE) can also be of type boolean. However, the property $sample is declared as type string. Maybe add an additional type check?

Our type inference engine has found a suspicous assignment of a value to a property. This check raises an issue when a value that can be of a mixed type is assigned to a property that is type hinted more strictly.

For example, imagine you have a variable $accountId that can either hold an Id object or false (if there is no account id yet). Your code now assigns that value to the id property of an instance of the Account class. This class holds a proper account, so the id value must no longer be false.

Either this assignment is in error or a type check should be added for that assignment.

class Id
{
    public $id;

    public function __construct($id)
    {
        $this->id = $id;
    }

}

class Account
{
    /** @var  Id $id */
    public $id;
}

$account_id = false;

if (starsAreRight()) {
    $account_id = new Id(42);
}

$account = new Account();
if ($account instanceof Id)
{
    $account->id = $account_id;
}
Loading history...
123 18
    }
124
125
    /**
126
     * Examine the input source and determine what "Flavor" of CSV it contains.
127
     * The CSV format, while having an RFC (https://tools.ietf.org/html/rfc4180),
128
     * doesn't necessarily always conform to it. And it doesn't provide meta such as the delimiting character, quote character, or what types of data are quoted.
129
     * such as the delimiting character, quote character, or what types of data are quoted.
130
     * are quoted.
131
     *
132
     * @return \CSVelte\Flavor The metadata that the CSV format doesn't provide
133
     * @access public
134
     * @todo Implement a lickQuote method for when lickQuoteAndDelim method fails
135
     * @todo Should there bea lickEscapeChar method? the python module that inspired
136
     *     this library doesn't include one...
137
     * @todo This should cache the results and only regenerate if $this->sample
138
     *     changes (or $this->input)
139
     */
140 10
    public function lick()
141
    {
142 10
        $lineTerminator = $this->lickLineEndings();
143
        try {
144 10
            list($quoteChar, $delimiter) = $this->lickQuoteAndDelim();
145 10
        } catch (TasterException $e) {
146 3
            if ($e->getCode() !== TasterException::ERR_QUOTE_AND_DELIM) throw $e;
147 3
            $quoteChar = '"';
148 3
            $delimiter = $this->lickDelimiter($lineTerminator);
149
        }
150
        /**
151
         * @todo Should this be null? Because doubleQuote = true means this = null
152
         */
153 10
        $escapeChar = '\\';
154 10
        $quoteStyle = $this->lickQuotingStyle($quoteChar, $delimiter, $lineTerminator);
155 10
        $header = $this->lickHeader($quoteChar, $delimiter, $lineTerminator);
156 10
        return new Flavor(compact('quoteChar', 'escapeChar', 'delimiter', 'lineTerminator', 'quoteStyle', 'header'));
157
    }
158
159
    /**
160
     * Replaces all quoted columns with a blank string. I was using this method
161
     * to prevent explode() from incorrectly splitting at delimiters and newlines
162
     * within quotes when parsing a file. But this was before I wrote the
163
     * replaceQuotedSpecialChars method which (at least to me) makes more sense.
164
     *
165
     * @param string The string to replace quoted strings within
166
     * @return string The input string with quoted strings removed
167
     * @access protected
168
     * @todo Replace code that uses this method with the replaceQuotedSpecialChars
169
     *     method instead. I think it's cleaner.
170
     */
171 10
    protected function removeQuotedStrings($data)
172
    {
173 10
        return preg_replace($pattern = '/(["\'])(?:(?=(\\\\?))\2.)*?\1/sm', $replace = '', $data);
174
    }
175
176
    /**
177
     * Examine the input source to determine which character(s) are being used
178
     * as the end-of-line character
179
     *
180
     * @return string The end-of-line char for the input data
181
     * @access protected
182
     * @credit pulled from stackoverflow thread *tips hat to username "Harm"*
183
     * @todo This should throw an exception if it cannot determine the line ending
184
     * @todo I probably will make this method protected when I'm done with testing...
185
     * @todo If there is any way for this method to fail (for instance if a file )
186
     *       is totally empty or contains no line breaks), then it needs to throw
187
     *       a relevant TasterException
188
     * @todo Use replaceQuotedSpecialChars rather than removeQuotedStrings()
189
     */
190 10
    protected function lickLineEndings()
191
    {
192 10
        $str = $this->removeQuotedStrings($this->sample);
193
        $eols = [
194 10
            self::EOL_WINDOWS => "\r\n",  // 0x0D - 0x0A - Windows, DOS OS/2
195 10
            self::EOL_UNIX    => "\n",    // 0x0A -      - Unix, OSX
196 10
            self::EOL_TRS80   => "\r",    // 0x0D -      - Apple ][, TRS80
197 10
        ];
198
199 10
        $curCount = 0;
200
        // @todo This should return a default maybe?
201 10
        $curEol = PHP_EOL;
202 10
        foreach($eols as $k => $eol) {
203 10
            if( ($count = substr_count($str, $eol)) > $curCount) {
204 10
                $curCount = $count;
205 10
                $curEol = $eol;
206 10
            }
207 10
        }
208 10
        return $curEol;
209
    }
210
211
    /**
212
     * The best way to determine quote and delimiter characters is when columns
213
     * are quoted, often you can seek out a pattern of delim, quote, stuff, quote, delim
214
     * but this only works if you have quoted columns. If you don't you have to
215
     * determine these characters some other way... (see lickDelimiter)
216
     *
217
     * @return array A two-row array containing quotechar, delimchar
218
     * @access protected
219
     * @todo make protected
220
     * @todo This should throw an exception if it cannot determine the delimiter
221
     *     this way.
222
     * @todo This should check for any line endings not just \n
223
     */
224 10
    protected function lickQuoteAndDelim()
225
    {
226
        /**
227
         * @var array An array of pattern matches
228
         */
229 10
        $matches = null;
230
        /**
231
         * @var array An array of patterns (regex)
232
         */
233 10
        $patterns = [];
234
        // delim can be anything but line breaks, quotes, alphanumeric, underscore, backslash, or any type of spaces
235 10
        $antidelims = implode(array("\r", "\n", "\w", preg_quote('"', '/'), preg_quote("'", '/')/*, preg_quote('\\', '/')*/, preg_quote(chr(self::SPACE), '/')));
236 10
        $delim = '(?P<delim>[^' . $antidelims . '])';
237 10
        $quote = '(?P<quoteChar>"|\'|`)'; // @todo I think MS Excel uses some strange encoding for fancy open/close quotes
238 10
        $patterns[] = '/' . $delim . ' ?' . $quote . '.*?\2\1/ms'; // ,"something", - anything but whitespace or quotes followed by a possible space followed by a quote followed by anything followed by same quote, followed by same anything but whitespace
239 10
        $patterns[] = '/(?:^|\n)' . $quote . '.*?\1' . $delim . ' ?/ms'; // 'something', - beginning of line or line break, followed by quote followed by anything followed by quote followed by anything but whitespace or quotes
240 10
        $patterns[] = '/' . $delim . ' ?' . $quote . '.*?\2(?:^|\n)/ms'; // ,'something' - anything but whitespace or quote followed by possible space followed by quote followed by anything followed by quote, followed by end of line
241 10
        $patterns[] = '/(?:^|\n)' . $quote . '.*?\2(?:$|\n)/ms'; // 'something' - beginning of line followed by quote followed by anything followed by quote followed by same quote followed by end of line
242 10
        foreach ($patterns as $pattern) {
243
            // @todo I had to add the error suppression char here because it was
244
            //     causing undefined offset errors with certain data sets. strange...
245 10
            if (@preg_match_all($pattern, $this->sample, $matches) && $matches) break;
246 10
        }
247 10
        if ($matches) {
248 10
            $quotes = array_count_values($matches['quoteChar']);
249 10
            arsort($quotes);
250 10
            $quotes = array_flip($quotes);
251 10
            if ($theQuote = array_shift($quotes)) {
252 9
                $delims = array_count_values($matches['delim']);
253 9
                arsort($delims);
254 9
                $delims = array_flip($delims);
255 9
                $theDelim = array_shift($delims);
256 9
                return array($theQuote, $theDelim);
257
            }
258 3
        }
259 3
        throw new TasterException("quoteChar and delimiter cannot be determined", TasterException::ERR_QUOTE_AND_DELIM);
260
    }
261
262
     /**
263
      * Take a list of likely delimiter characters and find the one that occurs
264
      * the most consistent amount of times within the provided data.
265
      *
266
      * @param string The character(s) used for newlines
267
      * @return string One of four Flavor::QUOTING_* constants
268
      * @see \CSVelte\Flavor for possible quote style constants
269
      * @access protected
270
      * @todo Refactor this method--It needs more thorough testing against a wider
271
      *     variety of CSV data to be sure it works reliably. And I'm sure there
272
      *     are many performance and logic improvements that could be made. This
273
      *     is essentially a first draft.
274
      * @todo Use replaceQuotedSpecialChars rather than removeQuotedStrings
275
      */
276 3
    protected function lickDelimiter($eol = "\n")
277
    {
278 3
        $delimiters = array(",", "\t", "|", ":", ";", "/", '\\');
279 3
        $lines = explode($eol, $this->removeQuotedStrings($this->sample));
280 3
        $start = 0;
281 3
        $charFrequency = array();
282 3
        while ($start < count($lines)) {
283 3
            foreach ($lines as $key => $line) {
284 3
                if (!trim($line)) continue;
285 3
                foreach ($delimiters as $char) {
286 3
                    $freq = substr_count($line, $char);
287 3
                    $charFrequency[$char][$key] = $freq;
288 3
                }
289 3
            }
290 3
            $start++;
291 3
        }
292 3
        $averages = Utils::array_average($charFrequency);
293 3
        $modes = Utils::array_mode($charFrequency);
294 3
        $consistencies = array();
295 3
        foreach ($averages as $achar => $avg) {
296 3
            foreach ($modes as $mchar => $mode) {
297 3
                if ($achar == $mchar) {
298 3
                    if ($mode) {
299 3
                        $consistencies[$achar] = $avg / $mode;
300 3
                    } else {
301 3
                        $consistencies[$achar] = 0;
302
                    }
303 3
                    break;
304
                }
305 3
            }
306 3
        }
307 3
        if (empty($consistencies)) {
308
            throw new TasterException('Cannot determine delimiter character', TasterException::ERR_DELIMITER);
309
        }
310 3
        arsort($consistencies);
311 3
        return key($consistencies);
312
    }
313
314
    /**
315
     * Determine the "style" of data quoting. The CSV format, while having an RFC
316
     * (https://tools.ietf.org/html/rfc4180), doesn't necessarily always conform
317
     * to it. And it doesn't provide metadata such as the delimiting character,
318
     * quote character, or what types of data are quoted. So this method makes a
319
     * logical guess by finding which columns have been quoted (if any) and
320
     * examining their data type. Most often, CSV files will only use quotes
321
     * around columns that contain special characters such as the dilimiter,
322
     * the quoting character, newlines, etc. (we refer to this style as )
323
     * QUOTE_MINIMAL), but some quote all columns that contain nonnumeric data
324
     * (QUOTE_NONNUMERIC). Then there are CSV files that quote all columns
325
     * (QUOTE_ALL) and those that quote none (QUOTE_NONE).
326
     *
327
     * @param string The data to examime for "quoting style"
328
     * @param string The type of quote character being used (single or double)
329
     * @param string The character used as the column delimiter
330
     * @param string The character used for newlines
331
     * @return string One of four "QUOTING_" constants defined above--see this
332
     *     method's description for more info.
333
     * @access protected
334
     * @todo Refactor this method--It needs more thorough testing against a wider
335
     *     variety of CSV data to be sure it works reliably. And I'm sure there
336
     *     are many performance and logic improvements that could be made. This
337
     *     is essentially a first draft.
338
     */
339 10
    protected function lickQuotingStyle($quote, $delim, $eol)
0 ignored issues
show
Unused Code introduced by
The parameter $quote is not used and could be removed.

This check looks from parameters that have been defined for a function or method, but which are not used in the method body.

Loading history...
340
    {
341 10
        $data = $this->replaceQuotedSpecialChars($this->sample, $delim);
342
343
        $quoting_styles = array(
344 10
            Flavor::QUOTE_ALL => 0,
345 10
            Flavor::QUOTE_NONE => 0,
346 10
            Flavor::QUOTE_MINIMAL => 0,
347 10
            Flavor::QUOTE_NONNUMERIC => 0,
348 10
        );
349
350 10
        $lines = explode($eol, $data);
351
        $freq = array(
352 10
            'quoted' => array(),
353 10
            'unquoted' => array()
354 10
        );
355
356 10
        foreach ($lines as $key => $line) {
357
            // now we can sub back in the correct newlines
358 10
            $line = str_replace(self::PLACEHOLDER_NEWLINE, $eol, $line);
359 10
            $cols = explode($delim, $line);
360 10
            foreach ($cols as $colkey => $col) {
361
                // now we can sub back in the correct delim characters
362 10
                $col = str_replace(self::PLACEHOLDER_DELIM, $delim, $col);
363 10
                if ($this->isQuoted($col)) {
364 9
                    $col = $this->unQuote($col);
365 9
                    $type = $this->lickDataType($col);
366
                    // we can remove this guy all together since at lease one column is quoted
367 9
                    unset($quoting_styles[Flavor::QUOTE_NONE]);
368 9
                    $freq['quoted'][] = $type;
369 9
                } else {
370 10
                    $type = $this->lickDataType($col);
371
                    // we can remove this guy all together since at lease one column is unquoted
372 10
                    unset($quoting_styles[Flavor::QUOTE_ALL]);
373 10
                    $freq['unquoted'][] = $type;
374
                }
375 10
            }
376 10
        }
377 10
        $types = array_unique($freq['quoted']);
378
        // if quoting_styles still has QUOTE_ALL or QUOTE_NONE, then that's the one to return
379 10
        if (array_key_exists(Flavor::QUOTE_ALL, $quoting_styles)) return Flavor::QUOTE_ALL;
380 10
        if (array_key_exists(Flavor::QUOTE_NONE, $quoting_styles)) return Flavor::QUOTE_NONE;
381 9
        if (count($types) == 1) {
382 9
            if (current($types) == self::DATA_SPECIAL) return Flavor::QUOTE_MINIMAL;
383
            elseif (current($types) == self::DATA_NONNUMERIC) return Flavor::QUOTE_NONNUMERIC;
384
        } else {
385 1
            if (array_key_exists(self::DATA_NONNUMERIC, array_flip($types))) {
386
                // allow for a SMALL amount of error here
387 1
                $counts = array(self::DATA_SPECIAL => 0, self::DATA_NONNUMERIC => 0);
388
                array_walk($freq['quoted'], function ($val) use (&$counts) {
389 1
                    $counts[$val]++;
390 1
                });
391 1
                arsort($counts);
392 1
                $most = current($counts);
393 1
                $least = end($counts);
394 1
                $err_margin = $least / $most;
395 1
                if ($err_margin < 1) return Flavor::QUOTE_NONNUMERIC;
396
            }
397
        }
398
        return Flavor::QUOTE_MINIMAL;
399
    }
400
401
    /**
402
     * Remove quotes around a piece of text (if there are any)
403
     *
404
     * @param string The data to "unquote"
405
     * @return string The data passed in, only with quotes stripped (off the edges)
406
     * @access protected
407
     */
408 15
    protected function unQuote($data)
409
    {
410 15
        return preg_replace('/^(["\'])(.*)\1$/', '\2', $data);
411
    }
412
413
    /**
414
     * Determine whether a particular string of data has quotes around it.
415
     *
416
     * @param string The data to check
417
     * @return boolean Whether the data is quoted or not
418
     * @access protected
419
     */
420 10
    protected function isQuoted($data)
421
    {
422 10
        return preg_match('/^([\'"])[^\1]*\1$/', $data);
423
    }
424
425
    /**
426
     * Determine what type of data is contained within a variable
427
     * Possible types:
428
     *     - nonnumeric - only numbers
429
     *     - special - contains characters that could potentially need to be quoted (possible delimiter characters)
430
     *     - unknown - everything else
431
     * This method is really only used within the "lickQuotingStyle" method to
432
     * help determine whether a particular column has been quoted due to it being
433
     * nonnumeric or because it has some special character in it such as a delimiter
434
     * or newline or quote.
435
     *
436
     * @param string The data to determine the type of
437
     * @return string The type of data (one of the "DATA_" constants above)
438
     * @access protected
439
     * @todo I could probably eliminate this method and use an anonymous function
440
     *     instead. It isn't used anywhere else and its name could be misleading.
441
     *     Especially since I also have a lickType method that is used within the
442
     *     lickHeader method.
443
     */
444 10
    protected function lickDataType($data)
445
    {
446
        // @todo make this check for only the quote and delim that are actually being used
447
        // that will make the guess more accurate
448 10
        if (preg_match('/[\'",\t\|:;-]/', $data)) {
449 9
            return self::DATA_SPECIAL;
450 10
        } elseif (preg_match('/[^0-9]/', $data)) {
451 10
            return self::DATA_NONNUMERIC;
452
        }
453 10
        return self::DATA_UNKNOWN;
454
    }
455
456
    /**
457
     * Replace all instances of newlines and whatever character you specify (as
458
     * the delimiter) that are contained within quoted text. The replacements are
459
     * simply a special placeholder string. This is done so that I can use the
460
     * very unsmart "explode" function and not have to worry about it exploding
461
     * on delimiters or newlines within quotes. Once I have exploded, I typically
462
     * sub back in the real characters before doing anything else. Although
463
     * currently there is no dedicated method for doing so I just use str_replace
464
     *
465
     * @param string The string to do the replacements on
466
     * @param string The delimiter character to replace
467
     * @return string The data with replacements performed
468
     * @access protected
469
     * @todo I could probably pass in (maybe optionally) the newline character I
470
     *     want to replace as well. I'll do that if I need to.
471
     */
472
    protected function replaceQuotedSpecialChars($data, $delim)
473
    {
474 15
        return preg_replace_callback('/([\'"])(.*)\1/imsU', function($matches) use ($delim) {
475 13
            $ret = preg_replace("/([\r\n])/", self::PLACEHOLDER_NEWLINE, $matches[0]);
476 13
            $ret = str_replace($delim, self::PLACEHOLDER_DELIM, $ret);
477 13
            return $ret;
478 15
        }, $data);
479
    }
480
481
    /**
482
     * Determine the "type" of a particular string of data. Used for the lickHeader
483
     * method to assign a type to each column to try to determine whether the
484
     * first for is different than a consistent column type.
485
     *
486
     * @todo As I'm writing this method I'm beginning ot realize how expensive
487
     * the lickHeader method is going to end up being since it has to apply all
488
     * these regexes (potentially) to every column. I may end up writing a much
489
     * simpler type-checking method than this if it proves to be too expensive
490
     * to be practical.
491
     *
492
     * @param string The string of data to check the type of
493
     * @return string One of the TYPE_ string constants above
494
     * @access protected
495
     * @uses \Carbon\Carbon date/time ilbrary/class
496
     */
497 15
    protected function lickType($data)
498
    {
499 15
        if (preg_match('/^[+-]?[\d\.]+$/', $data)) {
500 13
            return self::TYPE_NUMBER;
501 15
        } elseif (preg_match('/^[+-]?[\d]+\.[\d]+$/', $data)) {
502
            return self::TYPE_DOUBLE;
503 15
        } elseif (preg_match('/^[+-]?[¥£€$]\d+(\.\d+)$/', $data)) {
504
            return self::TYPE_CURRENCY;
505 15
        } elseif (preg_match('/^[a-zA-Z]+$/', $data)) {
506 14
            return self::TYPE_ALPHA;
507
        } else {
508
            try {
509 15
                $year = '([01][0-9])?[0-9]{2}';
510 15
                $month = '([01]?[0-9]|Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)';
511 15
                $day = '[0-3]?[0-9]';
512 15
                $sep = '[\/\.\-]?';
513 15
                $time = '([0-2]?[0-9](:[0-5][0-9]){1,2}(am|pm)?|[01]?[0-9](am|pm))';
514 15
                $date = '(' . $month . $sep . $day . $sep . $year . '|' . $day . $sep . $month . $sep . $year . '|' . $year . $sep . $month . $sep . $day . ')';
515 15
                $dt = Carbon::parse($data);
516 14
                if ($dt->today()) {
517
                    // then this is most likely a time string...
518 14
                    if (preg_match("/^{$time}$/i", $data)) {
519
                        return self::TYPE_TIME;
520
                    }
521 14
                }
522 14
                if (preg_match("/^{$date}$/i", $data)) {
523 13
                    return self::TYPE_DATE;
524 3
                } elseif(preg_match("/^{$date} {$time}$/i")) {
525
                    return self::TYPE_DATETIME;
526
                }
527 15
            } catch (\Exception $e) {
528
                // now go on checking remaining types
529 15
                if (preg_match('/^\w+$/', $data)) {
530 2
                    return self::TYPE_ALNUM;
531
                }
532
            }
533
        }
534 15
        return self::TYPE_STRING;
535
    }
536
537
    /**
538
     * Examines the contents of the CSV data to make a determination of whether
539
     * or not it contains a header row. To make this determination, it creates
540
     * an array of each column's (in each row)'s data type and length and then
541
     * compares them. If all of the rows except the header look similar, it will
542
     * return true. This is only a guess though. There is no programmatic way to
543
     * determine 100% whether a CSV file has a header. The format does not
544
     * provide metadata such as that.
545
     *
546
     * @param string The CSV data to examine (only 20 rows will be examined so )
547
     *     there is no need to provide any more data than that)
548
     * @param string The CSV data's quoting char (either double or single quote)
549
     * @param string The CSV data's delimiting char (can be a variety of chars but)
550
     *     typically is either a comma or a tab, sometimes a pipe)
551
     * @param string The CSV data's end-of-line char(s) (\n \r or \r\n)
552
     * @return boolean True if the data (most likely) contains a header row
553
     * @access public
554
     * @todo This method needs a total refactor. It's not necessary to loop twice
555
     *     You could get away with one loop and that would allow for me to do
556
     *     something like only examining enough rows to get to a particular
557
     *     "hasHeader" score (+-100 for instance) & then just return true|false
558
     * @todo Also, break out of the first loop after a certain (perhaps even a
559
     *     configurable) amount of lines (you only need to examine so much data )
560
     *     to reliably make a determination and this is an expensive method)
561
     * @todo Because the header isn't actually part of the "flavor",
562
     *     I could remove the need for quote, delim, and eol by "licking" the
563
     *     data sample provided in the first argument. Also, I could actually
564
     *     create a Reader object to read the data here.
565
     */
566 16
    public function lickHeader($quote, $delim, $eol)
0 ignored issues
show
Unused Code introduced by
The parameter $quote is not used and could be removed.

This check looks from parameters that have been defined for a function or method, but which are not used in the method body.

Loading history...
567
    {
568 16
        $data = $this->replaceQuotedSpecialChars($this->sample, $delim);
569 16
        $lines = explode($eol, $data);
570 16
        $types = array();
571 16
        foreach ($lines as $line_no => $line) {
572
            // now we can sub back in the correct newlines
573 16
            $line = str_replace(self::PLACEHOLDER_NEWLINE, $eol, $line);
574 16
            $cols = explode($delim, $line);
575 16
            foreach ($cols as $col_no => $col) {
576
                // now we can sub back in the correct delim characters
577 16
                $col = str_replace(self::PLACEHOLDER_DELIM, $delim, $col);
578 16
                $types[$line_no][$col_no] = array(
579 16
                    'type' => $this->lickType($this->unQuote($col)),
580 16
                    'length' => strlen($col)
581 16
                );
582 16
            }
583 16
        }
584 16
        $hasHeader = 0;
585 16
        $potential_header = array_shift($types);
586 16
        foreach ($types as $line_no => $cols) {
587 15
            foreach ($cols as $col_no => $col_info) {
588 15
                extract($col_info);
589 15
                if (!array_key_exists($col_no, $potential_header)) continue;
590 15
                extract($potential_header[$col_no], EXTR_PREFIX_ALL, "header");
591 15
                if ($header_type == self::TYPE_STRING) {
592
                    // use length
593 14
                    if ($length != $header_length) $hasHeader++;
594
                    else $hasHeader--;
595 14
                } else {
596 15
                    if ($type != $header_type) $hasHeader++;
597
                    else $hasHeader--;
598
                }
599 15
            }
600 16
        }
601 16
        return $hasHeader > 0;
602
    }
603
}
604