Complex classes like Taster often do a lot of different things. To break such a class down, we need to identify a cohesive component within that class. A common approach to find such a component is to look for fields/methods that share the same prefixes, or suffixes. You can also have a look at the cohesion graph to spot any un-connected, or weakly-connected components.
Once you have determined the fields that belong together, you can apply the Extract Class refactoring. If the component makes sense as a sub-class, Extract Subclass is also a candidate, and is often faster.
While breaking up the class, it is a good idea to analyze how other classes use Taster, and based on these observations, apply Extract Interface, too.
| 1 | <?php |
||
| 41 | class Taster |
||
| 42 | { |
||
| 43 | /** |
||
| 44 | * End-of-line constants. |
||
| 45 | */ |
||
| 46 | const EOL_UNIX = 'lf'; |
||
| 47 | const EOL_TRS80 = 'cr'; |
||
| 48 | const EOL_WINDOWS = 'crlf'; |
||
| 49 | |||
| 50 | /** |
||
| 51 | * ASCII character codes for "invisibles". |
||
| 52 | */ |
||
| 53 | const HORIZONTAL_TAB = 9; |
||
| 54 | const LINE_FEED = 10; |
||
| 55 | const CARRIAGE_RETURN = 13; |
||
| 56 | const SPACE = 32; |
||
| 57 | |||
| 58 | /** |
||
| 59 | * Data types -- Used within the lickQuotingStyle method. |
||
| 60 | */ |
||
| 61 | const DATA_NONNUMERIC = 'nonnumeric'; |
||
| 62 | const DATA_SPECIAL = 'special'; |
||
| 63 | const DATA_UNKNOWN = 'unknown'; |
||
| 64 | |||
| 65 | /** |
||
| 66 | * Placeholder strings -- hold the place of newlines and delimiters contained |
||
| 67 | * within quoted text so that the explode method doesn't split incorrectly. |
||
| 68 | */ |
||
| 69 | const PLACEHOLDER_NEWLINE = '[__NEWLINE__]'; |
||
| 70 | const PLACEHOLDER_DELIM = '[__DELIM__]'; |
||
| 71 | |||
| 72 | /** |
||
| 73 | * Recommended data sample size. |
||
| 74 | */ |
||
| 75 | const SAMPLE_SIZE = 2500; |
||
| 76 | |||
| 77 | /** |
||
| 78 | * Column data types -- used within the lickHeader method to determine |
||
| 79 | * whether the first row contains different types of data than the rest of |
||
| 80 | * the rows (and thus, is likely a header row). |
||
| 81 | */ |
||
| 82 | // +-987 |
||
| 83 | const TYPE_NUMBER = 'number'; |
||
| 84 | // +-12.387 |
||
| 85 | const TYPE_DOUBLE = 'double'; |
||
| 86 | // I am a string. I can contain all kinds of stuff. |
||
| 87 | const TYPE_STRING = 'string'; |
||
| 88 | // 10-Jul-15, 9/1/2007, April 1st, 2006, etc. |
||
| 89 | const TYPE_DATE = 'date'; |
||
| 90 | // 10:00pm, 5pm, 13:08, etc. |
||
| 91 | const TYPE_TIME = 'time'; |
||
| 92 | // $98.96, ¥12389, £6.08, €87.00 |
||
| 93 | const TYPE_CURRENCY = 'currency'; |
||
| 94 | // 12ab44m1n2_asdf |
||
| 95 | const TYPE_ALNUM = 'alnum'; |
||
| 96 | // abababab |
||
| 97 | const TYPE_ALPHA = 'alpha'; |
||
| 98 | |||
| 99 | /** |
||
| 100 | * @var CSVelte\Contract\Readable The source of data to examine |
||
| 101 | */ |
||
| 102 | protected $input; |
||
| 103 | |||
| 104 | /** |
||
| 105 | * Sample of CSV data to use for tasting (determining CSV flavor). |
||
| 106 | * |
||
| 107 | * @var string |
||
| 108 | */ |
||
| 109 | protected $sample; |
||
| 110 | |||
| 111 | /** |
||
| 112 | * Class constructor--accepts a CSV input source. |
||
| 113 | * |
||
| 114 | * @param CSVelte\Contract\Readable The source of CSV data |
||
| 115 | * |
||
| 116 | * @return void |
||
|
|
|||
| 117 | * |
||
| 118 | * @todo It may be a good idea to skip the first line or two for the sample |
||
| 119 | * so that the header line(s) don't throw things off (with the exception |
||
| 120 | * of lickHeader() obviously) |
||
| 121 | */ |
||
| 122 | 37 | public function __construct(Readable $input) |
|
| 127 | |||
| 128 | /** |
||
| 129 | * I'm not sure what this is for... |
||
| 130 | * |
||
| 131 | * @param Readable $input The input source |
||
| 132 | * |
||
| 133 | * @return CSVelte\Taster |
||
| 134 | * |
||
| 135 | * @todo Get rid of this unless there is a good reason for having it...? |
||
| 136 | * @ignore |
||
| 137 | */ |
||
| 138 | public static function create(Readable $input) |
||
| 142 | |||
| 143 | /** |
||
| 144 | * Examine the input source and determine what "Flavor" of CSV it contains. |
||
| 145 | * The CSV format, while having an RFC (https://tools.ietf.org/html/rfc4180), |
||
| 146 | * doesn't necessarily always conform to it. And it doesn't provide meta such as the delimiting character, quote character, or what types of data are quoted. |
||
| 147 | * such as the delimiting character, quote character, or what types of data are quoted. |
||
| 148 | * are quoted. |
||
| 149 | * |
||
| 150 | * @return CSVelte\Flavor The metadata that the CSV format doesn't provide |
||
| 151 | * |
||
| 152 | * @todo Implement a lickQuote method for when lickQuoteAndDelim method fails |
||
| 153 | * @todo Should there bea lickEscapeChar method? the python module that inspired |
||
| 154 | * this library doesn't include one... |
||
| 155 | * @todo This should cache the results and only regenerate if $this->sample |
||
| 156 | * changes (or $this->input) |
||
| 157 | */ |
||
| 158 | 16 | public function lick() |
|
| 176 | |||
| 177 | /** |
||
| 178 | * Replaces all quoted columns with a blank string. I was using this method |
||
| 179 | * to prevent explode() from incorrectly splitting at delimiters and newlines |
||
| 180 | * within quotes when parsing a file. But this was before I wrote the |
||
| 181 | * replaceQuotedSpecialChars method which (at least to me) makes more sense. |
||
| 182 | * |
||
| 183 | * @param string The string to replace quoted strings within |
||
| 184 | * |
||
| 185 | * @return string The input string with quoted strings removed |
||
| 186 | * |
||
| 187 | * @todo Replace code that uses this method with the replaceQuotedSpecialChars |
||
| 188 | * method instead. I think it's cleaner. |
||
| 189 | */ |
||
| 190 | 16 | protected function removeQuotedStrings($data) |
|
| 194 | |||
| 195 | /** |
||
| 196 | * Examine the input source to determine which character(s) are being used |
||
| 197 | * as the end-of-line character. |
||
| 198 | * |
||
| 199 | * @return char The end-of-line char for the input data |
||
| 200 | * @credit pulled from stackoverflow thread *tips hat to username "Harm"* |
||
| 201 | * |
||
| 202 | * @todo This should throw an exception if it cannot determine the line ending |
||
| 203 | * @todo I probably will make this method protected when I'm done with testing... |
||
| 204 | * @todo If there is any way for this method to fail (for instance if a file ) |
||
| 205 | * is totally empty or contains no line breaks), then it needs to throw |
||
| 206 | * a relevant TasterException |
||
| 207 | * @todo Use replaceQuotedSpecialChars rather than removeQuotedStrings() |
||
| 208 | */ |
||
| 209 | 16 | protected function lickLineEndings() |
|
| 229 | |||
| 230 | /** |
||
| 231 | * The best way to determine quote and delimiter characters is when columns |
||
| 232 | * are quoted, often you can seek out a pattern of delim, quote, stuff, quote, delim |
||
| 233 | * but this only works if you have quoted columns. If you don't you have to |
||
| 234 | * determine these characters some other way... (see lickDelimiter). |
||
| 235 | * |
||
| 236 | * @return array A two-row array containing quotechar, delimchar |
||
| 237 | * |
||
| 238 | * @todo make protected |
||
| 239 | * @todo This should throw an exception if it cannot determine the delimiter |
||
| 240 | * this way. |
||
| 241 | * @todo This should check for any line endings not just \n |
||
| 242 | */ |
||
| 243 | 16 | protected function lickQuoteAndDelim() |
|
| 276 | |||
| 277 | /** |
||
| 278 | * Take a list of likely delimiter characters and find the one that occurs |
||
| 279 | * the most consistent amount of times within the provided data. |
||
| 280 | * |
||
| 281 | * @param string The character(s) used for newlines |
||
| 282 | * |
||
| 283 | * @return string One of four Flavor::QUOTING_* constants |
||
| 284 | * |
||
| 285 | * @see CSVelte\Flavor for possible quote style constants |
||
| 286 | * |
||
| 287 | * @todo Refactor this method--It needs more thorough testing against a wider |
||
| 288 | * variety of CSV data to be sure it works reliably. And I'm sure there |
||
| 289 | * are many performance and logic improvements that could be made. This |
||
| 290 | * is essentially a first draft. |
||
| 291 | * @todo Use replaceQuotedSpecialChars rather than removeQuotedStrings |
||
| 292 | */ |
||
| 293 | 2 | protected function lickDelimiter($eol = "\n") |
|
| 334 | |||
| 335 | /** |
||
| 336 | * Determine the "style" of data quoting. The CSV format, while having an RFC |
||
| 337 | * (https://tools.ietf.org/html/rfc4180), doesn't necessarily always conform |
||
| 338 | * to it. And it doesn't provide metadata such as the delimiting character, |
||
| 339 | * quote character, or what types of data are quoted. So this method makes a |
||
| 340 | * logical guess by finding which columns have been quoted (if any) and |
||
| 341 | * examining their data type. Most often, CSV files will only use quotes |
||
| 342 | * around columns that contain special characters such as the dilimiter, |
||
| 343 | * the quoting character, newlines, etc. (we refer to this style as ) |
||
| 344 | * QUOTE_MINIMAL), but some quote all columns that contain nonnumeric data |
||
| 345 | * (QUOTE_NONNUMERIC). Then there are CSV files that quote all columns |
||
| 346 | * (QUOTE_ALL) and those that quote none (QUOTE_NONE). |
||
| 347 | * |
||
| 348 | * @param string The data to examime for "quoting style" |
||
| 349 | * @param char The type of quote character being used (single or double) |
||
| 350 | * @param char The character used as the column delimiter |
||
| 351 | * @param char The character used for newlines |
||
| 352 | * |
||
| 353 | * @return string One of four "QUOTING_" constants defined above--see this |
||
| 354 | * method's description for more info. |
||
| 355 | * |
||
| 356 | * @todo Refactor this method--It needs more thorough testing against a wider |
||
| 357 | * variety of CSV data to be sure it works reliably. And I'm sure there |
||
| 358 | * are many performance and logic improvements that could be made. This |
||
| 359 | * is essentially a first draft. |
||
| 360 | */ |
||
| 361 | 16 | protected function lickQuotingStyle($data, $quote, $delim, $eol) |
|
| 432 | |||
| 433 | /** |
||
| 434 | * Remove quotes around a piece of text (if there are any). |
||
| 435 | * |
||
| 436 | * @param string The data to "unquote" |
||
| 437 | * |
||
| 438 | * @return string The data passed in, only with quotes stripped (off the edges) |
||
| 439 | */ |
||
| 440 | 22 | protected function unQuote($data) |
|
| 444 | |||
| 445 | /** |
||
| 446 | * Determine whether a particular string of data has quotes around it. |
||
| 447 | * |
||
| 448 | * @param string The data to check |
||
| 449 | * |
||
| 450 | * @return bool Whether the data is quoted or not |
||
| 451 | */ |
||
| 452 | 16 | protected function isQuoted($data) |
|
| 456 | |||
| 457 | /** |
||
| 458 | * Determine what type of data is contained within a variable |
||
| 459 | * Possible types: |
||
| 460 | * - nonnumeric - only numbers |
||
| 461 | * - special - contains characters that could potentially need to be quoted (possible delimiter characters) |
||
| 462 | * - unknown - everything else |
||
| 463 | * This method is really only used within the "lickQuotingStyle" method to |
||
| 464 | * help determine whether a particular column has been quoted due to it being |
||
| 465 | * nonnumeric or because it has some special character in it such as a delimiter |
||
| 466 | * or newline or quote. |
||
| 467 | * |
||
| 468 | * @param string The data to determine the type of |
||
| 469 | * |
||
| 470 | * @return string The type of data (one of the "DATA_" constants above) |
||
| 471 | * |
||
| 472 | * @todo I could probably eliminate this method and use an anonymous function |
||
| 473 | * instead. It isn't used anywhere else and its name could be misleading. |
||
| 474 | * Especially since I also have a lickType method that is used within the |
||
| 475 | * lickHeader method. |
||
| 476 | */ |
||
| 477 | 16 | protected function lickDataType($data) |
|
| 489 | |||
| 490 | /** |
||
| 491 | * Replace all instances of newlines and whatever character you specify (as |
||
| 492 | * the delimiter) that are contained within quoted text. The replacements are |
||
| 493 | * simply a special placeholder string. This is done so that I can use the |
||
| 494 | * very unsmart "explode" function and not have to worry about it exploding |
||
| 495 | * on delimiters or newlines within quotes. Once I have exploded, I typically |
||
| 496 | * sub back in the real characters before doing anything else. Although |
||
| 497 | * currently there is no dedicated method for doing so I just use str_replace. |
||
| 498 | * |
||
| 499 | * @param string The string to do the replacements on |
||
| 500 | * @param char The delimiter character to replace |
||
| 501 | * |
||
| 502 | * @return string The data with replacements performed |
||
| 503 | * |
||
| 504 | * @todo I could probably pass in (maybe optionally) the newline character I |
||
| 505 | * want to replace as well. I'll do that if I need to. |
||
| 506 | */ |
||
| 507 | protected function replaceQuotedSpecialChars($data, $delim) |
||
| 516 | |||
| 517 | /** |
||
| 518 | * Determine the "type" of a particular string of data. Used for the lickHeader |
||
| 519 | * method to assign a type to each column to try to determine whether the |
||
| 520 | * first for is different than a consistent column type. |
||
| 521 | * |
||
| 522 | * @todo As I'm writing this method I'm beginning ot realize how expensive |
||
| 523 | * the lickHeader method is going to end up being since it has to apply all |
||
| 524 | * these regexes (potentially) to every column. I may end up writing a much |
||
| 525 | * simpler type-checking method than this if it proves to be too expensive |
||
| 526 | * to be practical. |
||
| 527 | * |
||
| 528 | * @param string The string of data to check the type of |
||
| 529 | * |
||
| 530 | * @return string One of the TYPE_ string constants above |
||
| 531 | * |
||
| 532 | * @uses Carbon/Carbon date/time ilbrary/class |
||
| 533 | */ |
||
| 534 | 22 | protected function lickType($data) |
|
| 574 | |||
| 575 | /** |
||
| 576 | * Examines the contents of the CSV data to make a determination of whether |
||
| 577 | * or not it contains a header row. To make this determination, it creates |
||
| 578 | * an array of each column's (in each row)'s data type and length and then |
||
| 579 | * compares them. If all of the rows except the header look similar, it will |
||
| 580 | * return true. This is only a guess though. There is no programmatic way to |
||
| 581 | * determine 100% whether a CSV file has a header. The format does not |
||
| 582 | * provide metadata such as that. |
||
| 583 | * |
||
| 584 | * @param string The CSV data to examine (only 20 rows will be examined so ) |
||
| 585 | * there is no need to provide any more data than that) |
||
| 586 | * @param char The CSV data's quoting char (either double or single quote) |
||
| 587 | * @param char The CSV data's delimiting char (can be a variety of chars but) |
||
| 588 | * typically is either a comma or a tab, sometimes a pipe) |
||
| 589 | * @param char The CSV data's end-of-line char(s) (\n \r or \r\n) |
||
| 590 | * |
||
| 591 | * @return bool True if the data (most likely) contains a header row |
||
| 592 | * |
||
| 593 | * @todo This method needs a total refactor. It's not necessary to loop twice |
||
| 594 | * You could get away with one loop and that would allow for me to do |
||
| 595 | * something like only examining enough rows to get to a particular |
||
| 596 | * "hasHeader" score (+-100 for instance) & then just return true|false |
||
| 597 | * @todo Also, break out of the first loop after a certain (perhaps even a |
||
| 598 | * configurable) amount of lines (you only need to examine so much data ) |
||
| 599 | * to reliably make a determination and this is an expensive method) |
||
| 600 | * @todo Because the header isn't actually part of the "flavor", |
||
| 601 | * I could remove the need for quote, delim, and eol by "licking" the |
||
| 602 | * data sample provided in the first argument. Also, I could actually |
||
| 603 | * create a Reader object to read the data here. |
||
| 604 | */ |
||
| 605 | 22 | public function lickHeader($data, $quote, $delim, $eol) |
|
| 652 | } |
||
| 653 |
Adding a
@returnannotation to a constructor is not recommended, since a constructor does not have a meaningful return value.Please refer to the PHP core documentation on constructors.