features_csv_to_dict.main()   F
last analyzed

Complexity

Conditions 42

Size

Total Lines 284
Code Lines 176

Duplication

Lines 0
Ratio 0 %

Importance

Changes 0
Metric Value
eloc 176
dl 0
loc 284
rs 0
c 0
b 0
f 0
cc 42
nop 1

How to fix   Long Method    Complexity   

Long Method

Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.

For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.

Commonly applied refactorings include:

Complexity

Complex classes like features_csv_to_dict.main() often do a lot of different things. To break such a class down, we need to identify a cohesive component within that class. A common approach to find such a component is to look for fields/methods that share the same prefixes, or suffixes.

Once you have determined the fields that belong together, you can apply the Extract Class refactoring. If the component makes sense as a sub-class, Extract Subclass is also a candidate, and is often faster.

1
#!/usr/bin/env python3
2
# Copyright 2014-2020 by Christopher C. Little.
3
# This file is part of Abydos.
4
#
5
# Abydos is free software: you can redistribute it and/or modify
6
# it under the terms of the GNU General Public License as published by
7
# the Free Software Foundation, either version 3 of the License, or
8
# (at your option) any later version.
9
#
10
# Abydos is distributed in the hope that it will be useful,
11
# but WITHOUT ANY WARRANTY; without even the implied warranty of
12
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
13
# GNU General Public License for more details.
14
#
15
# You should have received a copy of the GNU General Public License
16
# along with Abydos. If not, see <http://www.gnu.org/licenses/>.
17
18
"""features_csv_to_dict.py.
19
20
This script converts a CSV document of feature values to a Python dict.
21
22
The CSV document is of the format
23
<phonetic symbol(s)>,<variant number>,<segmental>,<feature>+
24
25
    Phonetic symbols are IPA (or other system) symbols.
26
    Variant number refers to one of the following codes:
27
         0 = IPA
28
         1 = Americanist
29
         2 = IPA variant
30
         3 = Americanist variant
31
         9 = other
32
    Segmental must be either 1 (segmental) or 0 (featural). Featural symbols
33
    replace features on the preceding segmental symbol.
34
    Features may be 1 (+), -1 (-), or 0 (0) to indicate feature values.
35
36
Lines beginning with # are interpreted as comments
37
"""
38
39
import codecs
40
import getopt
41
import sys
42
import unicodedata
43
44
45
def main(argv):
46
    """Read input file and write to output.
47
48
    Parameters
49
    ----------
50
    argv : list
51
        Arguments to the script
52
53
    """
54
    first_col = 3
55
    last_col = -1
56
57
    def print_usage():
58
        """Print usage statement."""
59
        sys.stdout.write(
60
            'features_csv_to_dict.py -i <inputfile> ' + '[-o <outputfile>]\n'
61
        )
62
        sys.exit(2)
63
64
    def binarize(num):
65
        """Replace 0, -1, 1, 2 with 00, 10, 01, 11.
66
67
        Parameters
68
        ----------
69
        num : str
70
            The number to binarize
71
72
        Returns
73
        -------
74
        str
75
            A binarized number
76
77
        """
78
        if num == '0':  # 0
79
            return '00'
80
        elif num == '-1':  # -
81
            return '10'
82
        elif num == '1':  # +
83
            return '01'
84
        # '2' -> ± (segmental) or copy from base (non-segmental)
85
        return '11'
86
87
    def init_termdicts():
88
        """Initialize the terms dict.
89
90
        Returns
91
        -------
92
        (dict, dict)
93
            Term & feature mask dictionaries
94
95
        """
96
        ifile = codecs.open('features_terms.csv', 'r', 'utf-8')
97
98
        feature_mask = {}
99
        keyline = ifile.readline().strip().split(',')[first_col:last_col]
100
        mag = len(keyline)
101
        for i in range(len(keyline)):
102
            features = '0b' + ('00' * i) + '11' + ('00' * (mag - i - 1))
103
            feature_mask[keyline[i]] = int(features, 2)
104
105
        termdict = {}
106
        for line in ifile:
107
            line = line.strip().rstrip(',')
108
            if '#' in line:
109
                line = line[: line.find('#')].strip()
110
            if line:
111
                line = line.split(',')
112
                term = line[last_col]
113
                features = '0b' + ''.join(
114
                    [binarize(val) for val in line[first_col:last_col]]
115
                )
116
                termdict[term] = int(features, 2)
117
118
        return termdict, feature_mask
119
120
    def check_terms(sym, features, name, termdict):
121
        """Check terms.
122
123
        Check each term of the phone name to confirm that it matches
124
        the expected features implied by that feature.
125
126
        Parameters
127
        ----------
128
        sym : str
129
            Symbol to check
130
        features : int
131
            Phone features
132
        name : str
133
            Phone name
134
        termdict : dict
135
            Dictionary of terms
136
137
        """
138
        if '#' in name:
139
            name = name[: name.find('#')].strip()
140
        for term in name.split():
141
            if term in termdict:
142
                if termdict[term] & features != termdict[term]:
143
                    sys.stdout.write(
144
                        'Feature mismatch for term "'
145
                        + term
146
                        + '" in   '
147
                        + sym
148
                        + '\n'
149
                    )
150
            else:
151
                sys.stdout.write(
152
                    'Unknown term "'
153
                    + term
154
                    + '" in '
155
                    + name
156
                    + ' : '
157
                    + sym
158
                    + '\n'
159
                )
160
161
    def check_entailments(sym, features, feature_mask):
162
        """Check entailments.
163
164
        Check for necessary feature assignments (entailments)
165
        For example, [+round] necessitates [+labial].
166
167
        Parameters
168
        ----------
169
        sym : str
170
            Symbol to check
171
        features : int
172
            Phone features
173
        feature_mask : dict
174
            The feature mask
175
176
        """
177
        entailments = {
178
            '+labial': ('±round', '±protruded', '±compressed', '±labiodental'),
179
            '-labial': ('0round', '0protruded', '0compressed', '0labiodental'),
180
            '+coronal': ('±anterior', '±distributed'),
181
            '-coronal': ('0anterior', '0distributed'),
182
            '+dorsal': ('±high', '±low', '±front', '±back', '±tense'),
183
            '-dorsal': ('0high', '0low', '0front', '0back', '0tense'),
184
            '+pharyngeal': ('±atr', '±rtr'),
185
            '-pharyngeal': ('0atr', '0rtr'),
186
            '+protruded': ('+labial', '+round', '-compressed'),
187
            '+compressed': ('+labial', '+round', '-protruded'),
188
            '+glottalic_suction': ('-velaric_suction',),
189
            '+velaric_suction': ('-glottalic_suction',),
190
        }
191
192
        for feature in entailments:
193
            fname = feature[1:]
194
            if feature[0] == '+':
195
                fm = (feature_mask[fname] >> 1) & feature_mask[fname]
196
            else:
197
                fm = (feature_mask[fname] << 1) & feature_mask[fname]
198
            if (features & fm) == fm:
199
                for ent in entailments[feature]:
200
                    ename = ent[1:]
201
                    if ent[0] == '+':
202
                        efm = (feature_mask[ename] >> 1) & feature_mask[ename]
203
                    elif ent[0] == '-':
204
                        efm = (feature_mask[ename] << 1) & feature_mask[ename]
205
                    elif ent[0] == '0':
206
                        efm = 0
207
                    elif ent[0] == '±':
208
                        efm = feature_mask[ename]
209
210
                    if ent[0] == '±':
211
                        if (features & efm) == 0:
0 ignored issues
show
introduced by
The variable efm does not seem to be defined for all execution paths.
Loading history...
212
                            sys.stdout.write(
213
                                'Incorrect entailment for '
214
                                + sym
215
                                + ' for feature '
216
                                + fname
217
                                + ' and entailment '
218
                                + ename
219
                            )
220
                    else:
221
                        if (features & efm) != efm:
222
                            sys.stdout.write(
223
                                'Incorrect entailment for '
224
                                + sym
225
                                + ' for feature '
226
                                + fname
227
                                + ' and entailment '
228
                                + ename
229
                            )
230
231
    checkdict = {}  # a mapping of symbol to feature
232
    checkset_s = set()  # a set of the symbols seen
233
    checkset_f = set()  # a set of the feature values seen
234
235
    termdict, feature_mask = init_termdicts()
236
237
    ifile = ''
238
    ofile = ''
239
    try:
240
        opts = getopt.getopt(argv, 'hi:o:', ['ifile=', 'ofile='])[0]
241
    except getopt.GetoptError:
242
        print_usage()
243
    for opt, arg in opts:
244
        if opt == '-h':
245
            print_usage()
246
        elif opt in ('-i', '--ifile'):
247
            ifile = codecs.open(arg, 'r', 'utf-8')
248
        elif opt in ('-o', '--ofile'):
249
            ofile = codecs.open(arg, 'w', 'utf-8')
250
    if not ifile:
251
        print_usage()
252
253
    oline = 'PHONETIC_FEATURES = {'
254
    if not ofile:
255
        ofile = sys.stdout
256
257
    ofile.write(oline + '\n')
258
259
    keyline = ifile.readline().strip().split(',')[first_col:last_col]
260
    for line in ifile:
261
        line = line.strip().rstrip(',')
262
263
        if line.startswith('####'):
264
            break
265
266
        line = unicodedata.normalize('NFC', line)
267
268
        if not line or line.startswith('#'):
269
            oline = '                     ' + line
270
271
        else:
272
            line = line.strip().split(',')
273
            if '#' in line:
274
                line = line[: line.find('#')]
275
            symbol = line[0]
276
            variant = int(line[1])
277
            segmental = bool(line[2])
278
            features = '0b' + ''.join(
279
                [binarize(val) for val in line[first_col:last_col]]
280
            )
281
            name = line[-1].strip()
282
            if not segmental:
283
                features = '-' + features
284
285
            featint = int(features, 2)
286
            check_terms(symbol, featint, name, termdict)
287
            check_entailments(symbol, featint, feature_mask)
288
            if symbol in checkset_s:
289
                sys.stdout.write(
290
                    'Symbol ' + symbol + ' appears twice in CSV.\n'
291
                )
292
            else:
293
                checkset_s.add(symbol)
294
295
            if variant < 2:
296
                if featint in checkset_f:
297
                    sys.stdout.write(
298
                        'Feature set '
299
                        + str(featint)
300
                        + ' appears in CSV for two primary IPA '
301
                        + 'symbols: '
302
                        + symbol
303
                        + ' and '
304
                        + checkdict[featint]
305
                    )
306
                else:
307
                    checkdict[featint] = symbol
308
                    checkset_f.add(featint)
309
310
            if variant < 5:
311
                oline = "                     '{}': {},".format(
312
                    symbol, featint
313
                )
314
            else:
315
                oline = ''
316
317
        if oline:
318
            ofile.write(oline + '\n')
319
320
    ofile.write('                    }\n\nFEATURE_MASK = {')
321
322
    mag = len(keyline)
323
    for i in range(len(keyline)):
324
        features = int('0b' + ('00' * i) + '11' + ('00' * (mag - i - 1)), 2)
325
        oline = "                '{}': {},".format(keyline[i], features)
326
        ofile.write(oline + '\n')
327
328
    ofile.write('               }\n')
329
330
331
if __name__ == '__main__':
332
    main(sys.argv[1:])
333