Passed
Pull Request — master (#559)
by Konstantin
03:44
created

ocrd.processor.base   C

Complexity

Total Complexity 53

Size/Duplication

Total Lines 306
Duplicated Lines 7.84 %

Importance

Changes 0
Metric Value
wmc 53
eloc 184
dl 24
loc 306
rs 6.96
c 0
b 0
f 0

10 Methods

Rating   Name   Duplication   Size   Complexity  
A Processor.show_version() 0 2 1
A Processor.process() 0 5 1
B Processor.__init__() 0 43 8
A Processor.verify() 0 5 1
A Processor.show_help() 0 2 1
F Processor.zip_input_files() 24 109 25
A Processor.list_all_resources() 0 5 1
C Processor.resolve_resource() 0 37 11
A Processor.add_metadata() 0 21 1
A Processor.input_files() 0 21 3

How to fix   Duplicated Code    Complexity   

Duplicated Code

Duplicate code is one of the most pungent code smells. A rule that is often used is to re-structure code once it is duplicated in three or more places.

Common duplication problems, and corresponding solutions are:

Complexity

 Tip:   Before tackling complexity, make sure that you eliminate any duplication first. This often can reduce the size of classes significantly.

Complex classes like ocrd.processor.base often do a lot of different things. To break such a class down, we need to identify a cohesive component within that class. A common approach to find such a component is to look for fields/methods that share the same prefixes, or suffixes.

Once you have determined the fields that belong together, you can apply the Extract Class refactoring. If the component makes sense as a sub-class, Extract Subclass is also a candidate, and is often faster.

1
"""
2
Processor base class and helper functions
3
"""
4
5
__all__ = [
6
    'Processor',
7
    'generate_processor_help',
8
    'run_cli',
9
    'run_processor'
10
]
11
12
from os import makedirs
13
from os.path import exists, isdir, join
14
from shutil import copyfileobj
15
import json
16
import os
17
import re
18
from pkg_resources import resource_filename
19
20
import requests
21
22
from ocrd_utils import (
23
    VERSION as OCRD_VERSION,
24
    MIMETYPE_PAGE,
25
    getLogger,
26
    list_resource_candidates,
27
    list_all_resources,
28
    XDG_CACHE_HOME
29
)
30
from ocrd_validators import ParameterValidator
31
from ocrd_models.ocrd_page import MetadataItemType, LabelType, LabelsType
32
33
# XXX imports must remain for backwards-compatibilty
34
from .helpers import run_cli, run_processor, generate_processor_help # pylint: disable=unused-import
35
36
class Processor():
37
    """
38
    A processor is an OCR-D compliant command-line-interface for executing
39
    a single workflow step on the workspace (represented by local METS). It
40
    reads input files for all or requested physical pages of the input fileGrp(s),
41
    and writes output files for them into the output fileGrp(s). It may take 
42
    a number of optional or mandatory parameters.
43
    """
44
45
    def __init__(
46
            self,
47
            workspace,
48
            ocrd_tool=None,
49
            parameter=None,
50
            # TODO OCR-D/core#274
51
            # input_file_grp=None,
52
            # output_file_grp=None,
53
            input_file_grp="INPUT",
54
            output_file_grp="OUTPUT",
55
            page_id=None,
56
            show_help=False,
57
            show_version=False,
58
            dump_json=False,
59
            version=None
60
    ):
61
        if parameter is None:
62
            parameter = {}
63
        if dump_json:
64
            print(json.dumps(ocrd_tool, indent=True))
65
            return
66
        self.ocrd_tool = ocrd_tool
67
        if show_help:
68
            self.show_help()
69
            return
70
        self.version = version
71
        if show_version:
72
            self.show_version()
73
            return
74
        self.workspace = workspace
75
        # FIXME HACK would be better to use pushd_popd(self.workspace.directory)
76
        # but there is no way to do that in process here since it's an
77
        # overridden method. chdir is almost always an anti-pattern.
78
        if self.workspace:
79
            os.chdir(self.workspace.directory)
80
        self.input_file_grp = input_file_grp
81
        self.output_file_grp = output_file_grp
82
        self.page_id = None if page_id == [] or page_id is None else page_id
83
        parameterValidator = ParameterValidator(ocrd_tool)
84
        report = parameterValidator.validate(parameter)
85
        if not report.is_valid:
86
            raise Exception("Invalid parameters %s" % report.errors)
87
        self.parameter = parameter
88
89
    def show_help(self):
90
        print(generate_processor_help(self.ocrd_tool, processor_instance=self))
91
92
    def show_version(self):
93
        print("Version %s, ocrd/core %s" % (self.version, OCRD_VERSION))
94
95
    def verify(self):
96
        """
97
        Verify that the input fulfills the processor's requirements.
98
        """
99
        return True
100
101
    def process(self):
102
        """
103
        Process the workspace
104
        """
105
        raise Exception("Must be implemented")
106
107
108
    def add_metadata(self, pcgts):
109
        """
110
        Adds PAGE-XML MetadataItem describing the processing step
111
        """
112
        pcgts.get_Metadata().add_MetadataItem(
113
                MetadataItemType(type_="processingStep",
114
                    name=self.ocrd_tool['steps'][0],
115
                    value=self.ocrd_tool['executable'],
116
                    Labels=[LabelsType(
117
                        externalModel="ocrd-tool",
118
                        externalId="parameters",
119
                        Label=[LabelType(type_=name,
120
                                         value=self.parameter[name])
121
                               for name in self.parameter.keys()]),
122
                            LabelsType(
123
                        externalModel="ocrd-tool",
124
                        externalId="version",
125
                        Label=[LabelType(type_=self.ocrd_tool['executable'],
126
                                         value=self.version),
127
                               LabelType(type_='ocrd/core',
128
                                         value=OCRD_VERSION)])
129
                    ]))
130
131
    def resolve_resource(self, parameter_name, val):
132
        """
133
        Resolve a resource name with the algorithm in
134
        https://ocr-d.de/en/spec/ocrd_tool#file-parameters
135
136
        Args:
137
            parameter_name (string): name of parameter to resolve resource for
138
            val (string): resource value to resolve
139
        """
140
        executable = self.ocrd_tool['executable']
141
        try:
142
            param = self.ocrd_tool['parameter'][parameter_name]
143
        except KeyError:
144
            raise ValueError("Parameter '%s' not defined in ocrd-tool.json" % parameter_name)
145
        if not param['mimetype']:
146
            raise ValueError("Parameter '%s' is not a file parameter (has no 'mimetype' field)" %
147
                             parameter_name)
148
        if val.startswith('http:') or val.startswith('https:'):
149
            cache_dir = join(XDG_CACHE_HOME, executable)
150
            cache_key = re.sub('[^A-Za-z0-9]', '', val)
151
            cache_fpath = join(cache_dir, cache_key)
152
            # TODO Proper caching (make head request for size, If-Modified etc)
153
            if not exists(cache_fpath):
154
                if not isdir(cache_dir):
155
                    makedirs(cache_dir)
156
                with requests.get(val, stream=True) as r:
157
                    with open(cache_fpath, 'wb') as f:
158
                        copyfileobj(r.raw, f)
159
            return cache_fpath
160
        ret = next([cand for cand in list_resource_candidates(executable, val) if exists(cand)])
161
        if ret:
162
            return ret
163
        bundled_fpath = resource_filename(__name__, val)
164
        if exists(bundled_fpath):
165
            return bundled_fpath
166
        raise FileNotFoundError("Could not resolve '%s' file parameter value '%s'" %
167
                                (parameter_name, val))
168
169
    def list_all_resources(self):
170
        """
171
        List all resources found in the filesystem
172
        """
173
        return list_all_resources(self.ocrd_tool['executable'])
174
175
    @property
176
    def input_files(self):
177
        """
178
        List the input files (for single input file groups).
179
180
        For each physical page:
181
        - If there is a single PAGE-XML for the page, take it (and forget about all
182
          other files for that page)
183
        - Else if there is a single image file, take it (and forget about all other
184
          files for that page)
185
        - Otherwise raise an error (complaining that only PAGE-XML warrants
186
          having multiple images for a single page)
187
        (https://github.com/cisocrgroup/ocrd_cis/pull/57#issuecomment-656336593)
188
        """
189
        if not self.input_file_grp:
190
            raise ValueError("Processor is missing input fileGrp")
191
        ret = self.zip_input_files(mimetype=None, on_error='abort')
192
        if not ret:
193
            return []
194
        assert len(ret[0]) == 1, 'Use zip_input_files() instead of input_files when processing multiple input fileGrps'
195
        return [tuples[0] for tuples in ret]
196
197
    def zip_input_files(self, require_first=True, mimetype=None, on_error='skip'):
198
        """
199
        List tuples of input files (for multiple input file groups).
200
201
        Processors that expect/need multiple input file groups,
202
        cannot use ``input_files``. They must align (zip) input files
203
        across pages. This includes the case where not all pages
204
        are equally present in all file groups. It also requires
205
        making a consistent selection if there are multiple files
206
        per page.
207
208
        Following the OCR-D functional model, this function tries to
209
        find a single PAGE file per page, or fall back to a single
210
        image file per page. In either case, multiple matches per page
211
        are an error (see error handling below).
212
        This default behaviour can be changed by using a fixed MIME
213
        type filter via ``mimetype``. But still, multiple matching
214
        files per page are an error.
215
216
        Single-page multiple-file errors are handled according to
217
        ``on_error``:
218
        - if ``skip``, then the page for the respective fileGrp will be
219
          silently skipped (as if there was no match at all)
220
        - if ``first``, then the first matching file for the page will be
221
          silently selected (as if the first was the only match)
222
        - if ``last``, then the last matching file for the page will be
223
          silently selected (as if the last was the only match)
224
        - if ``abort``, then an exception will be raised.
225
        Multiple matches for PAGE-XML will always raise an exception.
226
227
        Args:
228
             require_first (bool): If true, then skip a page entirely
229
             whenever it is not available in the first input fileGrp.
230
231
             mimetype (str): If not None, filter by the specified MIME
232
             type (literal or regex prefixed by ``//``.
233
             Otherwise prefer PAGE or image.
234
        """
235
        if not self.input_file_grp:
236
            raise ValueError("Processor is missing input fileGrp")
237
238
        LOG = getLogger('ocrd.processor.base')
239
        ifgs = self.input_file_grp.split(",")
240
        # Iterating over all files repeatedly may seem inefficient at first sight,
241
        # but the unnecessary OcrdFile instantiations for posterior fileGrp filtering
242
        # can actually be much more costly than traversing the ltree.
243
        # This might depend on the number of pages vs number of fileGrps.
244
245
        pages = dict()
246
        for i, ifg in enumerate(ifgs):
247
            for file_ in sorted(self.workspace.mets.find_all_files(
248
                    pageId=self.page_id, fileGrp=ifg, mimetype=mimetype),
249
                                # sort by MIME type so PAGE comes before images
250
                                key=lambda file_: file_.mimetype):
251
                if not file_.pageId:
252
                    continue
253
                ift = pages.setdefault(file_.pageId, [None]*len(ifgs))
254
                if ift[i]:
255
                    LOG.debug("another file %s for page %s in input file group %s", file_.ID, file_.pageId, ifg)
256
                    # fileGrp has multiple files for this page ID
257
                    if mimetype:
258
                        # filter was active, this must not happen
259 View Code Duplication
                        if on_error == 'skip':
0 ignored issues
show
Duplication introduced by
This code seems to be duplicated in your project.
Loading history...
260
                            ift[i] = None
261
                        elif on_error == 'first':
262
                            pass # keep first match
263
                        elif on_error == 'last':
264
                            ift[i] = file_
265
                        elif on_error == 'abort':
266
                            raise ValueError(
267
                                "Multiple '%s' matches for page '%s' in fileGrp '%s'." % (
268
                                    mimetype, file_.pageId, ifg))
269
                        else:
270
                            raise Exception("Unknown 'on_error' strategy '%s'" % on_error)
271
                    elif (ift[i].mimetype == MIMETYPE_PAGE and
272
                          file_.mimetype != MIMETYPE_PAGE):
273
                        pass # keep PAGE match
274
                    elif (ift[i].mimetype == MIMETYPE_PAGE and
275
                          file_.mimetype == MIMETYPE_PAGE):
276
                            raise ValueError(
277
                                "Multiple PAGE-XML matches for page '%s' in fileGrp '%s'." % (
278
                                    file_.pageId, ifg))
279
                    else:
280
                        # filter was inactive but no PAGE is in control, this must not happen
281 View Code Duplication
                        if on_error == 'skip':
0 ignored issues
show
Duplication introduced by
This code seems to be duplicated in your project.
Loading history...
282
                            ift[i] = None
283
                        elif on_error == 'first':
284
                            pass # keep first match
285
                        elif on_error == 'last':
286
                            ift[i] = file_
287
                        elif on_error == 'abort':
288
                            raise ValueError(
289
                                "No PAGE-XML for page '%s' in fileGrp '%s' but multiple matches." % (
290
                                    file_.pageId, ifg))
291
                        else:
292
                            raise Exception("Unknown 'on_error' strategy '%s'" % on_error)
293
                else:
294
                    LOG.debug("adding file %s for page %s to input file group %s", file_.ID, file_.pageId, ifg)
295
                    ift[i] = file_
296
        ifts = list()
297
        for page, ifiles in pages.items():
298
            for i, ifg in enumerate(ifgs):
299
                if not ifiles[i]:
300
                    # other fallback options?
301
                    LOG.error('found no page %s in file group %s',
302
                              page, ifg)
303
            if ifiles[0] or not require_first:
304
                ifts.append(tuple(ifiles))
305
        return ifts
306