Passed
Pull Request — master (#875)
by Konstantin
12:01
created

ocrd_models.ocrd_mets.OcrdMets.add_file()   D

Complexity

Conditions 12

Size

Total Lines 60
Code Lines 27

Duplication

Lines 0
Ratio 0 %

Importance

Changes 0
Metric Value
eloc 27
dl 0
loc 60
rs 4.8
c 0
b 0
f 0
cc 12
nop 10

How to fix   Long Method    Complexity    Many Parameters   

Long Method

Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.

For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.

Commonly applied refactorings include:

Complexity

Complex classes like ocrd_models.ocrd_mets.OcrdMets.add_file() often do a lot of different things. To break such a class down, we need to identify a cohesive component within that class. A common approach to find such a component is to look for fields/methods that share the same prefixes, or suffixes.

Once you have determined the fields that belong together, you can apply the Extract Class refactoring. If the component makes sense as a sub-class, Extract Subclass is also a candidate, and is often faster.

Many Parameters

Methods with many parameters are not only hard to understand, but their parameters also often become inconsistent when you need more, or different data.

There are several approaches to avoid long parameter lists:

1
"""
2
API to METS
3
"""
4
from datetime import datetime
5
import re
6
import typing
7
from lxml import etree as ET
8
from copy import deepcopy
9
10
from ocrd_utils import (
11
    is_local_filename,
12
    getLogger,
13
    generate_range,
14
    VERSION,
15
    REGEX_PREFIX,
16
    REGEX_FILE_ID
17
)
18
19
from .constants import (
20
    NAMESPACES as NS,
21
    TAG_METS_AGENT,
22
    TAG_METS_DIV,
23
    TAG_METS_FILE,
24
    TAG_METS_FILEGRP,
25
    TAG_METS_FILESEC,
26
    TAG_METS_FPTR,
27
    TAG_METS_METSHDR,
28
    TAG_METS_STRUCTMAP,
29
    IDENTIFIER_PRIORITY,
30
    TAG_MODS_IDENTIFIER,
31
    METS_XML_EMPTY,
32
)
33
34
from .ocrd_xml_base import OcrdXmlDocument, ET
35
from .ocrd_file import OcrdFile
36
from .ocrd_agent import OcrdAgent
37
38
REGEX_PREFIX_LEN = len(REGEX_PREFIX)
39
40
class OcrdMets(OcrdXmlDocument):
41
    """
42
    API to a single METS file
43
    """
44
45
    @staticmethod
46
    def empty_mets(now=None, cache_flag=False):
47
        """
48
        Create an empty METS file from bundled template.
49
        """
50
        if not now:
51
            now = datetime.now().isoformat()
52
        tpl = METS_XML_EMPTY.decode('utf-8')
53
        tpl = tpl.replace('{{ VERSION }}', VERSION)
54
        tpl = tpl.replace('{{ NOW }}', '%s' % now)
55
        return OcrdMets(content=tpl.encode('utf-8'), cache_flag=cache_flag)
56
57
    def __init__(self, **kwargs):
58
        """
59
        """
60
        super(OcrdMets, self).__init__(**kwargs)
61
        
62
        # If cache is enabled
63
        if self._cache_flag:
64
65
            # Cache for the files (mets:file) - two nested dictionaries
66
            # The outer dictionary's Key: 'fileGrp.USE'
67
            # The outer dictionary's Value: Inner dictionary
68
            # The inner dictionary's Key: 'file.ID'
69
            # The inner dictionary's Value: a 'file' object at some memory location
70
            self._file_cache = {}
71
72
            # Cache for the pages (mets:div)
73
            # The dictionary's Key: 'div.ID'
74
            # The dictionary's Value: a 'div' object at some memory location
75
            self._page_cache = {}
76
77
            # Cache for the file pointers (mets:fptr) - two nested dictionaries
78
            # The outer dictionary's Key: 'div.ID'
79
            # The outer dictionary's Value: Inner dictionary
80
            # The inner dictionary's Key: 'fptr.FILEID'
81
            # The inner dictionary's Value: a 'fptr' object at some memory location
82
            self._fptr_cache = {}
83
84
            # Note, if the empty_mets() function is used to instantiate OcrdMets
85
            # Then the cache is empty even after this operation
86
            self._fill_caches()
87
88
    def __exit__(self):
89
        """
90
91
        """
92
        if self._cache_flag:
93
            self._clear_caches()
94
95
    def __str__(self):
96
        """
97
        String representation
98
        """
99
        return 'OcrdMets[cached=%s,fileGrps=%s,files=%s]' % (self._cache_flag, self.file_groups, list(self.find_files()))
100
101
    def _fill_caches(self):
102
        """
103
        Fills the caches with fileGrps and FileIDs
104
        """
105
106
        tree_root = self._tree.getroot()
107
108
        # Fill with files
109
        el_fileSec = tree_root.find("mets:fileSec", NS)
110
        if el_fileSec is None:
111
            return
112
113
        log = getLogger('ocrd_models.ocrd_mets._fill_caches-files')
114
115
        for el_fileGrp in el_fileSec.findall('mets:fileGrp', NS):
116
            fileGrp_use = el_fileGrp.get('USE')
117
118
            # Assign an empty dictionary that will hold the files of the added fileGrp
119
            self._file_cache[fileGrp_use] = {}
120
121
            for el_file in el_fileGrp:
122
                file_id = el_file.get('ID')
123
                self._file_cache[fileGrp_use].update({file_id : el_file})
124
                # log.info("File added to the cache: %s" % file_id)
125
126
        # Fill with pages
127
        el_div_list = tree_root.findall(".//mets:div[@TYPE='page']", NS)
128
        if len(el_div_list) == 0:
129
            return
130
        log = getLogger('ocrd_models.ocrd_mets._fill_caches-pages')
131
132
        for el_div in el_div_list:
133
            div_id = el_div.get('ID')
134
            log.debug("DIV_ID: %s" % el_div.get('ID'))
135
136
            self._page_cache[div_id] = el_div
137
138
            # Assign an empty dictionary that will hold the fptr of the added page (div)
139
            self._fptr_cache[div_id] = {}
140
141
            # log.info("Page_id added to the cache: %s" % div_id)
142
143
            for el_fptr in el_div:
144
                self._fptr_cache[div_id].update({el_fptr.get('FILEID') : el_fptr})
145
                # log.info("Fptr added to the cache: %s" % el_fptr.get('FILEID'))
146
147
        # log.info("Len of page_cache: %s" % len(self._page_cache))
148
        # log.info("Len of fptr_cache: %s" % len(self._fptr_cache))
149
150
    def _clear_caches(self):
151
        """
152
        Deallocates the caches
153
        """
154
155
        self._file_cache = None
156
        self._page_cache = None
157
        self._fptr_cache = None
158
159
    @property
160
    def unique_identifier(self):
161
        """
162
        Get the unique identifier by looking through ``mods:identifier``
163
        See `specs <https://ocr-d.de/en/spec/mets#unique-id-for-the-document-processed>`_ for details.
164
        """
165
        for t in IDENTIFIER_PRIORITY:
166
            found = self._tree.getroot().find('.//mods:identifier[@type="%s"]' % t, NS)
167
            if found is not None:
168
                return found.text
169
        
170
    @unique_identifier.setter
171
    def unique_identifier(self, purl):
172
        """
173
        Set the unique identifier by looking through ``mods:identifier``
174
        See `specs <https://ocr-d.de/en/spec/mets#unique-id-for-the-document-processed>`_ for details.
175
        """
176
        id_el = None
177
        for t in IDENTIFIER_PRIORITY:
178
            id_el = self._tree.getroot().find('.//mods:identifier[@type="%s"]' % t, NS)
179
            if id_el is not None:
180
                break
181
        if id_el is None:
182
            mods = self._tree.getroot().find('.//mods:mods', NS)
183
            id_el = ET.SubElement(mods, TAG_MODS_IDENTIFIER)
184
            id_el.set('type', 'purl')
185
        id_el.text = purl
186
187
    @property
188
    def agents(self):
189
        """
190
        List all :py:class:`ocrd_models.ocrd_agent.OcrdAgent`s
191
        """
192
        return [OcrdAgent(el_agent) for el_agent in self._tree.getroot().findall('mets:metsHdr/mets:agent', NS)]
193
194
    def add_agent(self, *args, **kwargs):
195
        """
196
        Add an :py:class:`ocrd_models.ocrd_agent.OcrdAgent` to the list of agents in the ``metsHdr``.
197
        """
198
        el_metsHdr = self._tree.getroot().find('.//mets:metsHdr', NS)
199
        if el_metsHdr is None:
200
            el_metsHdr = ET.Element(TAG_METS_METSHDR)
201
            self._tree.getroot().insert(0, el_metsHdr)
202
        #  assert(el_metsHdr is not None)
203
        el_agent = ET.SubElement(el_metsHdr, TAG_METS_AGENT)
204
        #  print(ET.tostring(el_metsHdr))
205
        return OcrdAgent(el_agent, *args, **kwargs)
206
207
    @property
208
    def file_groups(self):
209
        """
210
        List the `@USE` of all `mets:fileGrp` entries.
211
        """
212
213
        # WARNING: Actually we cannot return strings in place of elements!
214
        if self._cache_flag:
215
           return list(self._file_cache.keys())
216
217
        return [el.get('USE') for el in self._tree.getroot().findall('.//mets:fileGrp', NS)]
218
219
    def find_all_files(self, *args, **kwargs):
220
        """
221
        Like :py:meth:`find_files` but return a list of all results.
222
        Equivalent to ``list(self.find_files(...))``
223
        """
224
        return list(self.find_files(*args, **kwargs))
225
226
    # pylint: disable=multiple-statements
227
    def find_files(self, ID=None, fileGrp=None, pageId=None, mimetype=None, url=None, local_only=False):
228
        """
229
        Search ``mets:file`` entries in this METS document and yield results.
230
        The :py:attr:`ID`, :py:attr:`pageId`, :py:attr:`fileGrp`,
231
        :py:attr:`url` and :py:attr:`mimetype` parameters can each be either a
232
        literal string, or a regular expression if the string starts with
233
        ``//`` (double slash).
234
        If it is a regex, the leading ``//`` is removed and candidates are matched
235
        against the regex with `re.fullmatch`. If it is a literal string, comparison
236
        is done with string equality.
237
        The :py:attr:`pageId` parameter supports the numeric range operator ``..``. For
238
        example, to find all files in pages ``PHYS_0001`` to ``PHYS_0003``,
239
        ``PHYS_0001..PHYS_0003`` will be expanded to ``PHYS_0001,PHYS_0002,PHYS_0003``.
240
        Keyword Args:
241
            ID (string) : ``@ID`` of the ``mets:file``
242
            fileGrp (string) : ``@USE`` of the ``mets:fileGrp`` to list files of
243
            pageId (string) : ``@ID`` of the corresponding physical ``mets:structMap`` entry (physical page)
244
            url (string) : ``@xlink:href`` (URL or path) of ``mets:Flocat`` of ``mets:file``
245
            mimetype (string) : ``@MIMETYPE`` of ``mets:file``
246
            local (boolean) : Whether to restrict results to local files in the filesystem
247
        Yields:
248
            :py:class:`ocrd_models:ocrd_file:OcrdFile` instantiations
249
        """
250
        pageId_list = []
251
        if pageId:
252
            pageId_patterns = []
253
            for pageId_token in re.split(r',', pageId):
254
                if pageId_token.startswith(REGEX_PREFIX):
255
                    pageId_patterns.append(re.compile(pageId_token[REGEX_PREFIX_LEN:]))
256
                elif '..' in pageId_token:
257
                    pageId_patterns += generate_range(*pageId_token.split('..', 1))
258
                else:
259
                    pageId_patterns += [pageId_token]
260
            if self._cache_flag:
261
                for page_id in self._page_cache.keys():
262
                    if page_id in pageId_patterns or \
263
                        any([isinstance(p, typing.Pattern) and p.fullmatch(page_id) for p in pageId_patterns]):
264
                        pageId_list += self._fptr_cache[page_id]
265
            else:
266
                for page in self._tree.getroot().xpath(
267
                    '//mets:div[@TYPE="page"]', namespaces=NS):
268
                    if page.get('ID') in pageId_patterns or \
269
                        any([isinstance(p, typing.Pattern) and p.fullmatch(page.get('ID')) for p in pageId_patterns]):
270
                        pageId_list += [fptr.get('FILEID') for fptr in page.findall('mets:fptr', NS)]
271
272
        if ID and ID.startswith(REGEX_PREFIX):
273
            ID = re.compile(ID[REGEX_PREFIX_LEN:])
274
        if fileGrp and fileGrp.startswith(REGEX_PREFIX):
275
            fileGrp = re.compile(fileGrp[REGEX_PREFIX_LEN:])
276
        if mimetype and mimetype.startswith(REGEX_PREFIX):
277
            mimetype = re.compile(mimetype[REGEX_PREFIX_LEN:])
278
        if url and url.startswith(REGEX_PREFIX):
279
            url = re.compile(url[REGEX_PREFIX_LEN:])
280
            
281
        candidates = []
282
        if self._cache_flag:
283
            if fileGrp:
284
                if isinstance(fileGrp, str):
285
                    candidates += self._file_cache.get(fileGrp, {}).values()
286
                else:
287
                    candidates = [x for fileGrp_needle, el_file_list in self._file_cache.items() if fileGrp.match(fileGrp_needle) for x in el_file_list.values()]
288
            else:
289
                candidates = [el_file for id_to_file in self._file_cache.values() for el_file in id_to_file.values()]
290
        else:
291
            candidates = self._tree.getroot().xpath('//mets:file', namespaces=NS)
292
            
293
        for cand in candidates:
294
            if ID:
295
                if isinstance(ID, str):
296
                    if not ID == cand.get('ID'): continue
297
                else:
298
                    if not ID.fullmatch(cand.get('ID')): continue
299
300
            if pageId is not None and cand.get('ID') not in pageId_list:
301
                continue
302
303
            if not self._cache_flag and fileGrp:
304
                if isinstance(fileGrp, str):
305
                    if cand.getparent().get('USE') != fileGrp: continue
306
                else:
307
                    if not fileGrp.fullmatch(cand.getparent().get('USE')): continue
308
309
            if mimetype:
310
                if isinstance(mimetype, str):
311
                    if cand.get('MIMETYPE') != mimetype: continue
312
                else:
313
                    if not mimetype.fullmatch(cand.get('MIMETYPE') or ''): continue
314
315
            if url:
316
                cand_locat = cand.find('mets:FLocat', namespaces=NS)
317
                if cand_locat is None:
318
                    continue
319
                cand_url = cand_locat.get('{%s}href' % NS['xlink'])
320
                if isinstance(url, str):
321
                    if cand_url != url: continue
322
                else:
323
                    if not url.fullmatch(cand_url): continue
324
325
            # Note: why we instantiate a class only to find out that the local_only is set afterwards
326
            # Checking local_only and url before instantiation should be better?
327
            f = OcrdFile(cand, mets=self)
328
329
            # If only local resources should be returned and f is not a file path: skip the file
330
            if local_only and not is_local_filename(f.url):
331
                continue
332
            yield f
333
334
    def add_file_group(self, fileGrp):
335
        """
336
        Add a new ``mets:fileGrp``.
337
        Arguments:
338
            fileGrp (string): ``@USE`` of the new ``mets:fileGrp``.
339
        """
340
        if ',' in fileGrp:
341
            raise Exception('fileGrp must not contain commas')
342
        el_fileSec = self._tree.getroot().find('mets:fileSec', NS)
343
        if el_fileSec is None:
344
            el_fileSec = ET.SubElement(self._tree.getroot(), TAG_METS_FILESEC)
345
        el_fileGrp = el_fileSec.find('mets:fileGrp[@USE="%s"]' % fileGrp, NS)
346
        if el_fileGrp is None:
347
            el_fileGrp = ET.SubElement(el_fileSec, TAG_METS_FILEGRP)
348
            el_fileGrp.set('USE', fileGrp)
349
            
350
            if self._cache_flag:
351
                # Assign an empty dictionary that will hold the files of the added fileGrp
352
                self._file_cache[fileGrp] = {}
353
                
354
        return el_fileGrp
355
356
    def rename_file_group(self, old, new):
357
        """
358
        Rename a ``mets:fileGrp`` by changing the ``@USE`` from :py:attr:`old` to :py:attr:`new`.
359
        """
360
        el_fileGrp = self._tree.getroot().find('mets:fileSec/mets:fileGrp[@USE="%s"]' % old, NS)
361
        if el_fileGrp is None:
362
            raise FileNotFoundError("No such fileGrp '%s'" % old)
363
        el_fileGrp.set('USE', new)
364
        
365
        if self._cache_flag:
366
            self._file_cache[new] = self._file_cache.pop(old)
367
368
    def remove_file_group(self, USE, recursive=False, force=False):
369
        """
370
        Remove a ``mets:fileGrp`` (single fixed ``@USE`` or multiple regex ``@USE``)
371
        Arguments:
372
            USE (string): ``@USE`` of the ``mets:fileGrp`` to delete. Can be a regex if prefixed with ``//``
373
            recursive (boolean): Whether to recursively delete each ``mets:file`` in the group
374
            force (boolean): Do not raise an exception if ``mets:fileGrp`` does not exist
375
        """
376
        log = getLogger('ocrd_models.ocrd_mets.remove_file_group')
377
        el_fileSec = self._tree.getroot().find('mets:fileSec', NS)
378
        if el_fileSec is None:
379
            raise Exception("No fileSec!")
380
        if isinstance(USE, str):
381
            if USE.startswith(REGEX_PREFIX):
382
                use = re.compile(USE[REGEX_PREFIX_LEN:])
383
                for cand in el_fileSec.findall('mets:fileGrp', NS):
384
                    if use.fullmatch(cand.get('USE')):
385
                        self.remove_file_group(cand, recursive=recursive)
386
                return
387
            else:
388
                el_fileGrp = el_fileSec.find('mets:fileGrp[@USE="%s"]' % USE, NS)
389
        else:
390
            el_fileGrp = USE
391
        if el_fileGrp is None:   # pylint: disable=len-as-condition
392
            msg = "No such fileGrp: %s" % USE
393
            if force:
394
                log.warning(msg)
395
                return
396
            raise Exception(msg)
397
398
        # The cache should also be used here
399
        if self._cache_flag:
400
            files = self._file_cache.get(el_fileGrp.get('USE'), {}).values()
401
        else:
402
            files = el_fileGrp.findall('mets:file', NS)
403
404
        if files:
405
            if not recursive:
406
                raise Exception("fileGrp %s is not empty and recursive wasn't set" % USE)
407
            for f in files:
408
                # NOTE: Here we know the fileGrp, we should pass it as a parameter
409
                self.remove_one_file(ID=f.get('ID'), fileGrp=f.get('USE'))
410
                # NOTE2: Since remove_one_file also takes OcrdFile, we could just pass the file
411
                # self.remove_one_file(f)
412
                
413
        if self._cache_flag:
414
            # Note: Since the files inside the group are removed
415
            # with the 'remove_one_file' method above, 
416
            # we should not take care of that again.
417
            # We just remove the fileGrp.
418
            del self._file_cache[el_fileGrp.get('USE')]
419
            
420
        el_fileGrp.getparent().remove(el_fileGrp)
421
422
    def add_file(self, fileGrp, mimetype=None, url=None, ID=None, pageId=None, force=False, local_filename=None, ignore=False, **kwargs):
423
        """
424
        Instantiate and add a new :py:class:`ocrd_models.ocrd_file.OcrdFile`.
425
        Arguments:
426
            fileGrp (string): ``@USE`` of ``mets:fileGrp`` to add to
427
        Keyword Args:
428
            mimetype (string): ``@MIMETYPE`` of the ``mets:file`` to use
429
            url (string): ``@xlink:href`` (URL or path) of the ``mets:file`` to use
430
            ID (string): ``@ID`` of the ``mets:file`` to use
431
            pageId (string): ``@ID`` in the physical ``mets:structMap`` to link to
432
            force (boolean): Whether to add the file even if a ``mets:file`` with the same ``@ID`` already exists.
433
            ignore (boolean): Do not look for existing files at all. Shift responsibility for preventing errors from duplicate ID to the user.
434
            local_filename (string):
435
        """
436
        if not ID:
437
            raise ValueError("Must set ID of the mets:file")
438
        if not fileGrp:
439
            raise ValueError("Must set fileGrp of the mets:file")
440
        if not REGEX_FILE_ID.fullmatch(ID):
441
            raise ValueError("Invalid syntax for mets:file/@ID %s (not an xs:ID)" % ID)
442
        if not REGEX_FILE_ID.fullmatch(fileGrp):
443
            raise ValueError("Invalid syntax for mets:fileGrp/@USE %s (not an xs:ID)" % fileGrp)
444
        log = getLogger('ocrd_models.ocrd_mets.add_file')
445
446
        """
447
        # Note: we do not benefit enough from having 
448
        # a separate cache for fileGrp elements
449
450
        if self._cache_flag: 
451
            if fileGrp in self._fileGrp_cache:
452
                el_fileGrp = self._fileGrp_cache[fileGrp]
453
        """
454
455
        el_fileGrp = self.add_file_group(fileGrp)
456
        if not ignore:
457
            # Since we are sure that fileGrp parameter is set,
458
            # we could send that parameter to find_files for direct search
459
            mets_file = next(self.find_files(ID=ID, fileGrp=fileGrp), None)
460
            if mets_file:
461
                if mets_file.fileGrp == fileGrp and \
462
                   mets_file.pageId == pageId and \
463
                   mets_file.mimetype == mimetype:
464
                    if not force:
465
                        raise FileExistsError(f"A file with ID=={ID} already exists {mets_file} and neither force nor ignore are set")
466
                    self.remove_file(ID=ID, fileGrp=fileGrp)
467
                else:
468
                    raise FileExistsError(f"A file with ID=={ID} already exists {mets_file} but unrelated - cannot mitigate")
469
470
        # To get rid of Python's FutureWarning - checking if v is not None
471
        kwargs = {k: v for k, v in locals().items() if k in ['url', 'ID', 'mimetype', 'pageId', 'local_filename'] and v is not None}
472
        # This separation is needed to reuse the same el_mets_file element in the caching if block
473
        el_mets_file = ET.SubElement(el_fileGrp, TAG_METS_FILE)
474
        # The caching of the physical page is done in the OcrdFile constructor
475
        mets_file = OcrdFile(el_mets_file, mets=self, **kwargs)
476
477
        if self._cache_flag:
478
            # Add the file to the file cache
479
            self._file_cache[fileGrp].update({ID: el_mets_file})
480
481
        return mets_file
482
483
    def remove_file(self, *args, **kwargs):
484
        """
485
        Delete each ``ocrd:file`` matching the query. Same arguments as :py:meth:`find_files`
486
        """
487
        files = list(self.find_files(*args, **kwargs))
488
        if files:
489
            for f in files:
490
                self.remove_one_file(f)
491
            if len(files) > 1:
492
                return files
493
            else:
494
                return files[0] # for backwards-compatibility
495
        if any(1 for kwarg in kwargs
496
               if isinstance(kwarg, str) and kwarg.startswith(REGEX_PREFIX)):
497
            # allow empty results if filter criteria involve a regex
498
            return []
499
        raise FileNotFoundError("File not found: %s %s" % (args, kwargs))
500
501
    def remove_one_file(self, ID, fileGrp=None):
502
        """
503
        Delete an existing :py:class:`ocrd_models.ocrd_file.OcrdFile`.
504
        Arguments:
505
            ID (string): ``@ID`` of the ``mets:file`` to delete 
506
            -> ID could also be an OcrdFile, potentially misleading?
507
            fileGrp (string):
508
        Returns:
509
            The old :py:class:`ocrd_models.ocrd_file.OcrdFile` reference.
510
        """
511
        log = getLogger('ocrd_models.ocrd_mets.remove_one_file')
512
        log.debug("remove_one_file(%s %s)" % (ID, fileGrp))
513
        if isinstance(ID, OcrdFile):
514
            ocrd_file = ID
515
            ID = ocrd_file.ID
516
            # fileGrp = ocrd_file.fileGrp 
517
            # -> could this potentially help to improve the cached approach?
518
        else:
519
            # NOTE: We should pass the fileGrp, if known, as a parameter here as well
520
            # Leaving that out for now
521
            ocrd_file = next(self.find_files(ID=ID, fileGrp=fileGrp), None)
522
523
        if not ocrd_file:
524
            raise FileNotFoundError("File not found: %s" % ID)
525
526
        # Delete the physical page ref
527
        fptrs = []
528
        if self._cache_flag:
529
            for page in self._fptr_cache.keys():
530
                if ID in self._fptr_cache[page]:
531
                    fptrs.append(self._fptr_cache[page][ID])
532
        else:
533
            fptrs = self._tree.getroot().findall('.//mets:fptr[@FILEID="%s"]' % ID, namespaces=NS)
534
535
        # Delete the physical page ref
536
        for fptr in fptrs:
537
            log.debug("Delete fptr element %s for page '%s'", fptr, ID)
538
            page_div = fptr.getparent()
539
            page_div.remove(fptr)
540
            # Remove the fptr from the cache as well
541
            if self._cache_flag:
542
                del self._fptr_cache[page_div.get('ID')][ID]
543
            # delete empty pages
544
            if not page_div.getchildren():
545
                log.debug("Delete empty page %s", page_div)
546
                page_div.getparent().remove(page_div)
547
                # Delete the empty pages from caches as well
548
                if self._cache_flag:
549
                    del self._page_cache[page_div.get('ID')]
550
                    del self._fptr_cache[page_div.get('ID')]
551
552
        # Delete the file reference from the cache
553
        if self._cache_flag:
554
            parent_use = ocrd_file._el.getparent().get('USE')
555
            # Note: if the file is in the XML tree,
556
            # it must also be in the file cache.
557
            # Anyway, we perform the checks, then remove
558
            del self._file_cache[parent_use][ocrd_file.ID]
559
560
        # Delete the file reference
561
        # pylint: disable=protected-access
562
        ocrd_file._el.getparent().remove(ocrd_file._el)
563
564
        return ocrd_file
565
566
    @property
567
    def physical_pages(self):
568
        """
569
        List all page IDs (the ``@ID`` of each physical ``mets:structMap`` ``mets:div``)
570
        """
571
        if self._cache_flag:
572
            return self._page_cache.values()
573
            
574
        return self._tree.getroot().xpath(
575
            'mets:structMap[@TYPE="PHYSICAL"]/mets:div[@TYPE="physSequence"]/mets:div[@TYPE="page"]/@ID',
576
            namespaces=NS)
577
578
    def get_physical_pages(self, for_fileIds=None):
579
        """
580
        List all page IDs (the ``@ID`` of each physical ``mets:structMap`` ``mets:div``),
581
        optionally for a subset of ``mets:file`` ``@ID`` :py:attr:`for_fileIds`.
582
        """
583
        if for_fileIds is None:
584
            return self.physical_pages
585
        ret = [None] * len(for_fileIds)
586
        
587
        # Note: This entire function potentially could be further simplified
588
        # TODO: Simplify
589
        if self._cache_flag:
590
            for pageId in self._fptr_cache.keys():
591
                for fptr in self._fptr_cache[pageId].keys():
592
                    if fptr in for_fileIds:
593
                        ret[for_fileIds.index(fptr)] = pageId
594
        else:
595
          for page in self._tree.getroot().xpath(
596
              'mets:structMap[@TYPE="PHYSICAL"]/mets:div[@TYPE="physSequence"]/mets:div[@TYPE="page"]',
597
                  namespaces=NS):
598
              for fptr in page.findall('mets:fptr', NS):
599
                  if fptr.get('FILEID') in for_fileIds:
600
                      ret[for_fileIds.index(fptr.get('FILEID'))] = page.get('ID')
601
        return ret
602
603
    def set_physical_page_for_file(self, pageId, ocrd_file, order=None, orderlabel=None):
604
        """
605
        Set the physical page ID (``@ID`` of the physical ``mets:structMap`` ``mets:div`` entry)
606
        corresponding to the ``mets:file`` :py:attr:`ocrd_file`, creating all structures if necessary.
607
        Arguments:
608
            pageId (string): ``@ID`` of the physical ``mets:structMap`` entry to use
609
            ocrd_file (object): existing :py:class:`ocrd_models.ocrd_file.OcrdFile` object
610
        Keyword Args:
611
            order (string): ``@ORDER`` to use
612
            orderlabel (string): ``@ORDERLABEL`` to use
613
        """
614
        #  print(pageId, ocrd_file)
615
        # delete any page mapping for this file.ID
616
        
617
        # NOTE: The pageId coming from 'test_merge(sbb_sample_01)' is an Element not a string
618
        if not isinstance(pageId, str):
619
            pageId = pageId.get('ID')
620
            
621
        candidates = []
622
        if self._cache_flag:
623
            for page_id in self._fptr_cache.keys():
624
                if ocrd_file.ID in self._fptr_cache[page_id].keys():
625
                    if self._fptr_cache[page_id][ocrd_file.ID] is not None:
626
                        candidates.append(self._fptr_cache[page_id][ocrd_file.ID])
627
        else:
628
            candidates = self._tree.getroot().findall(
629
                'mets:structMap[@TYPE="PHYSICAL"]/mets:div[@TYPE="physSequence"]/mets:div[@TYPE="page"]/mets:fptr[@FILEID="%s"]' %
630
                ocrd_file.ID, namespaces=NS)
631
632
        for el_fptr in candidates:
633
            if self._cache_flag:
634
                del self._fptr_cache[el_fptr.getparent().get('ID')][ocrd_file.ID]
635
            el_fptr.getparent().remove(el_fptr)
636
637
        # find/construct as necessary
638
        el_structmap = self._tree.getroot().find('mets:structMap[@TYPE="PHYSICAL"]', NS)
639
        if el_structmap is None:
640
            el_structmap = ET.SubElement(self._tree.getroot(), TAG_METS_STRUCTMAP)
641
            el_structmap.set('TYPE', 'PHYSICAL')
642
        el_seqdiv = el_structmap.find('mets:div[@TYPE="physSequence"]', NS)
643
        if el_seqdiv is None:
644
            el_seqdiv = ET.SubElement(el_structmap, TAG_METS_DIV)
645
            el_seqdiv.set('TYPE', 'physSequence')
646
        
647
        el_pagediv = None
648
        if self._cache_flag:
649
            if pageId in self._page_cache.keys():
650
                el_pagediv = self._page_cache[pageId]
651
        else:
652
            el_pagediv = el_seqdiv.find('mets:div[@ID="%s"]' % pageId, NS)
653
        
654
        if el_pagediv is None:
655
            el_pagediv = ET.SubElement(el_seqdiv, TAG_METS_DIV)
656
            el_pagediv.set('TYPE', 'page')
657
            el_pagediv.set('ID', pageId)
658
            if order:
659
                el_pagediv.set('ORDER', order)
660
            if orderlabel:
661
                el_pagediv.set('ORDERLABEL', orderlabel)
662
            if self._cache_flag:
663
                # Create a new entry in the page cache
664
                self._page_cache[pageId] = el_pagediv
665
                # Create a new entry in the fptr cache and 
666
                # assign an empty dictionary to hold the fileids
667
                self._fptr_cache[pageId] = {}
668
                
669
        el_fptr = ET.SubElement(el_pagediv, TAG_METS_FPTR)
670
        el_fptr.set('FILEID', ocrd_file.ID)
671
672
        if self._cache_flag:
673
            # Assign the ocrd fileID to the pageId in the cache
674
            self._fptr_cache[el_pagediv.get('ID')].update({ocrd_file.ID : el_fptr})
675
676
    def get_physical_page_for_file(self, ocrd_file):
677
        """
678
        Get the physical page ID (``@ID`` of the physical ``mets:structMap`` ``mets:div`` entry)
679
        corresponding to the ``mets:file`` :py:attr:`ocrd_file`.
680
        """
681
        ret = []
682
        if self._cache_flag:
683
            for pageId in self._fptr_cache.keys():
684
                if ocrd_file.ID in self._fptr_cache[pageId].keys():
685
                    ret.append(self._page_cache[pageId].get('ID'))
686
        else:
687
            ret = self._tree.getroot().xpath(
688
                '/mets:mets/mets:structMap[@TYPE="PHYSICAL"]/mets:div[@TYPE="physSequence"]/mets:div[@TYPE="page"][./mets:fptr[@FILEID="%s"]]/@ID' %
689
                ocrd_file.ID, namespaces=NS)
690
691
        # To get rid of the python's FutureWarning
692
        if len(ret):
693
            return ret[0]
694
695
    def remove_physical_page(self, ID):
696
        """
697
        Delete page (physical ``mets:structMap`` ``mets:div`` entry ``@ID``) :py:attr:`ID`.
698
        """
699
        mets_div = None
700
        if self._cache_flag:
701
            if ID in self._page_cache.keys():
702
                mets_div = [self._page_cache[ID]]
703
        else:
704
            mets_div = self._tree.getroot().xpath(
705
                'mets:structMap[@TYPE="PHYSICAL"]/mets:div[@TYPE="physSequence"]/mets:div[@TYPE="page"][@ID="%s"]' % ID,
706
                namespaces=NS)
707
        if mets_div is not None:
708
            mets_div[0].getparent().remove(mets_div[0])
709
            if self._cache_flag:
710
                del self._page_cache[ID]
711
                del self._fptr_cache[ID]
712
713
    def remove_physical_page_fptr(self, fileId):
714
        """
715
        Delete all ``mets:fptr[@FILEID = fileId]`` to ``mets:file[@ID == fileId]`` for :py:attr:`fileId` from all ``mets:div`` entries in the physical ``mets:structMap``.
716
        Returns:
717
            List of pageIds that mets:fptrs were deleted from
718
        """
719
720
        # Question: What is the reason to keep a list of mets_fptrs?
721
        # Do we have a situation in which the fileId is same for different pageIds ?
722
        # From the examples I have seen inside 'assets' that is not the case
723
        # and the mets_fptrs list will always contain a single element.
724
        # If that's the case then we do not need to iterate 2 loops, just one.
725
        mets_fptrs = []
726
        if self._cache_flag:
727
            for page_id in self._fptr_cache.keys():
728
                if fileId in self._fptr_cache[page_id].keys():
729
                    mets_fptrs.append(self._fptr_cache[page_id][fileId]) 
730
        else:
731
            mets_fptrs = self._tree.getroot().xpath(
732
                'mets:structMap[@TYPE="PHYSICAL"]/mets:div[@TYPE="physSequence"]/mets:div[@TYPE="page"]/mets:fptr[@FILEID="%s"]' % fileId, namespaces=NS)
733
        ret = []
734
        for mets_fptr in mets_fptrs:
735
            mets_div = mets_fptr.getparent()
736
            ret.append(mets_div.get('ID'))
737
            if self._cache_flag:
738
                del self._fptr_cache[mets_div.get('ID')][mets_fptr.get('FILEID')]
739
            mets_div.remove(mets_fptr)
740
        return ret
741
742
    def merge(self, other_mets, force=False, fileGrp_mapping=None, fileId_mapping=None, pageId_mapping=None, after_add_cb=None, **kwargs):
743
        """
744
        Add all files from other_mets.
745
        Accepts the same kwargs as :py:func:`find_files`
746
        Keyword Args:
747
            force (boolean): Whether to :py:meth:`add_file`s with force (overwriting existing ``mets:file``s)
748
            fileGrp_mapping (dict): Map :py:attr:`other_mets` fileGrp to fileGrp in this METS
749
            fileId_mapping (dict): Map :py:attr:`other_mets` file ID to file ID in this METS
750
            pageId_mapping (dict): Map :py:attr:`other_mets` page ID to page ID in this METS
751
            after_add_cb (function): Callback received after file is added to the METS
752
        """
753
        if not fileGrp_mapping:
754
            fileGrp_mapping = {}
755
        if not fileId_mapping:
756
            fileId_mapping = {}
757
        if not pageId_mapping:
758
            pageId_mapping = {}
759
        for f_src in other_mets.find_files(**kwargs):
760
            f_dest = self.add_file(
761
                    fileGrp_mapping.get(f_src.fileGrp, f_src.fileGrp),
762
                    mimetype=f_src.mimetype,
763
                    url=f_src.url,
764
                    ID=fileId_mapping.get(f_src.ID, f_src.ID),
765
                    pageId=pageId_mapping.get(f_src.pageId, f_src.pageId),
766
                    force=force)
767
            # FIXME: merge metsHdr, amdSec, dmdSec as well
768
            # FIXME: merge structMap logical and structLink as well
769
            if after_add_cb:
770
                after_add_cb(f_dest)
771