Passed
Push — master ( 1f4459...f4ca5c )
by Daniel
08:26
created

amd.io   F

Complexity

Total Complexity 193

Size/Duplication

Total Lines 1526
Duplicated Lines 3.34 %

Importance

Changes 0
Metric Value
wmc 193
eloc 783
dl 51
loc 1526
rs 1.817
c 0
b 0
f 0

16 Functions

Rating   Name   Duplication   Size   Complexity  
A _custom_warning() 0 2 1
B periodicset_from_pymatgen_structure() 0 70 7
A _atom_has_disorder() 0 4 1
F periodicset_from_ase_cifblock() 26 165 30
A _refcodes_from_families() 0 28 3
A _heaviest_component() 0 19 5
B periodicset_from_ase_atoms() 0 59 6
A periodicset_from_ccdc_entry() 0 67 4
A _gemmi_loop_to_dict() 0 9 2
C _expand_asym_unit() 0 64 9
F periodicset_from_gemmi_block() 0 138 18
F periodicset_from_pymatgen_cifblock() 25 160 29
F _parse_sitesym_pymatgen() 0 78 19
F periodicset_from_ccdc_crystal() 0 130 15
A _frac_molecular_centres_ccdc() 0 13 2
A _unique_sites() 0 14 1

9 Methods

Rating   Name   Duplication   Size   Complexity  
A _Reader.read() 0 9 2
B _Reader.__next__() 0 30 7
A CifReader._generate_from_dir() 0 6 3
A _Reader.__iter__() 0 4 3
A _Reader.__init__() 0 21 2
B CSDReader.__init__() 0 50 7
A CifReader._pymatgen_cifblock_generator() 0 5 1
A CSDReader._ccdc_generator() 0 15 5
D CifReader.__init__() 0 97 11

How to fix   Duplicated Code    Complexity   

Duplicated Code

Duplicate code is one of the most pungent code smells. A rule that is often used is to re-structure code once it is duplicated in three or more places.

Common duplication problems, and corresponding solutions are:

Complexity

 Tip:   Before tackling complexity, make sure that you eliminate any duplication first. This often can reduce the size of classes significantly.

Complex classes like amd.io often do a lot of different things. To break such a class down, we need to identify a cohesive component within that class. A common approach to find such a component is to look for fields/methods that share the same prefixes, or suffixes.

Once you have determined the fields that belong together, you can apply the Extract Class refactoring. If the component makes sense as a sub-class, Extract Subclass is also a candidate, and is often faster.

1
"""Tools for reading crystals from files, or from the CSD with
2
``csd-python-api``. The readers return
3
:class:`amd.PeriodicSet <.periodicset.PeriodicSet>` objects representing
4
the crystal which can be passed to :func:`amd.AMD() <.calculate.AMD>`
5
and :func:`amd.PDD() <.calculate.PDD>` to get their invariants.
6
"""
7
8
import os
9
import re
10
import functools
11
import warnings
12
from typing import Tuple
13
14
import numpy as np
15
import numba
16
from ase.data import atomic_numbers as ase_atomic_numbers
17
from ase.spacegroup import get_basis as ase_get_basis
18
from ase.spacegroup.spacegroup import parse_sitesym as ase_parse_sitesym
19
from ase.io.cif import (
20
    convert_value as ase_convert_value, 
21
    parse_cif as ase_parse_cif
22
)
23
24
from .utils import cellpar_to_cell
25
from .periodicset import PeriodicSet
26
27
28
def _custom_warning(message, category, filename, lineno, *args, **kwargs):
29
    return f'{category.__name__}: {message}\n'
30
31
warnings.formatwarning = _custom_warning
32
33
_EQUIV_SITE_TOL = 1e-3
34
35
CIF_TAGS = {
36
    'cellpar': [
37
        '_cell_length_a',
38
        '_cell_length_b',
39
        '_cell_length_c',
40
        '_cell_angle_alpha',
41
        '_cell_angle_beta',
42
        '_cell_angle_gamma',],
43
44
    'atom_site_fract': [
45
        '_atom_site_fract_x',
46
        '_atom_site_fract_y',
47
        '_atom_site_fract_z',],
48
49
    'atom_site_cartn': [
50
        '_atom_site_cartn_x',
51
        '_atom_site_cartn_y',
52
        '_atom_site_cartn_z',],
53
54
    'atom_symbol': [
55
        '_atom_site_type_symbol',
56
        '_atom_site_label',],
57
58
    'symop': [
59
        '_symmetry_equiv_pos_as_xyz',
60
        '_space_group_symop_operation_xyz',
61
        '_space_group_symop.operation_xyz',
62
        '_symmetry_equiv_pos_as_xyz_',
63
        '_space_group_symop_operation_xyz_',],
64
65
    'spacegroup_name': [
66
        '_space_group_name_Hall',
67
        '_symmetry_space_group_name_hall',
68
        '_space_group_name_H-M_alt',
69
        '_symmetry_space_group_name_H-M',
70
        '_symmetry_space_group_name_H_M',
71
        '_symmetry_space_group_name_h-m',],
72
73
    'spacegroup_number': [
74
        '_space_group_IT_number',
75
        '_symmetry_Int_Tables_number',
76
        '_space_group_IT_number_',
77
        '_symmetry_Int_Tables_number_',],
78
}
79
80
81
class _Reader:
82
83
    _DISORDER_OPTIONS = {'skip', 'ordered_sites', 'all_sites'}
84
85
    def __init__(
86
            self,
87
            remove_hydrogens=False,
88
            disorder='skip',
89
            heaviest_component=False,
90
            molecular_centres=False,
91
            show_warnings=True
92
    ):
93
94
        if disorder not in _Reader._DISORDER_OPTIONS:
95
            msg = 'disorder parameter must be one of ' \
96
                  f'{_Reader._DISORDER_OPTIONS} (passed {disorder})'
97
            raise ValueError(msg)
98
99
        self.remove_hydrogens = remove_hydrogens
100
        self.disorder = disorder
101
        self.heaviest_component = heaviest_component
102
        self.molecular_centres = molecular_centres
103
        self.show_warnings = show_warnings
104
        self._backend_iterator = None
105
        self._converter = None
106
107
    def __iter__(self):
108
        if self._backend_iterator is None or self._converter is None:
109
            raise RuntimeError(f'{self.__class__.__name__} not initialized.')
110
        return self
111
112
    def __next__(self):
113
        """Iterates over self._backend_iterator, passing items through
114
        self._converter. Catches :class:`ParseError <.io.ParseError>`
115
        and warnings raised in self._converter, optionally printing
116
        them.
117
        """
118
119
        if not self.show_warnings:
120
            warnings.simplefilter('ignore')
121
122
        while True:
123
124
            item = next(self._backend_iterator)
125
126
            with warnings.catch_warnings(record=True) as warning_msgs:
127
                msg = None
128
                try:
129
                    periodic_set = self._converter(item)
130
                except ParseError as err:
131
                    msg = str(err)
132
133
            if msg:
134
                warnings.warn(msg)
135
                continue
136
137
            for warning in warning_msgs:
138
                msg = f'{periodic_set.name} {warning.message}'
139
                warnings.warn(msg, category=warning.category)
140
141
            return periodic_set
142
143
    def read(self):
144
        """Reads the crystal(s), returns one
145
        :class:`amd.PeriodicSet <.periodicset.PeriodicSet>` if there is
146
        only one, otherwise returns a list. (Note the return type is not
147
        consistent!)"""
148
        l = list(self)
149
        if len(l) == 1:
150
            return l[0]
151
        return l
152
153
154
class CifReader(_Reader):
155
    """Read all structures in a .cif file or all files in a folder
156
    with ase or csd-python-api (if installed), yielding
157
    :class:`amd.PeriodicSet <.periodicset.PeriodicSet>` s.
158
159
    Parameters
160
    ----------
161
    path : str
162
        Path to a .cif file or directory. (Other files are accepted when
163
        using ``reader='ccdc'``, if csd-python-api is installed.)
164
    reader : str, optional
165
        The backend package used for parsing. Default is :code:`ase`,
166
        to use csd-python-api change to :code:`ccdc`. The ccdc reader
167
        should be able to read any format accepted by
168
        :class:`ccdc.io.EntryReader`, though only cifs have been tested.
169
    remove_hydrogens : bool, optional
170
        Remove Hydrogens from the crystal.
171
    disorder : str, optional
172
        Controls how disordered structures are handled. Default is
173
        ``skip`` which skips any crystal with disorder, since disorder
174
        conflicts with the periodic set model. To read disordered
175
        structures anyway, choose either :code:`ordered_sites` to remove
176
        atoms with disorder or :code:`all_sites` include all atoms
177
        regardless of disorder.
178
    heaviest_component : bool, optional
179
        csd-python-api only. Removes all but the heaviest molecule in
180
        the asymmeric unit, intended for removing solvents.
181
    molecular_centres : bool, default False
182
        csd-python-api only. Extract the centres of molecules in the
183
        unit cell and store in the attribute molecular_centres.
184
    show_warnings : bool, optional
185
        Controls whether warnings that arise during reading are printed.
186
187
    Yields
188
    ------
189
    :class:`amd.PeriodicSet <.periodicset.PeriodicSet>`
190
        Represents the crystal as a periodic set, consisting of a finite
191
        set of points (motif) and lattice (unit cell). Contains other
192
        data, e.g. the crystal's name and information about the
193
        asymmetric unit.
194
195
    Examples
196
    --------
197
198
        ::
199
200
            # Put all crystals in a .CIF in a list
201
            structures = list(amd.CifReader('mycif.cif'))
202
203
            # Can also accept path to a directory, reading all files inside
204
            structures = list(amd.CifReader('path/to/folder'))
205
206
            # Reads just one if the .CIF has just one crystal
207
            periodic_set = amd.CifReader('mycif.cif').read()
208
209
            # List of AMDs (k=100) of crystals in a .CIF
210
            amds = [amd.AMD(periodic_set, 100) for periodic_set in amd.CifReader('mycif.cif')]
211
    """
212
213
    def __init__(
214
            self,
215
            path,
216
            reader='ase',
217
            remove_hydrogens=False,
218
            disorder='skip',
219
            heaviest_component=False,
220
            molecular_centres=False,
221
            show_warnings=True,
222
    ):
223
224
        super().__init__(
225
            remove_hydrogens=remove_hydrogens,
226
            disorder=disorder,
227
            heaviest_component=heaviest_component,
228
            molecular_centres=molecular_centres,
229
            show_warnings=show_warnings
230
        )
231
232
        # file_parser: Callable(path) -> Iterable[External object]
233
        # self._converter: Callable(External object) -> PeriodicSet
234
        # set options in self._converter with functools.partial
235
236
        if reader != 'ccdc':
237
238
            if heaviest_component:
239
                msg = 'Parameter heaviest_component only implemented for ' \
240
                      'reader="ccdc".'
241
                raise NotImplementedError(msg)
242
243
            if molecular_centres:
244
                msg = 'Parameter molecular_centres only implemented for ' \
245
                      'reader="ccdc".'
246
                raise NotImplementedError(msg)
247
248
        if reader in ('ase', 'pycodcif'):
249
250
            extensions = {'cif'}
251
            file_parser = functools.partial(ase_parse_cif, reader=reader)
252
            self._converter = functools.partial(
253
                periodicset_from_ase_cifblock,
254
                remove_hydrogens=self.remove_hydrogens,
255
                disorder=self.disorder
256
            )
257
258
        elif reader == 'pymatgen':
259
260
            extensions = {'cif'}
261
            file_parser = CifReader._pymatgen_cifblock_generator
262
            self._converter = functools.partial(
263
                periodicset_from_pymatgen_cifblock,
264
                remove_hydrogens=self.remove_hydrogens,
265
                disorder=self.disorder
266
            )
267
268
        elif reader == 'gemmi':
269
270
            import gemmi
271
272
            extensions = {'cif'}
273
            file_parser = gemmi.cif.read_file
274
            self._converter = functools.partial(
275
                periodicset_from_gemmi_block,
276
                remove_hydrogens=self.remove_hydrogens,
277
                disorder=self.disorder
278
            )
279
280
        elif reader == 'ccdc':
281
282
            try:
283
                import ccdc.io
284
            except (ImportError, RuntimeError) as e:
285
                msg = 'Failed to import csd-python-api, please check it is' \
286
                      'installed and licensed.'
287
                raise ImportError(msg) from e
288
289
            extensions = ccdc.io.EntryReader.known_suffixes
290
            file_parser = ccdc.io.EntryReader
291
            self._converter = functools.partial(
292
                periodicset_from_ccdc_entry,
293
                remove_hydrogens=self.remove_hydrogens,
294
                disorder=self.disorder,
295
                molecular_centres=self.molecular_centres,
296
                heaviest_component=self.heaviest_component
297
            )
298
299
        else:
300
            raise ValueError(f'Unknown reader {reader}.')
301
302
        if os.path.isfile(path):
303
            it = file_parser(path)
304
        elif os.path.isdir(path):
305
            it = CifReader._generate_from_dir(path, file_parser, extensions)
306
        else:
307
            raise FileNotFoundError(f'No such file or directory: {path}')
308
309
        self._backend_iterator = iter(it)
310
        
311
    @staticmethod
312
    def _generate_from_dir(path, file_parser, extensions):
313
        for file in os.listdir(path):
314
            suff = os.path.splitext(file)[1][1:]
315
            if suff.lower() in extensions:
316
                yield from file_parser(os.path.join(path, file))
317
318
    @staticmethod
319
    def _pymatgen_cifblock_generator(path):
320
        """Path to .cif --> generator of pymatgen CifBlocks."""
321
        from pymatgen.io.cif import CifFile
322
        yield from CifFile.from_file(path).data.values()
323
324
325
class CSDReader(_Reader):
326
    """Read structures from the CSD with csd-python-api, yielding
327
    :class:`amd.PeriodicSet <.periodicset.PeriodicSet>` s.
328
329
    Parameters
330
    ----------
331
    refcodes : str or List[str], optional
332
        Single or list of CSD refcodes to read. If None or 'CSD',
333
        iterates over the whole CSD.
334
    families : bool, optional
335
        Read all entries whose refcode starts with the given strings, or
336
        'families' (e.g. giving 'DEBXIT' reads all entries starting with
337
        DEBXIT).
338
    remove_hydrogens : bool, optional
339
        Remove hydrogens from the crystal.
340
    disorder : str, optional
341
        Controls how disordered structures are handled. Default is
342
        ``skip`` which skips any crystal with disorder, since disorder
343
        conflicts with the periodic set model. To read disordered
344
        structures anyway, choose either :code:`ordered_sites` to remove
345
        atoms with disorder or :code:`all_sites` include all atoms
346
        regardless of disorder.
347
    heaviest_component : bool, optional
348
        Removes all but the heaviest molecule in the asymmeric unit,
349
        intended for removing solvents.
350
    molecular_centres : bool, default False
351
        Extract the centres of molecules in the unit cell and store in
352
        attribute molecular_centres.
353
    show_warnings : bool, optional
354
        Controls whether warnings that arise during reading are printed.
355
356
    Yields
357
    ------
358
    :class:`amd.PeriodicSet <.periodicset.PeriodicSet>`
359
        Represents the crystal as a periodic set, consisting of a finite
360
        set of points (motif) and lattice (unit cell). Contains other
361
        useful data, e.g. the crystal's name and information about the
362
        asymmetric unit for calculation.
363
364
    Examples
365
    --------
366
367
        ::
368
369
            # Put these entries in a list
370
            refcodes = ['DEBXIT01', 'DEBXIT05', 'HXACAN01']
371
            structures = list(amd.CSDReader(refcodes))
372
373
            # Read refcode families (any whose refcode starts with strings in the list)
374
            refcode_families = ['ACSALA', 'HXACAN']
375
            structures = list(amd.CSDReader(refcode_families, families=True))
376
377
            # Get AMDs (k=100) for crystals in these families
378
            refcodes = ['ACSALA', 'HXACAN']
379
            amds = []
380
            for periodic_set in amd.CSDReader(refcodes, families=True):
381
                amds.append(amd.AMD(periodic_set, 100))
382
383
            # Giving the reader nothing reads from the whole CSD.
384
            for periodic_set in amd.CSDReader():
385
                ...
386
    """
387
388
    def __init__(
389
            self,
390
            refcodes=None,
391
            families=False,
392
            remove_hydrogens=False,
393
            disorder='skip',
394
            heaviest_component=False,
395
            molecular_centres=False,
396
            show_warnings=True,
397
    ):
398
399
        super().__init__(
400
            remove_hydrogens=remove_hydrogens,
401
            disorder=disorder,
402
            heaviest_component=heaviest_component,
403
            molecular_centres=molecular_centres,
404
            show_warnings=show_warnings,
405
        )
406
407
        try:
408
            import ccdc.io
409
        except (ImportError, RuntimeError) as _:
410
            msg = 'Failed to import csd-python-api, please check it is ' \
411
                  'installed and licensed.'
412
            raise ImportError(msg)
413
414
        if isinstance(refcodes, str) and refcodes.lower() == 'csd':
415
            refcodes = None
416
417
        if refcodes is None:
418
            families = False
419
        else:
420
            if isinstance(refcodes, str):
421
                refcodes = [refcodes]
422
            else:
423
                refcodes = list(refcodes)
424
425
        if families:
426
            refcodes = _refcodes_from_families(refcodes)
427
428
        entry_reader = ccdc.io.EntryReader('CSD')
429
        self._converter = functools.partial(
430
            periodicset_from_ccdc_entry,
431
            remove_hydrogens=self.remove_hydrogens,
432
            disorder=self.disorder,
433
            molecular_centres=self.molecular_centres,
434
            heaviest_component=self.heaviest_component
435
        )
436
        it = self._ccdc_generator(refcodes, entry_reader)
437
        self._backend_iterator = iter(it)
438
439
    @staticmethod
440
    def _ccdc_generator(refcodes, entry_reader):
441
        """Generates ccdc Entries from CSD refcodes.
442
        """
443
444
        if refcodes is None:
445
            for entry in entry_reader:
446
                yield entry
447
        else:
448
            for refcode in refcodes:
449
                try:
450
                    entry = entry_reader.entry(refcode)
451
                    yield entry
452
                except RuntimeError:
453
                    warnings.warn(f'{refcode} not found in database')
454
455
456
class ParseError(ValueError):
457
    """Raised when an item cannot be parsed into a periodic set.
458
    """
459
    pass
460
461
462
def periodicset_from_ase_cifblock(
463
        block,
464
        remove_hydrogens=False,
465
        disorder='skip'
466
) -> PeriodicSet:
467
    """:class:`ase.io.cif.CIFBlock` --> 
468
    :class:`amd.PeriodicSet <.periodicset.PeriodicSet>`.
469
    :class:`ase.io.cif.CIFBlock` is the type returned by
470
    :class:`ase.io.cif.parse_cif`.
471
472
    Parameters
473
    ----------
474
    block : :class:`ase.io.cif.CIFBlock`
475
        An ase :class:`ase.io.cif.CIFBlock` object representing a
476
        crystal.
477
    remove_hydrogens : bool, optional
478
        Remove Hydrogens from the crystal.
479
    disorder : str, optional
480
        Controls how disordered structures are handled. Default is
481
        ``skip`` which skips any crystal with disorder, since disorder
482
        conflicts with the periodic set model. To read disordered
483
        structures anyway, choose either :code:`ordered_sites` to remove
484
        atoms with disorder or :code:`all_sites` include all atoms
485
        regardless of disorder.
486
487
    Returns
488
    -------
489
    :class:`amd.PeriodicSet <.periodicset.PeriodicSet>`
490
        Represents the crystal as a periodic set, consisting of a finite
491
        set of points (motif) and lattice (unit cell). Contains other
492
        useful data, e.g. the crystal's name and information about the
493
        asymmetric unit for calculation.
494
495
    Raises
496
    ------
497
    ParseError
498
        Raised if the structure fails to be parsed for any of the
499
        following: 1. Required data is missing (e.g. cell parameters),
500
        2. The motif is empty after removing H or disordered sites,
501
        3. :code:``disorder == 'skip'`` and disorder is found on any
502
        atom.
503
    """
504
505
    # unit cell
506
    cellpar = [block.get(tag) for tag in CIF_TAGS['cellpar']]
507
    if None in cellpar:
508
        raise ParseError(f'{block.name} has missing cell data')
509
    cell = cellpar_to_cell(*cellpar)
510
511
    # asymmetric unit coords. ase removes uncertainty brackets.
512
    cartesian = False # flag needed for later
513
    asym_unit = [block.get(name) for name in CIF_TAGS['atom_site_fract']]
514
    if None in asym_unit: # missing scaled coords, try cartesian
515
        asym_unit = [block.get(name) for name in CIF_TAGS['atom_site_cartn']]
516
        if None in asym_unit:
517
            raise ParseError(f'{block.name} has no coordinates')
518
        cartesian = True
519
    asym_unit = list(zip(*asym_unit)) # transpose [xs,ys,zs] -> [p1,p2,...]
520
521
    # atomic types
522
    asym_symbols = block._get_any(CIF_TAGS['atom_symbol'])
523 View Code Duplication
    if asym_symbols is None:
0 ignored issues
show
Duplication introduced by
This code seems to be duplicated in your project.
Loading history...
524
        warnings.warn('missing atomic types will be labelled 0')
525
        asym_types = [0] * len(asym_unit)
526
    else:
527
        asym_types = []
528
        for label in asym_symbols:
529
            if label in ('.', '?'):
530
                warnings.warn('missing atomic types will be labelled 0')
531
                num = 0
532
            else:
533
                sym = re.search(r'([A-Z][a-z]?)', label).group(0)
534
                if sym == 'D':
535
                    sym = 'H'
536
                num = ase_atomic_numbers[sym]
537
            asym_types.append(num)
538
539
    # find if sites have disorder if necassary
540
    has_disorder = []
541
    if disorder != 'all_sites':
542
        occupancies = block.get('_atom_site_occupancy')
543
        if occupancies is None:
544
            occupancies = np.ones((len(asym_unit), ))
545
        labels = block.get('_atom_site_label')
546
        if labels is None:
547
            labels = [''] * len(asym_unit)
548
        for lab, occ in zip(labels, occupancies):
549
            has_disorder.append(_atom_has_disorder(lab, occ))
550
551
    # remove sites with ?, . or other invalid string
552
    invalid_coords = []
553
    for i, xyz in enumerate(asym_unit):
554
        if not all(isinstance(coord, (int, float)) for coord in xyz):
555
            invalid_coords.append(i)
556
    if invalid_coords:
557
        warnings.warn('atoms without sites or missing data will be removed')
558
        asym_unit = [xyz for i, xyz in enumerate(asym_unit) if i not in invalid_coords]
559
        asym_types = [t for i, t in enumerate(asym_types) if i not in invalid_coords]
560
        if disorder != 'all_sites':
561
            has_disorder = [d for i, d in enumerate(has_disorder) if i not in invalid_coords]
562
563
    remove_sites = []
564
    if remove_hydrogens:
565
        remove_sites.extend(i for i, num in enumerate(asym_types) if num == 1)
566
567
    # remove atoms with fractional occupancy or raise ParseError
568 View Code Duplication
    if disorder != 'all_sites':
0 ignored issues
show
Duplication introduced by
This code seems to be duplicated in your project.
Loading history...
569
        for i, dis in enumerate(has_disorder):
570
            if i in remove_sites:
571
                continue
572
            if dis:
573
                if disorder == 'skip':
574
                    msg = f"{block.name} has disorder, pass " \
575
                           "disorder='ordered_sites'or 'all_sites' to " \
576
                           "remove/ignore disorder"
577
                    raise ParseError(msg)
578
                elif disorder == 'ordered_sites':
579
                    remove_sites.append(i)
580
581
    asym_unit = [xyz for i, xyz in enumerate(asym_unit) if i not in remove_sites]
582
    asym_types = [t for i, t in enumerate(asym_types) if i not in remove_sites]
583
    if len(asym_unit) == 0:
584
        raise ParseError(f'{block.name} has no valid sites')
585
    asym_unit = np.array(asym_unit)
586
587
    # if Cartesian coords were given, convert to scaled
588
    if cartesian:
589
        asym_unit = asym_unit @ np.linalg.inv(cell)
590
    asym_unit = np.mod(asym_unit, 1)
591
    # asym_unit = _snap_small_prec_coords(asym_unit) # recommended by pymatgen
592
593
    # remove overlapping sites unless disorder == 'all_sites'
594
    if disorder != 'all_sites':
595
        keep_sites = _unique_sites(asym_unit)
596
        if not np.all(keep_sites):
597
            msg = 'may have overlapping sites, duplicates will be removed'
598
            warnings.warn(msg)
599
            asym_unit = asym_unit[keep_sites]
600
            asym_types = [t for t, keep in zip(asym_types, keep_sites) if keep]
601
602
    # symmetry operations
603
    sitesym = block._get_any(CIF_TAGS['symop'])
604
    if sitesym is None: # no symops, use spacegroup
605
        try:
606
            spg = block.get_spacegroup(True)
607
            rot, trans = spg.rotations, spg.translations
608
        except: # no spacegroup, assume P1
609
            rot, trans = ase_parse_sitesym(['x,y,z'])
610
    else:
611
        if isinstance(sitesym, str):
612
            sitesym = [sitesym]
613
        rot, trans = ase_parse_sitesym(sitesym)
614
615
    # apply symmetries to asymmetric unit
616
    frac_motif, asym_inds, wyc_muls, inverses = _expand_asym_unit(asym_unit, rot, trans)
617
    types = np.array([asym_types[i] for i in inverses])
618
    motif = frac_motif @ cell
619
    
620
    return PeriodicSet(
621
        motif=motif,
622
        cell=cell,
623
        name=block.name,
624
        asymmetric_unit=asym_inds,
625
        wyckoff_multiplicities=wyc_muls,
626
        types=types
627
    )
628
629
630
def periodicset_from_ase_atoms(
631
        atoms,
632
        remove_hydrogens=False
633
) -> PeriodicSet:
634
    """:class:`ase.atoms.Atoms` -->
635
    :class:`amd.PeriodicSet <.periodicset.PeriodicSet>`. Does not have
636
    the option to remove disorder.
637
638
    Parameters
639
    ----------
640
    atoms : :class:`ase.atoms.Atoms`
641
        An ase :class:`ase.atoms.Atoms` object representing a crystal.
642
    remove_hydrogens : bool, optional
643
        Remove Hydrogens from the crystal.
644
645
    Returns
646
    -------
647
    :class:`amd.PeriodicSet <.periodicset.PeriodicSet>`
648
        Represents the crystal as a periodic set, consisting of a finite
649
        set of points (motif) and lattice (unit cell). Contains other
650
        useful data, e.g. the crystal's name and information about the
651
        asymmetric unit for calculation.
652
653
    Raises
654
    ------
655
    ParseError
656
        Raised if there are no valid sites in atoms.
657
    """
658
659
    cell = atoms.get_cell().array
660
661
    remove_inds = []
662
    if remove_hydrogens:
663
        for i in np.where(atoms.get_atomic_numbers() == 1)[0]:
664
            remove_inds.append(i)
665
666
    for i in sorted(remove_inds, reverse=True):
667
        atoms.pop(i)
668
669
    if len(atoms) == 0:
670
        raise ParseError(f'ase Atoms object has no valid sites')
671
672
    spg = None
673
    if 'spacegroup' in atoms.info:
674
        spg = atoms.info['spacegroup']
675
        rot, trans = spg.rotations, spg.translations
676
    # else assume no symmetries?
677
678
    # default tol=1e-5
679
    asym_unit = ase_get_basis(atoms, spacegroup=spg, tol=_EQUIV_SITE_TOL)
680
    frac_motif, asym_inds, wyc_muls, _ = _expand_asym_unit(asym_unit, rot, trans)
0 ignored issues
show
introduced by
The variable rot does not seem to be defined in case 'spacegroup' in atoms.info on line 673 is False. Are you sure this can never be the case?
Loading history...
introduced by
The variable trans does not seem to be defined in case 'spacegroup' in atoms.info on line 673 is False. Are you sure this can never be the case?
Loading history...
681
    motif = frac_motif @ cell
682
683
    return PeriodicSet(
684
        motif=motif,
685
        cell=cell,
686
        asymmetric_unit=asym_inds,
687
        wyckoff_multiplicities=wyc_muls,
688
        types=atoms.get_atomic_numbers()
689
    )
690
691
692
def periodicset_from_pymatgen_cifblock(
693
        block,
694
        remove_hydrogens=False,
695
        disorder='skip'
696
) -> PeriodicSet:
697
    """:class:`pymatgen.io.cif.CifBlock` -->
698
    :class:`amd.PeriodicSet <.periodicset.PeriodicSet>`.
699
700
    Parameters
701
    ----------
702
    block : :class:`pymatgen.io.cif.CifBlock`
703
        A pymatgen CifBlock object representing a crystal.
704
    remove_hydrogens : bool, optional
705
        Remove Hydrogens from the crystal.
706
    disorder : str, optional
707
        Controls how disordered structures are handled. Default is
708
        ``skip`` which skips any crystal with disorder, since disorder
709
        conflicts with the periodic set model. To read disordered
710
        structures anyway, choose either :code:`ordered_sites` to remove
711
        atoms with disorder or :code:`all_sites` include all atoms
712
        regardless of disorder.
713
714
    Returns
715
    -------
716
    :class:`amd.PeriodicSet <.periodicset.PeriodicSet>`
717
        Represents the crystal as a periodic set, consisting of a finite
718
        set of points (motif) and lattice (unit cell). Contains other
719
        useful data, e.g. the crystal's name and information about the
720
        asymmetric unit for calculation.
721
722
    Raises
723
    ------
724
    ParseError
725
        Raised if the structure can/should not be parsed for the
726
        following reasons: 1. No sites found or motif is empty after
727
        removing Hydrogens & disorder, 2. A site has missing
728
        coordinates, 3. :code:``disorder == 'skip'`` and disorder is
729
        found on any atom.
730
    """
731
732
    odict = block.data
733
734
    # unit cell
735
    cellpar = [odict.get(tag) for tag in CIF_TAGS['cellpar']]
736
    if None in cellpar:
737
        raise ParseError(f'{block.header} has missing cell data')
738
    cellpar = [ase_convert_value(v) for v in cellpar]
739
    cell = cellpar_to_cell(*cellpar)
740
741
    # asymmetric unit coords
742
    cartesian = False # flag needed for later
743
    asym_unit = [odict.get(tag) for tag in CIF_TAGS['atom_site_fract']]
744
745
    if None in asym_unit: # missing scaled coords, try cartesian
746
        asym_unit = [odict.get(tag) for tag in CIF_TAGS['atom_site_cartn']]
747
        if None in asym_unit:
748
            raise ParseError(f'{block.header} has no coordinates')
749
        cartesian = True
750
751
    asym_unit = list(zip(*asym_unit)) # transpose [xs,ys,zs] -> [p1,p2,...]
752
    # pymatgen does not remove uncertainty brackets
753
    asym_unit = [[ase_convert_value(coord) for coord in xyz] 
754
                 for xyz in asym_unit]
755
756
    # atomic types
757
    for tag in CIF_TAGS['atom_symbol']:
758
        asym_symbols = odict.get(tag)
759 View Code Duplication
        if asym_symbols is not None:
0 ignored issues
show
Duplication introduced by
This code seems to be duplicated in your project.
Loading history...
760
            asym_types = []
761
            for label in asym_symbols:
762
                if label in ('.', '?'):
763
                    warnings.warn('missing atomic types will be labelled 0')
764
                    num = 0
765
                else:
766
                    sym = re.search(r'([A-Z][a-z]?)', label).group(0)
767
                    if sym == 'D':
768
                        sym = 'H'
769
                    # could use pymatgen here
770
                    num = ase_atomic_numbers[sym]
771
                asym_types.append(num)
772
            break
773
    else:
774
        warnings.warn('missing atomic types will be labelled 0')
775
        asym_types = [0] * len(asym_unit)
776
777
    # find if sites have disorder or not (if necassary)
778
    has_disorder = []
779
    if disorder != 'all_sites':
780
        occupancies = odict.get('_atom_site_occupancy')
781
        if occupancies is None:
782
            occupancies = np.ones((len(asym_unit), ))
783
        labels = odict.get('_atom_site_label')
784
        if labels is None:
785
            labels = [''] * len(asym_unit)
786
        for lab, occ in zip(labels, occupancies):
787
            has_disorder.append(_atom_has_disorder(lab, occ))
788
789
    # remove sites with ?, . or other string
790
    invalid_coords = []
791
    for i, xyz in enumerate(asym_unit):
792
        if not all(isinstance(coord, (int, float)) for coord in xyz):
793
            invalid_coords.append(i)
794
795
    if invalid_coords:
796
        warnings.warn('atoms without sites or missing data will be removed')
797
        asym_unit = [xyz for i, xyz in enumerate(asym_unit) if i not in invalid_coords]
798
        asym_types = [xyz for i, xyz in enumerate(asym_types) if i not in invalid_coords]
799
        if disorder != 'all_sites':
800
            has_disorder = [d for i, d in enumerate(has_disorder) if i not in invalid_coords]
801
802
    remove_sites = []
803
    if remove_hydrogens:
804
        remove_sites.extend((i for i, num in enumerate(asym_types) if num == 1))
0 ignored issues
show
introduced by
The variable i does not seem to be defined in case the for loop on line 791 is not entered. Are you sure this can never be the case?
Loading history...
805
806 View Code Duplication
    if disorder != 'all_sites':
0 ignored issues
show
Duplication introduced by
This code seems to be duplicated in your project.
Loading history...
807
        for i, dis in enumerate(has_disorder):
808
            if i in remove_sites:
809
                continue
810
            if dis:
811
                if disorder == 'skip':
812
                    msg = f"{block.header} has disorder, pass " \
813
                            "disorder='ordered_sites' or 'all_sites' to " \
814
                            "remove/ignore disorder"
815
                    raise ParseError(msg)
816
                elif disorder == 'ordered_sites':
817
                    remove_sites.append(i)
818
819
    asym_unit = [xyz for i, xyz in enumerate(asym_unit) if i not in remove_sites]
820
    asym_types = [t for i, t in enumerate(asym_types) if i not in remove_sites]
821
    asym_unit = np.array(asym_unit)
822
823
    if asym_unit.shape[0] == 0:
824
        raise ParseError(f'{block.header} has no valid sites')
825
826
    if cartesian:
827
        asym_unit = asym_unit @ np.linalg.inv(cell)
828
    asym_unit = np.mod(asym_unit, 1)
829
    # asym_unit = _snap_small_prec_coords(asym_unit) # recommended by pymatgen
830
831
    # remove overlapping sites unless disorder == 'all_sites'
832
    if disorder != 'all_sites':      
833
        keep_sites = _unique_sites(asym_unit)
834
        if not np.all(keep_sites):
835
            msg = 'may have overlapping sites; duplicates will be removed'
836
            warnings.warn(msg)
837
        asym_unit = asym_unit[keep_sites]
838
        asym_types = [sym for sym, keep in zip(asym_types, keep_sites) if keep]
839
840
    rot, trans = _parse_sitesym_pymatgen(odict)
841
    frac_motif, asym_inds, wyc_muls, inverses = _expand_asym_unit(asym_unit, rot, trans)
842
    types = np.array([asym_types[i] for i in inverses])
843
    motif = frac_motif @ cell
844
845
    return PeriodicSet(
846
        motif=motif,
847
        cell=cell,
848
        name=block.header,
849
        asymmetric_unit=asym_inds,
850
        wyckoff_multiplicities=wyc_muls,
851
        types=types
852
    )
853
854
855
def periodicset_from_pymatgen_structure(
856
        structure,
857
        remove_hydrogens=False,
858
        disorder='skip'
859
) -> PeriodicSet:
860
    """:class:`pymatgen.core.structure.Structure` -->
861
    :class:`amd.PeriodicSet <.periodicset.PeriodicSet>`.
862
    Does not set the name of the periodic set, as there seems to be no
863
    name attribute in the pymatgen Structure object.
864
865
    Parameters
866
    ----------
867
    structure : :class:`pymatgen.core.structure.Structure`
868
        A pymatgen Structure object representing a crystal.
869
    remove_hydrogens : bool, optional
870
        Remove Hydrogens from the crystal.
871
    disorder : str, optional
872
        Controls how disordered structures are handled. Default is
873
        ``skip`` which skips any crystal with disorder, since disorder
874
        conflicts with the periodic set model. To read disordered
875
        structures anyway, choose either :code:`ordered_sites` to remove
876
        atoms with disorder or :code:`all_sites` include all atoms
877
        regardless of disorder.
878
879
    Returns
880
    -------
881
    :class:`amd.PeriodicSet <.periodicset.PeriodicSet>`
882
        Represents the crystal as a periodic set, consisting of a finite
883
        set of points (motif) and lattice (unit cell). Contains other
884
        useful data, e.g. the crystal's name and information about the
885
        asymmetric unit for calculation.
886
    
887
    Raises
888
    ------
889
    ParseError
890
        Raised if the :code:`disorder == 'skip'` and 
891
        :code:`not structure.is_ordered`
892
    """
893
894
    from pymatgen.symmetry.analyzer import SpacegroupAnalyzer
895
896
    if remove_hydrogens:
897
        structure.remove_species(['H', 'D'])
898
899
    if disorder == 'skip':
900
        if not structure.is_ordered:
901
            msg = f"pymatgen Structure has disorder, pass " \
902
                   "disorder='ordered_sites' or 'all_sites' to " \
903
                   "remove/ignore disorder"
904
            raise ParseError(msg)
905
    elif disorder == 'ordered_sites':
906
        remove_inds = []
907
        for i, comp in enumerate(structure.species_and_occu):
908
            if comp.num_atoms < 1:
909
                remove_inds.append(i)
910
        structure.remove_sites(remove_inds)
911
912
    motif = structure.cart_coords
913
    cell = structure.lattice.matrix
914
    sym_structure = SpacegroupAnalyzer(structure).get_symmetrized_structure()
915
    asym_unit = np.array([l[0] for l in sym_structure.equivalent_indices])
916
    wyc_muls = np.array([len(l) for l in sym_structure.equivalent_indices])
917
    types = np.array(sym_structure.atomic_numbers)
918
919
    return PeriodicSet(
920
        motif=motif,
921
        cell=cell,
922
        asymmetric_unit=asym_unit,
923
        wyckoff_multiplicities=wyc_muls,
924
        types=types
925
    )
926
927
928
def periodicset_from_ccdc_entry(
929
        entry,
930
        remove_hydrogens=False,
931
        disorder='skip',
932
        heaviest_component=False,
933
        molecular_centres=False
934
) -> PeriodicSet:
935
    """:class:`ccdc.entry.Entry` -->
936
    :class:`amd.PeriodicSet <.periodicset.PeriodicSet>`.
937
    Entry is the type returned by :class:`ccdc.io.EntryReader`.
938
939
    Parameters
940
    ----------
941
    entry : :class:`ccdc.entry.Entry`
942
        A ccdc Entry object representing a database entry.
943
    remove_hydrogens : bool, optional
944
        Remove Hydrogens from the crystal.
945
    disorder : str, optional
946
        Controls how disordered structures are handled. Default is
947
        ``skip`` which skips any crystal with disorder, since disorder
948
        conflicts with the periodic set model. To read disordered
949
        structures anyway, choose either :code:`ordered_sites` to remove
950
        atoms with disorder or :code:`all_sites` include all atoms
951
        regardless of disorder.
952
    heaviest_component : bool, optional
953
        Removes all but the heaviest molecule in the asymmeric unit,
954
        intended for removing solvents.
955
    molecular_centres : bool, default False
956
        Extract the centres of molecules in the unit cell and store in
957
        the attribute molecular_centres of the returned PeriodicSet.
958
959
    Returns
960
    -------
961
    :class:`amd.PeriodicSet <.periodicset.PeriodicSet>`
962
        Represents the crystal as a periodic set, consisting of a finite
963
        set of points (motif) and lattice (unit cell). Contains other
964
        useful data, e.g. the crystal's name and information about the
965
        asymmetric unit for calculation.
966
967
    Raises
968
    ------
969
    ParseError
970
        Raised if the structure fails parsing for any of the following:
971
        1. entry.has_3d_structure is False, 2.
972
        :code:``disorder == 'skip'`` and disorder is found on any atom,
973
        3. entry.crystal.molecule.all_atoms_have_sites is False,
974
        4. a.fractional_coordinates is None for any a in
975
        entry.crystal.disordered_molecule, 5. The motif is empty after
976
        removing Hydrogens and disordered sites.
977
    """
978
979
    # Entry specific flags
980
    if not entry.has_3d_structure:
981
        raise ParseError(f'{entry.identifier} has no 3D structure')
982
983
    if disorder == 'skip' and entry.has_disorder:
984
        msg = f"{entry.identifier} has disorder, pass " \
985
                "disorder='ordered_sites' or 'all_sites' to remove/ignore" \
986
                "disorder"
987
        raise ParseError(msg)
988
989
    return periodicset_from_ccdc_crystal(
990
        entry.crystal,
991
        remove_hydrogens=remove_hydrogens,
992
        disorder=disorder,
993
        heaviest_component=heaviest_component,
994
        molecular_centres=molecular_centres
995
    )
996
997
998
def periodicset_from_ccdc_crystal(
999
        crystal,
1000
        remove_hydrogens=False,
1001
        disorder='skip',
1002
        heaviest_component=False,
1003
        molecular_centres=False
1004
) -> PeriodicSet:
1005
    """:class:`ccdc.crystal.Crystal` -->
1006
    :class:`amd.PeriodicSet <.periodicset.PeriodicSet>`.
1007
    Crystal is the type returned by :class:`ccdc.io.CrystalReader`.
1008
1009
    Parameters
1010
    ----------
1011
    crystal : :class:`ccdc.crystal.Crystal`
1012
        A ccdc Crystal object representing a crystal structure.
1013
    remove_hydrogens : bool, optional
1014
        Remove Hydrogens from the crystal.
1015
    disorder : str, optional
1016
        Controls how disordered structures are handled. Default is
1017
        ``skip`` which skips any crystal with disorder, since disorder
1018
        conflicts with the periodic set model. To read disordered
1019
        structures anyway, choose either :code:`ordered_sites` to remove
1020
        atoms with disorder or :code:`all_sites` include all atoms
1021
        regardless of disorder.
1022
    heaviest_component : bool, optional
1023
        Removes all but the heaviest molecule in the asymmeric unit,
1024
        intended for removing solvents.
1025
    molecular_centres : bool, default False
1026
        Extract the centres of molecules in the unit cell and store in
1027
        the attribute molecular_centres of the returned PeriodicSet.
1028
1029
    Returns
1030
    -------
1031
    :class:`amd.PeriodicSet <.periodicset.PeriodicSet>`
1032
        Represents the crystal as a periodic set, consisting of a finite
1033
        set of points (motif) and lattice (unit cell). Contains other
1034
        useful data, e.g. the crystal's name and information about the
1035
        asymmetric unit for calculation.
1036
1037
    Raises
1038
    ------
1039
    ParseError
1040
        Raised if the structure fails parsing for any of the following:
1041
        1. :code:``disorder == 'skip'`` and disorder is found on any
1042
        atom, 2. crystal.molecule.all_atoms_have_sites is False,
1043
        3. a.fractional_coordinates is None for any a in
1044
        crystal.disordered_molecule, 4. The motif is empty after
1045
        removing H, disordered sites or solvents.
1046
    """
1047
1048
    molecule = crystal.disordered_molecule
1049
1050
    # if skipping disorder, check disorder flags and then all atoms
1051
    if disorder == 'skip':
1052
        if crystal.has_disorder or \
1053
         any(_atom_has_disorder(a.label, a.occupancy) for a in molecule.atoms):
1054
            msg = f"{crystal.identifier} has disorder, pass " \
1055
                   "disorder='ordered_sites' or 'all_sites' to remove/ignore" \
1056
                   "disorder"
1057
            raise ParseError(msg)
1058
1059
    elif disorder == 'ordered_sites':
1060
        molecule.remove_atoms(a for a in molecule.atoms
1061
                              if _atom_has_disorder(a.label, a.occupancy))
1062
1063
    if remove_hydrogens:
1064
        molecule.remove_atoms(
1065
            a for a in molecule.atoms if a.atomic_symbol in 'HD'
1066
        )
1067
1068
    if heaviest_component and len(molecule.components) > 1:
1069
        molecule = _heaviest_component(molecule)
1070
1071
    # remove atoms with missing coordinate data and warn
1072
    is_missing = (a.fractional_coordinates is None for a in molecule.atoms)
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable a does not seem to be defined.
Loading history...
1073
    if any(is_missing):
1074
        warnings.warn('atoms without sites or missing data will be removed')
1075
        molecule.remove_atoms(
1076
            a for a, missing in zip(molecule.atoms, is_missing) if missing
1077
        )
1078
1079
    if not molecule.all_atoms_have_sites:
1080
        raise ParseError(f'{crystal.identifier} has atoms without sites')
1081
1082
    crystal.molecule = molecule
1083
    cell = cellpar_to_cell(*crystal.cell_lengths, *crystal.cell_angles)
1084
1085
    if molecular_centres:
1086
        frac_centres = _frac_molecular_centres_ccdc(crystal)
1087
        mol_centres = frac_centres @ cell
1088
        return PeriodicSet(mol_centres, cell, name=crystal.identifier)
1089
1090
    asym_atoms = crystal.asymmetric_unit_molecule.atoms
1091
    # check for None?
1092
    asym_unit = np.array([tuple(a.fractional_coordinates) for a in asym_atoms])
1093
1094
    if asym_unit.shape[0] == 0:
1095
        raise ParseError(f'{crystal.identifier} has no valid sites')
1096
1097
    asym_unit = np.mod(asym_unit, 1)
1098
    # asym_unit = _snap_small_prec_coords(asym_unit)
1099
    asym_types = [a.atomic_number for a in asym_atoms]
1100
1101
    if disorder != 'all_sites':
1102
        keep_sites = _unique_sites(asym_unit)
1103
        if not np.all(keep_sites):
1104
            msg = 'may have overlapping sites; duplicates will be removed'
1105
            warnings.warn(msg)
1106
        asym_unit = asym_unit[keep_sites]
1107
        asym_types = [sym for sym, keep in zip(asym_types, keep_sites) if keep]
1108
1109
    sitesym = crystal.symmetry_operators
1110
    # try spacegroup numbers?
1111
    if not sitesym:
1112
        sitesym = ['x,y,z']
1113
1114
    rot = np.array([np.array(crystal.symmetry_rotation(op)).reshape((3, 3)) 
1115
                    for op in sitesym])
1116
    trans = np.array([crystal.symmetry_translation(op) for op in sitesym])
1117
    frac_motif, asym_inds, wyc_muls, inverses = _expand_asym_unit(asym_unit, rot, trans)
1118
    motif = frac_motif @ cell
1119
    types = np.array([asym_types[i] for i in inverses])
1120
1121
    return PeriodicSet(
1122
        motif=motif,
1123
        cell=cell,
1124
        name=crystal.identifier,
1125
        asymmetric_unit=asym_inds,
1126
        wyckoff_multiplicities=wyc_muls,
1127
        types=types
1128
    )
1129
1130
1131
# function is not quite finished.
1132
def periodicset_from_gemmi_block(
1133
        block,
1134
        remove_hydrogens=False,
1135
        disorder='skip'
1136
) -> PeriodicSet:
1137
    """:class:`gemmi.cif.Block` -->
1138
    :class:`amd.PeriodicSet <.periodicset.PeriodicSet>`.
1139
    Block is the type returned by :class:`gemmi.cif.read_file`.
1140
1141
    Parameters
1142
    ----------
1143
    block : :class:`ase.io.cif.CIFBlock`
1144
        An ase CIFBlock object representing a crystal.
1145
    remove_hydrogens : bool, optional
1146
        Remove Hydrogens from the crystal.
1147
    disorder : str, optional
1148
        Controls how disordered structures are handled. Default is
1149
        ``skip`` which skips any crystal with disorder, since disorder
1150
        conflicts with the periodic set model. To read disordered
1151
        structures anyway, choose either :code:`ordered_sites` to remove
1152
        atoms with disorder or :code:`all_sites` include all atoms
1153
        regardless of disorder.
1154
1155
    Returns
1156
    -------
1157
    :class:`amd.PeriodicSet <.periodicset.PeriodicSet>`
1158
        Represents the crystal as a periodic set, consisting of a finite
1159
        set of points (motif) and lattice (unit cell). Contains other
1160
        useful data, e.g. the crystal's name and information about the
1161
        asymmetric unit for calculation.
1162
1163
    Raises
1164
    ------
1165
    ParseError
1166
        Raised if the structure fails to be parsed for any of the
1167
        following: 1. Required data is missing (e.g. cell parameters),
1168
        2. :code:``disorder == 'skip'`` and disorder is found on any
1169
        atom, 3. The motif is empty after removing H or disordered
1170
        sites.
1171
    """
1172
1173
    cellpar = [ase_convert_value(block.find_value(tag))
1174
               for tag in CIF_TAGS['cellpar']]
1175
    if None in cellpar:
1176
        raise ParseError(f'{block.name} has missing cell information')
1177
1178
    cell = cellpar_to_cell(*cellpar)
1179
1180
    xyz_loop = block.find(CIF_TAGS['atom_site_fract']).loop
1181
    if xyz_loop is None:
1182
        # check for crtsn!
1183
        raise ParseError(f'{block.name} has missing coordinate data')
1184
1185
    # loop will contain all xyz cols, maybe labels, maybe types, maybe occs
1186
1187
    loop_dict = _gemmi_loop_to_dict(xyz_loop)
1188
    xyz_str = [loop_dict[t] for t in CIF_TAGS['atom_site_fract']]
1189
    asym_unit = [[ase_convert_value(c) for c in coords] for coords in xyz_str]
1190
1191
    asym_unit = np.mod(np.array(asym_unit).T, 1)
1192
    # asym_unit = _snap_small_prec_coords(asym_unit)
1193
1194
    if '_atom_site_type_symbol' in loop_dict:
1195
        asym_syms = loop_dict['_atom_site_type_symbol']
1196
        asym_types = [ase_atomic_numbers[s] for s in asym_syms]
1197
    else:
1198
        warnings.warn('missing atomic types')
1199
        asym_types = [0 for _ in range(len(asym_unit))]
1200
1201
    remove_sites = []
1202
1203
    # if labels exist, check them for disorder
1204
    if '_atom_site_label' in loop_dict:
1205
        labels = loop_dict['_atom_site_label']
1206
    else:
1207
        labels = [''] * xyz_loop.length()
1208
1209
    # if occupancies exist, check them for disorder
1210
    if '_atom_site_occupancy' in loop_dict:
1211
        occupancies = [ase_convert_value(occ)
1212
                       for occ in loop_dict['_atom_site_occupancy']]
1213
    else:
1214
        occupancies = [None for _ in range(xyz_loop.length())]
1215
1216
    if disorder == 'skip':
1217
        if any(_atom_has_disorder(lab, occ) for lab, occ in zip(labels, occupancies)):
1218
            msg = f"{block.name} has disorder, pass " \
1219
                   "disorder='ordered_sites' or 'all_sites' to " \
1220
                   "remove/ignore disorder"
1221
            raise ParseError(msg)
1222
    elif disorder == 'ordered_sites':
1223
        for i, (lab, occ) in enumerate(zip(labels, occupancies)):
1224
            if _atom_has_disorder(lab, occ):
1225
                remove_sites.append(i)
1226
1227
    if remove_hydrogens:
1228
        remove_sites.extend(
1229
            i for i, num in enumerate(asym_types) if num == 1
1230
        )
1231
1232
    asym_unit = np.delete(asym_unit, remove_sites, axis=0)
1233
    asym_types = [s for i, s in enumerate(asym_types) if i not in remove_sites]
1234
1235
    if disorder != 'all_sites':
1236
        keep_sites = _unique_sites(asym_unit)
1237
        if not np.all(keep_sites):
1238
            msg = 'may have overlapping sites; duplicates will be removed'
1239
            warnings.warn(msg)
1240
        asym_unit = asym_unit[keep_sites]
1241
        asym_types = [sym for sym, keep in zip(asym_types, keep_sites) if keep]
1242
1243
    if asym_unit.shape[0] == 0:
1244
        raise ParseError(f'{block.name} has no valid sites')
1245
1246
    # get symops
1247
    sitesym = []
1248
    for tag in CIF_TAGS['symop']:
1249
        symop_loop = block.find([tag]).loop
1250
        if symop_loop is not None:
1251
            symop_loop_dict = _gemmi_loop_to_dict(symop_loop)
1252
            sitesym = symop_loop_dict[tag]
1253
            break
1254
    # try spacegroup names/nums?
1255
    if not sitesym:
1256
        sitesym = ['x,y,z',]
1257
1258
    rot, trans = ase_parse_sitesym(sitesym)
1259
    frac_motif, asym_inds, wyc_muls, inverses = _expand_asym_unit(asym_unit, rot, trans)
1260
    types = np.array([asym_types[i] for i in inverses])
1261
    motif = frac_motif @ cell
1262
1263
    return PeriodicSet(
1264
        motif=motif,
1265
        cell=cell,
1266
        name=block.name,
1267
        asymmetric_unit=asym_inds,
1268
        wyckoff_multiplicities=wyc_muls,
1269
        types=types
1270
    )
1271
1272
1273
def _expand_asym_unit(
1274
        asym_unit: np.ndarray,
1275
        rotations: np.ndarray,
1276
        translations: np.ndarray
1277
) -> Tuple[np.ndarray, ...]:
1278
    """
1279
    Asymmetric unit frac coords, list of rotations, list of translations
1280
    -->
1281
    full fractional motif, asymmetric unit indices, multiplicities,
1282
    inverses.
1283
    """
1284
1285
    all_sites = []
1286
    asym_inds = [0]
1287
    multiplicities = []
1288
    inverses = []
1289
    m, dims = asym_unit.shape
1290
1291
    expanded_sites = np.zeros((m, len(rotations), dims))
1292
    for i in range(m):
1293
        expanded_sites[i] = np.dot(rotations, asym_unit[i]) + translations
1294
    expanded_sites = np.mod(expanded_sites, 1)
1295
1296
    for inv, sites in enumerate(expanded_sites):
1297
1298
        multiplicity = 0
1299
1300
        for site_ in sites:
1301
1302
            if not all_sites:
1303
                all_sites.append(site_)
1304
                inverses.append(inv)
1305
                multiplicity += 1
1306
                continue
1307
1308
            # check if site_ overlaps with existing sites
1309
            diffs1 = np.abs(site_ - all_sites)
1310
            diffs2 = np.abs(diffs1 - 1)
1311
            mask = np.all((diffs1 <= _EQUIV_SITE_TOL) | 
1312
                          (diffs2 <= _EQUIV_SITE_TOL), axis=-1)
1313
1314
            if np.any(mask):
1315
                where_equal = np.argwhere(mask).flatten()
1316
                for ind in where_equal:
1317
                    if inverses[ind] == inv: # invariant
1318
                        pass
1319
                    else: # equivalent to a different site
1320
                        msg = f'has equivalent sites at positions' \
1321
                              f'{inverses[ind]}, {inv}'
1322
                        warnings.warn(msg)
1323
            else:
1324
                all_sites.append(site_)
1325
                inverses.append(inv)
1326
                multiplicity += 1
1327
1328
        if multiplicity > 0:
1329
            multiplicities.append(multiplicity)
1330
            asym_inds.append(len(all_sites))
1331
1332
    frac_motif = np.array(all_sites)
1333
    asym_inds = np.array(asym_inds[:-1])
1334
    multiplicities = np.array(multiplicities)
1335
1336
    return frac_motif, asym_inds, multiplicities, inverses
1337
1338
1339
@numba.njit()
1340
def _unique_sites(asym_unit):
1341
    """Uniquify (within _EQUIV_SITE_TOL) a list of fractional
1342
    coordinates, considering all points modulo 1. Returns an array of
1343
    bools such that asym_unit[_unique_sites(asym_unit)] is a uniquified
1344
    list.
1345
    """
1346
1347
    site_diffs1 = np.abs(np.expand_dims(asym_unit, 1) - asym_unit)
1348
    site_diffs2 = np.abs(site_diffs1 - 1)
1349
    sites_neq_mask = np.logical_and((site_diffs1 > _EQUIV_SITE_TOL),
1350
                                    (site_diffs2 > _EQUIV_SITE_TOL))
1351
    overlapping = np.triu(sites_neq_mask.sum(axis=-1) == 0, 1)
1352
    return overlapping.sum(axis=0) == 0
1353
1354
1355
def _parse_sitesym_pymatgen(data):
1356
    """In order to generate symmetry equivalent positions, the symmetry
1357
    operations are parsed. If the symops are not present, the space
1358
    group symbol is parsed, and symops are generated.
1359
    """
1360
1361
    from pymatgen.symmetry.groups import SpaceGroup
1362
    from pymatgen.core.operations import SymmOp
1363
    import pymatgen.io.cif
1364
1365
    symops = []
1366
1367
    # try to parse xyz symops
1368
    for symmetry_label in CIF_TAGS['symop']:
1369
1370
        xyz = data.get(symmetry_label)
1371
        if not xyz:
1372
            continue
1373
        if isinstance(xyz, str):
1374
            xyz = [xyz]
1375
        try:
1376
            symops = [SymmOp.from_xyz_string(s) for s in xyz]
1377
            break
1378
        except ValueError:
1379
            continue
1380
1381
    # try to parse symbol
1382
    if not symops:
1383
1384
        for symmetry_label in CIF_TAGS['spacegroup_name']:
1385
1386
            sg = data.get(symmetry_label)
1387
            if not sg:
1388
                continue
1389
            sg = re.sub(r'[\s_]', '', sg)
1390
1391
            try:
1392
                spg = pymatgen.io.cif.space_groups.get(sg)
1393
                if not spg:
1394
                    continue
1395
                symops = SpaceGroup(spg).symmetry_ops
1396
                break
1397
            except ValueError:
1398
                pass
1399
1400
            try:
1401
                for d in pymatgen.io.cif._get_cod_data():
1402
                    if sg == re.sub(r'\s+', '', d['hermann_mauguin']):
1403
                        xyz = d['symops']
1404
                        symops = [SymmOp.from_xyz_string(s) for s in xyz]
1405
                        break
1406
            except Exception:
1407
                continue
1408
1409
            if symops:
1410
                break
1411
1412
    # try to parse international number
1413
    if not symops:
1414
        for symmetry_label in CIF_TAGS['spacegroup_number']:
1415
            num = data.get(symmetry_label)
1416
            if not num:
1417
                continue
1418
1419
            try:
1420
                i = int(ase_convert_value(num))
1421
                symops = SpaceGroup.from_int_number(i).symmetry_ops
1422
                break
1423
            except ValueError:
1424
                continue
1425
1426
    if not symops:
1427
        symops = [SymmOp.from_xyz_string(s) for s in ['x', 'y', 'z']]
1428
1429
    rotations = [op.rotation_matrix for op in symops]
1430
    translations = [op.translation_vector for op in symops]
1431
1432
    return rotations, translations
1433
1434
1435
def _frac_molecular_centres_ccdc(crystal):
1436
    """Returns any geometric centres of molecules in the unit cell.
1437
    Expects a ccdc Crystal object and returns fractional coordiantes.
1438
    """
1439
1440
    frac_centres = []
1441
    for comp in crystal.packing(inclusion='CentroidIncluded').components:
1442
        coords = [a.fractional_coordinates for a in comp.atoms]
1443
        x, y, z = zip(*coords)
1444
        m = len(coords)
1445
        frac_centres.append((sum(x) / m, sum(y) / m, sum(z) / m))
1446
    frac_centres = np.mod(np.array(frac_centres), 1)
1447
    return frac_centres[_unique_sites(frac_centres)]
1448
1449
1450
def _atom_has_disorder(label, occupancy):
1451
    """Return True if label ends with ? or occupancy is a number < 1.
1452
    """
1453
    return label.endswith('?') or (np.isscalar(occupancy) and occupancy < 1)
1454
1455
1456
# def _snap_small_prec_coords(frac_coords):
1457
#     """Find where frac_coords is within 1e-4 of 1/3 or 2/3, change to
1458
#     1/3 and 2/3. Recommended by pymatgen's CIF parser. May use later.
1459
#     """
1460
#
1461
#     frac_coords[np.abs(1 - 3 * frac_coords) < 1e-4] = 1 / 3.
1462
#     frac_coords[np.abs(1 - 3 * frac_coords / 2) < 1e-4] = 2 / 3.
1463
#     return frac_coords
1464
1465
1466
def _heaviest_component(molecule):
1467
    """Heaviest component (removes all but the heaviest component of the
1468
    asym unit). Intended for removing solvents. Probably doesn't play
1469
    well with disorder.
1470
    """
1471
1472
    component_weights = []
1473
    for component in molecule.components:
1474
        weight = 0
1475
        for a in component.atoms:
1476
            if isinstance(a.atomic_weight, (float, int)):
1477
                if isinstance(a.occupancy, (float, int)):
1478
                    weight += a.occupancy * a.atomic_weight
1479
                else:
1480
                    weight += a.atomic_weight
1481
        component_weights.append(weight)
1482
    largest_component_ind = np.argmax(np.array(component_weights))
1483
    molecule = molecule.components[largest_component_ind]
1484
    return molecule
1485
1486
1487
def _refcodes_from_families(refcode_families):
1488
    """List of strings --> all CSD refcodes starting with any of the
1489
    strings. Intended to be passed a list of families and return all
1490
    refcodes in them.
1491
    """
1492
1493
    try:
1494
        import ccdc.search
1495
    except (ImportError, RuntimeError) as _:
1496
        msg = 'Failed to import csd-python-api, please check it is ' \
1497
              'installed and licensed.'
1498
        raise ImportError(msg)
1499
1500
    all_refcodes = []
1501
    for refcode in refcode_families:
1502
        query = ccdc.search.TextNumericSearch()
1503
        query.add_identifier(refcode)
1504
        hits = [hit.identifier for hit in query.search()]
1505
        all_refcodes.extend(hits)
1506
1507
    # filter to unique refcodes
1508
    seen = set()
1509
    seen_add = seen.add
1510
    refcodes = [
1511
        refcode for refcode in all_refcodes
1512
        if not (refcode in seen or seen_add(refcode))]
1513
1514
    return refcodes
1515
1516
1517
def _gemmi_loop_to_dict(gemmi_loop):
1518
    """gemmi Loop object --> dict, tags: values
1519
    """
1520
1521
    tablified_loop = [[] for _ in range(len(gemmi_loop.tags))]
1522
    n_cols = gemmi_loop.width()
1523
    for i, item in enumerate(gemmi_loop.values):
1524
        tablified_loop[i % n_cols].append(item)
1525
    return {tag: l for tag, l in zip(gemmi_loop.tags, tablified_loop)}
1526