Conditions | 20 |
Total Lines | 250 |
Code Lines | 97 |
Lines | 0 |
Ratio | 0 % |
Changes | 0 |
Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.
For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.
Commonly applied refactorings include:
If many parameters/temporary variables are present:
Complex classes like amd.compare.compare() often do a lot of different things. To break such a class down, we need to identify a cohesive component within that class. A common approach to find such a component is to look for fields/methods that share the same prefixes, or suffixes.
Once you have determined the fields that belong together, you can apply the Extract Class refactoring. If the component makes sense as a sub-class, Extract Subclass is also a candidate, and is often faster.
Methods with many parameters are not only hard to understand, but their parameters also often become inconsistent when you need more, or different data.
There are several approaches to avoid long parameter lists:
1 | """Functions for comparing AMDs and PDDs of crystals. |
||
32 | def compare( |
||
33 | crystals, |
||
34 | crystals_=None, |
||
35 | by: str = 'AMD', |
||
36 | k: int = 100, |
||
37 | nearest: Optional[int] = None, |
||
38 | reader: str = 'gemmi', |
||
39 | remove_hydrogens: bool = False, |
||
40 | disorder: str = 'skip', |
||
41 | heaviest_component: bool = False, |
||
42 | molecular_centres: bool = False, |
||
43 | csd_refcodes: bool = False, |
||
44 | refcode_families: bool = False, |
||
45 | show_warnings: bool = True, |
||
46 | collapse_tol: float = 1e-4, |
||
47 | metric: str = 'chebyshev', |
||
48 | n_jobs: Optional[int] = None, |
||
49 | backend: str = 'multiprocessing', |
||
50 | verbose: bool = False, |
||
51 | low_memory: bool = False, |
||
52 | **kwargs |
||
53 | ) -> pd.DataFrame: |
||
54 | r"""Given one or two sets of crystals, compare by AMD or PDD and |
||
55 | return a pandas DataFrame of the distance matrix. |
||
56 | |||
57 | Given one or two paths to CIFs, periodic sets, CSD refcodes or lists |
||
58 | thereof, compare by AMD or PDD and return a pandas DataFrame of the |
||
59 | distance matrix. Default is to comapre by AMD with k = 100. Accepts |
||
60 | most keyword arguments accepted by |
||
61 | :class:`CifReader <.io.CifReader>`, |
||
62 | :class:`CSDReader <.io.CSDReader>` and functions from |
||
63 | :mod:`.compare`. |
||
64 | |||
65 | Parameters |
||
66 | ---------- |
||
67 | crystals : list of str or :class:`PeriodicSet <.periodicset.PeriodicSet>` |
||
68 | A path, :class:`PeriodicSet <.periodicset.PeriodicSet>`, tuple |
||
69 | or a list of those. |
||
70 | crystals\_ : list of str or :class:`PeriodicSet <.periodicset.PeriodicSet>`, optional |
||
71 | A path, :class:`PeriodicSet <.periodicset.PeriodicSet>`, tuple |
||
72 | or a list of those. |
||
73 | by : str, default 'AMD' |
||
74 | Use AMD or PDD to compare crystals. |
||
75 | k : int, default 100 |
||
76 | Parameter for AMD/PDD, the number of neighbour atoms to consider |
||
77 | for each atom in a unit cell. |
||
78 | nearest : int, deafult None |
||
79 | Find a number of nearest neighbours instead of a full distance |
||
80 | matrix between crystals. |
||
81 | reader : str, optional |
||
82 | The backend package used to parse the CIF. The default is |
||
83 | :code:`gemmi`, :code:`pymatgen` and :code:`ase` are also |
||
84 | accepted, as well as :code:`ccdc` if csd-python-api is |
||
85 | installed. The ccdc reader should be able to read any format |
||
86 | accepted by :class:`ccdc.io.EntryReader`, though only CIFs have |
||
87 | been tested. |
||
88 | remove_hydrogens : bool, optional |
||
89 | Remove hydrogens from the crystals. |
||
90 | disorder : str, optional |
||
91 | Controls how disordered structures are handled. Default is |
||
92 | ``skip`` which skips any crystal with disorder, since disorder |
||
93 | conflicts with the periodic set model. To read disordered |
||
94 | structures anyway, choose either :code:`ordered_sites` to remove |
||
95 | atoms with disorder or :code:`all_sites` include all atoms |
||
96 | regardless of disorder. |
||
97 | heaviest_component : bool, optional, csd-python-api only |
||
98 | Removes all but the heaviest molecule in |
||
99 | the asymmeric unit, intended for removing solvents. |
||
100 | molecular_centres : bool, default False, csd-python-api only |
||
101 | Use the centres of molecules for comparison |
||
102 | instead of centres of atoms. |
||
103 | csd_refcodes : bool, optional, csd-python-api only |
||
104 | Interpret ``crystals`` and ``crystals_`` as CSD refcodes or |
||
105 | lists thereof, rather than paths. |
||
106 | refcode_families : bool, optional, csd-python-api only |
||
107 | Read all entries whose refcode starts with |
||
108 | the given strings, or 'families' (e.g. giving 'DEBXIT' reads all |
||
109 | entries with refcodes starting with DEBXIT). |
||
110 | show_warnings : bool, optional |
||
111 | Controls whether warnings that arise during reading are printed. |
||
112 | collapse_tol: float, default 1e-4, ``by='PDD'`` only |
||
113 | If two PDD rows have all elements closer |
||
114 | than ``collapse_tol``, they are merged and weights are given to |
||
115 | rows in proportion to the number of times they appeared. |
||
116 | metric : str or callable, default 'chebyshev' |
||
117 | The metric to compare AMDs/PDDs with. AMDs are compared directly |
||
118 | with this metric. EMD is the metric used between PDDs, which |
||
119 | requires giving a metric to use between PDD rows. Chebyshev |
||
120 | (L-infinity) distance is the default. Accepts any metric |
||
121 | accepted by :func:`scipy.spatial.distance.cdist`. |
||
122 | n_jobs : int, default None, ``by='PDD'`` only |
||
123 | Maximum number of concurrent jobs for |
||
124 | parallel processing with :code:`joblib`. Set to -1 to use the |
||
125 | maximum. Using parallel processing may be slower for small |
||
126 | inputs. |
||
127 | backend : str, default 'multiprocessing', ``by='PDD'`` only |
||
128 | The parallelization backend implementation for PDD comparisons. |
||
129 | For a list of supported backends, see the backend argument of |
||
130 | :class:`joblib.Parallel`. |
||
131 | verbose : bool, default False |
||
132 | Prints a progress bar when reading crystals, calculating |
||
133 | AMDs/PDDs and comparing PDDs. If using parallel processing |
||
134 | (n_jobs > 1), the verbose argument of :class:`joblib.Parallel` |
||
135 | is used, otherwise uses ``tqdm``. |
||
136 | low_memory : bool, default False, ``by='AMD'`` only |
||
137 | Use a slower but more memory efficient |
||
138 | method for large collections of AMDs (metric 'chebyshev' only). |
||
139 | |||
140 | Returns |
||
141 | ------- |
||
142 | df : :class:`pandas.DataFrame` |
||
143 | DataFrame of the distance matrix for the given crystals compared |
||
144 | by the chosen invariant. |
||
145 | |||
146 | Raises |
||
147 | ------ |
||
148 | ValueError |
||
149 | If by is not 'AMD' or 'PDD', if either set given have no valid |
||
150 | crystals to compare, or if crystals or crystals\_ are an invalid |
||
151 | type. |
||
152 | |||
153 | Examples |
||
154 | -------- |
||
155 | Compare everything in a .cif (deafult, AMD with k=100):: |
||
156 | |||
157 | df = amd.compare('data.cif') |
||
158 | |||
159 | Compare everything in one cif with all crystals in all cifs in a |
||
160 | directory (PDD, k=50):: |
||
161 | |||
162 | df = amd.compare('data.cif', 'dir/to/cifs', by='PDD', k=50) |
||
163 | |||
164 | **Examples (csd-python-api only)** |
||
165 | |||
166 | Compare two crystals by CSD refcode (PDD, k=50):: |
||
167 | |||
168 | df = amd.compare('DEBXIT01', 'DEBXIT02', csd_refcodes=True, by='PDD', k=50) |
||
169 | |||
170 | Compare everything in a refcode family (AMD, k=100):: |
||
171 | |||
172 | df = amd.compare('DEBXIT', csd_refcodes=True, families=True) |
||
173 | """ |
||
174 | |||
175 | by = by.upper() |
||
176 | if by not in ('AMD', 'PDD'): |
||
177 | raise ValueError( |
||
178 | "'by' parameter of amd.compare() must be 'AMD' or 'PDD' (passed " |
||
179 | f"'{by}')" |
||
180 | ) |
||
181 | |||
182 | if heaviest_component or molecular_centres: |
||
183 | reader = 'ccdc' |
||
184 | |||
185 | if refcode_families: |
||
186 | csd_refcodes = True |
||
187 | |||
188 | reader_kwargs = { |
||
189 | 'reader': reader, |
||
190 | 'families': refcode_families, |
||
191 | 'remove_hydrogens': remove_hydrogens, |
||
192 | 'disorder': disorder, |
||
193 | 'heaviest_component': heaviest_component, |
||
194 | 'molecular_centres': molecular_centres, |
||
195 | 'show_warnings': show_warnings, |
||
196 | 'verbose': verbose, |
||
197 | } |
||
198 | |||
199 | compare_kwargs = { |
||
200 | 'metric': metric, |
||
201 | 'n_jobs': n_jobs, |
||
202 | 'backend': backend, |
||
203 | 'verbose': verbose, |
||
204 | 'low_memory': low_memory, |
||
205 | **kwargs |
||
206 | } |
||
207 | |||
208 | # Get list(s) of periodic sets from first input |
||
209 | if csd_refcodes: |
||
210 | crystals = _unwrap_refcode_list(crystals, **reader_kwargs) |
||
211 | else: |
||
212 | crystals = _unwrap_pset_list(crystals, **reader_kwargs) |
||
213 | |||
214 | if not crystals: |
||
215 | raise ValueError( |
||
216 | 'First argument passed to amd.compare() contains no valid ' |
||
217 | 'crystals/periodic sets' |
||
218 | ) |
||
219 | names = [s.name for s in crystals] |
||
220 | if verbose: |
||
221 | container = tqdm.tqdm(crystals, desc='Calculating', delay=1) |
||
222 | else: |
||
223 | container = crystals |
||
224 | |||
225 | # Get list(s) of periodic sets from second input if given |
||
226 | if crystals_ is None: |
||
227 | names_ = names |
||
228 | container_ = None |
||
229 | else: |
||
230 | if csd_refcodes: |
||
231 | crystals_ = _unwrap_refcode_list(crystals_, **reader_kwargs) |
||
232 | else: |
||
233 | crystals_ = _unwrap_pset_list(crystals_, **reader_kwargs) |
||
234 | if not crystals_: |
||
235 | raise ValueError( |
||
236 | 'Second argument passed to amd.compare() contains no ' |
||
237 | 'valid crystals/periodic sets' |
||
238 | ) |
||
239 | names_ = [s.name for s in crystals_] |
||
240 | if verbose: |
||
241 | container_ = tqdm.tqdm(crystals_, desc='Calculating', delay=1) |
||
242 | else: |
||
243 | container_ = crystals_ |
||
244 | |||
245 | if by == 'AMD': |
||
246 | invs = [AMD(s, k) for s in container] |
||
247 | if isinstance(container, tqdm.tqdm): |
||
248 | container.close() |
||
249 | compare_kwargs.pop('n_jobs', None) |
||
250 | compare_kwargs.pop('backend', None) |
||
251 | compare_kwargs.pop('verbose', None) |
||
252 | |||
253 | if crystals_ is None: |
||
254 | dm = AMD_pdist(invs, **compare_kwargs) |
||
255 | else: |
||
256 | invs_ = [AMD(s, k) for s in container_] |
||
257 | dm = AMD_cdist(invs, invs_, **compare_kwargs) |
||
258 | |||
259 | elif by == 'PDD': |
||
260 | invs = [PDD(s, k, collapse_tol=collapse_tol) for s in container] |
||
261 | compare_kwargs.pop('low_memory', None) |
||
262 | |||
263 | if crystals_ is None: |
||
264 | dm = PDD_pdist(invs, **compare_kwargs) |
||
265 | else: |
||
266 | invs_ = [PDD(s, k, collapse_tol=collapse_tol) for s in container_] |
||
267 | dm = PDD_cdist(invs, invs_, **compare_kwargs) |
||
268 | |||
269 | if nearest: |
||
270 | nn_dm, inds = neighbours_from_distance_matrix(nearest, dm) |
||
271 | data = {} |
||
272 | for i in range(nearest): |
||
273 | data['ID ' + str(i+1)] = [names_[j] for j in inds[:, i]] |
||
274 | data['DIST ' + str(i+1)] = nn_dm[:, i] |
||
275 | df = pd.DataFrame(data, index=names) |
||
276 | else: |
||
277 | if dm.ndim == 1: |
||
278 | dm = squareform(dm) |
||
279 | df = pd.DataFrame(dm, index=names, columns=names_) |
||
280 | |||
281 | return df |
||
282 | |||
644 |