sciapy.regress.load_data.load_scia_dzm() - Code Metrics - Inspection of "regress: Fix `load_data()` docs" - st-bender/sciapy - Measure and Improve Code Quality continuously with Scrutinizer

Passed

Push — master ( ff27bc...351d25 )

by Stefan

created 2019-11-20 17:00 UTC

sciapy.regress.load_data.load_scia_dzm() D

↳ Parent: sciapy.regress.load_data

Complexity

Conditions

Size

Total Lines	136
Code Lines	64

Duplication

Lines	136
Ratio	100 %

Code Coverage

Tests	37
CRAP Score	13.0792

Importance

Changes

Metric	Value
cc	12
eloc	64
nop	12
dl	136
loc	136
ccs	37
cts	46
cp	0.8043
crap	13.0792
rs	4.5381
c	0
b	0
f	0

How to fix Long Method Complexity Many Parameters

Long Method

Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.

For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.

Commonly applied refactorings include:

If many parameters/temporary variables are present:
- Replace temporary variables with Query
- Introduce parameter object; often combined with preserve whole object
- If the above two are insufficient: Replace method with method object
If you have long conditionals: Decompose Conditional
Otherwise: Extract method

Complexity

Complex classes like sciapy.regress.load_data.load_scia_dzm() often do a lot of different things. To break such a class down, we need to identify a cohesive component within that class. A common approach to find such a component is to look for fields/methods that share the same prefixes, or suffixes.

Once you have determined the fields that belong together, you can apply the Extract Class refactoring. If the component makes sense as a sub-class, Extract Subclass is also a candidate, and is often faster.

Many Parameters

Methods with many parameters are not only hard to understand, but their parameters also often become inconsistent when you need more, or different data.

There are several approaches to avoid long parameter lists:

If you can get the data from another object that the method knows about already: Replace the parameter with method call
If several parameters are part of another object: Preserve Whole Object
If parameters are not yet connected: Introduce Parameter Object

# -*- coding: utf-8 -*-
# vim:fileencoding=utf-8
#
# Copyright (c) 2017-2018 Stefan Bender
#
# This module is part of sciapy.
# sciapy is free software: you can redistribute it or modify
# it under the terms of the GNU General Public License as published
# by the Free Software Foundation, version 2.
# See accompanying LICENSE file or http://www.gnu.org/licenses/gpl-2.0.html.
"""SCIAMACHY data interface

Data loading and selection routines for SCIAMACHY data regression fits.
Includes reading the solar and geomagnetic index files used as proxies.
"""

import logging

import numpy as np
import pandas as pd
import xarray as xr

from astropy.time import Time

__all__ = ["load_solar_gm_table", "load_scia_dzm"]

_NPV_MAJOR, _NPV_MINOR, _NPV_PATCH = np.__version__.split('.')

_seasons = {
	"summerNH": [
		slice("2002-03-20", "2002-09-25"),
		slice("2003-03-20", "2003-10-13"),
		slice("2004-03-20", "2004-10-20"),
		slice("2005-03-20", "2005-10-20"),
		slice("2006-03-20", "2006-10-20"),
		slice("2007-03-21", "2007-10-20"),
		slice("2008-03-20", "2008-10-20"),
		slice("2009-03-20", "2009-10-20"),
		slice("2010-03-20", "2010-10-20"),
		slice("2011-03-20", "2011-08-31"),
		slice("2012-03-20", "2012-04-07"),
	],
	"summerSH": [
		slice("2002-09-13", "2003-03-26"),
		slice("2003-09-13", "2004-04-01"),
		slice("2004-09-14", "2005-04-02"),
		slice("2005-09-13", "2006-04-02"),
		slice("2006-09-13", "2007-04-02"),
		slice("2007-09-13", "2008-04-01"),
		slice("2008-09-14", "2009-04-02"),
		slice("2009-09-03", "2010-04-02"),
		slice("2010-09-13", "2011-04-02"),
		slice("2011-09-13", "2012-04-01"),
	]
}

_SPEs = [pd.date_range("2002-04-20", "2002-05-01"),
		pd.date_range("2002-05-21", "2002-06-02"),
		pd.date_range("2002-07-15", "2002-07-27"),
		pd.date_range("2002-08-23", "2002-09-03"),
		pd.date_range("2002-09-06", "2002-09-17"),
		pd.date_range("2002-11-08", "2002-11-20"),
		pd.date_range("2003-05-27", "2003-06-07"),
		pd.date_range("2003-10-25", "2003-11-15"),
		pd.date_range("2004-07-24", "2004-08-05"),
		pd.date_range("2004-11-06", "2004-11-18"),
		pd.date_range("2005-01-15", "2005-01-27"),
		pd.date_range("2005-05-13", "2005-05-25"),
		pd.date_range("2005-07-13", "2005-08-05"),
		pd.date_range("2005-08-21", "2005-09-02"),
		pd.date_range("2005-09-07", "2005-09-21"),
		pd.date_range("2006-12-05", "2006-12-23"),
		pd.date_range("2012-01-22", "2012-02-07"),
		pd.date_range("2012-03-06", "2012-03-27")]


def load_dailymeanAE(
		filename="data/indices/AE_Kyoto_1980-2016_daily2_shift12h.dat",
		name="AE", tfmt="jyear"):
	from pkg_resources import resource_filename
	AEfilename = resource_filename("sciapy", filename)
	return load_solar_gm_table(AEfilename, cols=[0, 1],
			names=["time", name], tfmt=tfmt)


def load_dailymeanLya(filename="data/indices/lisird_lya3_1980-2017.dat",
		name="Lya", tfmt="jyear"):
	from pkg_resources import resource_filename
	Lyafilename = resource_filename("sciapy", filename)
	return load_solar_gm_table(Lyafilename, cols=[0, 1],
			names=["time", name], tfmt=tfmt)


def load_solar_gm_table(filename, cols, names, sep="\t", tfmt="jyear"):

	"""Load proxy tables from ascii files

	This function wraps :func:`numpy.genfromtxt()` [#]_ with
	pre-defined settings to match the file format.
	It explicitly returns the times as UTC decimal years or julian epochs.

	.. [#] https://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html

	Parameters
	----------
	filename: str
		The file to read
	cols: sequence
		The columns in the file to use, passed to `numpy.genfromtxt`'s
		`usecols` keyword. Should be at least of length 2, indicating
		time values in the first column, e.g., `cols=[0, 1]`.
	names: sequence
		The column names, passed as `names` to `numpy.genfromtxt()`.
		Should be at least of length 2, naming the proxy values in the
		second column, e.g., `names=["time", "proxy"]`.
	sep: str, optional
		The column separator character, passed as `sep`.
		Default: `\t`
	tfmt: str, optional
		The astropy.time "Time Format" for the time units,
		for example, "jyear", "decimalyear", "jd", "mjd", etc.
		See:
		http://docs.astropy.org/en/stable/time/index.html#time-format
		Default: "jyear"

	Returns
	-------
	times: array_like
		The proxy times according to the `tfmt` keyword (UTC) as
		:class:`numpy.ndarray` from the first column of "cols".
	values: tuple of array_like
		The proxy values combined into a structured array
		(:class:`numpy.recarray`) with fields set
		according to "names[1:]" and "cols[1:]" passed above.

	See Also
	--------
	:class:`astropy.time.Time`
	:class:`numpy.recarray`
	"""
	if len(cols) < 2 and len(names) < 2:
		raise ValueError(
				"Both `cols` and `names` should be at least of length 2.")
	encoding = {}
	# set the encoding for numpy >= 1.14.0
	if int(_NPV_MAJOR) >= 1 and int(_NPV_MINOR) >= 14:
		encoding = dict(encoding=None)
	tab = np.genfromtxt(filename,
			delimiter=sep,
			dtype=None,
			names=names,
			usecols=cols,
			**encoding)
	times = Time(tab[names[0]], scale="utc")
	ts = getattr(times, tfmt)
	return ts, tab[names[1:]]


def _greedy_select(ds, size, varname="NO_DENS_std", scale=1.):

	logging.info("Greedy subsampling to size: %s", size)
	var = np.ma.masked_array((ds[varname] * scale)**2, mask=[False])
	var_p = np.ma.masked_array((ds[varname] * scale)**2, mask=[False])
	sigma2_i = scale**2
	sigma2_ip = scale**2
	idxs = np.arange(len(var))
	for _ in range(size):
		max_entr_idx = np.ma.argmax(np.log(1. + var * sigma2_i))
		min_entr_idx = np.ma.argmin(np.log(1. + var_p * sigma2_ip))
		sigma2_i += 1. / var[max_entr_idx]
		sigma2_ip += 1. / var_p[min_entr_idx]
		var.mask[max_entr_idx] = True
		var_p.mask[max_entr_idx] = True
	return ds.isel(time=idxs[var.mask])

def _greedy_idxs_post(x, xerr, size):

	logging.info("Greedy subsampling to size: %s", size)
	var = np.ma.masked_array(xerr**2, mask=[False])
	var_p = np.ma.masked_array(xerr**2, mask=[False])
	sigma2_i = 1.
	sigma2_ip = 1.
	idxs = np.arange(len(var))
	for _ in range(size):
		max_entr_idx = np.ma.argmax(np.log(1. + var * sigma2_i))
		min_entr_idx = np.ma.argmin(np.log(1. + var_p * sigma2_ip))
		sigma2_i += 1. / var[max_entr_idx]
		sigma2_ip += 1. / var_p[min_entr_idx]
		var.mask[max_entr_idx] = True
		var_p.mask[max_entr_idx] = True
	return idxs[var.mask]

def load_scia_dzm(filename, alt, lat, tfmt="jyear",

		scale=1, subsample_factor=1, subsample_method="greedy",
		akd_threshold=0.002, cnt_threshold=0,
		center=False, season=None, SPEs=False):
	"""Load SCIAMACHY daily zonal mean data

	Interface function for SCIAMACHY daily zonal mean data files version 6.x.
	Uses :mod:`xarray` [#]_ to load and select the data. Possible selections are by
	hemispheric summer (NH summer ~ SH winter and SH summer ~ NH winter) and
	exclusion of strong solar proton events (SPE).

	.. [#] https://xarray.pydata.org

	Parameters
	----------
	filename: str
		The input filename
	alt: float
		The altitude
	lat: float
		The longitude
	tfmt: string, optional
		The astropy.time "Time Format" for the time units,
		for example, "jyear", "decimalyear", "jd", "mjd", etc.
		See:
		http://docs.astropy.org/en/stable/time/index.html#time-format
		Default: "jyear"
	scale: float, optional
		Scale factor of the data (default: 1)
	subsample_factor: int, optional
		Factor to subsample the data by (see `subsample_method`)
		(default: 1 (no subsampling))
	subsample_method: "equal", "greedy", or "random", optional
		Method for subsampling the data (see `subsample_factor`).
		"equal" for equally spaced subsampling,
		"greedy" for selecting the data based on their uncertainty,
		and "random" for uniform random subsampling.
		(default: "greedy")
	center: bool, optional
		Center the data by subtracting the global mean.
		(default: False)
	season: "summerNH", "summerSH", or `None`, optional
		Select the named season or `None` for all data (default: None)
	SPEs: bool, optional
		Set to `True` to exclude strong SPE events (default: False)

	Returns
	-------
	(times, dens, errs): tuple of (N,) array_like
		The measurement times according to the `tfmt` keyword,
		the number densities, and their uncertainties.
	"""
	logging.info("Opening dataset: '%s'", filename)
	NO_ds = xr.open_mfdataset(filename, decode_times=False,
				chunks={"time": 400, "latitude": 9, "altitude": 11})
	logging.info("done.")
	# Decode time coordinate for selection
	try:
		# xarray <= 0.9.6
		NO_ds["time"] = xr.conventions.decode_cf_variable(NO_ds.time)
	except TypeError:
		# xarray => 0.10.0
		NO_ds["time"] = xr.conventions.decode_cf_variable("time", NO_ds.time)

	NO_mean = 0.
	if center:
		NO_mean = NO_ds.NO_DENS.mean().values
		logging.info("Centering with global mean: %s", NO_mean)
	NO_tds = NO_ds.sel(latitude=lat, altitude=alt)

	# Exclude SPEs first if requested
	if SPEs:
		logging.info("Removing SPEs.")
		for spe in _SPEs:
			NO_tds = NO_tds.drop(
					NO_tds.sel(time=slice(spe[0], spe[-1])).time.values,
					dim="time")

	# Filter by season
	if season in _seasons.keys():
		logging.info("Restricting to season: %s", season)
		NO_tds = xr.concat([NO_tds.sel(time=s) for s in _seasons[season]],
					dim="time")
		NO_tds.load()
	else:
		logging.info("No season selected or unknown season, "
					"using all available data.")

	try:
		NO_counts = NO_tds.NO_DENS_cnt
	except AttributeError:
		NO_counts = NO_tds.counts

	# Select only useful data
	NO_tds = NO_tds.where(
				np.isfinite(NO_tds.NO_DENS) &
				(NO_tds.NO_DENS_std != 0) &
				(NO_tds.NO_AKDIAG > akd_threshold) &
				(NO_counts > cnt_threshold) &
				(NO_tds.NO_MASK == 0),
				drop=True)

	no_dens = scale * NO_tds.NO_DENS
	if center:
		no_dens -= scale * NO_mean
	no_errs = scale * NO_tds.NO_DENS_std / np.sqrt(NO_counts)
	logging.debug("no_dens.shape (ntime,): %s", no_dens.shape)

	no_sza = NO_tds.mean_SZA

	# Convert to astropy.Time for Julian epoch or decimal year
	if NO_tds.time.size > 0:
		no_t = Time(pd.to_datetime(NO_tds.time.values, utc=True).to_pydatetime(),
					format="datetime", scale="utc")
		no_ys = getattr(no_t, tfmt)
	else:
		no_ys = np.empty_like(NO_tds.time.values, dtype=np.float64)

	if subsample_factor > 1:
		new_data_size = no_dens.shape[0] // subsample_factor
		if subsample_method == "random":
			# random subsampling
			_idxs = np.random.choice(no_dens.shape[0],
					new_data_size, replace=False)
		elif subsample_method == "equal":
			# equally spaced subsampling
			_idxs = slice(0, no_dens.shape[0], subsample_factor)
		else:
			# "greedy" subsampling (default fall-back)
			_idxs = _greedy_idxs_post(no_dens, no_errs, new_data_size)
		return (no_ys[_idxs],
				no_dens.values[_idxs],
				no_errs.values[_idxs],
				no_sza.values[_idxs])

	return no_ys, no_dens.values, no_errs.values, no_sza.values


1			# -- coding: utf-8 --
2			# vim:fileencoding=utf-8
3			#
4			# Copyright (c) 2017-2018 Stefan Bender
5			#
6			# This module is part of sciapy.
7			# sciapy is free software: you can redistribute it or modify
8			# it under the terms of the GNU General Public License as published
9			# by the Free Software Foundation, version 2.
10			# See accompanying LICENSE file or http://www.gnu.org/licenses/gpl-2.0.html.
11	1		"""SCIAMACHY data interface
12
13			Data loading and selection routines for SCIAMACHY data regression fits.
14			Includes reading the solar and geomagnetic index files used as proxies.
15			"""
16
17	1		import logging
18
19	1		import numpy as np
20	1		import pandas as pd
21	1		import xarray as xr
22
23	1		from astropy.time import Time
24
25	1		__all__ = ["load_solar_gm_table", "load_scia_dzm"]
26
27	1		_NPV_MAJOR, _NPV_MINOR, _NPV_PATCH = np.__version__.split('.')
28
29	1		_seasons = {
30			"summerNH": [
31			slice("2002-03-20", "2002-09-25"),
32			slice("2003-03-20", "2003-10-13"),
33			slice("2004-03-20", "2004-10-20"),
34			slice("2005-03-20", "2005-10-20"),
35			slice("2006-03-20", "2006-10-20"),
36			slice("2007-03-21", "2007-10-20"),
37			slice("2008-03-20", "2008-10-20"),
38			slice("2009-03-20", "2009-10-20"),
39			slice("2010-03-20", "2010-10-20"),
40			slice("2011-03-20", "2011-08-31"),
41			slice("2012-03-20", "2012-04-07"),
42			],
43			"summerSH": [
44			slice("2002-09-13", "2003-03-26"),
45			slice("2003-09-13", "2004-04-01"),
46			slice("2004-09-14", "2005-04-02"),
47			slice("2005-09-13", "2006-04-02"),
48			slice("2006-09-13", "2007-04-02"),
49			slice("2007-09-13", "2008-04-01"),
50			slice("2008-09-14", "2009-04-02"),
51			slice("2009-09-03", "2010-04-02"),
52			slice("2010-09-13", "2011-04-02"),
53			slice("2011-09-13", "2012-04-01"),
54			]
55			}
56
57	1		_SPEs = [pd.date_range("2002-04-20", "2002-05-01"),
58			pd.date_range("2002-05-21", "2002-06-02"),
59			pd.date_range("2002-07-15", "2002-07-27"),
60			pd.date_range("2002-08-23", "2002-09-03"),
61			pd.date_range("2002-09-06", "2002-09-17"),
62			pd.date_range("2002-11-08", "2002-11-20"),
63			pd.date_range("2003-05-27", "2003-06-07"),
64			pd.date_range("2003-10-25", "2003-11-15"),
65			pd.date_range("2004-07-24", "2004-08-05"),
66			pd.date_range("2004-11-06", "2004-11-18"),
67			pd.date_range("2005-01-15", "2005-01-27"),
68			pd.date_range("2005-05-13", "2005-05-25"),
69			pd.date_range("2005-07-13", "2005-08-05"),
70			pd.date_range("2005-08-21", "2005-09-02"),
71			pd.date_range("2005-09-07", "2005-09-21"),
72			pd.date_range("2006-12-05", "2006-12-23"),
73			pd.date_range("2012-01-22", "2012-02-07"),
74			pd.date_range("2012-03-06", "2012-03-27")]
75
76
77	1		def load_dailymeanAE(
78			filename="data/indices/AE_Kyoto_1980-2016_daily2_shift12h.dat",
79			name="AE", tfmt="jyear"):
80	1		from pkg_resources import resource_filename
81	1		AEfilename = resource_filename("sciapy", filename)
82	1		return load_solar_gm_table(AEfilename, cols=[0, 1],
83			names=["time", name], tfmt=tfmt)
84
85
86	1		def load_dailymeanLya(filename="data/indices/lisird_lya3_1980-2017.dat",
87			name="Lya", tfmt="jyear"):
88	1		from pkg_resources import resource_filename
89	1		Lyafilename = resource_filename("sciapy", filename)
90	1		return load_solar_gm_table(Lyafilename, cols=[0, 1],
91			names=["time", name], tfmt=tfmt)
92
93
94	1	View Code Duplication	def load_solar_gm_table(filename, cols, names, sep="\t", tfmt="jyear"):
			0 ignored issues – show Duplication introduced 2018-07-02 09:55 UTC by Report Bug Copy Issue Report This code seems to be duplicated in your project. Loading history...
95			"""Load proxy tables from ascii files
96
97			This function wraps :func:`numpy.genfromtxt()` [#]_ with
98			pre-defined settings to match the file format.
99			It explicitly returns the times as UTC decimal years or julian epochs.
100
101			.. [#] https://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html
102
103			Parameters
104			----------
105			filename: str
106			The file to read
107			cols: sequence
108			The columns in the file to use, passed to `numpy.genfromtxt`'s
109			`usecols` keyword. Should be at least of length 2, indicating
110			time values in the first column, e.g., `cols=[0, 1]`.
111			names: sequence
112			The column names, passed as `names` to `numpy.genfromtxt()`.
113			Should be at least of length 2, naming the proxy values in the
114			second column, e.g., `names=["time", "proxy"]`.
115			sep: str, optional
116			The column separator character, passed as `sep`.
117			Default: `\t`
118			tfmt: str, optional
119			The astropy.time "Time Format" for the time units,
120			for example, "jyear", "decimalyear", "jd", "mjd", etc.
121			See:
122			http://docs.astropy.org/en/stable/time/index.html#time-format
123			Default: "jyear"
124
125			Returns
126			-------
127			times: array_like
128			The proxy times according to the `tfmt` keyword (UTC) as
129			:class:`numpy.ndarray` from the first column of "cols".
130			values: tuple of array_like
131			The proxy values combined into a structured array
132			(:class:`numpy.recarray`) with fields set
133			according to "names[1:]" and "cols[1:]" passed above.
134
135			See Also
136			--------
137			:class:`astropy.time.Time`
138			:class:`numpy.recarray`
139			"""
140	1		if len(cols) < 2 and len(names) < 2:
141			raise ValueError(
142			"Both `cols` and `names` should be at least of length 2.")
143	1		encoding = {}
144			# set the encoding for numpy >= 1.14.0
145	1		if int(_NPV_MAJOR) >= 1 and int(_NPV_MINOR) >= 14:
146	1		encoding = dict(encoding=None)
147	1		tab = np.genfromtxt(filename,
148			delimiter=sep,
149			dtype=None,
150			names=names,
151			usecols=cols,
152			**encoding)
153	1		times = Time(tab[names[0]], scale="utc")
154	1		ts = getattr(times, tfmt)
155	1		return ts, tab[names[1:]]
156
157
158	1	View Code Duplication	def _greedy_select(ds, size, varname="NO_DENS_std", scale=1.):
			0 ignored issues – show Duplication introduced 2018-07-02 09:55 UTC by Report Bug Copy Issue Report This code seems to be duplicated in your project. Loading history...
159			logging.info("Greedy subsampling to size: %s", size)
160			var = np.ma.masked_array((ds[varname] * scale)**2, mask=[False])
161			var_p = np.ma.masked_array((ds[varname] * scale)**2, mask=[False])
162			sigma2_i = scale**2
163			sigma2_ip = scale**2
164			idxs = np.arange(len(var))
165			for _ in range(size):
166			max_entr_idx = np.ma.argmax(np.log(1. + var * sigma2_i))
167			min_entr_idx = np.ma.argmin(np.log(1. + var_p * sigma2_ip))
168			sigma2_i += 1. / var[max_entr_idx]
169			sigma2_ip += 1. / var_p[min_entr_idx]
170			var.mask[max_entr_idx] = True
171			var_p.mask[max_entr_idx] = True
172			return ds.isel(time=idxs[var.mask])
173
174	1	View Code Duplication	def _greedy_idxs_post(x, xerr, size):
			0 ignored issues – show Duplication introduced 2018-07-02 09:55 UTC by Report Bug Copy Issue Report This code seems to be duplicated in your project. Loading history...
175			logging.info("Greedy subsampling to size: %s", size)
176			var = np.ma.masked_array(xerr**2, mask=[False])
177			var_p = np.ma.masked_array(xerr**2, mask=[False])
178			sigma2_i = 1.
179			sigma2_ip = 1.
180			idxs = np.arange(len(var))
181			for _ in range(size):
182			max_entr_idx = np.ma.argmax(np.log(1. + var * sigma2_i))
183			min_entr_idx = np.ma.argmin(np.log(1. + var_p * sigma2_ip))
184			sigma2_i += 1. / var[max_entr_idx]
185			sigma2_ip += 1. / var_p[min_entr_idx]
186			var.mask[max_entr_idx] = True
187			var_p.mask[max_entr_idx] = True
188			return idxs[var.mask]
189
190	1	View Code Duplication	def load_scia_dzm(filename, alt, lat, tfmt="jyear",
			0 ignored issues – show Duplication introduced 2018-07-02 09:55 UTC by Report Bug Copy Issue Report This code seems to be duplicated in your project. Loading history...
191			scale=1, subsample_factor=1, subsample_method="greedy",
192			akd_threshold=0.002, cnt_threshold=0,
193			center=False, season=None, SPEs=False):
194			"""Load SCIAMACHY daily zonal mean data
195
196			Interface function for SCIAMACHY daily zonal mean data files version 6.x.
197			Uses :mod:`xarray` [#]_ to load and select the data. Possible selections are by
198			hemispheric summer (NH summer ~ SH winter and SH summer ~ NH winter) and
199			exclusion of strong solar proton events (SPE).
200
201			.. [#] https://xarray.pydata.org
202
203			Parameters
204			----------
205			filename: str
206			The input filename
207			alt: float
208			The altitude
209			lat: float
210			The longitude
211			tfmt: string, optional
212			The astropy.time "Time Format" for the time units,
213			for example, "jyear", "decimalyear", "jd", "mjd", etc.
214			See:
215			http://docs.astropy.org/en/stable/time/index.html#time-format
216			Default: "jyear"
217			scale: float, optional
218			Scale factor of the data (default: 1)
219			subsample_factor: int, optional
220			Factor to subsample the data by (see `subsample_method`)
221			(default: 1 (no subsampling))
222			subsample_method: "equal", "greedy", or "random", optional
223			Method for subsampling the data (see `subsample_factor`).
224			"equal" for equally spaced subsampling,
225			"greedy" for selecting the data based on their uncertainty,
226			and "random" for uniform random subsampling.
227			(default: "greedy")
228			center: bool, optional
229			Center the data by subtracting the global mean.
230			(default: False)
231			season: "summerNH", "summerSH", or `None`, optional
232			Select the named season or `None` for all data (default: None)
233			SPEs: bool, optional
234			Set to `True` to exclude strong SPE events (default: False)
235
236			Returns
237			-------
238			(times, dens, errs): tuple of (N,) array_like
239			The measurement times according to the `tfmt` keyword,
240			the number densities, and their uncertainties.
241			"""
242	1		logging.info("Opening dataset: '%s'", filename)
243	1		NO_ds = xr.open_mfdataset(filename, decode_times=False,
244			chunks={"time": 400, "latitude": 9, "altitude": 11})
245	1		logging.info("done.")
246			# Decode time coordinate for selection
247	1		try:
248			# xarray <= 0.9.6
249	1		NO_ds["time"] = xr.conventions.decode_cf_variable(NO_ds.time)
250	1		except TypeError:
251			# xarray => 0.10.0
252	1		NO_ds["time"] = xr.conventions.decode_cf_variable("time", NO_ds.time)
253
254	1		NO_mean = 0.
255	1		if center:
256	1		NO_mean = NO_ds.NO_DENS.mean().values
257	1		logging.info("Centering with global mean: %s", NO_mean)
258	1		NO_tds = NO_ds.sel(latitude=lat, altitude=alt)
259
260			# Exclude SPEs first if requested
261	1		if SPEs:
262	1		logging.info("Removing SPEs.")
263	1		for spe in _SPEs:
264	1		NO_tds = NO_tds.drop(
265			NO_tds.sel(time=slice(spe[0], spe[-1])).time.values,
266			dim="time")
267
268			# Filter by season
269	1		if season in _seasons.keys():
270	1		logging.info("Restricting to season: %s", season)
271	1		NO_tds = xr.concat([NO_tds.sel(time=s) for s in _seasons[season]],
272			dim="time")
273	1		NO_tds.load()
274			else:
275	1		logging.info("No season selected or unknown season, "
276			"using all available data.")
277
278	1		try:
279	1		NO_counts = NO_tds.NO_DENS_cnt
280			except AttributeError:
281			NO_counts = NO_tds.counts
282
283			# Select only useful data
284	1		NO_tds = NO_tds.where(
285			np.isfinite(NO_tds.NO_DENS) &
286			(NO_tds.NO_DENS_std != 0) &
287			(NO_tds.NO_AKDIAG > akd_threshold) &
288			(NO_counts > cnt_threshold) &
289			(NO_tds.NO_MASK == 0),
290			drop=True)
291
292	1		no_dens = scale * NO_tds.NO_DENS
293	1		if center:
294	1		no_dens -= scale * NO_mean
295	1		no_errs = scale * NO_tds.NO_DENS_std / np.sqrt(NO_counts)
296	1		logging.debug("no_dens.shape (ntime,): %s", no_dens.shape)
297
298	1		no_sza = NO_tds.mean_SZA
299
300			# Convert to astropy.Time for Julian epoch or decimal year
301	1		if NO_tds.time.size > 0:
302	1		no_t = Time(pd.to_datetime(NO_tds.time.values, utc=True).to_pydatetime(),
303			format="datetime", scale="utc")
304	1		no_ys = getattr(no_t, tfmt)
305			else:
306	1		no_ys = np.empty_like(NO_tds.time.values, dtype=np.float64)
307
308	1		if subsample_factor > 1:
309			new_data_size = no_dens.shape[0] // subsample_factor
310			if subsample_method == "random":
311			# random subsampling
312			_idxs = np.random.choice(no_dens.shape[0],
313			new_data_size, replace=False)
314			elif subsample_method == "equal":
315			# equally spaced subsampling
316			_idxs = slice(0, no_dens.shape[0], subsample_factor)
317			else:
318			# "greedy" subsampling (default fall-back)
319			_idxs = _greedy_idxs_post(no_dens, no_errs, new_data_size)
320			return (no_ys[_idxs],
321			no_dens.values[_idxs],
322			no_errs.values[_idxs],
323			no_sza.values[_idxs])
324
325			return no_ys, no_dens.values, no_errs.values, no_sza.values
326

st-bender / sciapy

Push — master ( ff27bc...351d25 )

sciapy.regress.load_data.load_scia_dzm() D

Complexity

Size

Duplication

Code Coverage

Importance

How to fix Long Method Complexity Many Parameters

Long Method

Complexity

Many Parameters

Duplication Side-by-Side

Filter issues like