tracking_policy_agendas.word2vec.w2v_emb - Code Metrics - MohammadForouhesh/tracking-policy-agendas - Measure and Improve Code Quality continuously with Scrutinizer

tracking_policy_agendas.word2vec.w2v_emb A
last analyzed 2023-01-07 20:54 UTC

↳ Parent: Project

Complexity

Total Complexity

Size/Duplication

Total Lines	105
Duplicated Lines	77.14 %

Test Coverage

Coverage

80.48%

Importance

Changes

Metric	Value
wmc	13
eloc	45
dl	81
loc	105
ccs	33
cts	41
cp	0.8048
rs	10
c	0
b	0
f	0

8 Methods

Rating	Name	Duplication	Size	Complexity
A	W2VEmb.__init()	10	10	1
A	W2VEmb.__init__()	5	5	2
A	W2VEmb.tf_idf_transformer()	11	11	1
A	W2VEmb.__getitem__()	8	8	2
A	W2VEmb.tf_idf_mean()	11	11	2
A	W2VEmb.load()	9	9	2
A	W2VEmb.save()	9	9	2
A	W2VEmb.encode()	10	10	1

How to fix Duplicated Code

"""
Word2Vec Embedding

....................................................................................................
MIT License
Copyright (c) 2021-2023 AUT Iran, Mohammad H Forouhesh
Copyright (c) 2021-2022 MetoData.ai, Mohammad H Forouhesh
....................................................................................................
This module encapsulate the Word2Vec embedding of a given corpus.
"""

import pickle
import gensim
import numpy as np
import pandas as pd
from typing import List, Generator


from gensim import utils
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from .w2v_corpus import W2VCorpus


class W2VEmb:

    def __init__(self, text_document=None):
        self.wv2_corpus = None
        self.w2v_model = None
        self.tf_idf_transformation = None
        if text_document is not None: self.__init(text_document)


    def __init(self, text_document: pd.Series) -> None:
        """
        Constructor
        :param text_document:  text corpus
        :return:               None
        """
        text_document = text_document.fillna('')
        self.tf_idf_transformation = self.tf_idf_transformer(text_document)
        self.wv2_corpus = W2VCorpus(text_document)
        self.w2v_model = gensim.models.Word2Vec(sentences=self.wv2_corpus, min_count=1, vector_size=900, epochs=50)


    def __getitem__(self, text: str) -> np.ndarray:
        """
        getitem overwrite to get word embedding for a given text.
        :param text:    Input text.
        :return:        A numpy array of embedding array.
        """
        try:                return self.w2v_model.wv[text]

        except KeyError:    return np.array([0 for _ in range(0, self.w2v_model.vector_size)])


    def tf_idf_transformer(self, text_series):
class Foo:
    def some_method(self, x, y):
        return x + y;
        """
        TF-IDF transformer for weighting words
        :param text_series:
        :return:
        """
        tfidf = Pipeline([('count', CountVectorizer(encoding='utf-8', min_df=3, #max_df=0.9,
                                                    max_features=900,
                                                    ngram_range=(1, 2))),
                          ('tfid', TfidfTransformer(sublinear_tf=True, norm='l2'))]).fit(text_series.ravel())

        return tfidf

    def encode(self, text: str) -> np.ndarray:
        """
        Encoding function
        :param text:    Input text
        :return:        A numpy array of embedding array.
        """
        stream = utils.simple_preprocess(text)
        tf_idf_vec = self.tf_idf_transformation.transform(stream).toarray()
        w2v_encode = self[stream]
        return np.mean(list(self.tf_idf_mean(tf_idf_vec, w2v_encode)), axis=0)

    def save(self, path: str) -> None:
        """
        A tool to save model w2v to disk
        :param path:   Saving path.
        :return:       None.
        """

        with open(path, 'wb') as f:

            pickle.dump(self, f, protocol=pickle.HIGHEST_PROTOCOL)

    def load(self, path: str) -> None:
        """
        A tool to load w2v model from disk.
        :param path:   Model path.
        :return:       None
        """

        with open(path, 'rb') as f:

            self.__dict__.update(pickle.load(f).__dict__)

    @staticmethod
    def tf_idf_mean(tf_idf_vec: np.ndarray, w2v_encode: np.ndarray) -> Generator[List[float], None, None]:

        """
        Mean pooling to encode sentences using tf-idf weights of words.
        :param tf_idf_vec:  A tf-idf vector of the sentence
        :param w2v_encode:  A word2vec vector of the sentence
        :return:            A generator that yield relative vector of a word with respect to its tf-idf vector.

        """

        for ind in range(len(tf_idf_vec)):

            yield tf_idf_vec[ind]*w2v_encode[ind]


1			"""
2			Word2Vec Embedding
3
4			....................................................................................................
5			MIT License
6			Copyright (c) 2021-2023 AUT Iran, Mohammad H Forouhesh
7			Copyright (c) 2021-2022 MetoData.ai, Mohammad H Forouhesh
8			....................................................................................................
9			This module encapsulate the Word2Vec embedding of a given corpus.
10			"""
11
12	1		import pickle
13	1		import gensim
14	1		import numpy as np
15	1		import pandas as pd
16	1		from typing import List, Generator
			0 ignored issues – show introduced 2022-03-13 16:50 UTC by Report Bug Copy Issue Report standard import "from typing import List, Generator" should be placed before "import gensim" Loading history...
17
18	1		from gensim import utils
19	1		from sklearn.pipeline import Pipeline
20	1		from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
21	1		from .w2v_corpus import W2VCorpus
22
23
24	1	View Code Duplication	class W2VEmb:
			0 ignored issues – show introduced 2022-03-13 12:10 UTC by Report Bug Copy Issue Report Missing class docstring Loading history... Duplication introduced 2022-03-13 12:10 UTC by Report Bug Copy Issue Report This code seems to be duplicated in your project. Loading history...
25	1		def __init__(self, text_document=None):
26	1		self.wv2_corpus = None
27	1		self.w2v_model = None
28	1		self.tf_idf_transformation = None
29	1		if text_document is not None: self.__init(text_document)
			0 ignored issues – show Coding Style introduced 2022-03-13 12:10 UTC by Report Bug Copy Issue Report More than one statement on a single line Loading history...
30
31	1		def __init(self, text_document: pd.Series) -> None:
32			"""
33			Constructor
34			:param text_document: text corpus
35			:return: None
36			"""
37			text_document = text_document.fillna('')
38			self.tf_idf_transformation = self.tf_idf_transformer(text_document)
39			self.wv2_corpus = W2VCorpus(text_document)
40			self.w2v_model = gensim.models.Word2Vec(sentences=self.wv2_corpus, min_count=1, vector_size=900, epochs=50)
			0 ignored issues – show Coding Style introduced 2022-03-13 12:10 UTC by Report Bug Copy Issue Report This line is too long as per the coding-style (115/100). This check looks for lines that are too long. You can specify the maximum line length. Loading history...
41
42	1		def __getitem__(self, text: str) -> np.ndarray:
43			"""
44			getitem overwrite to get word embedding for a given text.
45			:param text: Input text.
46			:return: A numpy array of embedding array.
47			"""
48	1		try: return self.w2v_model.wv[text]
			0 ignored issues – show Coding Style introduced 2022-03-13 12:10 UTC by Report Bug Copy Issue Report More than one statement on a single line Loading history...
49	1		except KeyError: return np.array([0 for _ in range(0, self.w2v_model.vector_size)])
			0 ignored issues – show Coding Style introduced 2022-03-13 12:10 UTC by Report Bug Copy Issue Report More than one statement on a single line Loading history...
50
51	1		def tf_idf_transformer(self, text_series):
			0 ignored issues – show Coding Style introduced 2022-03-13 12:10 UTC by Report Bug Copy Issue Report This method could be written as a function/class method. If a method does not access any attributes of the class, it could also be implemented as a function or static method. This can help improve readability. For example class Foo: def some_method(self, x, y): return x + y; could be written as class Foo: @classmethod def some_method(cls, x, y): return x + y; Loading history...
52			"""
53			TF-IDF transformer for weighting words
54			:param text_series:
55			:return:
56			"""
57			tfidf = Pipeline([('count', CountVectorizer(encoding='utf-8', min_df=3, #max_df=0.9,
58			max_features=900,
59			ngram_range=(1, 2))),
60			('tfid', TfidfTransformer(sublinear_tf=True, norm='l2'))]).fit(text_series.ravel())
			0 ignored issues – show Coding Style introduced 2022-03-13 12:10 UTC by Report Bug Copy Issue Report This line is too long as per the coding-style (109/100). This check looks for lines that are too long. You can specify the maximum line length. Loading history...
61			return tfidf
62
63	1		def encode(self, text: str) -> np.ndarray:
64			"""
65			Encoding function
66			:param text: Input text
67			:return: A numpy array of embedding array.
68			"""
69	1		stream = utils.simple_preprocess(text)
70	1		tf_idf_vec = self.tf_idf_transformation.transform(stream).toarray()
71	1		w2v_encode = self[stream]
72	1		return np.mean(list(self.tf_idf_mean(tf_idf_vec, w2v_encode)), axis=0)
73
74	1		def save(self, path: str) -> None:
75			"""
76			A tool to save model w2v to disk
77			:param path: Saving path.
78			:return: None.
79			"""
80
81			with open(path, 'wb') as f:
			0 ignored issues – show Coding Style Naming introduced 2022-03-13 12:10 UTC by Report Bug Copy Issue Report Variable name "f" doesn't conform to snake_case naming style ('([^\\W\\dA-Z][^\\WA-Z]2,\|_[^\\WA-Z]*\|__[^\\WA-Z\\d_][^\\WA-Z]+__)$' pattern) This check looks for invalid names for a range of different identifiers. You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements. If your project includes a Pylint configuration file, the settings contained in that file take precedence. To find out more about Pylint, please refer to their site. Loading history...
82			pickle.dump(self, f, protocol=pickle.HIGHEST_PROTOCOL)
83
84	1		def load(self, path: str) -> None:
85			"""
86			A tool to load w2v model from disk.
87			:param path: Model path.
88			:return: None
89			"""
90
91	1		with open(path, 'rb') as f:
			0 ignored issues – show Coding Style Naming introduced 2022-03-13 12:10 UTC by Report Bug Copy Issue Report Variable name "f" doesn't conform to snake_case naming style ('([^\\W\\dA-Z][^\\WA-Z]2,\|_[^\\WA-Z]*\|__[^\\WA-Z\\d_][^\\WA-Z]+__)$' pattern) This check looks for invalid names for a range of different identifiers. You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements. If your project includes a Pylint configuration file, the settings contained in that file take precedence. To find out more about Pylint, please refer to their site. Loading history...
92	1		self.__dict__.update(pickle.load(f).__dict__)
93
94	1		@staticmethod
95	1		def tf_idf_mean(tf_idf_vec: np.ndarray, w2v_encode: np.ndarray) -> Generator[List[float], None, None]:
			0 ignored issues – show Coding Style introduced 2022-03-13 16:50 UTC by Report Bug Copy Issue Report This line is too long as per the coding-style (106/100). This check looks for lines that are too long. You can specify the maximum line length. Loading history...
96			"""
97			Mean pooling to encode sentences using tf-idf weights of words.
98			:param tf_idf_vec: A tf-idf vector of the sentence
99			:param w2v_encode: A word2vec vector of the sentence
100			:return: A generator that yield relative vector of a word with respect to its tf-idf vector.
			0 ignored issues – show Coding Style introduced 2022-03-13 16:50 UTC by Report Bug Copy Issue Report This line is too long as per the coding-style (111/100). This check looks for lines that are too long. You can specify the maximum line length. Loading history...
101			"""
102
103	1		for ind in range(len(tf_idf_vec)):
			0 ignored issues – show unused-code introduced 2022-03-13 12:10 UTC by Report Bug Copy Issue Report Consider using enumerate instead of iterating with range and len Loading history...
104			yield tf_idf_vec[ind]*w2v_encode[ind]
105

MohammadForouhesh / tracking-policy-agendas

tracking_policy_agendas.word2vec.w2v_emb A last analyzed 2023-01-07 20:54 UTC

Complexity

Size/Duplication

Test Coverage

Importance

8 Methods

How to fix Duplicated Code

Duplicated Code

Duplication Side-by-Side

Filter issues like

tracking_policy_agendas.word2vec.w2v_emb A
last analyzed 2023-01-07 20:54 UTC