tracking_policy_agendas.word2vec.w2v_emb   A
last analyzed

Complexity

Total Complexity 13

Size/Duplication

Total Lines 105
Duplicated Lines 77.14 %

Test Coverage

Coverage 80.48%

Importance

Changes 0
Metric Value
wmc 13
eloc 45
dl 81
loc 105
ccs 33
cts 41
cp 0.8048
rs 10
c 0
b 0
f 0

8 Methods

Rating   Name   Duplication   Size   Complexity  
A W2VEmb.__init() 10 10 1
A W2VEmb.__init__() 5 5 2
A W2VEmb.tf_idf_transformer() 11 11 1
A W2VEmb.__getitem__() 8 8 2
A W2VEmb.tf_idf_mean() 11 11 2
A W2VEmb.load() 9 9 2
A W2VEmb.save() 9 9 2
A W2VEmb.encode() 10 10 1

How to fix   Duplicated Code   

Duplicated Code

Duplicate code is one of the most pungent code smells. A rule that is often used is to re-structure code once it is duplicated in three or more places.

Common duplication problems, and corresponding solutions are:

1
"""
2
Word2Vec Embedding
3
4
....................................................................................................
5
MIT License
6
Copyright (c) 2021-2023 AUT Iran, Mohammad H Forouhesh
7
Copyright (c) 2021-2022 MetoData.ai, Mohammad H Forouhesh
8
....................................................................................................
9
This module encapsulate the Word2Vec embedding of a given corpus.
10
"""
11
12 1
import pickle
13 1
import gensim
14 1
import numpy as np
15 1
import pandas as pd
16 1
from typing import List, Generator
0 ignored issues
show
introduced by
standard import "from typing import List, Generator" should be placed before "import gensim"
Loading history...
17
18 1
from gensim import utils
19 1
from sklearn.pipeline import Pipeline
20 1
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
21 1
from .w2v_corpus import W2VCorpus
22
23
24 1 View Code Duplication
class W2VEmb:
0 ignored issues
show
introduced by
Missing class docstring
Loading history...
Duplication introduced by
This code seems to be duplicated in your project.
Loading history...
25 1
    def __init__(self, text_document=None):
26 1
        self.wv2_corpus = None
27 1
        self.w2v_model = None
28 1
        self.tf_idf_transformation = None
29 1
        if text_document is not None: self.__init(text_document)
0 ignored issues
show
Coding Style introduced by
More than one statement on a single line
Loading history...
30
31 1
    def __init(self, text_document: pd.Series) -> None:
32
        """
33
        Constructor
34
        :param text_document:  text corpus
35
        :return:               None
36
        """
37
        text_document = text_document.fillna('')
38
        self.tf_idf_transformation = self.tf_idf_transformer(text_document)
39
        self.wv2_corpus = W2VCorpus(text_document)
40
        self.w2v_model = gensim.models.Word2Vec(sentences=self.wv2_corpus, min_count=1, vector_size=900, epochs=50)
0 ignored issues
show
Coding Style introduced by
This line is too long as per the coding-style (115/100).

This check looks for lines that are too long. You can specify the maximum line length.

Loading history...
41
42 1
    def __getitem__(self, text: str) -> np.ndarray:
43
        """
44
        getitem overwrite to get word embedding for a given text.
45
        :param text:    Input text.
46
        :return:        A numpy array of embedding array.
47
        """
48 1
        try:                return self.w2v_model.wv[text]
0 ignored issues
show
Coding Style introduced by
More than one statement on a single line
Loading history...
49 1
        except KeyError:    return np.array([0 for _ in range(0, self.w2v_model.vector_size)])
0 ignored issues
show
Coding Style introduced by
More than one statement on a single line
Loading history...
50
51 1
    def tf_idf_transformer(self, text_series):
0 ignored issues
show
Coding Style introduced by
This method could be written as a function/class method.

If a method does not access any attributes of the class, it could also be implemented as a function or static method. This can help improve readability. For example

class Foo:
    def some_method(self, x, y):
        return x + y;

could be written as

class Foo:
    @classmethod
    def some_method(cls, x, y):
        return x + y;
Loading history...
52
        """
53
        TF-IDF transformer for weighting words
54
        :param text_series:
55
        :return:
56
        """
57
        tfidf = Pipeline([('count', CountVectorizer(encoding='utf-8', min_df=3, #max_df=0.9,
58
                                                    max_features=900,
59
                                                    ngram_range=(1, 2))),
60
                          ('tfid', TfidfTransformer(sublinear_tf=True, norm='l2'))]).fit(text_series.ravel())
0 ignored issues
show
Coding Style introduced by
This line is too long as per the coding-style (109/100).

This check looks for lines that are too long. You can specify the maximum line length.

Loading history...
61
        return tfidf
62
63 1
    def encode(self, text: str) -> np.ndarray:
64
        """
65
        Encoding function
66
        :param text:    Input text
67
        :return:        A numpy array of embedding array.
68
        """
69 1
        stream = utils.simple_preprocess(text)
70 1
        tf_idf_vec = self.tf_idf_transformation.transform(stream).toarray()
71 1
        w2v_encode = self[stream]
72 1
        return np.mean(list(self.tf_idf_mean(tf_idf_vec, w2v_encode)), axis=0)
73
74 1
    def save(self, path: str) -> None:
75
        """
76
        A tool to save model w2v to disk
77
        :param path:   Saving path.
78
        :return:       None.
79
        """
80
81
        with open(path, 'wb') as f:
0 ignored issues
show
Coding Style Naming introduced by
Variable name "f" doesn't conform to snake_case naming style ('([^\\W\\dA-Z][^\\WA-Z]2,|_[^\\WA-Z]*|__[^\\WA-Z\\d_][^\\WA-Z]+__)$' pattern)

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
82
            pickle.dump(self, f, protocol=pickle.HIGHEST_PROTOCOL)
83
84 1
    def load(self, path: str) -> None:
85
        """
86
        A tool to load w2v model from disk.
87
        :param path:   Model path.
88
        :return:       None
89
        """
90
91 1
        with open(path, 'rb') as f:
0 ignored issues
show
Coding Style Naming introduced by
Variable name "f" doesn't conform to snake_case naming style ('([^\\W\\dA-Z][^\\WA-Z]2,|_[^\\WA-Z]*|__[^\\WA-Z\\d_][^\\WA-Z]+__)$' pattern)

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
92 1
            self.__dict__.update(pickle.load(f).__dict__)
93
94 1
    @staticmethod
95 1
    def tf_idf_mean(tf_idf_vec: np.ndarray, w2v_encode: np.ndarray) -> Generator[List[float], None, None]:
0 ignored issues
show
Coding Style introduced by
This line is too long as per the coding-style (106/100).

This check looks for lines that are too long. You can specify the maximum line length.

Loading history...
96
        """
97
        Mean pooling to encode sentences using tf-idf weights of words.
98
        :param tf_idf_vec:  A tf-idf vector of the sentence
99
        :param w2v_encode:  A word2vec vector of the sentence
100
        :return:            A generator that yield relative vector of a word with respect to its tf-idf vector.
0 ignored issues
show
Coding Style introduced by
This line is too long as per the coding-style (111/100).

This check looks for lines that are too long. You can specify the maximum line length.

Loading history...
101
        """
102
103 1
        for ind in range(len(tf_idf_vec)):
0 ignored issues
show
unused-code introduced by
Consider using enumerate instead of iterating with range and len
Loading history...
104
            yield tf_idf_vec[ind]*w2v_encode[ind]
105