_PLSA()   D
last analyzed

Complexity

Conditions 8

Size

Total Lines 24

Duplication

Lines 0
Ratio 0 %
Metric Value
dl 0
loc 24
rs 4.3478
cc 8
1
# -*- coding: utf-8 -*-
0 ignored issues
show
Coding Style introduced by
This module should have a docstring.

The coding style of this project requires that you add a docstring to this code element. Below, you find an example for methods:

class SomeClass:
    def some_method(self):
        """Do x and return foo."""

If you would like to know more about docstrings, we recommend to read PEP-257: Docstring Conventions.

Loading history...
2
3
import logging
4
5
import numpy as np
6
7
from .base_model_output import ModelOutput
8
from ._registry import register
9
10
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',
11
                    level=logging.WARNING)
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable logging does not seem to be defined.
Loading history...
12
13
14
def _rand_mat(rows, cols):
0 ignored issues
show
Coding Style introduced by
This function should have a docstring.

The coding style of this project requires that you add a docstring to this code element. Below, you find an example for methods:

class SomeClass:
    def some_method(self):
        """Do x and return foo."""

If you would like to know more about docstrings, we recommend to read PEP-257: Docstring Conventions.

Loading history...
15
    out = np.random.random((rows, cols))
16
    for row in out:
17
        row /= row.sum()
18
    return out
19
20
21
def _cal_p_dw(words_in_docs, word_cts_in_docs, topic_array, zw, dz, beta, p_dw):
0 ignored issues
show
Coding Style Naming introduced by
The name zw does not conform to the argument naming conventions ([a-z_][a-z0-9_]{2,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
Coding Style Naming introduced by
The name dz does not conform to the argument naming conventions ([a-z_][a-z0-9_]{2,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
Coding Style introduced by
This function should have a docstring.

The coding style of this project requires that you add a docstring to this code element. Below, you find an example for methods:

class SomeClass:
    def some_method(self):
        """Do x and return foo."""

If you would like to know more about docstrings, we recommend to read PEP-257: Docstring Conventions.

Loading history...
Unused Code introduced by
The argument topic_array seems to be unused.
Loading history...
22
    for (d, doc_id, words) in words_in_docs:
0 ignored issues
show
Coding Style Naming introduced by
The name d does not conform to the variable naming conventions ([a-z_][a-z0-9_]{2,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
23
        p_dw[d, words] = (word_cts_in_docs[doc_id] * (zw[:, words]*np.expand_dims(dz[d, :], 1))**beta).sum(axis=0)
24
    return p_dw
25
26
27
def _e_step(words_in_docs, dw_z, topic_array, zw, dz, beta, p_dw):
0 ignored issues
show
Coding Style Naming introduced by
The name zw does not conform to the argument naming conventions ([a-z_][a-z0-9_]{2,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
Coding Style Naming introduced by
The name dz does not conform to the argument naming conventions ([a-z_][a-z0-9_]{2,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
Coding Style introduced by
This function should have a docstring.

The coding style of this project requires that you add a docstring to this code element. Below, you find an example for methods:

class SomeClass:
    def some_method(self):
        """Do x and return foo."""

If you would like to know more about docstrings, we recommend to read PEP-257: Docstring Conventions.

Loading history...
Unused Code introduced by
The argument topic_array seems to be unused.
Loading history...
28
    for (d, _, words) in words_in_docs:
0 ignored issues
show
Coding Style Naming introduced by
The name d does not conform to the variable naming conventions ([a-z_][a-z0-9_]{2,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
29
        dw_z[d, words, :] = ((zw[:, words].T * dz[d, :]) ** beta) / np.expand_dims(p_dw[d, words], 1)
30
    return dw_z
31
32
33
def _m_step(words_in_docs, word_cts_in_docs, topic_array, zw, dw_z, dz):
0 ignored issues
show
Coding Style Naming introduced by
The name zw does not conform to the argument naming conventions ([a-z_][a-z0-9_]{2,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
Coding Style Naming introduced by
The name dz does not conform to the argument naming conventions ([a-z_][a-z0-9_]{2,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
Coding Style introduced by
This function should have a docstring.

The coding style of this project requires that you add a docstring to this code element. Below, you find an example for methods:

class SomeClass:
    def some_method(self):
        """Do x and return foo."""

If you would like to know more about docstrings, we recommend to read PEP-257: Docstring Conventions.

Loading history...
Unused Code introduced by
The argument topic_array seems to be unused.
Loading history...
34
    zw[:] = 0
35
    for (d, doc_id, words) in words_in_docs:
0 ignored issues
show
Coding Style Naming introduced by
The name d does not conform to the variable naming conventions ([a-z_][a-z0-9_]{2,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
36
        zw[:, words] += word_cts_in_docs[doc_id]*dw_z[d, words].T
37
    # normalize by sum of topic word weights
38
    zw /= np.expand_dims(zw.sum(axis=1), 1)
39
    for (d, doc_id, words) in words_in_docs:
0 ignored issues
show
Coding Style Naming introduced by
The name d does not conform to the variable naming conventions ([a-z_][a-z0-9_]{2,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
40
        dz[d] = (word_cts_in_docs[doc_id] * dw_z[d, words].T).sum(axis=1)
41
    dz /= np.expand_dims(dz.sum(axis=1), 1)
42
    return zw, dz
43
44
45
def _cal_likelihood(words_in_docs, word_cts_in_docs, p_dw):
0 ignored issues
show
Coding Style introduced by
This function should have a docstring.

The coding style of this project requires that you add a docstring to this code element. Below, you find an example for methods:

class SomeClass:
    def some_method(self):
        """Do x and return foo."""

If you would like to know more about docstrings, we recommend to read PEP-257: Docstring Conventions.

Loading history...
46
    likelihood = 0
47
    for (d, doc_id, words) in words_in_docs:
0 ignored issues
show
Coding Style Naming introduced by
The name d does not conform to the variable naming conventions ([a-z_][a-z0-9_]{2,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
48
        likelihood += sum(word_cts_in_docs[doc_id] * np.log(p_dw[d][words]))
49
    return likelihood
50
51
52
def _get_topic_term_matrix(zw, ntopics, id_term_map):
0 ignored issues
show
Coding Style Naming introduced by
The name zw does not conform to the argument naming conventions ([a-z_][a-z0-9_]{2,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
Coding Style introduced by
This function should have a docstring.

The coding style of this project requires that you add a docstring to this code element. Below, you find an example for methods:

class SomeClass:
    def some_method(self):
        """Do x and return foo."""

If you would like to know more about docstrings, we recommend to read PEP-257: Docstring Conventions.

Loading history...
Unused Code introduced by
The argument id_term_map seems to be unused.
Loading history...
53
    labeled_zw = {"topic"+str(topicno): zw[topicno].tolist() for topicno in range(ntopics)}
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable topicno does not seem to be defined.
Loading history...
54
    return labeled_zw
55
56
57
def _get_doc_topic_matrix(dz, ntopics, vectorized_corpus):
0 ignored issues
show
Coding Style Naming introduced by
The name dz does not conform to the argument naming conventions ([a-z_][a-z0-9_]{2,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
Coding Style introduced by
This function should have a docstring.

The coding style of this project requires that you add a docstring to this code element. Below, you find an example for methods:

class SomeClass:
    def some_method(self):
        """Do x and return foo."""

If you would like to know more about docstrings, we recommend to read PEP-257: Docstring Conventions.

Loading history...
Unused Code introduced by
The argument ntopics seems to be unused.
Loading history...
58
    labeled_dz = {doc_id: dz[i].tolist() for i, (doc_id, vector) in enumerate(vectorized_corpus.get_vectors())}
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable vector does not seem to be defined.
Loading history...
Comprehensibility Best Practice introduced by
The variable i does not seem to be defined.
Loading history...
Comprehensibility Best Practice introduced by
The variable doc_id does not seem to be defined.
Loading history...
59
    return labeled_dz
60
61
62
def _PLSA(vectorized_corpus, ntopics, max_iter):
0 ignored issues
show
Coding Style Naming introduced by
The name _PLSA does not conform to the function naming conventions ([a-z_][a-z0-9_]{2,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
Coding Style introduced by
This function should have a docstring.

The coding style of this project requires that you add a docstring to this code element. Below, you find an example for methods:

class SomeClass:
    def some_method(self):
        """Do x and return foo."""

If you would like to know more about docstrings, we recommend to read PEP-257: Docstring Conventions.

Loading history...
63
    cur = 0
64
    topic_array = np.arange(ntopics, dtype=np.int32)
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable np does not seem to be defined.
Loading history...
65
    # topic-word matrix
66
    zw = _rand_mat(ntopics, vectorized_corpus.global_term_count)
0 ignored issues
show
Coding Style Naming introduced by
The name zw does not conform to the variable naming conventions ([a-z_][a-z0-9_]{2,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
67
    # document-topic matrix
68
    dz = _rand_mat(len(vectorized_corpus), ntopics)
0 ignored issues
show
Coding Style Naming introduced by
The name dz does not conform to the variable naming conventions ([a-z_][a-z0-9_]{2,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
69
    dw_z = np.zeros((len(vectorized_corpus), vectorized_corpus.global_term_count, ntopics))
70
    p_dw = np.zeros((len(vectorized_corpus), vectorized_corpus.global_term_count))
71
    beta = 0.8
72
    words_in_docs = [(id, doc_id, [word_id for word_id, _ in doc.items()])
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable word_id does not seem to be defined.
Loading history...
Comprehensibility Best Practice introduced by
The variable id does not seem to be defined.
Loading history...
Comprehensibility Best Practice introduced by
The variable _ does not seem to be defined.
Loading history...
Comprehensibility Best Practice introduced by
The variable doc_id does not seem to be defined.
Loading history...
73
                     for id, (doc_id, doc) in enumerate(vectorized_corpus.get_vectors())]
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable doc does not seem to be defined.
Loading history...
74
    word_cts_in_docs = {doc_id: [ct for _, ct in doc.items()] for doc_id, doc in vectorized_corpus.get_vectors()}
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable ct does not seem to be defined.
Loading history...
75
    for i in range(max_iter):
0 ignored issues
show
Unused Code introduced by
The variable i seems to be unused.
Loading history...
76
        p_dw = _cal_p_dw(words_in_docs, word_cts_in_docs, topic_array, zw, dz, beta, p_dw)
77
        dw_z = _e_step(words_in_docs, dw_z, topic_array, zw, dz, beta, p_dw)
78
        zw, dz = _m_step(words_in_docs, word_cts_in_docs, topic_array, zw, dw_z, dz)
0 ignored issues
show
Coding Style Naming introduced by
The name zw does not conform to the variable naming conventions ([a-z_][a-z0-9_]{2,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
Coding Style Naming introduced by
The name dz does not conform to the variable naming conventions ([a-z_][a-z0-9_]{2,30}$).

This check looks for invalid names for a range of different identifiers.

You can set regular expressions to which the identifiers must conform if the defaults do not match your requirements.

If your project includes a Pylint configuration file, the settings contained in that file take precedence.

To find out more about Pylint, please refer to their site.

Loading history...
79
        likelihood = _cal_likelihood(words_in_docs, word_cts_in_docs, p_dw)
80
        if cur != 0 and abs((likelihood-cur)/cur) < 1e-8:
81
            break
82
        cur = likelihood
83
    topic_term_matrix = _get_topic_term_matrix(zw, ntopics, vectorized_corpus.id_term_map)
84
    doc_topic_matrix = _get_doc_topic_matrix(dz, ntopics, vectorized_corpus)
85
    return topic_term_matrix, doc_topic_matrix
86
87
@register
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable register does not seem to be defined.
Loading history...
88
def plsa(vectorized_corpus, ntopics, max_iter=100, **kwargs):
0 ignored issues
show
Coding Style introduced by
This function should have a docstring.

The coding style of this project requires that you add a docstring to this code element. Below, you find an example for methods:

class SomeClass:
    def some_method(self):
        """Do x and return foo."""

If you would like to know more about docstrings, we recommend to read PEP-257: Docstring Conventions.

Loading history...
89
    return ModelOutput(vectorized_corpus=vectorized_corpus, model_func=_PLSA, ntopics=ntopics, max_iter=max_iter, **kwargs)
0 ignored issues
show
Comprehensibility Best Practice introduced by
The variable kwargs does not seem to be defined.
Loading history...
90
91
92