Completed
Branch master (e214b7)
by Philippe
36s
created

src.pipeline.commands.TXTDenoiser.execute()   F

Complexity

Conditions 12

Size

Total Lines 44

Duplication

Lines 0
Ratio 0 %
Metric Value
cc 12
dl 0
loc 44
rs 2.7855

How to fix   Complexity   

Complexity

Complex classes like src.pipeline.commands.TXTDenoiser.execute() often do a lot of different things. To break such a class down, we need to identify a cohesive component within that class. A common approach to find such a component is to look for fields/methods that share the same prefixes, or suffixes.

Once you have determined the fields that belong together, you can apply the Extract Class refactoring. If the component makes sense as a sub-class, Extract Subclass is also a candidate, and is often faster.

1
"""Package to clean TXT files
2
3
.. Authors:
4
    Philippe Dessauw
5
    [email protected]
6
7
.. Sponsor:
8
    Alden Dima
9
    [email protected]
10
    Information Systems Group
11
    Software and Systems Division
12
    Information Technology Laboratory
13
    National Institute of Standards and Technology
14
    http://www.nist.gov/itl/ssd/is
15
"""
16
import codecs
17
from os.path import join, isfile, splitext, basename
18
from os import listdir
19
from denoiser import Denoiser
20
from pipeline.command import Command
21
22
23
class TXTDenoiser(Command):
24
    """Command to clean TXT files
25
    """
26
27
    def __init__(self, filename, logger, config):
28
        super(TXTDenoiser, self).__init__(filename, logger, config)
29
        self.denoiser = Denoiser(config)
30
31
    def execute(self):
32
        """Execute the command
33
        """
34
        try:
35
            self.logger.debug("::: Text cleaning :::")
36
            super(TXTDenoiser, self).get_file()
37
38
            txt_dir = join(self.unzipped, "txt")
39
            txt_files = [join(txt_dir, f) for f in listdir(txt_dir) if isfile(join(txt_dir, f)) and f.endswith(".txt")]
40
41
            if len(txt_files) != 1:
42
                self.logger.error("Incorrect number of text files")
43
                self.finalize()
44
                return -1
45
46
            text_data = self.denoiser.cleanse(txt_files[0], False)
47
48
            # Writing classified lines
49
            base_filename = splitext(basename(txt_files[0]))[0]
50
            clean_filename = join(txt_dir, base_filename+".clean.txt")
51
            garbage_filename = join(txt_dir, base_filename+".grbge.txt")
52
            unclassified_filename = join(txt_dir, base_filename+".unclss.txt")
53
54
            with codecs.open(clean_filename, "wb", encoding="utf-8") as clean_file:
55
                for line in text_data.get_clean_lines():
56
                    clean_file.write(line+"\n")
57
58
            with codecs.open(garbage_filename, "wb", encoding="utf-8") as garbage_file:
59
                for line in text_data.get_garbage_lines():
60
                    garbage_file.write(line+"\n")
61
62
            if len(text_data.get_unclassified_lines()) > 0:
63
                with codecs.open(unclassified_filename, "wb", encoding="utf-8") as unclassified_file:
64
                    for line in text_data.get_unclassified_lines():
65
                        unclassified_file.write(line+"\n")
66
        except Exception, e:
67
            print e
68
69
            self.logger.error("Cleaner has stopped unexpectedly: "+e.message)
70
            self.finalize()
71
            return -2
72
73
        self.finalize()
74
        return 0
75
76
    def finalize(self):
77
        """Finalize the job
78
        """
79
        super(TXTDenoiser, self).store_file()
80
        self.logger.debug("::: Text cleaning (END) :::")
81