Passed
Push — combine-cli-commands-index-and... ( 0724e4 )
by Osma
03:51
created

annif.cli.run_index()   F

Complexity

Conditions 19

Size

Total Lines 136
Code Lines 99

Duplication

Lines 0
Ratio 0 %

Importance

Changes 0
Metric Value
cc 19
eloc 99
nop 11
dl 0
loc 136
rs 0.4527
c 0
b 0
f 0

How to fix   Long Method    Complexity    Many Parameters   

Long Method

Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.

For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.

Commonly applied refactorings include:

Complexity

Complex classes like annif.cli.run_index() often do a lot of different things. To break such a class down, we need to identify a cohesive component within that class. A common approach to find such a component is to look for fields/methods that share the same prefixes, or suffixes.

Once you have determined the fields that belong together, you can apply the Extract Class refactoring. If the component makes sense as a sub-class, Extract Subclass is also a candidate, and is often faster.

Many Parameters

Methods with many parameters are not only hard to understand, but their parameters also often become inconsistent when you need more, or different data.

There are several approaches to avoid long parameter lists:

1
"""Definitions for command-line (Click) commands for invoking Annif
2
operations and printing the results to console."""
3
4
import collections
5
import importlib
6
import json
7
import os.path
8
import re
9
import sys
10
11
import click
12
import click_log
13
from flask.cli import FlaskGroup
14
15
import annif
16
import annif.parallel
17
import annif.project
18
import annif.registry
19
from annif import cli_util, hfh_util
20
from annif.corpus import Document, DocumentDirectory
21
from annif.exception import (
22
    NotInitializedException,
23
    NotSupportedException,
24
    OperationFailedException,
25
)
26
from annif.project import Access
27
from annif.simplemma_util import detect_language
28
from annif.util import metric_code, suggestion_to_dict
29
30
logger = annif.logger
31
click_log.basic_config(logger)
32
33
create_app = annif.create_flask_app
34
cli = FlaskGroup(
35
    create_app=create_app, add_default_commands=False, add_version_option=False
36
)
37
cli = click.version_option(message="%(version)s")(cli)
38
cli.params = [opt for opt in cli.params if opt.name not in ("env_file", "app")]
39
40
41
@cli.command("list-projects")
42
@cli_util.common_options
43
def run_list_projects():
44
    """
45
    List available projects.
46
    \f
47
    Show a list of currently defined projects. Projects are defined in a
48
    configuration file, normally called ``projects.cfg``. See `Project
49
    configuration
50
    <https://github.com/NatLibFi/Annif/wiki/Project-configuration>`_
51
    for details.
52
    """
53
54
    column_headings = (
55
        "Project ID",
56
        "Project Name",
57
        "Vocabulary ID",
58
        "Language",
59
        "Trained",
60
        "Modification time",
61
    )
62
    table = [
63
        (
64
            proj.project_id,
65
            proj.name,
66
            proj.vocab.vocab_id if proj.vocab_spec else "-",
67
            proj.language,
68
            str(proj.is_trained),
69
            cli_util.format_datetime(proj.modification_time),
70
        )
71
        for proj in annif.registry.get_projects(min_access=Access.private).values()
72
    ]
73
    template = cli_util.make_list_template(column_headings, *table)
74
    header = template.format(*column_headings)
75
    click.echo(header)
76
    click.echo("-" * len(header))
77
    for row in table:
78
        click.echo(template.format(*row))
79
80
81
@cli.command("show-project")
82
@cli_util.project_id
83
@cli_util.common_options
84
def run_show_project(project_id):
85
    """
86
    Show information about a project.
87
    """
88
89
    proj = cli_util.get_project(project_id)
90
    click.echo(f"Project ID:        {proj.project_id}")
91
    click.echo(f"Project Name:      {proj.name}")
92
    click.echo(f"Language:          {proj.language}")
93
    click.echo(f"Vocabulary:        {proj.vocab.vocab_id}")
94
    click.echo(f"Vocab language:    {proj.vocab_lang}")
95
    click.echo(f"Access:            {proj.access.name}")
96
    click.echo(f"Backend:           {proj.backend.name}")
97
    click.echo(f"Trained:           {proj.is_trained}")
98
    click.echo(f"Modification time: {cli_util.format_datetime(proj.modification_time)}")
99
100
101
@cli.command("clear")
102
@cli_util.project_id
103
@cli_util.common_options
104
def run_clear_project(project_id):
105
    """
106
    Initialize the project to its original, untrained state.
107
    """
108
    proj = cli_util.get_project(project_id)
109
    proj.remove_model_data()
110
111
112
@cli.command("list-vocabs")
113
@cli_util.common_options
114
def run_list_vocabs():
115
    """
116
    List available vocabularies.
117
    """
118
119
    column_headings = ("Vocabulary ID", "Languages", "Size", "Loaded")
120
    table = []
121
    for vocab in annif.registry.get_vocabs(min_access=Access.private).values():
122
        try:
123
            languages = ",".join(sorted(vocab.languages))
124
            size = len(vocab)
125
            loaded = True
126
        except NotInitializedException:
127
            languages = "-"
128
            size = "-"
129
            loaded = False
130
        row = (vocab.vocab_id, languages, str(size), str(loaded))
131
        table.append(row)
132
133
    template = cli_util.make_list_template(column_headings, *table)
134
    header = template.format(*column_headings)
135
    click.echo(header)
136
    click.echo("-" * len(header))
137
    for row in table:
138
        click.echo(template.format(*row))
139
140
141
@cli.command("load-vocab")
142
@click.argument("vocab_id", shell_complete=cli_util.complete_param)
143
@click.argument("vocab_file", type=click.Path(exists=True, dir_okay=False))
144
@click.option("--language", "-L", help="Language of TSV vocabulary file")
145
@click.option(
146
    "--force",
147
    "-f",
148
    default=False,
149
    is_flag=True,
150
    help="Replace existing vocabulary completely instead of updating it",
151
)
152
@cli_util.common_options
153
def run_load_vocab(vocab_id, language, force, vocab_file):
154
    """
155
    Load a vocabulary from a subject file.
156
    """
157
    vocab = cli_util.get_vocab(vocab_id)
158
    if annif.vocab.VocabFileSKOS.is_rdf_file(vocab_file):
159
        # SKOS/RDF file supported by rdflib
160
        vocab_file = annif.vocab.VocabFileSKOS(vocab_file)
161
        click.echo(f"Loading vocabulary from SKOS file {vocab_file}...")
162
    elif annif.vocab.VocabFileCSV.is_csv_file(vocab_file):
163
        # CSV file
164
        vocab_file = annif.vocab.VocabFileCSV(vocab_file)
165
        click.echo(f"Loading vocabulary from CSV file {vocab_file}...")
166
    else:
167
        # probably a TSV file - we need to know its language
168
        if not language:
169
            click.echo(
170
                "Please use --language option to set the language of a TSV vocabulary.",
171
                err=True,
172
            )
173
            sys.exit(1)
174
        click.echo(f"Loading vocabulary from TSV file {vocab_file}...")
175
        vocab_file = annif.vocab.VocabFileTSV(vocab_file, language)
176
    vocab.load_vocabulary(vocab_file, force=force)
177
178
179
@cli.command("train")
180
@cli_util.project_id
181
@click.argument("paths", type=click.Path(exists=True), nargs=-1)
182
@click.option(
183
    "--cached/--no-cached",
184
    "-c/-C",
185
    default=False,
186
    help="Reuse preprocessed training data from previous run",
187
)
188
@click.option(
189
    "--jobs",
190
    "-j",
191
    default=0,
192
    help="Number of parallel jobs (0 means choose automatically)",
193
)
194
@cli_util.docs_limit_option
195
@cli_util.backend_param_option
196
@cli_util.common_options
197
def run_train(project_id, paths, cached, docs_limit, jobs, backend_param):
198
    """
199
    Train a project on a collection of documents.
200
    \f
201
    This will train the project using the documents from ``PATHS`` (directories
202
    or possibly gzipped TSV files) in a single batch operation. If ``--cached``
203
    is set, preprocessed training data from the previous run is reused instead
204
    of documents input; see `Reusing preprocessed training data
205
    <https://github.com/NatLibFi/Annif/wiki/
206
    Reusing-preprocessed-training-data>`_.
207
    """
208
    proj = cli_util.get_project(project_id)
209
    backend_params = cli_util.parse_backend_params(backend_param, proj)
210
    if cached:
211
        if len(paths) > 0:
212
            raise click.UsageError(
213
                "Corpus paths cannot be given when using --cached option."
214
            )
215
        documents = "cached"
216
    else:
217
        documents = cli_util.open_documents(
218
            paths, proj.subjects, proj.vocab_lang, docs_limit
219
        )
220
    proj.train(documents, backend_params, jobs)
221
222
223
@cli.command("learn")
224
@cli_util.project_id
225
@click.argument("paths", type=click.Path(exists=True), nargs=-1)
226
@cli_util.docs_limit_option
227
@cli_util.backend_param_option
228
@cli_util.common_options
229
def run_learn(project_id, paths, docs_limit, backend_param):
230
    """
231
    Further train an existing project on a collection of documents.
232
    \f
233
    Similar to the ``train`` command. This will continue training an already
234
    trained project using the documents given by ``PATHS`` in a single batch
235
    operation. Not supported by all backends.
236
    """
237
    proj = cli_util.get_project(project_id)
238
    backend_params = cli_util.parse_backend_params(backend_param, proj)
239
    documents = cli_util.open_documents(
240
        paths, proj.subjects, proj.vocab_lang, docs_limit
241
    )
242
    proj.learn(documents, backend_params)
243
244
245
@cli.command("suggest")
246
@cli_util.project_id
247
@click.argument(
248
    "paths", type=click.Path(dir_okay=False, exists=True, allow_dash=True), nargs=-1
249
)
250
@click.option("--limit", "-l", default=10, help="Maximum number of subjects")
251
@click.option("--threshold", "-t", default=0.0, help="Minimum score threshold")
252
@click.option("--language", "-L", help="Language of subject labels")
253
@cli_util.docs_limit_option
254
@cli_util.backend_param_option
255
@click.option(
256
    "--metadata",
257
    "-D",
258
    multiple=True,
259
    help="Additional metadata for a document read from standard input. "
260
    + "Syntax: `-D <field>=<value>`.",
261
)
262
@cli_util.common_options
263
def run_suggest(
264
    project_id, paths, limit, threshold, language, backend_param, metadata, docs_limit
265
):
266
    """
267
    Suggest subjects for a single document from standard input (optionally
268
    with metadata) or for one or more document file(s) given its/their
269
    path(s).
270
    \f
271
    This will read a text document from standard input and suggest subjects for
272
    it, or if given path(s) to file(s), suggest subjects for it/them.
273
    """
274
    project = cli_util.get_project(project_id)
275
    lang = language or project.vocab_lang
276
    if lang not in project.vocab.languages:
277
        raise click.BadParameter(f'language "{lang}" not supported by vocabulary')
278
    backend_params = cli_util.parse_backend_params(backend_param, project)
279
280
    if paths and not (len(paths) == 1 and paths[0] == "-"):
281
        docs = cli_util.open_text_documents(paths, docs_limit)
282
        results = project.suggest_corpus(docs, backend_params).filter(limit, threshold)
283
        for (
284
            suggestions,
285
            path,
286
        ) in zip(results, paths):
287
            click.echo(f"Suggestions for {path}")
288
            cli_util.show_hits(suggestions, project, lang)
289
    else:
290
        text = sys.stdin.read()
291
        doc_metadata = cli_util.parse_metadata(metadata)
292
        suggestions = project.suggest(
293
            [Document(text=text, metadata=doc_metadata)], backend_params
294
        ).filter(limit, threshold)[0]
295
        cli_util.show_hits(suggestions, project, lang)
296
297
298
@cli.command("index")
299
@cli_util.project_id
300
@click.argument("paths", type=click.Path(exists=True), nargs=-1)
301
@click.option(
302
    "--suffix", "-s", default=".annif", help="File name suffix for result files"
303
)
304
@click.option(
305
    "--force/--no-force",
306
    "-f/-F",
307
    default=False,
308
    help="Force overwriting of existing result files",
309
)
310
@click.option("--limit", "-l", default=10, help="Maximum number of subjects")
311
@click.option("--threshold", "-t", default=0.0, help="Minimum score threshold")
312
@click.option("--language", "-L", help="Language of subject labels")
313
@click.option(
314
    "--gzip/--no-gzip",
315
    "-z/-Z",
316
    "use_gzip",
317
    default=False,
318
    help="Gzip compress result files",
319
)
320
@click.option(
321
    "--output",
322
    "-O",
323
    type=click.Path(dir_okay=False, writable=True),
324
    default=None,
325
    help="Redirect all output to the given file (or '-' for stdout)",
326
)
327
@click.option(
328
    "--include-doc/--no-include-doc",
329
    "-i/-I",
330
    default=True,
331
    help="Include input documents in output",
332
)
333
@cli_util.backend_param_option
334
@cli_util.common_options
335
def run_index(
336
    project_id,
337
    paths,
338
    suffix,
339
    force,
340
    limit,
341
    threshold,
342
    language,
343
    backend_param,
344
    use_gzip=False,
345
    output=None,
346
    include_doc=True,
347
):
348
    """
349
    Index documents from directories or files, suggesting subjects for each document.
350
    Write the results in TSV files (for directories) or JSONL files (for files) with
351
    the given suffix (.jsonl suffix will be added to JSONL files).
352
    """
353
    project = cli_util.get_project(project_id)
354
    lang = language or project.vocab_lang
355
    if lang not in project.vocab.languages:
356
        raise click.BadParameter(f'language "{lang}" not supported by vocabulary')
357
    backend_params = cli_util.parse_backend_params(backend_param, project)
358
359
    # Helper function to process a directory
360
    def process_directory(directory):
361
        corpus = DocumentDirectory(directory, require_subjects=False)
362
        results = project.suggest_corpus(corpus, backend_params).filter(
363
            limit, threshold
364
        )
365
366
        for doc, suggestions in zip(corpus.documents, results):
367
            subjectfilename = re.sub(r"\.(txt|json)$", suffix, doc.file_path)
368
            if os.path.exists(subjectfilename) and not force:
369
                click.echo(
370
                    "Not overwriting {} (use --force to override)".format(
371
                        subjectfilename
372
                    )
373
                )
374
                continue
375
            with open(subjectfilename, "w", encoding="utf-8") as subjfile:
376
                cli_util.show_hits(suggestions, project, lang, file=subjfile)
377
378
    # Helper function to process a file
379
    def process_file(path):
380
        corpus = cli_util.open_doc_path(
381
            path, project.subjects, lang, require_subjects=False
382
        )
383
        results = project.suggest_corpus(corpus, backend_params).filter(
384
            limit, threshold
385
        )
386
387
        # Determine output stream
388
        if output == "-":
389
            stream_cm = cli_util.get_output_stream(path, suffix, "-", use_gzip, force)
390
        else:
391
            # For individual files, generate output filename based on input path
392
            if output:
393
                outfilename = output + (
394
                    ".gz" if use_gzip and not output.endswith(".gz") else ""
395
                )
396
                stream_cm = cli_util.get_output_stream(
397
                    path, suffix, outfilename, use_gzip, force
398
                )
399
            else:
400
                outfilename = (
401
                    re.sub(r"(\.[^.]+)?(\.gz)?$", "", path) + suffix + ".jsonl"
402
                )
403
                if use_gzip and not outfilename.endswith(".gz"):
404
                    outfilename += ".gz"
405
                stream_cm = cli_util.get_output_stream(
406
                    path, suffix, outfilename, use_gzip, force
407
                )
408
409
        if stream_cm is None:
410
            return
411
412
        with stream_cm as stream:
413
            for doc, suggestions in zip(corpus.documents, results):
414
                if include_doc:
415
                    output_data = doc.as_dict(project.subjects, lang)
416
                else:
417
                    output_data = {}
418
                    if doc.document_id:
419
                        output_data["document_id"] = doc.document_id
420
                output_data["results"] = [
421
                    suggestion_to_dict(suggestion, project.subjects, lang)
422
                    for suggestion in suggestions
423
                ]
424
                stream.write(json.dumps(output_data) + "\n")
425
426
    # Process paths in the order they were given
427
    for path in paths:
428
        if os.path.isdir(path):
429
            process_directory(path)
430
        elif os.path.isfile(path):
431
            process_file(path)
432
        else:
433
            raise click.BadParameter(f"Path '{path}' is neither a file nor a directory")
434
435
436
@cli.command("eval")
437
@cli_util.project_id
438
@click.argument("paths", type=click.Path(exists=True), nargs=-1)
439
@click.option("--limit", "-l", default=10, help="Maximum number of subjects")
440
@click.option("--threshold", "-t", default=0.0, help="Minimum score threshold")
441
@click.option(
442
    "--metric",
443
    "-m",
444
    default=[],
445
    multiple=True,
446
    help="Metric to calculate (default: all)",
447
)
448
@click.option(
449
    "--metrics-file",
450
    "-M",
451
    type=click.File("w", encoding="utf-8", errors="ignore", lazy=True),
452
    help="""Specify file in order to write evaluation metrics in JSON format.
453
    File directory must exist, existing file will be overwritten.""",
454
)
455
@click.option(
456
    "--results-file",
457
    "-r",
458
    type=click.File("w", encoding="utf-8", errors="ignore", lazy=True),
459
    help="""Specify file in order to write non-aggregated results per subject.
460
    File directory must exist, existing file will be overwritten.""",
461
)
462
@click.option(
463
    "--jobs", "-j", default=1, help="Number of parallel jobs (0 means all CPUs)"
464
)
465
@cli_util.docs_limit_option
466
@cli_util.backend_param_option
467
@cli_util.common_options
468
def run_eval(
469
    project_id,
470
    paths,
471
    limit,
472
    threshold,
473
    docs_limit,
474
    metric,
475
    metrics_file,
476
    results_file,
477
    jobs,
478
    backend_param,
479
):
480
    """
481
    Suggest subjects for documents and evaluate the results by comparing
482
    against a gold standard.
483
    \f
484
    With this command the documents from ``PATHS`` (directories or possibly
485
    gzipped TSV files) will be assigned subject suggestions and then
486
    statistical measures are calculated that quantify how well the suggested
487
    subjects match the gold-standard subjects in the documents.
488
489
    Normally the output is the list of the metrics calculated across documents.
490
    If ``--results-file <FILENAME>`` option is given, the metrics are
491
    calculated separately for each subject, and written to the given file.
492
    """
493
494
    project = cli_util.get_project(project_id)
495
    backend_params = cli_util.parse_backend_params(backend_param, project)
496
497
    import annif.eval
498
499
    eval_batch = annif.eval.EvaluationBatch(project.subjects)
500
501
    if results_file:
502
        try:
503
            print("", end="", file=results_file)
504
            click.echo(
505
                "Writing per subject evaluation results to {!s}".format(
506
                    results_file.name
507
                )
508
            )
509
        except Exception as e:
510
            raise NotSupportedException(
511
                "cannot open results-file for writing: " + str(e)
512
            )
513
    corpus = cli_util.open_documents(
514
        paths, project.subjects, project.vocab_lang, docs_limit
515
    )
516
    jobs, pool_class = annif.parallel.get_pool(jobs)
517
518
    project.initialize(parallel=True)
519
    psmap = annif.parallel.ProjectSuggestMap(
520
        project.registry, [project_id], backend_params, limit, threshold
521
    )
522
523
    with pool_class(jobs) as pool:
524
        for hit_sets, subject_sets in pool.imap_unordered(
525
            psmap.suggest_batch, corpus.doc_batches
526
        ):
527
            eval_batch.evaluate_many(hit_sets[project_id], subject_sets)
528
529
    template = "{0:<30}\t{1:{fmt_spec}}"
530
    metrics = eval_batch.results(
531
        metrics=metric, results_file=results_file, language=project.vocab_lang
532
    )
533
    for metric, score in metrics.items():
534
        if isinstance(score, int):
535
            fmt_spec = "d"
536
        elif isinstance(score, float):
537
            fmt_spec = ".04f"
538
        click.echo(template.format(metric + ":", score, fmt_spec=fmt_spec))
539
    if metrics_file:
540
        json.dump(
541
            {metric_code(mname): val for mname, val in metrics.items()},
542
            metrics_file,
543
            indent=2,
544
        )
545
546
547
@cli.command("run")
548
@click.option("--host", type=str, default="127.0.0.1")
549
@click.option("--port", type=int, default=5000)
550
@click_log.simple_verbosity_option(logger)
551
def run_app(**kwargs):
552
    """
553
    Run Annif in server mode for development.
554
    \f
555
    The server is for development purposes only.
556
    """
557
    kwargs = {k: v for k, v in kwargs.items() if v is not None}
558
    cxapp = annif.create_cx_app()
559
    cxapp.run(**kwargs)
560
561
562
FILTER_BATCH_MAX_LIMIT = 15
563
OPTIMIZE_METRICS = ["Precision (doc avg)", "Recall (doc avg)", "F1 score (doc avg)"]
564
565
566
@cli.command("optimize")
567
@cli_util.project_id
568
@click.argument("paths", type=click.Path(exists=True), nargs=-1)
569
@click.option(
570
    "--jobs", "-j", default=1, help="Number of parallel jobs (0 means all CPUs)"
571
)
572
@cli_util.docs_limit_option
573
@cli_util.backend_param_option
574
@cli_util.common_options
575
def run_optimize(project_id, paths, jobs, docs_limit, backend_param):
576
    """
577
    Suggest subjects for documents, testing multiple limits and thresholds.
578
    \f
579
    This command will use different limit (maximum number of subjects) and
580
    score threshold values when assigning subjects to each document given by
581
    ``PATHS`` and compare the results against the gold standard subjects in the
582
    documents. The output is a list of parameter combinations and their scores.
583
    From the output, you can determine the optimum limit and threshold
584
    parameters depending on which measure you want to target.
585
    """
586
    project = cli_util.get_project(project_id)
587
    backend_params = cli_util.parse_backend_params(backend_param, project)
588
    filter_params = cli_util.generate_filter_params(FILTER_BATCH_MAX_LIMIT)
589
590
    import annif.eval
591
592
    corpus = cli_util.open_documents(
593
        paths, project.subjects, project.vocab_lang, docs_limit
594
    )
595
596
    jobs, pool_class = annif.parallel.get_pool(jobs)
597
598
    project.initialize(parallel=True)
599
    psmap = annif.parallel.ProjectSuggestMap(
600
        project.registry,
601
        [project_id],
602
        backend_params,
603
        limit=FILTER_BATCH_MAX_LIMIT,
604
        threshold=0.0,
605
    )
606
607
    ndocs = 0
608
    suggestion_batches = []
609
    subject_set_batches = []
610
    with pool_class(jobs) as pool:
611
        for suggestion_batch, subject_sets in pool.imap_unordered(
612
            psmap.suggest_batch, corpus.doc_batches
613
        ):
614
            ndocs += len(suggestion_batch[project_id])
615
            suggestion_batches.append(suggestion_batch[project_id])
616
            subject_set_batches.append(subject_sets)
617
618
    from annif.suggestion import SuggestionResults
619
620
    orig_suggestion_results = SuggestionResults(suggestion_batches)
621
622
    click.echo("\t".join(("Limit", "Thresh.", "Prec.", "Rec.", "F1")))
623
624
    best_scores = collections.defaultdict(float)
625
    best_params = {}
626
627
    template = "{:d}\t{:.02f}\t{:.04f}\t{:.04f}\t{:.04f}"
628
    import annif.eval
629
630
    for limit, threshold in filter_params:
631
        eval_batch = annif.eval.EvaluationBatch(project.subjects)
632
        filtered_results = orig_suggestion_results.filter(limit, threshold)
633
        for batch, subject_sets in zip(filtered_results.batches, subject_set_batches):
634
            eval_batch.evaluate_many(batch, subject_sets)
635
        results = eval_batch.results(metrics=OPTIMIZE_METRICS)
636
        for metric, score in results.items():
637
            if score >= best_scores[metric]:
638
                best_scores[metric] = score
639
                best_params[metric] = (limit, threshold)
640
        click.echo(
641
            template.format(
642
                limit,
643
                threshold,
644
                results["Precision (doc avg)"],
645
                results["Recall (doc avg)"],
646
                results["F1 score (doc avg)"],
647
            )
648
        )
649
650
    click.echo()
651
    template2 = "Best {:>19}: {:.04f}\tLimit: {:d}\tThreshold: {:.02f}"
652
    for metric in OPTIMIZE_METRICS:
653
        click.echo(
654
            template2.format(
655
                metric,
656
                best_scores[metric],
657
                best_params[metric][0],
658
                best_params[metric][1],
659
            )
660
        )
661
    click.echo("Documents evaluated:\t{}".format(ndocs))
662
663
664
@cli.command("hyperopt")
665
@cli_util.project_id
666
@click.argument("paths", type=click.Path(exists=True), nargs=-1)
667
@click.option("--trials", "-T", default=10, help="Number of trials")
668
@click.option(
669
    "--jobs", "-j", default=1, help="Number of parallel runs (0 means all CPUs)"
670
)
671
@click.option(
672
    "--metric", "-m", default="NDCG", help="Metric to optimize (default: NDCG)"
673
)
674
@click.option(
675
    "--results-file",
676
    "-r",
677
    type=click.File("w", encoding="utf-8", errors="ignore", lazy=True),
678
    help="""Specify file path to write trial results as TSV.
679
    File directory must exist, existing file will be overwritten.""",
680
)
681
@cli_util.docs_limit_option
682
@cli_util.common_options
683
def run_hyperopt(project_id, paths, docs_limit, trials, jobs, metric, results_file):
684
    """
685
    Optimize the hyperparameters of a project using validation documents from
686
    ``PATHS``. Not supported by all backends. Output is a list of trial results
687
    and a report of the best performing parameters.
688
    """
689
    proj = cli_util.get_project(project_id)
690
    documents = cli_util.open_documents(
691
        paths, proj.subjects, proj.vocab_lang, docs_limit
692
    )
693
    click.echo(f"Looking for optimal hyperparameters using {trials} trials")
694
    rec = proj.hyperopt(documents, trials, jobs, metric, results_file)
695
    click.echo(f"Got best {metric} score {rec.score:.4f} with:")
696
    click.echo("---")
697
    for line in rec.lines:
698
        click.echo(line)
699
    click.echo("---")
700
701
702
@cli.command("upload")
703
@click.argument("project_ids_pattern", shell_complete=cli_util.complete_param)
704
@click.argument("repo_id")
705
@click.option(
706
    "--token",
707
    help="""Authentication token, obtained from the Hugging Face Hub.
708
    Will default to the stored token.""",
709
)
710
@click.option(
711
    "--revision",
712
    help="""An optional git revision to commit from. Defaults to the head of the "main"
713
    branch.""",
714
)
715
@click.option(
716
    "--commit-message",
717
    help="""The summary / title / first line of the generated commit.""",
718
)
719
@click.option(
720
    "--modelcard/--no-modelcard",
721
    default=True,
722
    help="Update or create a Model Card with upload.",
723
)
724
@cli_util.common_options
725
def run_upload(
726
    project_ids_pattern, repo_id, token, revision, commit_message, modelcard
727
):
728
    """
729
    Upload selected projects and their vocabularies to a Hugging Face Hub repository.
730
    \f
731
    This command zips the project directories and vocabularies of the projects
732
    that match the given `project_ids_pattern` to archive files, and uploads the
733
    archives along with the project configurations to the specified Hugging Face
734
    Hub repository. An authentication token and commit message can be given with
735
    options. If the README.md does not exist in the repository it is
736
    created with default contents and metadata of the uploaded projects, if it exists,
737
    its metadata are updated as necessary.
738
    """
739
    from huggingface_hub import HfApi
740
    from huggingface_hub.utils import HfHubHTTPError, HFValidationError
741
742
    projects = hfh_util.get_matching_projects(project_ids_pattern)
743
    click.echo(f"Uploading project(s): {', '.join([p.project_id for p in projects])}")
744
745
    commit_message = (
746
        commit_message
747
        if commit_message is not None
748
        else f"Upload project(s) {project_ids_pattern} with Annif"
749
    )
750
751
    fobjs, operations = [], []
752
    try:
753
        fobjs, operations = hfh_util.prepare_commits(projects, repo_id, token)
754
        api = HfApi()
755
        api.create_commit(
756
            repo_id=repo_id,
757
            operations=operations,
758
            commit_message=commit_message,
759
            revision=revision,
760
            token=token,
761
        )
762
    except (HfHubHTTPError, HFValidationError) as err:
763
        raise OperationFailedException(str(err))
764
    else:
765
        if modelcard:
766
            hfh_util.upsert_modelcard(repo_id, projects, token, revision)
767
    finally:
768
        for fobj in fobjs:
769
            fobj.close()
770
771
772
@cli.command("download")
773
@click.argument("project_ids_pattern")
774
@click.argument("repo_id")
775
@click.option(
776
    "--token",
777
    help="""Authentication token, obtained from the Hugging Face Hub.
778
    Will default to the stored token.""",
779
)
780
@click.option(
781
    "--revision",
782
    help="""
783
    An optional Git revision id which can be a branch name, a tag, or a commit
784
    hash.
785
    """,
786
)
787
@click.option(
788
    "--force",
789
    "-f",
790
    default=False,
791
    is_flag=True,
792
    help="Replace an existing project/vocabulary/config with the downloaded one",
793
)
794
@click.option(
795
    "--trust-repo",
796
    default=False,
797
    is_flag=True,
798
    help="Allow download from the repository even when it has no entries in the cache",
799
)
800
@cli_util.common_options
801
def run_download(project_ids_pattern, repo_id, token, revision, force, trust_repo):
802
    """
803
    Download selected projects and their vocabularies from a Hugging Face Hub
804
    repository.
805
    \f
806
    This command downloads the project and vocabulary archives and the
807
    configuration files of the projects that match the given
808
    `project_ids_pattern` from the specified Hugging Face Hub repository and
809
    unzips the archives to `data/` directory and places the configuration files
810
    to `projects.d/` directory. An authentication token and revision can be given with
811
    options. If the repository hasn’t been used for downloads previously
812
    (i.e., it doesn’t appear in the Hugging Face Hub cache on local system), the
813
    `--trust-repo` option needs to be used.
814
    """
815
816
    hfh_util.check_is_download_allowed(trust_repo, repo_id)
817
818
    project_ids = hfh_util.get_matching_project_ids_from_hf_hub(
819
        project_ids_pattern, repo_id, token, revision
820
    )
821
    click.echo(f"Downloading project(s): {', '.join(project_ids)}")
822
823
    vocab_ids = set()
824
    for project_id in project_ids:
825
        project_zip_cache_path = hfh_util.download_from_hf_hub(
826
            f"projects/{project_id}.zip", repo_id, token, revision
827
        )
828
        hfh_util.unzip_archive(project_zip_cache_path, force)
829
        config_file_cache_path = hfh_util.download_from_hf_hub(
830
            f"{project_id}.cfg", repo_id, token, revision
831
        )
832
        vocab_ids.add(hfh_util.get_vocab_id_from_config(config_file_cache_path))
833
        hfh_util.copy_project_config(config_file_cache_path, force)
834
835
    for vocab_id in vocab_ids:
836
        vocab_zip_cache_path = hfh_util.download_from_hf_hub(
837
            f"vocabs/{vocab_id}.zip", repo_id, token, revision
838
        )
839
        hfh_util.unzip_archive(vocab_zip_cache_path, force)
840
841
842
@cli.command("completion")
843
@click.option("--bash", "shell", flag_value="bash")
844
@click.option("--zsh", "shell", flag_value="zsh")
845
@click.option("--fish", "shell", flag_value="fish")
846
def run_completion(shell):
847
    """Generate the script for tab-key autocompletion for the given shell. To enable the
848
    completion support in your current bash terminal session run\n
849
        source <(annif completion --bash)
850
851
    To enable the completion support in all new sessions first add the completion script
852
    in your home directory:\n
853
        annif completion --bash > ~/.annif-complete.bash
854
855
    Then make the script to be automatically sourced for new terminal sessions by adding
856
    the following to your ~/.bashrc file (or in some alternative startup file)\n
857
        source ~/.annif-complete.bash
858
    """
859
860
    if shell is None:
861
        raise click.UsageError("Shell not given, try --bash, --zsh or --fish")
862
863
    script = os.popen(f"_ANNIF_COMPLETE={shell}_source annif").read()
864
    click.echo(f"# Generated by Annif {importlib.metadata.version('annif')}")
865
    click.echo(script)
866
867
868
@cli.command("detect-language")
869
@click.argument("languages")
870
@click.argument(
871
    "paths", type=click.Path(dir_okay=False, exists=True, allow_dash=True), nargs=-1
872
)
873
def run_detect_language(languages, paths):
874
    """
875
    Detect the language of a single text document from standard input or for one or more
876
    document file(s) given its/their path(s).
877
    """
878
879
    langs = tuple(languages.split(","))
880
881
    def detect_language_and_show(text, languages):
882
        try:
883
            proportions = detect_language(text, languages)
884
        except ValueError as e:
885
            raise click.UsageError(e)
886
        for lang, score in proportions.items():
887
            if lang == "unk":
888
                lang = "?"
889
            click.echo(f"{lang}\t{score:.04f}")
890
891
    if paths and not (len(paths) == 1 and paths[0] == "-"):
892
        doclist = cli_util.open_text_documents(paths, docs_limit=None)
893
        for doc, path in zip(doclist.documents, paths):
894
            click.echo(f"Detected languages for {path}")
895
            detect_language_and_show(doc.text, langs)
896
    else:
897
        text = sys.stdin.read()
898
        detect_language_and_show(text, langs)
899
900
901
if __name__ == "__main__":
902
    cli()
903