Conditions | 56 |
Total Lines | 368 |
Code Lines | 163 |
Lines | 0 |
Ratio | 0 % |
Changes | 0 |
Small methods make your code easier to understand, in particular if combined with a good name. Besides, if your method is small, finding a good name is usually much easier.
For example, if you find yourself adding comments to a method's body, this is usually a good sign to extract the commented part to a new method, and use the comment as a starting point when coming up with a good name for this new method.
Commonly applied refactorings include:
If many parameters/temporary variables are present:
Complex classes like build.rna_tools.Seq.RNASequence.predict_ss() often do a lot of different things. To break such a class down, we need to identify a cohesive component within that class. A common approach to find such a component is to look for fields/methods that share the same prefixes, or suffixes.
Once you have determined the fields that belong together, you can apply the Extract Class refactoring. If the component makes sense as a sub-class, Extract Subclass is also a candidate, and is often faster.
Methods with many parameters are not only hard to understand, but their parameters also often become inconsistent when you need more, or different data.
There are several approaches to avoid long parameter lists:
1 | #!/usr/bin/env python3 |
||
229 | def predict_ss(self, method="RNAfold", constraints='', enforce_constraint=False, shapefn='', explore='', verbose=0, path=''): |
||
230 | """Predict secondary structure of the seq. |
||
231 | |||
232 | Args: |
||
233 | method: {mcfold, RNAfold} |
||
234 | onstraints: |
||
235 | shapefn (str): path to a file with shape reactivites |
||
236 | verbose (boolean) |
||
237 | |||
238 | It creates a seq fasta file and runs various methods for secondary structure |
||
239 | prediction. You can provide also a constraints file for RNAfold and RNAsubopt. |
||
240 | |||
241 | Methods that can be used with contraints: RNAsubopt, RNAfold, mcfold. |
||
242 | |||
243 | Methods that can be used with SHAPE contraints: RNAfold. |
||
244 | |||
245 | **ContextFold** |
||
246 | |||
247 | Example:: |
||
248 | |||
249 | $ java -cp bin contextFold.app.Predict in:CCCCUUUGGGGG |
||
250 | CCCCUUUGGGGG |
||
251 | ((((....)))) |
||
252 | |||
253 | It seems that a seq has to be longer than 9. Otherwise:: |
||
254 | |||
255 | $ java -cp bin contextFold.app.Predict in:UUUUUUGGG |
||
256 | Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 10 |
||
257 | |||
258 | # this is OK |
||
259 | $ java -cp bin contextFold.app.Predict in:CCCCUUUGGG |
||
260 | CCCCUUUGGG |
||
261 | .(((...))) |
||
262 | |||
263 | |||
264 | **RNAstructure** |
||
265 | |||
266 | Example:: |
||
267 | |||
268 | >>> seq = RNASequence("GGGGUUUUCCC") |
||
269 | >>> print(seq.predict_ss("rnastructure")) |
||
270 | > ENERGY = -4.4 rna_seq |
||
271 | GGGGUUUUCCC |
||
272 | ((((...)))) |
||
273 | |||
274 | and with the shape data:: |
||
275 | |||
276 | >>> print(seq.predict_ss("rnastructure", shapefn="data/shape.txt")) |
||
277 | > ENERGY = -0.2 rna_seq |
||
278 | GGGGUUUUCCC |
||
279 | .(((....))) |
||
280 | |||
281 | the shape data:: |
||
282 | |||
283 | 1 10 |
||
284 | 2 1 |
||
285 | 3 1 |
||
286 | |||
287 | You can easily see that the first G is unpaired right now! The reactivity of this G was |
||
288 | set to 10. Worked! |
||
289 | |||
290 | |||
291 | **MC-Fold** |
||
292 | |||
293 | MC-Fold uses the online version of the tool, this is very powerful with constraints:: |
||
294 | |||
295 | rna_seq |
||
296 | acucggcuaggcgaguauaaauagccgucaggccuagcgcguccaagccuagccccuucuggggcugggcgaagggucggg |
||
297 | ((((........)))).......((((..............(((((((((((((((....)))))))))))))))..)))) |
||
298 | curl -Y 0 -y 300 -F "pass=lucy" -F mask="((((........)))).......((((..............(((((((((((((((....)))))))))))))))..))))" -F sequence="acucggcuaggcgaguauaaauagccgucaggccuagcgcguccaagccuagccccuucuggggcugggcgaagggucggg" https://www.major.iric.ca/cgi-bin/MC-Fold/mcfold.static.cgi |
||
299 | mcfold::energy best dynamics programming: -53.91 |
||
300 | (-53.91, '((((........)))).......((((..............(((((((((((((((....)))))))))))))))..))))') |
||
301 | |||
302 | curl -Y 0 -y 300 -F "pass=lucy" -F mask="((((........)))).......((((..............((((((((((..............))))))))))..))))" -F sequence="acucggcuaggcgaguauaaauagccgucaggccuagcgcguccaagccuagccccuucuggggcugggcgaagggucggg" https://www.major.iric.ca/cgi-bin/MC-Fold/mcfold.static.cgi |
||
303 | mcfold::energy best dynamics programming: -34.77 |
||
304 | (-34.77, '((((........)))).......((((..............((((((((((..............))))))))))..))))') |
||
305 | |||
306 | acucggcuaggcgaguauaaauagccgucaggccuagcgcguccaagccuagccccuucuggggcugggcgaagggucggg |
||
307 | ((((........)))).......((((..............(((((((((((((((....)))))))))))))))..)))) |
||
308 | curl -Y 0 -y 300 -F "pass=lucy" -F mask="((((xxxxxxxx))))xxxxxxx((((xxxxxxxxxxxxxx((((((((((xxxxxxxxxxxxxx))))))))))xx))))" -F sequence="acucggcuaggcgaguauaaauagccgucaggccuagcgcguccaagccuagccccuucuggggcugggcgaagggucggg" https://www.major.iric.ca/cgi-bin/MC-Fold/mcfold.static.cgi |
||
309 | mcfold::energy best dynamics programming: -34.77 |
||
310 | (-34.77, '((((........)))).......((((..............((((((((((..............))))))))))..))))') |
||
311 | |||
312 | |||
313 | acucggcuaggcgaguauaaauagccgucaggccuagcgcguccaagccuagccccuucuggggcugggcgaagggucggg |
||
314 | ((((........)))).......((((..............(((((((((((((((....)))))))))))))))..)))) |
||
315 | curl -Y 0 -y 300 -F "pass=lucy" -F mask="((((********))))*******((((**************((((((((((**************))))))))))**))))" -F sequence="acucggcuaggcgaguauaaauagccgucaggccuagcgcguccaagccuagccccuucuggggcugggcgaagggucggg" https://www.major.iric.ca/cgi-bin/MC-Fold/mcfold.static.cgi |
||
316 | mcfold::energy best dynamics programming: -77.30 |
||
317 | (-71.12, '(((((((..))))))).......((((((.(((...)))..(((((((((((((((....)))))))))))))))))))))') |
||
318 | |||
319 | acucggcuaggcgaguauaaauagccgucaggccuagcgcguccaagccuagccccuucuggggcugggcgaagggucggg |
||
320 | ((((........)))).......((((..............(((((((((((((((....)))))))))))))))..)))) |
||
321 | curl -Y 0 -y 300 -F "pass=lucy" -F mask="((((**[[[[[**))))*******((((****]]]]]****(((((((((((((((****)))))))))))))))**))))" -F sequence="acucggcuaggcgaguauaaauagccgucaggccuagcgcguccaagccuagccccuucuggggcugggcgaagggucggg" https://www.major.iric.ca/cgi-bin/MC-Fold/mcfold.static.cgi |
||
322 | mcfold::energy best dynamics programming: -77.30 |
||
323 | ('-77.30', '((((**[[[[[**))))*******((((****]]]]]****(((((((((((((((****)))))))))))))))**))))') |
||
324 | |||
325 | **explore** |
||
326 | |||
327 | The sub-optimal search space can be constrained within a percentage of the minimum free energy structure, as MC-fold makes use of the Waterman-Byers algorithm [18, 19]. Because the exploration has an exponential time complexity, increasing this value can have a dramatic effect on MC-Fold’s run time. |
||
328 | |||
329 | Parisien, M., & Major, F. (2009). RNA Modeling Using the MC-Fold and MC-Sym Pipeline [Manual] (pp. 1–84). |
||
330 | |||
331 | |||
332 | """ |
||
333 | |||
334 | tf = tempfile.NamedTemporaryFile(delete=False) |
||
335 | tf.name += '.fa' |
||
336 | with open(tf.name, 'w') as f: |
||
337 | f.write('>' + self.name + '\n') |
||
338 | f.write(self.seq + '\n') |
||
339 | if constraints: |
||
340 | f.write(constraints) |
||
341 | |||
342 | # check for seq and constraints |
||
343 | if constraints: |
||
344 | if len(self.seq) != len(constraints): |
||
345 | raise Exception('The seq and constraints should be of the same length: %i %s %i %s' % (len(self.seq), self.seq, len(constraints), constraints)) |
||
346 | |||
347 | # run prediction |
||
348 | # rnafold without contraints |
||
349 | if method == "RNAfold" and constraints: |
||
350 | cmd = 'RNAfold -C < ' + tf.name |
||
351 | if verbose: |
||
352 | print(cmd) |
||
353 | self.ss_log = subprocess.check_output(cmd, shell=True).decode() |
||
354 | return '\n'.join(self.ss_log.strip().split('\n')[:]) |
||
355 | |||
356 | if method == "RNAfoldX" and constraints: |
||
357 | if enforce_constraint: |
||
358 | cmd = 'RNAfold -p -d2 --noLP -C --enforceConstraint < ' + tf.name |
||
359 | else: |
||
360 | cmd = 'RNAfold -p -d2 --noLP -C < ' + tf.name |
||
361 | if verbose: |
||
362 | print(cmd) |
||
363 | |||
364 | try: |
||
365 | self.ss_log = subprocess.check_output(cmd, shell=True).decode() |
||
366 | except subprocess.CalledProcessError: |
||
367 | print('Error') |
||
368 | return 0, 'error', 0, '', 0, '', 0, 0 |
||
369 | |||
370 | if verbose: |
||
371 | print(self.ss_log) |
||
372 | # parse the results |
||
373 | lines = self.ss_log.split('\n') |
||
374 | if 'Supplied structure constraints create empty solution set for sequence' in self.ss_log: |
||
375 | return 0, 'Supplied structure constraints create empty solution set for sequence', 0, '', 0, '', 0, 0 |
||
376 | #if not 'frequency of mfe structure' in self.ss_log: |
||
377 | |||
378 | # RNAfold -p -d2 --noLP -C < /var/folders/yc/ssr9692s5fzf7k165grnhpk80000gp/T/tmpGiUoo7.fa |
||
379 | # >rna_seq |
||
380 | # AAAUUAAGGGGAAGCGUUGAGCCGCUACCCAUAUGUGGUUCACUCGGAUAGCGGGGAGCUAAUAGUGAAACCGGCCCUUUAGGGG |
||
381 | # ...((((((((.(((......((((((.((....(((...)))..)).))))))...)))..............))))))))... (-19.80) |
||
382 | # ...{(((((((.(((......((((((.((....(((...)))..)).))))))...)))..............)))))))}... [-21.05] |
||
383 | #...((((((((.(((......((((((.((....(((...)))..)).))))))...)))..............))))))))... {-19.80 d=2.34} |
||
384 | # frequency of mfe structure in ensemble 0.131644; ensemble diversity 3.68 |
||
385 | |||
386 | mfess = lines[2].split()[0] |
||
387 | mfe = ' '.join(lines[2].split()[-1:]) |
||
388 | mfe = float(mfe.replace('(', '').replace(')', '')) # (-19.80) ->-19.80 |
||
389 | |||
390 | efess = lines[3].split()[0] # ensamble free energy |
||
391 | efe = ' '.join(lines[3].split()[-1:]) |
||
392 | efe = float(efe.replace('[', '').replace(']', '')) # (-19.80) ->-19.80 |
||
393 | |||
394 | cfess = lines[4].split()[0] # ensamble free energy |
||
395 | cfe, d = ' '.join(lines[4].split()[1:]).split('d') |
||
396 | cfe = float(cfe.replace('{', '').replace('}', '')) # (-19.80) ->-19.80 |
||
397 | |||
398 | words = lines[5].split() # ensamble free energy |
||
399 | freq = round(float(words[6].replace(';', '')), 2) # frequency of mfe structure in ensemble |
||
400 | diversity = float(words[9]) # ensemble diversity |
||
401 | |||
402 | if verbose: |
||
403 | print(mfe, mfess, efe, efess, cfe, cfess, freq, diversity) |
||
404 | return mfe, mfess, efe, efess, cfe, cfess, freq, diversity |
||
405 | |||
406 | elif method == "RNAfold": |
||
407 | cmd = 'RNAfold < ' + tf.name |
||
408 | if verbose: |
||
409 | print(cmd) |
||
410 | self.ss_log = subprocess.check_output(cmd, shell=True).decode() |
||
411 | return '\n'.join(self.ss_log.strip().split('\n')[:]) |
||
412 | |||
413 | elif method == "RNAsubopt" and constraints: |
||
414 | cmd = 'RNAsubopt -C < ' + tf.name |
||
415 | if verbose: |
||
416 | print(cmd) |
||
417 | self.ss_log = subprocess.check_output(cmd, shell=True).decode() |
||
418 | return '\n'.join(self.ss_log.split('\n')[:]) |
||
419 | |||
420 | elif method == "RNAsubopt": |
||
421 | cmd = 'RNAsubopt < ' + tf.name |
||
422 | if verbose: |
||
423 | print(cmd) |
||
424 | self.ss_log = subprocess.check_output(cmd, shell=True).decode() |
||
425 | return '\n'.join(self.ss_log.split('\n')[:]) |
||
426 | |||
427 | elif method == "mcfold": |
||
428 | # -F tope=1 |
||
429 | if explore: |
||
430 | explore_str = " -F explore=" + str(explore) |
||
431 | else: |
||
432 | explore_str = '' |
||
433 | |||
434 | #if constraints: |
||
435 | #cmd = "curl -Y 0 -y 300 -F \"pass=lucy\" -F mask=\"" + constraints + "\" " + explore_str + \ |
||
436 | #" -F sequence=\"" + self.seq + "\" https://www.major.iric.ca/cgi-bin/MC-Fold/mcfold.static.cgi" |
||
437 | cmd = "curl https://www.major.iric.ca/cgi-bin/MC-Fold/mcfold.static.cgi\?pass\=lucy\&sequence\=" + self.seq + "\&top\=20\&explore\=15\&name\=\&mask\='" + constraints + "'\&singlehigh\=\&singlemed\=\&singlelow\=" |
||
438 | # cmd = "curl -Y 0 -y 300 -F \"pass=lucy\" -F sequence=\"" + self.seq + "\" https://www.major.iric.ca/cgi-bin/MC-Fold/mcfold.static.cgi" |
||
439 | if verbose: |
||
440 | print(cmd) |
||
441 | |||
442 | o = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE) |
||
443 | out = o.stdout.read().decode(errors='ignore').strip() |
||
444 | err = o.stderr.read().decode(errors='ignore').strip() |
||
445 | |||
446 | if verbose: |
||
447 | print(out) |
||
448 | |||
449 | # If the structure can't be find, detect this statement and finish this routine. |
||
450 | if 'Explored 0 structures' in out: |
||
451 | return 0.00, '', 'Explored 0 structures' |
||
452 | |||
453 | comment = '' |
||
454 | energy = '' |
||
455 | out = out.split('\n') |
||
456 | for l in out : |
||
457 | # first you will find the best dynamic energy, and in the next loop |
||
458 | # it will be used to search for lines with this energy and secondary |
||
459 | # structure |
||
460 | |||
461 | # (((..))) -5.43 |
||
462 | if energy: # if energy is set |
||
463 | if energy in l: |
||
464 | if verbose: print(l) |
||
465 | ss = l.split()[0] |
||
466 | |||
467 | # Performing Dynamic Programming... |
||
468 | # Best Dynamic Programming Solution has Energy: -5.43 |
||
469 | if l.startswith('Best Dynamic Programming Solution has Energy:'): |
||
470 | energy_bdp = l.split(':')[1].strip() |
||
471 | if verbose: |
||
472 | print ('mcfold::energy best dynamics programming: ' + energy_bdp) |
||
473 | comment = 'energy best dynamics programming' |
||
474 | ss = constraints |
||
475 | # return float(energy), constraints # I'm not sure if this is good |
||
476 | |||
477 | # Ok, for whatever reason Best DP energy might not be exactly the same as and |
||
478 | # the energy listed later for secondary structure. So this code finds this secondary |
||
479 | # structure and gets again the energy for this secondary structure, |
||
480 | # and overwrites the previous energy. |
||
481 | # In this case: |
||
482 | # Best Dynamic Programming Solution has Energy: -5.46 |
||
483 | # ... |
||
484 | # CUCUCGAAAGAUG |
||
485 | # (((.((..))))) -5.44 ( +0.00) |
||
486 | # (((.((..))))) BP >= 50% |
||
487 | |||
488 | # if evenn this will not find ss, then set ss to null not to crash |
||
489 | # and it's possible, like in here |
||
490 | # curl -Y 0 -y 300 -F "pass=lucy" -F mask="((******)))" -F sequence="CCUgcgcaAGG" \ |
||
491 | # http://www.major.iric.ca/cgi-bin/MC-Fold/mcfold.static.cgi |
||
492 | ss = '' |
||
493 | for l in out: |
||
494 | if 'target="_blank">MARNA</a>-formatted:<P><P><P></H2><pre>' in l: |
||
495 | index = out.index(l) |
||
496 | ss_line = out[index + 2] |
||
497 | ss, energy = ss_line.split()[0:2] # '(((.((..))))) -5.44 ( +0.00)' |
||
498 | |||
499 | # if there is |
||
500 | # UUGCCGUAAGACA |
||
501 | # ............. BP >= 50% |
||
502 | # then finish with energy 0.00, and empty ss |
||
503 | if energy == 'BP': |
||
504 | energy = energy_bdp |
||
|
|||
505 | comment = 'BP energy' |
||
506 | return energy_bdp, constraints, comment |
||
507 | # break |
||
508 | |||
509 | # prepare outputs, return and self-s |
||
510 | self.log = out |
||
511 | self.ss = ss |
||
512 | return float(energy), ss, comment |
||
513 | |||
514 | # if method == "RNAsubopt": |
||
515 | # from cogent.app.vienna_package import RNAfold, RNAsubopt |
||
516 | # r = RNAsubopt(WorkingDir="/tmp") |
||
517 | # res = r([self.seq]) |
||
518 | # return str(res['StdOut'].read()).strip() |
||
519 | |||
520 | # if method == 'RNAfold': |
||
521 | # from cogent.app.vienna_package import RNAfold, RNAsubopt |
||
522 | # r = RNAfold(WorkingDir="/tmp") |
||
523 | # res = r([self.seq]) |
||
524 | # self.ss_log = res['StdOut'].read() |
||
525 | # return self.ss_log.strip().split('\n')[-1].split()[0] |
||
526 | |||
527 | elif method == "ipknot": |
||
528 | self.ss_log = subprocess.check_output(IPKNOT_PATH + ' ' + tf.name, shell=True) |
||
529 | return '\n'.join(self.ss_log.decode().split('\n')[2:]) |
||
530 | |||
531 | elif method == "contextfold": |
||
532 | if path: |
||
533 | CONTEXTFOLD_PATH = path |
||
534 | if not CONTEXTFOLD_PATH: |
||
535 | print('Set up CONTEXTFOLD_PATH in configuration.') |
||
536 | sys.exit(0) |
||
537 | cmd = "cd " + CONTEXTFOLD_PATH + \ |
||
538 | " && java -cp bin contextFold.app.Predict in:" + self.seq |
||
539 | if verbose: |
||
540 | print(cmd) |
||
541 | self.ss_log = subprocess.check_output(cmd, shell=True).decode() |
||
542 | return '\n'.join(self.ss_log.split('\n')[1:]) |
||
543 | |||
544 | elif method == "centroid_fold": |
||
545 | self.ss_log = subprocess.check_output('centroid_fold ' + tf.name, shell=True) |
||
546 | return '\n'.join(self.ss_log.split('\n')[2:]) |
||
547 | |||
548 | elif method == "rnastructure": |
||
549 | cmd = RNASTRUCTURE_PATH + '/exe/Fold ' + tf.name + ' ' + tf.name + '.out ' |
||
550 | if shapefn: |
||
551 | cmd += ' -sh ' + shapefn |
||
552 | if verbose: |
||
553 | print(cmd) |
||
554 | o = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE) |
||
555 | stderr = o.stderr.read().strip() |
||
556 | if stderr: |
||
557 | print(stderr) |
||
558 | |||
559 | cmd = RNASTRUCTURE_PATH + '/exe/ct2dot ' + tf.name + '.out 1 ' + \ |
||
560 | tf.name + '.dot' |
||
561 | if verbose: |
||
562 | print(cmd) |
||
563 | o = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE) |
||
564 | stderr = o.stderr.read().strip() |
||
565 | if not stderr: |
||
566 | with open(tf.name + '.dot') as f: |
||
567 | return f.read().strip() |
||
568 | |||
569 | # (-51.15, '.(.(((((((((((((((..))))))))))))))))(..((((((((....)))).))))).') |
||
570 | elif method == "rnastructure_CycleFold": |
||
571 | cmd = RNASTRUCTURE_PATH + '/exe/CycleFold ' + tf.name + ' > ' + tf.name + '.ct ' |
||
572 | if verbose: |
||
573 | print(cmd) |
||
574 | o = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE) |
||
575 | stderr = o.stderr.read().strip() |
||
576 | if stderr: |
||
577 | print(stderr) |
||
578 | |||
579 | # get energy |
||
580 | energy = float(open(tf.name + '.ct').readline().split("energy:")[1].strip()) # >rna_seq energy: -51.1500 |
||
581 | |||
582 | # get ss in dot-bracket notation |
||
583 | cmd = RNASTRUCTURE_PATH + '/exe/ct2dot ' + tf.name + '.ct 1 ' + \ |
||
584 | tf.name + '.dot' |
||
585 | if verbose: |
||
586 | print(cmd) |
||
587 | o = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE) |
||
588 | stderr = o.stderr.read().strip() |
||
589 | if not stderr: |
||
590 | with open(tf.name + '.dot') as f: |
||
591 | # (-51.15, '.(.(((((((((((((((..))))))))))))))))(..((((((((....)))).))))).') |
||
592 | return energy, f.read().strip().split('\n')[2] |
||
593 | |||
594 | |||
595 | else: |
||
596 | raise MethodNotChosen('You have to define a correct method to use.') |
||
597 | |||
646 |