Skip to content
Snippets Groups Projects
Commit d1a36317 authored by Franziska Niemeyer's avatar Franziska Niemeyer
Browse files

Upload solutions for exercises_D

parent bef72f98
No related branches found
No related tags found
No related merge requests found
%% Cell type:markdown id: tags:
# Python course 2021 - Exercises D
%% Cell type:markdown id: tags:
## Part1 - writing files
%% Cell type:markdown id: tags:
---
1.1) Read the file AtCol0_Exons.fasta and write all headers (starting with '>') into a new file!
%% Cell type:code id: tags:
```
from google.colab import drive
drive.mount('/content/drive')
with open("/content/drive/MyDrive/ColabNotebooks/UniPythonCourse/Exercises/data/AtCol0_Exons.fasta", 'r') as exons:
with open("/content/drive/MyDrive/ColabNotebooks/UniPythonCourse/Exercises/data/Header_AtCol0_Exons.txt", 'w') as new_file:
line = exons.readline()
while line:
if line[0] == '>':
new_file.write(line)
line = exons.readline()
```
%% Output
Mounted at /content/drive
%% Cell type:markdown id: tags:
---
1.2) Read the file AtCol0_Exons.fasta and write the following:
* Line if it is a header
* Length of line if it is a sequence line
%% Cell type:code id: tags:
```
with open("/content/drive/MyDrive/ColabNotebooks/UniPythonCourse/Exercises/data/AtCol0_Exons.fasta", 'r') as exons:
with open("/content/drive/MyDrive/ColabNotebooks/UniPythonCourse/Exercises/data/Summary_AtCol0_Exons.txt", 'w') as new_file:
line = exons.readline()
while line:
if line[0] == '>':
new_file.write(line)
else:
new_file.write(str(len(line.strip())) + "\n")
line = exons.readline()
```
%% Cell type:markdown id: tags:
---
1.3) Calculate the number of sequences, the cumulative length and the average length in a new file! Are they matching the values of the original file?
%% Cell type:code id: tags:
```
def summarize_seq_info(summary_file):
with open(summary_file, 'r') as summary:
seq_count = 0
cum_len = 0
line = summary.readline()
while line:
if line[0] == '>':
seq_count += 1
else:
cum_len += int(line.strip())
line = summary.readline()
print("Number of sequences:", seq_count)
print("Cumulative length:", cum_len, "bases")
print("Average sequence length:", cum_len / seq_count, "bases")
summarize_seq_info("/content/drive/MyDrive/ColabNotebooks/UniPythonCourse/Exercises/data/Summary_AtCol0_Exons.txt")
```
%% Output
Number of sequences: 217183
Cumulative length: 64867051 bases
Average sequence length: 298.67462462531597 bases
%% Cell type:markdown id: tags:
---
1.4) Write sequences into a new file if their length is a multiple of 10!
%% Cell type:code id: tags:
```
def seq_lens_multiple_of_10(fasta_file):
with open(fasta_file, 'r') as fasta_input:
with open("/content/drive/MyDrive/ColabNotebooks/UniPythonCourse/Exercises/data/AtCol0_mult10.txt", 'w') as out:
cum_len = 0
sequence = ""
line = fasta_input.readline()
while line:
if line[0] == '>': # Reading the next header
if cum_len % 10 == 0: # Check if the length is a multiple of 10
out.write(sequence + '\n')
cum_len = 0 # Reset the sequence length and the sequence as we are in the next sequence now
sequence = ""
else:
sequence += line.strip() # Append the sequence to the current one as long as no other header is in between
cum_len += len(line.strip()) # Update the cumulative length for this sequence
line = fasta_input.readline()
seq_lens_multiple_of_10("/content/drive/MyDrive/ColabNotebooks/UniPythonCourse/Exercises/data/AtCol0_Exons.fasta")
```
%% Cell type:markdown id: tags:
## Part2 - characters
%% Cell type:markdown id: tags:
---
2.1) Read the file AtCol0_Exons.fasta and write the following:
* Only Arabidopsis Gene Identifier (e.g. AT1G01010)
* Gene Identifier, exon name, exon length (tab-delimited)
%% Cell type:code id: tags:
```
import re
def arabidopsis_only(fasta_file):
with open(fasta_file, 'r') as summary:
with open("/content/drive/MyDrive/ColabNotebooks/UniPythonCourse/Exercises/data/Arabidopsis_Exons.txt", 'w') as arabidopsis:
line = summary.readline()
while line:
if line.startswith('>AT'):
columns = line.split('|')
gene_identifier = columns[0].strip('>').split('.')[0]
if gene_identifier in re.findall("AT\dG\d{5}", gene_identifier):
exon_name = columns[1].strip()
exon_length = columns[3].strip().split(' ')[2]
arabidopsis.write(gene_identifier + '\t' + exon_name + '\t' + exon_length + '\n')
line = summary.readline()
arabidopsis_only("/content/drive/MyDrive/ColabNotebooks/UniPythonCourse/Exercises/data/AtCol0_Exons.fasta")
```
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment