Skip to content
GitLab
Explore
Sign in
Register
Primary navigation
Search or go to…
Project
P
Python programming
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Wiki
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Build
Pipelines
Jobs
Pipeline schedules
Artifacts
Deploy
Releases
Package Registry
Container Registry
Model registry
Operate
Environments
Terraform modules
Monitor
Incidents
Service Desk
Analyze
Value stream analytics
Contributor analytics
CI/CD analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Terms and privacy
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
Gerald W. Höfner
Python programming
Commits
d1a36317
Commit
d1a36317
authored
2 years ago
by
Franziska Niemeyer
Browse files
Options
Downloads
Patches
Plain Diff
Upload solutions for exercises_D
parent
bef72f98
No related branches found
Branches containing commit
No related tags found
No related merge requests found
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
Exercises/solutions/Python_course_2021_exercises_D.ipynb
+301
-0
301 additions, 0 deletions
Exercises/solutions/Python_course_2021_exercises_D.ipynb
with
301 additions
and
0 deletions
Exercises/solutions/Python_course_2021_exercises_D.ipynb
0 → 100644
+
301
−
0
View file @
d1a36317
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"name": "Python_course_2021_exercises_D.ipynb",
"provenance": [],
"collapsed_sections": []
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "xqfYLmi0LWEl",
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"# Python course 2021 - Exercises D"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "LRZcpmP8LaR_",
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"## Part1 - writing files"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "NDIaKYRcLfz1",
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"\n",
"\n",
"---\n",
"1.1) Read the file AtCol0_Exons.fasta and write all headers (starting with '>') into a new file!\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Mounted at /content/drive\n"
]
}
],
"source": [
"from google.colab import drive\n",
"drive.mount('/content/drive')\n",
"\n",
"with open(\"/content/drive/MyDrive/ColabNotebooks/UniPythonCourse/Exercises/data/AtCol0_Exons.fasta\", 'r') as exons:\n",
" with open(\"/content/drive/MyDrive/ColabNotebooks/UniPythonCourse/Exercises/data/Header_AtCol0_Exons.txt\", 'w') as new_file:\n",
" line = exons.readline()\n",
" while line:\n",
" if line[0] == '>':\n",
" new_file.write(line)\n",
" line = exons.readline()"
],
"metadata": {
"pycharm": {
"name": "#%%\n"
},
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "9OMVwvZLzW9A",
"outputId": "36f854a7-f0f2-4472-f3e5-081fad69ebe9"
}
},
{
"cell_type": "markdown",
"source": [
"\n",
"\n",
"---\n",
"1.2) Read the file AtCol0_Exons.fasta and write the following:\n",
"* Line if it is a header\n",
"* Length of line if it is a sequence line\n"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
},
"id": "4S2i2BAjzW9B"
}
},
{
"cell_type": "code",
"execution_count": 2,
"outputs": [],
"source": [
"with open(\"/content/drive/MyDrive/ColabNotebooks/UniPythonCourse/Exercises/data/AtCol0_Exons.fasta\", 'r') as exons:\n",
" with open(\"/content/drive/MyDrive/ColabNotebooks/UniPythonCourse/Exercises/data/Summary_AtCol0_Exons.txt\", 'w') as new_file:\n",
" line = exons.readline()\n",
" while line:\n",
" if line[0] == '>':\n",
" new_file.write(line)\n",
" else:\n",
" new_file.write(str(len(line.strip())) + \"\\n\")\n",
" line = exons.readline()"
],
"metadata": {
"pycharm": {
"name": "#%%\n"
},
"id": "WRHfLJuJzW9C"
}
},
{
"cell_type": "markdown",
"source": [
"\n",
"\n",
"---\n",
"1.3) Calculate the number of sequences, the cumulative length and the average length in a new file! Are they matching the values of the original file?\n"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
},
"id": "2Uy9-PtVzW9C"
}
},
{
"cell_type": "code",
"execution_count": 3,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Number of sequences: 217183\n",
"Cumulative length: 64867051 bases\n",
"Average sequence length: 298.67462462531597 bases\n"
]
}
],
"source": [
"def summarize_seq_info(summary_file):\n",
" with open(summary_file, 'r') as summary:\n",
" seq_count = 0\n",
" cum_len = 0\n",
" line = summary.readline()\n",
" while line:\n",
" if line[0] == '>':\n",
" seq_count += 1\n",
" else:\n",
" cum_len += int(line.strip())\n",
" line = summary.readline()\n",
" print(\"Number of sequences:\", seq_count)\n",
" print(\"Cumulative length:\", cum_len, \"bases\")\n",
" print(\"Average sequence length:\", cum_len / seq_count, \"bases\")\n",
"\n",
"summarize_seq_info(\"/content/drive/MyDrive/ColabNotebooks/UniPythonCourse/Exercises/data/Summary_AtCol0_Exons.txt\")"
],
"metadata": {
"pycharm": {
"name": "#%%\n"
},
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "V5s-M0dSzW9D",
"outputId": "212fadb1-dbbc-4929-e53e-d41426830418"
}
},
{
"cell_type": "markdown",
"source": [
"\n",
"\n",
"---\n",
"1.4) Write sequences into a new file if their length is a multiple of 10!\n"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
},
"id": "KuS83FpyzW9E"
}
},
{
"cell_type": "code",
"execution_count": 4,
"outputs": [],
"source": [
"def seq_lens_multiple_of_10(fasta_file):\n",
" with open(fasta_file, 'r') as fasta_input:\n",
" with open(\"/content/drive/MyDrive/ColabNotebooks/UniPythonCourse/Exercises/data/AtCol0_mult10.txt\", 'w') as out:\n",
" cum_len = 0\n",
" sequence = \"\"\n",
" line = fasta_input.readline()\n",
" while line:\n",
" if line[0] == '>': # Reading the next header\n",
" if cum_len % 10 == 0: # Check if the length is a multiple of 10\n",
" out.write(sequence + '\\n')\n",
" cum_len = 0 # Reset the sequence length and the sequence as we are in the next sequence now\n",
" sequence = \"\"\n",
" else:\n",
" sequence += line.strip() # Append the sequence to the current one as long as no other header is in between\n",
" cum_len += len(line.strip()) # Update the cumulative length for this sequence\n",
" line = fasta_input.readline()\n",
"\n",
"seq_lens_multiple_of_10(\"/content/drive/MyDrive/ColabNotebooks/UniPythonCourse/Exercises/data/AtCol0_Exons.fasta\")"
],
"metadata": {
"pycharm": {
"name": "#%%\n"
},
"id": "m2FZHcpmzW9G"
}
},
{
"cell_type": "markdown",
"source": [
"## Part2 - characters"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
},
"id": "HAzJAcOAzW9I"
}
},
{
"cell_type": "markdown",
"source": [
"\n",
"\n",
"---\n",
"2.1) Read the file AtCol0_Exons.fasta and write the following:\n",
"* Only Arabidopsis Gene Identifier (e.g. AT1G01010)\n",
"* Gene Identifier, exon name, exon length (tab-delimited)\n",
"\n",
"\n"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
},
"id": "hUOOeTCUzW9I"
}
},
{
"cell_type": "code",
"execution_count": 11,
"outputs": [],
"source": [
"import re\n",
"\n",
"def arabidopsis_only(fasta_file):\n",
" with open(fasta_file, 'r') as summary:\n",
" with open(\"/content/drive/MyDrive/ColabNotebooks/UniPythonCourse/Exercises/data/Arabidopsis_Exons.txt\", 'w') as arabidopsis:\n",
" line = summary.readline()\n",
" while line:\n",
" if line.startswith('>AT'):\n",
" columns = line.split('|')\n",
" gene_identifier = columns[0].strip('>').split('.')[0]\n",
" if gene_identifier in re.findall(\"AT\\dG\\d{5}\", gene_identifier):\n",
" exon_name = columns[1].strip()\n",
" exon_length = columns[3].strip().split(' ')[2]\n",
" arabidopsis.write(gene_identifier + '\\t' + exon_name + '\\t' + exon_length + '\\n')\n",
" line = summary.readline()\n",
"\n",
"arabidopsis_only(\"/content/drive/MyDrive/ColabNotebooks/UniPythonCourse/Exercises/data/AtCol0_Exons.fasta\")"
],
"metadata": {
"pycharm": {
"name": "#%%\n"
},
"id": "YStsUutUzW9I"
}
}
]
}
\ No newline at end of file
%% Cell type:markdown id: tags:
# Python course 2021 - Exercises D
%% Cell type:markdown id: tags:
## Part1 - writing files
%% Cell type:markdown id: tags:
---
1.1) Read the file AtCol0_Exons.fasta and write all headers (starting with '>') into a new file!
%% Cell type:code id: tags
:
```
from google.colab import drive
drive.mount('/content/drive')
with open("/content/drive/MyDrive/ColabNotebooks/UniPythonCourse/Exercises/data/AtCol0_Exons.fasta", 'r') as exons
:
with open("/content/drive/MyDrive/ColabNotebooks/UniPythonCourse/Exercises/data/Header_AtCol0_Exons.txt", 'w') as new_file
:
line = exons.readline()
while line
:
if line[0] == '>'
:
new_file.write(line)
line = exons.readline()
```
%%
Output
Mounted at /content/drive
%% Cell type:markdown id: tags
:
---
1.
2) Read the file AtCol0_Exons.fasta and write the following:
*
Line if it is a header
*
Length of line if it is a sequence line
%% Cell type:code id: tags:
```
with open("/content/drive/MyDrive/ColabNotebooks/UniPythonCourse/Exercises/data/AtCol0_Exons.fasta", 'r') as exons:
with open("/content/drive/MyDrive/ColabNotebooks/UniPythonCourse/Exercises/data/Summary_AtCol0_Exons.txt", 'w') as new_file:
line = exons.readline()
while line:
if line[0] == '>':
new_file.write(line)
else:
new_file.write(str(len(line.strip())) + "\n")
line = exons.readline()
```
%% Cell type:markdown id: tags:
---
1.3) Calculate the number of sequences, the cumulative length and the average length in a new file! Are they matching the values of the original file?
%% Cell type:code id: tags
:
```
def summarize_seq_info(summary_file)
:
with open(summary_file, 'r') as summary
:
seq_count =
0
cum_len =
0
line = summary.readline()
while line
:
if line[0] == '>'
:
seq_count +=
1
else
:
cum_len += int(line.strip())
line = summary.readline()
print("Number of sequences:", seq_count)
print("Cumulative length:", cum_len, "bases")
print("Average sequence length:", cum_len / seq_count, "bases")
summarize_seq_info("/content/drive/MyDrive/ColabNotebooks/UniPythonCourse/Exercises/data/Summary_AtCol0_Exons.txt")
```
%%
Output
Number of sequences
:
217183
Cumulative length
:
64867051 bases
Average sequence length
:
298.67462462531597 bases
%% Cell type:markdown id: tags
:
---
1.
4) Write sequences into a new file if their length is a multiple of 10!
%% Cell type:code id: tags:
```
def seq_lens_multiple_of_10(fasta_file):
with open(fasta_file, 'r') as fasta_input:
with open("/content/drive/MyDrive/ColabNotebooks/UniPythonCourse/Exercises/data/AtCol0_mult10.txt", 'w') as out:
cum_len = 0
sequence = ""
line = fasta_input.readline()
while line:
if line[0] == '>': # Reading the next header
if cum_len % 10 == 0: # Check if the length is a multiple of 10
out.write(sequence + '\n')
cum_len = 0 # Reset the sequence length and the sequence as we are in the next sequence now
sequence = ""
else:
sequence += line.strip() # Append the sequence to the current one as long as no other header is in between
cum_len += len(line.strip()) # Update the cumulative length for this sequence
line = fasta_input.readline()
seq_lens_multiple_of_10("/content/drive/MyDrive/ColabNotebooks/UniPythonCourse/Exercises/data/AtCol0_Exons.fasta")
```
%% Cell type:markdown id: tags:
## Part2 - characters
%% Cell type:markdown id: tags:
---
2.
1) Read the file AtCol0_Exons.fasta and write the following:
*
Only Arabidopsis Gene Identifier (e.g. AT1G01010)
*
Gene Identifier, exon name, exon length (tab-delimited)
%% Cell type:code id: tags:
```
import re
def arabidopsis_only(fasta_file):
with open(fasta_file, 'r') as summary:
with open("/content/drive/MyDrive/ColabNotebooks/UniPythonCourse/Exercises/data/Arabidopsis_Exons.txt", 'w') as arabidopsis:
line = summary.readline()
while line:
if line.startswith('>AT'):
columns = line.split('|')
gene_identifier = columns[0].strip('>').split('.')[0]
if gene_identifier in re.findall("AT\dG\d{5}", gene_identifier):
exon_name = columns[1].strip()
exon_length = columns[3].strip().split(' ')[2]
arabidopsis.write(gene_identifier + '\t' + exon_name + '\t' + exon_length + '\n')
line = summary.readline()
arabidopsis_only("/content/drive/MyDrive/ColabNotebooks/UniPythonCourse/Exercises/data/AtCol0_Exons.fasta")
```
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment