Upload solutions for exercises_C

bef72f98 · Franziska Niemeyer · 2a410e29 · bef72f98
Commit bef72f98 authored 2 years ago by Franziska Niemeyer
--- a/Exercises/solutions/Python_course_2021_exercises_C.ipynb
+++ b/Exercises/solutions/Python_course_2021_exercises_C.ipynb
+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "name": "Python_course_2021_exercises_C.ipynb",
+      "provenance": [],
+      "collapsed_sections": []
+    },
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3"
+    }
+  },
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "FyvebZ68I8BJ"
+      },
+      "source": [
+        "# Python course 2021 - Exercises C"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "8PgJ1ymVJCIO"
+      },
+      "source": [
+        "## Part1 - file handling"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "Ws7tJiEXJG8f"
+      },
+      "source": [
+        "\n",
+        "\n",
+        "---\n",
+        "1.1) Count number of sequences (number of headers) in AtCol0_Exons.fasta!\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "JzgmMxR0JVxL",
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "outputId": "52d24ee3-445c-4b78-9b91-52de52a9791d"
+      },
+      "source": [
+        "from google.colab import drive\n",
+        "drive.mount('/content/drive')"
+      ],
+      "execution_count": null,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "Mounted at /content/drive\n"
+          ]
+        }
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "oEJidIAEy8S5"
+      },
+      "source": [
+        "datei = open(\"/content/drive/MyDrive/PythonProgramming/AtCol0_Exons.fasta\", \"r\")\n",
+        "lines = datei.readlines()\n",
+        "datei.close()"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "def get_num_headers(lines):\n",
+        "  num_headers = 0\n",
+        "  for line in lines:\n",
+        "    if line:\n",
+        "      if line[0] == \">\":\n",
+        "        num_headers += 1\n",
+        "  return num_headers\n",
+        "\n",
+        "print(get_num_headers(lines))"
+      ],
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "q8s7_9qRxa_b",
+        "outputId": "ad4a42f6-4f24-42bd-8508-0d5e92d59347"
+      },
+      "execution_count": null,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "217183\n"
+          ]
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "rFtHoz5UKujx"
+      },
+      "source": [
+        "\n",
+        "\n",
+        "---\n",
+        "1.2) Count number of sequence lines!\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "AgMttuZlKyBg",
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "outputId": "88d9aa00-aa17-4778-e291-6d70cebfa67b"
+      },
+      "source": [
+        "def get_num_sequence_lines(lines):\n",
+        "  num_sequence_lines = 0\n",
+        "  for line in lines:\n",
+        "    if line:\n",
+        "      if line[0] != \">\":\n",
+        "        num_sequence_lines += 1\n",
+        "  return num_sequence_lines\n",
+        "\n",
+        "print(get_num_sequence_lines(lines))"
+      ],
+      "execution_count": null,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "916024\n"
+          ]
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "YTH3rkjJKyNm"
+      },
+      "source": [
+        "\n",
+        "\n",
+        "---\n",
+        "1.3) Count number of characters in document! (How many per line?)\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "6ECkHsa9K3-X",
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "outputId": "fe3ccf0a-8afa-4b7f-ba19-e3870430989f"
+      },
+      "source": [
+        "def get_num_characters(lines):\n",
+        "  num_characters = 0\n",
+        "  num_lines = 0\n",
+        "  for line in lines:\n",
+        "    num_characters += len(line)\n",
+        "    num_lines += 1\n",
+        "  return (num_characters, num_characters / num_lines)\n",
+        "\n",
+        "print(get_num_characters(lines))"
+      ],
+      "execution_count": null,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "(81803755, 72.18783064347467)\n"
+          ]
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "I9bkusUsK4HV"
+      },
+      "source": [
+        "\n",
+        "\n",
+        "---\n",
+        "1.4) How long are all contained sequences combined?\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "XC4que0hK81W",
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "outputId": "e77692ed-7a33-47ef-d8ee-5ac9550535fe"
+      },
+      "source": [
+        "def get_sequence_length(lines):\n",
+        "  total_sequence_length = 0\n",
+        "  for line in lines:\n",
+        "    if line:\n",
+        "      if line[0] != \">\":\n",
+        "        line = line.strip()\n",
+        "        total_sequence_length += len(line)\n",
+        "  return total_sequence_length\n",
+        "\n",
+        "print(get_sequence_length(lines))"
+      ],
+      "execution_count": null,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "64867051\n"
+          ]
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "5NnxagAWK9AP"
+      },
+      "source": [
+        "\n",
+        "\n",
+        "---\n",
+        "1.5) Calculate the average sequence length in this file!\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "MZNV3sNqLB62",
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "outputId": "01b885d6-dc65-4b5d-f893-288015678122"
+      },
+      "source": [
+        "def get_average_sequence_length(lines):\n",
+        "  return get_sequence_length(lines) / get_num_headers(lines)\n",
+        "\n",
+        "print(get_average_sequence_length(lines))"
+      ],
+      "execution_count": null,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "298.67462462531597\n"
+          ]
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "**Additional exercises**"
+      ],
+      "metadata": {
+        "id": "n9rZsJ5_4hTJ"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "1.6) Parse the fasta file entry-wise. An entry consists of a header and the corresponding sequence (which may comprise multiple lines). The result should be a list of tuples of the form (header, sequence)."
+      ],
+      "metadata": {
+        "id": "ItrnPkVE5fsv"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "\"\"\"\n",
+        "Parse a fasta file entry-wise as a list of tuples of the form (header, sequence).\n",
+        "\"\"\"\n",
+        "def read_fasta(file):\n",
+        "  result = []\n",
+        "\n",
+        "  header = None\n",
+        "  sequence = []\n",
+        "  for line in file:\n",
+        "    # remove all whitespace from the ends\n",
+        "    line = line.strip()\n",
+        "    if line.startswith('>'):\n",
+        "      # if you find a header return the previous FASTA block in tuple form after\n",
+        "      # concatenating the sequence lines(if there is a previous block)\n",
+        "      if header:\n",
+        "        result += [(header, ''.join(sequence))]\n",
+        "\n",
+        "      header = line\n",
+        "      sequence = []\n",
+        "    else:\n",
+        "      # current line is not a header\n",
+        "      # add line to the list of sequence lines of the current FASTA block after removing all whitespace from it\n",
+        "      sequence.append(line.translate(str.maketrans('', '', whitespace)))\n",
+        "  \n",
+        "  if header:\n",
+        "    result += [(header, ''.join(sequence))]\n",
+        "  return result"
+      ],
+      "metadata": {
+        "id": "RvF09FlO6YeT"
+      },
+      "execution_count": null,
+      "outputs": []
+    }
+  ]
+}
\ No newline at end of file
+%% Cell type:markdown id: tags:
+
+# Python course 2021 - Exercises C
+
+%% Cell type:markdown id: tags:
+
+## Part1 - file handling
+
+%% Cell type:markdown id: tags:
+
+
+
+---
+1.1) Count number of sequences (number of headers) in AtCol0_Exons.fasta!
+
+%% Cell type:code id: tags:
+
+``` 
+from google.colab import drive
+drive.mount('/content/drive')
+```
+
+%% Output
+
+    Mounted at /content/drive
+
+%% Cell type:code id: tags:
+
+``` 
+datei = open("/content/drive/MyDrive/PythonProgramming/AtCol0_Exons.fasta", "r")
+lines = datei.readlines()
+datei.close()
+```
+
+%% Cell type:code id: tags:
+
+``` 
+def get_num_headers(lines):
+  num_headers = 0
+  for line in lines:
+    if line:
+      if line[0] == ">":
+        num_headers += 1
+  return num_headers
+
+print(get_num_headers(lines))
+```
+
+%% Output
+
+    217183
+
+%% Cell type:markdown id: tags:
+
+
+
+---
+1.2) Count number of sequence lines!
+
+%% Cell type:code id: tags:
+
+``` 
+def get_num_sequence_lines(lines):
+  num_sequence_lines = 0
+  for line in lines:
+    if line:
+      if line[0] != ">":
+        num_sequence_lines += 1
+  return num_sequence_lines
+
+print(get_num_sequence_lines(lines))
+```
+
+%% Output
+
+    916024
+
+%% Cell type:markdown id: tags:
+
+
+
+---
+1.3) Count number of characters in document! (How many per line?)
+
+%% Cell type:code id: tags:
+
+``` 
+def get_num_characters(lines):
+  num_characters = 0
+  num_lines = 0
+  for line in lines:
+    num_characters += len(line)
+    num_lines += 1
+  return (num_characters, num_characters / num_lines)
+
+print(get_num_characters(lines))
+```
+
+%% Output
+
+    (81803755, 72.18783064347467)
+
+%% Cell type:markdown id: tags:
+
+
+
+---
+1.4) How long are all contained sequences combined?
+
+%% Cell type:code id: tags:
+
+``` 
+def get_sequence_length(lines):
+  total_sequence_length = 0
+  for line in lines:
+    if line:
+      if line[0] != ">":
+        line = line.strip()
+        total_sequence_length += len(line)
+  return total_sequence_length
+
+print(get_sequence_length(lines))
+```
+
+%% Output
+
+    64867051
+
+%% Cell type:markdown id: tags:
+
+
+
+---
+1.5) Calculate the average sequence length in this file!
+
+%% Cell type:code id: tags:
+
+``` 
+def get_average_sequence_length(lines):
+  return get_sequence_length(lines) / get_num_headers(lines)
+
+print(get_average_sequence_length(lines))
+```
+
+%% Output
+
+    298.67462462531597
+
+%% Cell type:markdown id: tags:
+
+**Additional exercises**
+
+%% Cell type:markdown id: tags:
+
+1.6) Parse the fasta file entry-wise. An entry consists of a header and the corresponding sequence (which may comprise multiple lines). The result should be a list of tuples of the form (header, sequence).
+
+%% Cell type:code id: tags:
+
+``` 
+"""
+Parse a fasta file entry-wise as a list of tuples of the form (header, sequence).
+"""
+def read_fasta(file):
+  result = []
+
+  header = None
+  sequence = []
+  for line in file:
+    # remove all whitespace from the ends
+    line = line.strip()
+    if line.startswith('>'):
+      # if you find a header return the previous FASTA block in tuple form after
+      # concatenating the sequence lines(if there is a previous block)
+      if header:
+        result += [(header, ''.join(sequence))]
+
+      header = line
+      sequence = []
+    else:
+      # current line is not a header
+      # add line to the list of sequence lines of the current FASTA block after removing all whitespace from it
+      sequence.append(line.translate(str.maketrans('', '', whitespace)))
+
+  if header:
+    result += [(header, ''.join(sequence))]
+  return result
+```