Skip to content
Snippets Groups Projects
Example.ipynb 6.29 KiB
Newer Older
  • Learn to ignore specific revisions
  • Stefan Knauff's avatar
    Stefan Knauff committed
    {
     "cells": [
      {
       "cell_type": "code",
    
    Stefan Knauff's avatar
    Stefan Knauff committed
       "execution_count": null,
    
    Stefan Knauff's avatar
    Stefan Knauff committed
       "id": "46204859-67c8-4f68-bfe7-3d9d7c4b5adc",
    
    Stefan Knauff's avatar
    Stefan Knauff committed
       "metadata": {},
       "outputs": [],
    
    Stefan Knauff's avatar
    Stefan Knauff committed
       "source": [
        "from lib.xgta import Xgta\n",
        "import connection_psql as creds\n",
    
    Stefan Knauff's avatar
    Stefan Knauff committed
        "xgta = Xgta(\n",
        "    creds=creds, \n",
        "    streaming=True, # Streaming uses less RAM, but is slower. Container or notebook may fail/shut down if RAM limit is exceeded.\n",
        ")"
    
    Stefan Knauff's avatar
    Stefan Knauff committed
       ]
      },
      {
       "cell_type": "markdown",
       "id": "063036c2-cf05-4ec0-90ad-0c80e636a82f",
       "metadata": {},
       "source": [
        "------\n",
        "\n",
        "Calculate frequencey of tweets per day that contain certain (case insensitive) keywords with:\n",
        "\n",
        "```python\n",
        "r1 = xgta.frequency_of_tweets_per_day_containing(keywords)\n",
        "```\n",
        "\n",
        "The `keywords` variable is case insensitive and performs a regex search. Stringing multiple keywords together is possible with the `|`-Operator.\n",
        "\n",
        "For example, tweets contianing the words \"Rassismus\" and / or \"Diskriminierung\" use the following):"
       ]
      },
      {
       "cell_type": "code",
    
    Stefan Knauff's avatar
    Stefan Knauff committed
       "execution_count": null,
    
    Stefan Knauff's avatar
    Stefan Knauff committed
       "id": "9e6416d0-8abb-4342-a7e3-b5b4e6103c80",
       "metadata": {
        "scrolled": true
       },
       "outputs": [],
       "source": [
        "r1 = xgta.frequency_of_tweets_per_day_containing(\"rassismus|diskriminierung\")"
       ]
      },
      {
       "cell_type": "markdown",
       "id": "dd320383-f444-4804-8cc3-68cbca5a0d2f",
       "metadata": {},
       "source": [
        "The returned object (in the example above `r1`) contains the `Polars.DataFrame` as `.df` (e.g., `r1.df`) and a method `.plot_frequency_of()` to plot a timeseries of the dataframe."
       ]
      },
      {
       "cell_type": "code",
    
    Stefan Knauff's avatar
    Stefan Knauff committed
       "execution_count": null,
    
    Stefan Knauff's avatar
    Stefan Knauff committed
       "id": "10ea4eba-de17-4d8b-9296-f95a60e41985",
    
    Stefan Knauff's avatar
    Stefan Knauff committed
       "metadata": {},
       "outputs": [],
    
    Stefan Knauff's avatar
    Stefan Knauff committed
       "source": [
        "r1.df # Returns the DataFrame itself"
       ]
      },
      {
       "cell_type": "code",
    
    Stefan Knauff's avatar
    Stefan Knauff committed
       "execution_count": null,
    
    Stefan Knauff's avatar
    Stefan Knauff committed
       "id": "67b21f77-d592-4fd7-927c-8f4d315e9593",
    
    Stefan Knauff's avatar
    Stefan Knauff committed
       "metadata": {},
       "outputs": [],
    
    Stefan Knauff's avatar
    Stefan Knauff committed
       "source": [
        "r1.plot_frequency_of(['search_term:all_tweets']) # Plots a timeseries of all tweets fitting the search criteria."
       ]
      },
      {
       "cell_type": "markdown",
       "id": "7798f830-5a90-4075-9028-cc29a2407a2e",
       "metadata": {},
       "source": [
        "It is also possibel to plot multiple columns of the DataFrame in one figure. (Uncomment or comment with `#`)"
       ]
      },
      {
       "cell_type": "code",
    
    Stefan Knauff's avatar
    Stefan Knauff committed
       "execution_count": null,
    
    Stefan Knauff's avatar
    Stefan Knauff committed
       "id": "3a4a18b4-92ba-4c30-a95b-5fedeeb097b1",
    
    Stefan Knauff's avatar
    Stefan Knauff committed
       "metadata": {},
       "outputs": [],
    
    Stefan Knauff's avatar
    Stefan Knauff committed
       "source": [
        "r1.plot_frequency_of([\n",
        "    'search_term:all_tweets',\n",
        "    'search_term:is_retweet',\n",
        "    'search_term:is_original_tweet',\n",
        "    'percent:all_tweets',\n",
        "    'percent:is_retweet',\n",
        "    'percent:is_original_tweet',\n",
        "    # 'all_tweets',\n",
        "    # 'is_retweet',\n",
        "    # 'is_original_tweet',\n",
        "])"
       ]
      },
      {
       "cell_type": "markdown",
       "id": "72e4b2b7-0bab-4c56-9f4e-039ad4db952d",
       "metadata": {},
       "source": [
        "Plots are interactive: \n",
        "\n",
        "- Deselect columns by clicking on the ledgend.\n",
        "- Draw rectangles to zoom in.\n",
        "- Double click to reset plots to their default view.\n",
        "\n",
        "---\n",
        "\n",
        "To keep previous results, and search for new terms, add save results in a new return-object, e.g., `r2`:"
       ]
      },
      {
       "cell_type": "code",
    
    Stefan Knauff's avatar
    Stefan Knauff committed
       "execution_count": null,
    
    Stefan Knauff's avatar
    Stefan Knauff committed
       "id": "69c3eff2-5504-4ae2-ac5c-7da9e783250b",
    
    Stefan Knauff's avatar
    Stefan Knauff committed
       "metadata": {},
       "outputs": [],
    
       "source": [
        "r2 = xgta.frequency_of_tweets_per_day_containing(\"krise\")\n",
        "r2.plot_frequency_of(['search_term:is_retweet','search_term:is_original_tweet'])\n",
        "r2.df.describe()"
       ]
      },
      {
       "cell_type": "markdown",
       "id": "c645b864-d166-4881-b096-1d174e22ccee",
       "metadata": {},
       "source": [
        "Compare multiple results with each other:"
       ]
      },
      {
       "cell_type": "code",
    
    Stefan Knauff's avatar
    Stefan Knauff committed
       "execution_count": null,
    
       "id": "69054b41-b209-48e6-a3ae-9a31e535648b",
    
    Stefan Knauff's avatar
    Stefan Knauff committed
       "metadata": {},
       "outputs": [],
    
       "source": [
        "xgta.plot_frequency_of(\n",
        "    results=[r1, r2], # Two or more results in an array []\n",
        "    plot=[\n",
        "        'search_term:is_retweet',\n",
        "        'search_term:is_original_tweet',\n",
        "    ],\n",
        "    shared_xaxes=True, # Optional\n",
        "    shared_yaxes=True, # Optional\n",
        ")"
    
    Stefan Knauff's avatar
    Stefan Knauff committed
       ]
    
    Stefan Knauff's avatar
    Stefan Knauff committed
      },
      {
       "cell_type": "markdown",
       "id": "462cfbdb-18d9-461b-8287-f2020d371571",
       "metadata": {},
       "source": [
        "---\n",
        "\n",
        "## Access dataset with polars directly"
       ]
      },
      {
       "cell_type": "code",
       "execution_count": null,
       "id": "b861d8c1-59e4-45d5-8500-a442a459b120",
       "metadata": {},
       "outputs": [],
       "source": [
        "import polars as pl"
       ]
      },
      {
       "cell_type": "markdown",
       "id": "3f3842fe-cc49-4725-8c02-9d0d488730e2",
       "metadata": {},
       "source": [
        "Get list of available columns."
       ]
      },
      {
       "cell_type": "code",
       "execution_count": null,
       "id": "7008c916-16da-4445-b236-ef232afdd201",
       "metadata": {},
       "outputs": [],
       "source": [
        "xgta.df_xgta.collect_schema().names()"
       ]
      },
      {
       "cell_type": "code",
       "execution_count": null,
       "id": "f130c9eb-1ee8-49e8-b60b-55a9a27a8e26",
       "metadata": {},
       "outputs": [],
       "source": [
        "pl.Config(fmt_str_lengths=350)\n",
        "\n",
        "q = (\n",
        "    xgta.df_xgta\n",
        "    .limit(1_000_000) # Use limit to develop your queries. It greatly speeds up the development time.\n",
        "    .filter(\n",
        "        pl.col('text').str.to_lowercase().str.contains(r\"\\bwir\\b\")\n",
        "        &\n",
        "        pl.col('isretweet').not_()\n",
        "    )\n",
        "    .select([\"postdate\", \"text\"])\n",
        ")\n",
        "\n",
        "q.collect()"
       ]
      },
      {
       "cell_type": "code",
       "execution_count": null,
       "id": "e8635cf5-e11f-4939-86de-1410f9c128ae",
       "metadata": {},
       "outputs": [],
       "source": []
    
    Stefan Knauff's avatar
    Stefan Knauff committed
      }
     ],
     "metadata": {
      "kernelspec": {
    
    Stefan Knauff's avatar
    Stefan Knauff committed
       "display_name": "Python 3 (ipykernel)",
    
    Stefan Knauff's avatar
    Stefan Knauff committed
       "language": "python",
    
    Stefan Knauff's avatar
    Stefan Knauff committed
       "name": "python3"
    
    Stefan Knauff's avatar
    Stefan Knauff committed
      },
      "language_info": {
       "codemirror_mode": {
        "name": "ipython",
        "version": 3
       },
       "file_extension": ".py",
       "mimetype": "text/x-python",
       "name": "python",
       "nbconvert_exporter": "python",
       "pygments_lexer": "ipython3",
    
    Stefan Knauff's avatar
    Stefan Knauff committed
       "version": "3.11.10"
    
    Stefan Knauff's avatar
    Stefan Knauff committed
      }
     },
     "nbformat": 4,
     "nbformat_minor": 5
    }