Example.ipynb

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "46204859-67c8-4f68-bfe7-3d9d7c4b5adc",
   "metadata": {},
   "outputs": [],
   "source": [
    "from lib.xgta import Xgta\n",
    "import connection_psql as creds\n",
    "xgta = Xgta(\n",
    "    creds=creds, \n",
    "    streaming=True, # Streaming uses less RAM, but is slower. Container or notebook may fail/shut down if RAM limit is exceeded.\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "063036c2-cf05-4ec0-90ad-0c80e636a82f",
   "metadata": {},
   "source": [
    "------\n",
    "\n",
    "Calculate frequencey of tweets per day that contain certain (case insensitive) keywords with:\n",
    "\n",
    "```python\n",
    "r1 = xgta.frequency_of_tweets_per_day_containing(keywords)\n",
    "```\n",
    "\n",
    "The `keywords` variable is case insensitive and performs a regex search. Stringing multiple keywords together is possible with the `|`-Operator.\n",
    "\n",
    "For example, tweets contianing the words \"Rassismus\" and / or \"Diskriminierung\" use the following):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9e6416d0-8abb-4342-a7e3-b5b4e6103c80",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "r1 = xgta.frequency_of_tweets_per_day_containing(\"rassismus|diskriminierung\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dd320383-f444-4804-8cc3-68cbca5a0d2f",
   "metadata": {},
   "source": [
    "The returned object (in the example above `r1`) contains the `Polars.DataFrame` as `.df` (e.g., `r1.df`) and a method `.plot_frequency_of()` to plot a timeseries of the dataframe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "10ea4eba-de17-4d8b-9296-f95a60e41985",
   "metadata": {},
   "outputs": [],
   "source": [
    "r1.df # Returns the DataFrame itself"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "67b21f77-d592-4fd7-927c-8f4d315e9593",
   "metadata": {},
   "outputs": [],
   "source": [
    "r1.plot_frequency_of(['search_term:all_tweets']) # Plots a timeseries of all tweets fitting the search criteria."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7798f830-5a90-4075-9028-cc29a2407a2e",
   "metadata": {},
   "source": [
    "It is also possibel to plot multiple columns of the DataFrame in one figure. (Uncomment or comment with `#`)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3a4a18b4-92ba-4c30-a95b-5fedeeb097b1",
   "metadata": {},
   "outputs": [],
   "source": [
    "r1.plot_frequency_of([\n",
    "    'search_term:all_tweets',\n",
    "    'search_term:is_retweet',\n",
    "    'search_term:is_original_tweet',\n",
    "    'percent:all_tweets',\n",
    "    'percent:is_retweet',\n",
    "    'percent:is_original_tweet',\n",
    "    # 'all_tweets',\n",
    "    # 'is_retweet',\n",
    "    # 'is_original_tweet',\n",
    "])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "72e4b2b7-0bab-4c56-9f4e-039ad4db952d",
   "metadata": {},
   "source": [
    "Plots are interactive: \n",
    "\n",
    "- Deselect columns by clicking on the ledgend.\n",
    "- Draw rectangles to zoom in.\n",
    "- Double click to reset plots to their default view.\n",
    "\n",
    "---\n",
    "\n",
    "To keep previous results, and search for new terms, add save results in a new return-object, e.g., `r2`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "69c3eff2-5504-4ae2-ac5c-7da9e783250b",
   "metadata": {},
   "outputs": [],
   "source": [
    "r2 = xgta.frequency_of_tweets_per_day_containing(\"krise\")\n",
    "r2.plot_frequency_of(['search_term:is_retweet','search_term:is_original_tweet'])\n",
    "r2.df.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c645b864-d166-4881-b096-1d174e22ccee",
   "metadata": {},
   "source": [
    "Compare multiple results with each other:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "69054b41-b209-48e6-a3ae-9a31e535648b",
   "metadata": {},
   "outputs": [],
   "source": [
    "xgta.plot_frequency_of(\n",
    "    results=[r1, r2], # Two or more results in an array []\n",
    "    plot=[\n",
    "        'search_term:is_retweet',\n",
    "        'search_term:is_original_tweet',\n",
    "    ],\n",
    "    shared_xaxes=True, # Optional\n",
    "    shared_yaxes=True, # Optional\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "462cfbdb-18d9-461b-8287-f2020d371571",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Access dataset with polars directly"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b861d8c1-59e4-45d5-8500-a442a459b120",
   "metadata": {},
   "outputs": [],
   "source": [
    "import polars as pl"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3f3842fe-cc49-4725-8c02-9d0d488730e2",
   "metadata": {},
   "source": [
    "Get list of available columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7008c916-16da-4445-b236-ef232afdd201",
   "metadata": {},
   "outputs": [],
   "source": [
    "xgta.df_xgta.collect_schema().names()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f130c9eb-1ee8-49e8-b60b-55a9a27a8e26",
   "metadata": {},
   "outputs": [],
   "source": [
    "pl.Config(fmt_str_lengths=350)\n",
    "\n",
    "q = (\n",
    "    xgta.df_xgta\n",
    "    .limit(1_000_000) # Use limit to develop your queries. It greatly speeds up the development time.\n",
    "    .filter(\n",
    "        pl.col('text').str.to_lowercase().str.contains(r\"\\bwir\\b\")\n",
    "        &\n",
    "        pl.col('isretweet').not_()\n",
    "    )\n",
    "    .select([\"postdate\", \"text\"])\n",
    ")\n",
    "\n",
    "q.collect()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e8635cf5-e11f-4939-86de-1410f9c128ae",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}