{ "cells": [ { "cell_type": "code", "execution_count": null, "id": "46204859-67c8-4f68-bfe7-3d9d7c4b5adc", "metadata": {}, "outputs": [], "source": [ "from lib.xgta import Xgta\n", "import connection_psql as creds\n", "xgta = Xgta(\n", " creds=creds, \n", " streaming=True, # Streaming uses less RAM, but is slower. Container or notebook may fail/shut down if RAM limit is exceeded.\n", ")" ] }, { "cell_type": "markdown", "id": "063036c2-cf05-4ec0-90ad-0c80e636a82f", "metadata": {}, "source": [ "------\n", "\n", "Calculate frequencey of tweets per day that contain certain (case insensitive) keywords with:\n", "\n", "```python\n", "r1 = xgta.frequency_of_tweets_per_day_containing(keywords)\n", "```\n", "\n", "The `keywords` variable is case insensitive and performs a regex search. Stringing multiple keywords together is possible with the `|`-Operator.\n", "\n", "For example, tweets contianing the words \"Rassismus\" and / or \"Diskriminierung\" use the following):" ] }, { "cell_type": "code", "execution_count": null, "id": "9e6416d0-8abb-4342-a7e3-b5b4e6103c80", "metadata": { "scrolled": true }, "outputs": [], "source": [ "r1 = xgta.frequency_of_tweets_per_day_containing(\"rassismus|diskriminierung\")" ] }, { "cell_type": "markdown", "id": "dd320383-f444-4804-8cc3-68cbca5a0d2f", "metadata": {}, "source": [ "The returned object (in the example above `r1`) contains the `Polars.DataFrame` as `.df` (e.g., `r1.df`) and a method `.plot_frequency_of()` to plot a timeseries of the dataframe." ] }, { "cell_type": "code", "execution_count": null, "id": "10ea4eba-de17-4d8b-9296-f95a60e41985", "metadata": {}, "outputs": [], "source": [ "r1.df # Returns the DataFrame itself" ] }, { "cell_type": "code", "execution_count": null, "id": "67b21f77-d592-4fd7-927c-8f4d315e9593", "metadata": {}, "outputs": [], "source": [ "r1.plot_frequency_of(['search_term:all_tweets']) # Plots a timeseries of all tweets fitting the search criteria." ] }, { "cell_type": "markdown", "id": "7798f830-5a90-4075-9028-cc29a2407a2e", "metadata": {}, "source": [ "It is also possibel to plot multiple columns of the DataFrame in one figure. (Uncomment or comment with `#`)" ] }, { "cell_type": "code", "execution_count": null, "id": "3a4a18b4-92ba-4c30-a95b-5fedeeb097b1", "metadata": {}, "outputs": [], "source": [ "r1.plot_frequency_of([\n", " 'search_term:all_tweets',\n", " 'search_term:is_retweet',\n", " 'search_term:is_original_tweet',\n", " 'percent:all_tweets',\n", " 'percent:is_retweet',\n", " 'percent:is_original_tweet',\n", " # 'all_tweets',\n", " # 'is_retweet',\n", " # 'is_original_tweet',\n", "])" ] }, { "cell_type": "markdown", "id": "72e4b2b7-0bab-4c56-9f4e-039ad4db952d", "metadata": {}, "source": [ "Plots are interactive: \n", "\n", "- Deselect columns by clicking on the ledgend.\n", "- Draw rectangles to zoom in.\n", "- Double click to reset plots to their default view.\n", "\n", "---\n", "\n", "To keep previous results, and search for new terms, add save results in a new return-object, e.g., `r2`:" ] }, { "cell_type": "code", "execution_count": null, "id": "69c3eff2-5504-4ae2-ac5c-7da9e783250b", "metadata": {}, "outputs": [], "source": [ "r2 = xgta.frequency_of_tweets_per_day_containing(\"krise\")\n", "r2.plot_frequency_of(['search_term:is_retweet','search_term:is_original_tweet'])\n", "r2.df.describe()" ] }, { "cell_type": "markdown", "id": "c645b864-d166-4881-b096-1d174e22ccee", "metadata": {}, "source": [ "Compare multiple results with each other:" ] }, { "cell_type": "code", "execution_count": null, "id": "69054b41-b209-48e6-a3ae-9a31e535648b", "metadata": {}, "outputs": [], "source": [ "xgta.plot_frequency_of(\n", " results=[r1, r2], # Two or more results in an array []\n", " plot=[\n", " 'search_term:is_retweet',\n", " 'search_term:is_original_tweet',\n", " ],\n", " shared_xaxes=True, # Optional\n", " shared_yaxes=True, # Optional\n", ")" ] }, { "cell_type": "markdown", "id": "462cfbdb-18d9-461b-8287-f2020d371571", "metadata": {}, "source": [ "---\n", "\n", "## Access dataset with polars directly" ] }, { "cell_type": "code", "execution_count": null, "id": "b861d8c1-59e4-45d5-8500-a442a459b120", "metadata": {}, "outputs": [], "source": [ "import polars as pl" ] }, { "cell_type": "markdown", "id": "3f3842fe-cc49-4725-8c02-9d0d488730e2", "metadata": {}, "source": [ "Get list of available columns." ] }, { "cell_type": "code", "execution_count": null, "id": "7008c916-16da-4445-b236-ef232afdd201", "metadata": {}, "outputs": [], "source": [ "xgta.df_xgta.collect_schema().names()" ] }, { "cell_type": "code", "execution_count": null, "id": "f130c9eb-1ee8-49e8-b60b-55a9a27a8e26", "metadata": {}, "outputs": [], "source": [ "pl.Config(fmt_str_lengths=350)\n", "\n", "q = (\n", " xgta.df_xgta\n", " .limit(1_000_000) # Use limit to develop your queries. It greatly speeds up the development time.\n", " .filter(\n", " pl.col('text').str.to_lowercase().str.contains(r\"\\bwir\\b\")\n", " &\n", " pl.col('isretweet').not_()\n", " )\n", " .select([\"postdate\", \"text\"])\n", ")\n", "\n", "q.collect()" ] }, { "cell_type": "code", "execution_count": null, "id": "e8635cf5-e11f-4939-86de-1410f9c128ae", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.10" } }, "nbformat": 4, "nbformat_minor": 5 }