spon_scraper

spon_scraper is a Python application designed to automate the scraping of articles and comments from DER SPIEGEL. Building on the spon_api package, spon_scraper facilitates large-scale data extraction based on specified date ranges and custom criteria by employing asynchronous processing.

Please note that due to restructuring of the DER SPIEGEL website in december 2023 the scraper is not working, as comment sections were removed (see the offical announcement) and this rendered the underlying api unusable. Thus, the code is not actively maintained as of 2024-01-01 and hosted for documentation purposes.

Features

Automated Date Range Scraping: Efficiently scrape articles from DER SPIEGEL for a specified range of dates.
Article and Comment Extraction: Fetch detailed metadata, content, and comments for individual articles.
Exclusion Criteria: Customize exclusion rules to filter out specific URLs or patterns.
Data Storage: Organize and save the extracted data into structured directories.

Setup

To set up spon_scraper, follow these steps:

Prerequisites

Ensure you have Python 3.10 or later installed.

Installation

Clone the repository:

git clone https://gitea.iwm-tuebingen.de/ganders/spon_scraper.git
cd spon_scraper

Install the required Python packages:

pip install -r requirements.txt

Usage

Configuration

Create a configuration JSON file that specifies the scraping parameters. An example that was employed in a larger scale data collection for understanding uncongeniality bias is found in job_example.json

Running the Scraper

To run the scraper, use the following command (make sure you use the correct command for your python version):

python main.py job_example.json

This command will start the scraping process based on the parameters specified in your configuration file.

Output Data Structure

The scraped data is organized into a structured directory format under the specified output_path. The directory structure and file naming conventions are as follows:

Folder Structure

Articles: Stored in articles/YYYY/MM directories, where YYYY is the year and MM is the month of the article's publication date.
Comments: Stored in comments/YYYY/MM directories, following the same convention.

output_path/
├── articles/
│   ├── 2023/
│   │   ├── 01/
│   │   │   ├── 2023-01-01-article0.json
│   │   │   ├── 2023-01-01-article0.json
│   │   │   ├── 2023-01-01-article1.json
│   │   └── ...
│   └── ...
└── comments/
    ├── 2023/
    │   ├── 01/
    │   │   ├── 2023-01-01-comments0.json
    │   │   ├── 2023-01-01-comments1.json
    │   └── ...
    └── ...

Articles

The articles are saved as JSON files within the articles/YYYY/MM directories and contain the following structure:

{
  "url": "article_url",
  "id": "article_id",
  "channel": "article_channel",
  "subchannel": "article_subchannel",
  "headline": {
    "main": "main_headline",
    "social": "social_headline"
  },
  "intro": "intro_text",
  "text": "article_text",
  "topics": "topics_array",
  "author": "article_author",
  "comments_enabled": true,
  "date_created": "creation_date",
  "date_modified": "modification_date",
  "date_published": "publication_date",
  "breadcrumbs": ["breadcrumb1", "breadcrumb2"]
}

Comments

The comments are saved as JSON files within the comments/YYYY-MM directories and contain the following structure where replies are nested (the example show an original comment with one reply and no further comments):

[
  {
    "id": "comment_id",
    "body": "comment_text",
    "action_summaries": [
      {"__typename": "DefaultActionSummary", "count": 4},
      {"__typename": "DownvoteActionSummary", "count": 23},
      {"__typename": "UpvoteActionSummary", "count": 5}
    ],
    "tags": [
      {
        "tag": {
          "name": "UNCHECKED",
          "created_at": "timestamp"
        },
        "assigned_by": null
      }
    ],
    "user": {
      "id": "user_id",
      "username": "username",
      "role": null
    },
    "status": "ACCEPTED",
    "created_at": "timestamp",
    "updated_at": "timestamp",
    "editing": {"edited": false},
    "richTextBody": null,
    "highlights": [],
    "replies": [
      {
        "id": "reply_id",
        "body": "reply_text",
        "action_summaries": [
          {"__typename": "UpvoteActionSummary", "count": -1}
        ],
        "tags": [],
        "user": {
          "id": "user_id",
          "username": "username",
          "role": null
        },
        "status": "ACCEPTED",
        "created_at": "timestamp",
        "updated_at": "timestamp",
        "editing": {"edited": false},
        "richTextBody": null,
        "highlights": [],
        "replies": []
      }
    ]
  }
]

License

This project is licensed under the GNU General Public License v3.0. See the LICENSE file for more details.

Contact

For questions, feedback, or to report issues, please contact Gerrit Anders at g.anders@iwm-tuebingen.de.

5.2 KiB Raw Blame History