src | ||
.gitignore | ||
job_example.json | ||
LICENSE | ||
main.py | ||
README.md | ||
requirements.txt |
spon_scraper
spon_scraper
is a Python application designed to automate the scraping of articles and comments from DER SPIEGEL.
Building on the spon_api
package, spon_scraper
facilitates large-scale data extraction
based on specified date ranges and custom criteria by employing asynchronous processing.
Please note that due to restructuring of the DER SPIEGEL website in december 2023 the scraper is not working, as comment sections were removed (see the offical announcement) and this rendered the underlying api unusable. Thus, the code is not actively maintained as of 2024-01-01 and hosted for documentation purposes.
Features
- Automated Date Range Scraping: Efficiently scrape articles from DER SPIEGEL for a specified range of dates.
- Article and Comment Extraction: Fetch detailed metadata, content, and comments for individual articles.
- Exclusion Criteria: Customize exclusion rules to filter out specific URLs or patterns.
- Data Storage: Organize and save the extracted data into structured directories.
Setup
To set up spon_scraper
, follow these steps:
Prerequisites
Ensure you have Python 3.10 or later installed.
Installation
- Clone the repository:
git clone https://gitea.iwm-tuebingen.de/ganders/spon_scraper.git
cd spon_scraper
- Install the required Python packages:
pip install -r requirements.txt
Usage
Configuration
Create a configuration JSON file that specifies the scraping parameters.
An example that was employed in a larger scale data collection for understanding uncongeniality bias is found in
job_example.json
Running the Scraper
To run the scraper, use the following command (make sure you use the correct command for your python version):
python main.py job_example.json
This command will start the scraping process based on the parameters specified in your configuration file.
Output Data Structure
The scraped data is organized into a structured directory format under the specified output_path
.
The directory structure and file naming conventions are as follows:
Folder Structure
- Articles: Stored in
articles/YYYY/MM
directories, whereYYYY
is the year andMM
is the month of the article's publication date. - Comments: Stored in
comments/YYYY/MM
directories, following the same convention.
output_path/
├── articles/
│ ├── 2023/
│ │ ├── 01/
│ │ │ ├── 2023-01-01-article0.json
│ │ │ ├── 2023-01-01-article0.json
│ │ │ ├── 2023-01-01-article1.json
│ │ └── ...
│ └── ...
└── comments/
├── 2023/
│ ├── 01/
│ │ ├── 2023-01-01-comments0.json
│ │ ├── 2023-01-01-comments1.json
│ └── ...
└── ...
Articles
The articles are saved as JSON files within the articles/YYYY/MM
directories and contain the following structure:
{
"url": "article_url",
"id": "article_id",
"channel": "article_channel",
"subchannel": "article_subchannel",
"headline": {
"main": "main_headline",
"social": "social_headline"
},
"intro": "intro_text",
"text": "article_text",
"topics": "topics_array",
"author": "article_author",
"comments_enabled": true,
"date_created": "creation_date",
"date_modified": "modification_date",
"date_published": "publication_date",
"breadcrumbs": ["breadcrumb1", "breadcrumb2"]
}
Comments
The comments are saved as JSON files within the comments/YYYY-MM
directories and contain the following structure
where replies are nested (the example show an original comment with one reply and no further comments):
[
{
"id": "comment_id",
"body": "comment_text",
"action_summaries": [
{"__typename": "DefaultActionSummary", "count": 4},
{"__typename": "DownvoteActionSummary", "count": 23},
{"__typename": "UpvoteActionSummary", "count": 5}
],
"tags": [
{
"tag": {
"name": "UNCHECKED",
"created_at": "timestamp"
},
"assigned_by": null
}
],
"user": {
"id": "user_id",
"username": "username",
"role": null
},
"status": "ACCEPTED",
"created_at": "timestamp",
"updated_at": "timestamp",
"editing": {"edited": false},
"richTextBody": null,
"highlights": [],
"replies": [
{
"id": "reply_id",
"body": "reply_text",
"action_summaries": [
{"__typename": "UpvoteActionSummary", "count": -1}
],
"tags": [],
"user": {
"id": "user_id",
"username": "username",
"role": null
},
"status": "ACCEPTED",
"created_at": "timestamp",
"updated_at": "timestamp",
"editing": {"edited": false},
"richTextBody": null,
"highlights": [],
"replies": []
}
]
}
]
License
This project is licensed under the GNU General Public License v3.0. See the LICENSE file for more details.
Contact
For questions, feedback, or to report issues, please contact Gerrit Anders at g.anders@iwm-tuebingen.de.