spon_scraper/README.md

177 lines
5.2 KiB
Markdown

# spon_scraper
`spon_scraper` is a Python application designed to automate the scraping of articles and comments from DER SPIEGEL.
Building on the `spon_api` package, `spon_scraper` facilitates large-scale data extraction
based on specified date ranges and custom criteria by employing asynchronous processing.
Please note that due to restructuring of the DER SPIEGEL website in december 2023 the scraper is not working,
as comment sections were removed (see the [offical announcement](https://www.spiegel.de/backstage/community-wir-starten-spiegel-debatte-a-8df0d3e4-722a-4cd9-87cf-6f809bb767ce)) and this rendered the underlying api unusable.
Thus, the code is not actively maintained as of 2024-01-01 and hosted for documentation purposes.
## Features
- **Automated Date Range Scraping**: Efficiently scrape articles from DER SPIEGEL for a specified range of dates.
- **Article and Comment Extraction**: Fetch detailed metadata, content, and comments for individual articles.
- **Exclusion Criteria**: Customize exclusion rules to filter out specific URLs or patterns.
- **Data Storage**: Organize and save the extracted data into structured directories.
## Setup
To set up `spon_scraper`, follow these steps:
### Prerequisites
Ensure you have Python 3.10 or later installed.
### Installation
1. **Clone the repository**:
```bash
git clone https://gitea.iwm-tuebingen.de/ganders/spon_scraper.git
cd spon_scraper
```
2. **Install the required Python packages**:
```bash
pip install -r requirements.txt
```
## Usage
### Configuration
Create a configuration JSON file that specifies the scraping parameters.
An example that was employed in a larger scale data collection for understanding uncongeniality bias is found in
`job_example.json`
### Running the Scraper
To run the scraper, use the following command (make sure you use the correct command for your python version):
```bash
python main.py job_example.json
```
This command will start the scraping process based on the parameters specified in your configuration file.
## Output Data Structure
The scraped data is organized into a structured directory format under the specified `output_path`.
The directory structure and file naming conventions are as follows:
### Folder Structure
- **Articles**: Stored in `articles/YYYY/MM` directories, where `YYYY` is the year and `MM` is the month of the article's publication date.
- **Comments**: Stored in `comments/YYYY/MM` directories, following the same convention.
```markdown
output_path/
├── articles/
│ ├── 2023/
│ │ ├── 01/
│ │ │ ├── 2023-01-01-article0.json
│ │ │ ├── 2023-01-01-article0.json
│ │ │ ├── 2023-01-01-article1.json
│ │ └── ...
│ └── ...
└── comments/
├── 2023/
│ ├── 01/
│ │ ├── 2023-01-01-comments0.json
│ │ ├── 2023-01-01-comments1.json
│ └── ...
└── ...
```
### Articles
The articles are saved as JSON files within the `articles/YYYY/MM` directories and contain the following structure:
```json
{
"url": "article_url",
"id": "article_id",
"channel": "article_channel",
"subchannel": "article_subchannel",
"headline": {
"main": "main_headline",
"social": "social_headline"
},
"intro": "intro_text",
"text": "article_text",
"topics": "topics_array",
"author": "article_author",
"comments_enabled": true,
"date_created": "creation_date",
"date_modified": "modification_date",
"date_published": "publication_date",
"breadcrumbs": ["breadcrumb1", "breadcrumb2"]
}
```
### Comments
The comments are saved as JSON files within the `comments/YYYY-MM` directories and contain the following structure
where replies are nested (the example show an original comment with one reply and no further comments):
```json
[
{
"id": "comment_id",
"body": "comment_text",
"action_summaries": [
{"__typename": "DefaultActionSummary", "count": 4},
{"__typename": "DownvoteActionSummary", "count": 23},
{"__typename": "UpvoteActionSummary", "count": 5}
],
"tags": [
{
"tag": {
"name": "UNCHECKED",
"created_at": "timestamp"
},
"assigned_by": null
}
],
"user": {
"id": "user_id",
"username": "username",
"role": null
},
"status": "ACCEPTED",
"created_at": "timestamp",
"updated_at": "timestamp",
"editing": {"edited": false},
"richTextBody": null,
"highlights": [],
"replies": [
{
"id": "reply_id",
"body": "reply_text",
"action_summaries": [
{"__typename": "UpvoteActionSummary", "count": -1}
],
"tags": [],
"user": {
"id": "user_id",
"username": "username",
"role": null
},
"status": "ACCEPTED",
"created_at": "timestamp",
"updated_at": "timestamp",
"editing": {"edited": false},
"richTextBody": null,
"highlights": [],
"replies": []
}
]
}
]
```
## License
This project is licensed under the GNU General Public License v3.0. See the LICENSE file for more details.
## Contact
For questions, feedback, or to report issues, please contact Gerrit Anders at g.anders@iwm-tuebingen.de.