177 lines
5.2 KiB
Markdown
177 lines
5.2 KiB
Markdown
|
# spon_scraper
|
||
|
|
||
|
`spon_scraper` is a Python application designed to automate the scraping of articles and comments from DER SPIEGEL.
|
||
|
Building on the `spon_api` package, `spon_scraper` facilitates large-scale data extraction
|
||
|
based on specified date ranges and custom criteria by employing asynchronous processing.
|
||
|
|
||
|
Please note that due to restructuring of the DER SPIEGEL website in december 2023 the scraper is not working,
|
||
|
as comment sections were removed (see the [offical announcement](https://www.spiegel.de/backstage/community-wir-starten-spiegel-debatte-a-8df0d3e4-722a-4cd9-87cf-6f809bb767ce)) and this rendered the underlying api unusable.
|
||
|
Thus, the code is not actively maintained as of 2024-01-01 and hosted for documentation purposes.
|
||
|
|
||
|
## Features
|
||
|
|
||
|
- **Automated Date Range Scraping**: Efficiently scrape articles from DER SPIEGEL for a specified range of dates.
|
||
|
- **Article and Comment Extraction**: Fetch detailed metadata, content, and comments for individual articles.
|
||
|
- **Exclusion Criteria**: Customize exclusion rules to filter out specific URLs or patterns.
|
||
|
- **Data Storage**: Organize and save the extracted data into structured directories.
|
||
|
|
||
|
## Setup
|
||
|
|
||
|
To set up `spon_scraper`, follow these steps:
|
||
|
|
||
|
### Prerequisites
|
||
|
|
||
|
Ensure you have Python 3.10 or later installed.
|
||
|
|
||
|
### Installation
|
||
|
|
||
|
1. **Clone the repository**:
|
||
|
|
||
|
```bash
|
||
|
git clone https://gitea.iwm-tuebingen.de/ganders/spon_scraper.git
|
||
|
cd spon_scraper
|
||
|
```
|
||
|
|
||
|
2. **Install the required Python packages**:
|
||
|
|
||
|
```bash
|
||
|
pip install -r requirements.txt
|
||
|
```
|
||
|
|
||
|
## Usage
|
||
|
|
||
|
### Configuration
|
||
|
|
||
|
Create a configuration JSON file that specifies the scraping parameters.
|
||
|
An example that was employed in a larger scale data collection for understanding uncongeniality bias is found in
|
||
|
`job_example.json`
|
||
|
|
||
|
|
||
|
### Running the Scraper
|
||
|
|
||
|
To run the scraper, use the following command (make sure you use the correct command for your python version):
|
||
|
|
||
|
```bash
|
||
|
python main.py job_example.json
|
||
|
```
|
||
|
|
||
|
This command will start the scraping process based on the parameters specified in your configuration file.
|
||
|
|
||
|
## Output Data Structure
|
||
|
The scraped data is organized into a structured directory format under the specified `output_path`.
|
||
|
The directory structure and file naming conventions are as follows:
|
||
|
|
||
|
### Folder Structure
|
||
|
- **Articles**: Stored in `articles/YYYY/MM` directories, where `YYYY` is the year and `MM` is the month of the article's publication date.
|
||
|
- **Comments**: Stored in `comments/YYYY/MM` directories, following the same convention.
|
||
|
|
||
|
```markdown
|
||
|
output_path/
|
||
|
├── articles/
|
||
|
│ ├── 2023/
|
||
|
│ │ ├── 01/
|
||
|
│ │ │ ├── 2023-01-01-article0.json
|
||
|
│ │ │ ├── 2023-01-01-article0.json
|
||
|
│ │ │ ├── 2023-01-01-article1.json
|
||
|
│ │ └── ...
|
||
|
│ └── ...
|
||
|
└── comments/
|
||
|
├── 2023/
|
||
|
│ ├── 01/
|
||
|
│ │ ├── 2023-01-01-comments0.json
|
||
|
│ │ ├── 2023-01-01-comments1.json
|
||
|
│ └── ...
|
||
|
└── ...
|
||
|
```
|
||
|
|
||
|
### Articles
|
||
|
The articles are saved as JSON files within the `articles/YYYY/MM` directories and contain the following structure:
|
||
|
|
||
|
```json
|
||
|
{
|
||
|
"url": "article_url",
|
||
|
"id": "article_id",
|
||
|
"channel": "article_channel",
|
||
|
"subchannel": "article_subchannel",
|
||
|
"headline": {
|
||
|
"main": "main_headline",
|
||
|
"social": "social_headline"
|
||
|
},
|
||
|
"intro": "intro_text",
|
||
|
"text": "article_text",
|
||
|
"topics": "topics_array",
|
||
|
"author": "article_author",
|
||
|
"comments_enabled": true,
|
||
|
"date_created": "creation_date",
|
||
|
"date_modified": "modification_date",
|
||
|
"date_published": "publication_date",
|
||
|
"breadcrumbs": ["breadcrumb1", "breadcrumb2"]
|
||
|
}
|
||
|
```
|
||
|
|
||
|
### Comments
|
||
|
The comments are saved as JSON files within the `comments/YYYY-MM` directories and contain the following structure
|
||
|
where replies are nested (the example show an original comment with one reply and no further comments):
|
||
|
|
||
|
```json
|
||
|
[
|
||
|
{
|
||
|
"id": "comment_id",
|
||
|
"body": "comment_text",
|
||
|
"action_summaries": [
|
||
|
{"__typename": "DefaultActionSummary", "count": 4},
|
||
|
{"__typename": "DownvoteActionSummary", "count": 23},
|
||
|
{"__typename": "UpvoteActionSummary", "count": 5}
|
||
|
],
|
||
|
"tags": [
|
||
|
{
|
||
|
"tag": {
|
||
|
"name": "UNCHECKED",
|
||
|
"created_at": "timestamp"
|
||
|
},
|
||
|
"assigned_by": null
|
||
|
}
|
||
|
],
|
||
|
"user": {
|
||
|
"id": "user_id",
|
||
|
"username": "username",
|
||
|
"role": null
|
||
|
},
|
||
|
"status": "ACCEPTED",
|
||
|
"created_at": "timestamp",
|
||
|
"updated_at": "timestamp",
|
||
|
"editing": {"edited": false},
|
||
|
"richTextBody": null,
|
||
|
"highlights": [],
|
||
|
"replies": [
|
||
|
{
|
||
|
"id": "reply_id",
|
||
|
"body": "reply_text",
|
||
|
"action_summaries": [
|
||
|
{"__typename": "UpvoteActionSummary", "count": -1}
|
||
|
],
|
||
|
"tags": [],
|
||
|
"user": {
|
||
|
"id": "user_id",
|
||
|
"username": "username",
|
||
|
"role": null
|
||
|
},
|
||
|
"status": "ACCEPTED",
|
||
|
"created_at": "timestamp",
|
||
|
"updated_at": "timestamp",
|
||
|
"editing": {"edited": false},
|
||
|
"richTextBody": null,
|
||
|
"highlights": [],
|
||
|
"replies": []
|
||
|
}
|
||
|
]
|
||
|
}
|
||
|
]
|
||
|
```
|
||
|
|
||
|
## License
|
||
|
This project is licensed under the GNU General Public License v3.0. See the LICENSE file for more details.
|
||
|
|
||
|
|
||
|
## Contact
|
||
|
For questions, feedback, or to report issues, please contact Gerrit Anders at g.anders@iwm-tuebingen.de.
|