Peraturan-Crawler

Peraturan-Crawler is an open-source utility that automates the crawling and downloading of legal PDF documents from peraturan.go.id. It performs robust crawling with retry, intelligent batch downloading, PDF validation, resume capability, and stores all found PDF files and their metadata locally for further processing or offline research.

Features

Automated crawling from a root URL (default: peraturan.go.id)
Smart, multi-level link discovery (finds all PDF links recursively)
Batch PDF download with retry and validation
PDF file validation (not just download, but also check if the file is really a PDF)
Resume support (safe to stop/restart; won't redownload completed files)
Metadata and progress logging (JSON files for all results)
Colorized and timestamped logs

Requirements

Node.js 16+
NPM

Getting Started

Clone the repository

git clone https://github.com/NeaByteLab/Peraturan-Crawler.git
cd Peraturan-Crawler

Install dependencies

npm install

Run the crawler

node index.js

The script will crawl from the default root (https://peraturan.go.id) and start downloading PDFs in batches into the pdf_peraturan/ folder. All progress and metadata are saved in local JSON files (all_pdf_metadata.json, resume_crawl.json).

How It Works (Flow)

Crawling
- Starts from the given root URL.
- Recursively finds all links (pages and PDFs) on the domain.
PDF Discovery & Validation
- Every discovered PDF link is checked and scheduled for download.
- Downloads are retried up to 3 times per file (configurable).
- Each file is validated to ensure it's a real, readable PDF.
Batch Download & Logging
- Downloads in batches (default: 5 concurrent files).
- All progress (success/fail/skip) is logged in color to console and to local JSON for resume.
Resume & Metadata
- If interrupted, rerun the script to resume unfinished jobs.
- Metadata (source page, filename, local path, etc) is stored in all_pdf_metadata.json.

Project Structure

.
├── index.js                # Main script: crawler & downloader
├── pdf_peraturan/          # All downloaded PDF files
├── all_pdf_metadata.json   # JSON metadata for all PDFs
├── resume_crawl.json       # Resume state
├── package.json

Configuration

Default root: https://peraturan.go.id (edit in index.js if needed)
Batch size: Change batchDownload parameter in the script
Retry count: Change maxRetry parameter in download/fetch functions

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
index.js		index.js
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Peraturan-Crawler

Features

Requirements

Getting Started

How It Works (Flow)

Project Structure

Configuration

License

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Peraturan-Crawler

Features

Requirements

Getting Started

How It Works (Flow)

Project Structure

Configuration

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages