Peraturan-Crawler is an open-source utility that automates the crawling and downloading of legal PDF documents from peraturan.go.id. It performs robust crawling with retry, intelligent batch downloading, PDF validation, resume capability, and stores all found PDF files and their metadata locally for further processing or offline research.
- Automated crawling from a root URL (default: peraturan.go.id)
- Smart, multi-level link discovery (finds all PDF links recursively)
- Batch PDF download with retry and validation
- PDF file validation (not just download, but also check if the file is really a PDF)
- Resume support (safe to stop/restart; won't redownload completed files)
- Metadata and progress logging (JSON files for all results)
- Colorized and timestamped logs
- Node.js 16+
- NPM
- Clone the repository
git clone https://github.com/NeaByteLab/Peraturan-Crawler.git
cd Peraturan-Crawler- Install dependencies
npm install- Run the crawler
node index.jsThe script will crawl from the default root (https://peraturan.go.id) and start downloading PDFs in batches into the pdf_peraturan/ folder. All progress and metadata are saved in local JSON files (all_pdf_metadata.json, resume_crawl.json).
- Crawling
- Starts from the given root URL.
- Recursively finds all links (pages and PDFs) on the domain.
- PDF Discovery & Validation
- Every discovered PDF link is checked and scheduled for download.
- Downloads are retried up to 3 times per file (configurable).
- Each file is validated to ensure it's a real, readable PDF.
- Batch Download & Logging
- Downloads in batches (default: 5 concurrent files).
- All progress (success/fail/skip) is logged in color to console and to local JSON for resume.
- Resume & Metadata
- If interrupted, rerun the script to resume unfinished jobs.
- Metadata (source page, filename, local path, etc) is stored in
all_pdf_metadata.json.
.
├── index.js # Main script: crawler & downloader
├── pdf_peraturan/ # All downloaded PDF files
├── all_pdf_metadata.json # JSON metadata for all PDFs
├── resume_crawl.json # Resume state
├── package.json
- Default root:
https://peraturan.go.id(edit inindex.jsif needed) - Batch size: Change
batchDownloadparameter in the script - Retry count: Change
maxRetryparameter in download/fetch functions
MIT © NeaByteLab