SG Jobs Aggregator – Data Integration Platform

n8n Web Scraping Data Pipeline Python GitHub Pages Automation

The Problem

Singapore job seekers face fragmented job listings across hundreds of company career sites with no centralized, up-to-date aggregation platform specific to Singapore market

My Approach

Built 20+ n8n workflows to scrape, normalize, and aggregate 4,000+ job listings in 20 minutes, deployed as an automated daily data platform on GitHub Pages

Overview

SG Jobs Aggregator is a comprehensive data integration platform that automatically scrapes, normalizes, and publishes Singapore job listings from 20+ company career sites. Built entirely on n8n workflows, it demonstrates end-to-end data engineering from heterogeneous sources to a clean, user-friendly web interface.

The Problem

Fragmented Job Market

Scattered Listings:

Each company maintains own career site
No standard format or structure
Different application processes
Varying update frequencies

Time-Consuming Search:

Job seekers visit dozens of sites manually
Difficult to track new postings
Miss opportunities between visits
Repetitive checking across platforms

Existing Solutions Fall Short:

General job boards miss company-specific listings
Paid aggregators focus on global markets
Singapore-specific sites incomplete
No automation for updates

Data Heterogeneity:

Different HTML structures per site
Various API formats
Inconsistent metadata
Diverse categorization schemes

The Solution

Comprehensive n8n Workflow System

20+ Custom Workflows:

Each workflow tailored to specific company:

Custom HTML parsers
API integrations
Data extraction logic
Error handling
Rate limiting

Supports Multiple Source Types:

Static HTML career pages
Dynamic JavaScript-rendered sites
REST APIs
GraphQL endpoints
RSS/Atom feeds

End-to-End Data Platform

Complete Pipeline Architecture:

Information Gathering
        ↓
Data Cleaning & Normalization
        ↓
Database Insert
        ↓
JSON Generation
        ↓
index.html Generation
        ↓
GitHub Pages Deployment

Fully Automated:

Scheduled daily executions
No manual intervention
Automatic error recovery
Self-healing workflows

Fast Aggregation

Performance Metrics:

4,000+ listings aggregated
20 minutes total execution time
20+ workflows run in parallel
Daily updates automatically scheduled

Optimization Strategies:

Parallel workflow execution
Efficient scraping patterns
Smart rate limiting
Incremental updates

Technical Implementation

Phase 1: Information Gathering

Web Scraping Techniques:

For Static Sites:

// n8n HTTP Request + HTML Extract
GET company-careers-url
→ Parse HTML
→ Extract job listings
→ Structure data

For Dynamic Sites:

JavaScript rendering
API endpoint discovery
XHR request interception
JSON data extraction

For API-First Companies:

REST API integration
Authentication handling
Pagination logic
Response parsing

Challenges Solved:

Anti-scraping measures
Rate limiting
Dynamic content loading
Authentication requirements
CAPTCHA avoidance

Phase 2: Data Cleaning & Normalization

Heterogeneous Data → Unified Schema:

Standardized Fields:

{
  "company": "string",
  "title": "string",
  "location": "string",
  "type": "Full-time | Part-time | Contract | Internship",
  "category": "Engineering | Sales | Marketing | ...",
  "applyUrl": "string",
  "postedDate": "ISO 8601",
  "description": "string",
  "metadata": {}
}

Normalization Steps:

Title Cleaning: Remove company names, standardize formats
Location Parsing: Extract city, country, remote status
Type Classification: Map various terms to standard types
Category Inference: Classify by keywords and patterns
Date Standardization: Convert to consistent format
URL Validation: Ensure valid application links

Data Quality:

Deduplication logic
Validation rules
Completeness checks
Consistency enforcement

Phase 3: Database Management

Supabase PostgreSQL:

Schema Design:

CREATE TABLE jobs (
  id SERIAL PRIMARY KEY,
  company VARCHAR(255),
  title TEXT,
  location VARCHAR(255),
  type VARCHAR(50),
  category VARCHAR(100),
  apply_url TEXT,
  posted_date TIMESTAMP,
  description TEXT,
  metadata JSONB,
  scraped_at TIMESTAMP,
  updated_at TIMESTAMP
);

Operations:

Upsert logic (insert or update)
Historical tracking
Expired job removal
Performance indexing

Phase 4: Output Generation

Multi-Format Export:

JSON API:

{
  "lastUpdated": "2025-12-01T08:00:00Z",
  "totalJobs": 4127,
  "companies": 23,
  "jobs": [...]
}

Static HTML:

Searchable table interface
Filter by company, type, category
Sort by date, company, title
Responsive design
Fast client-side operations

GitHub Pages Deployment:

Automatic git commits
Static site generation
CDN distribution
Zero hosting cost

Phase 5: Orchestration & Scheduling

n8n Cron Triggers:

Daily execution at 8 AM SGT
Parallel workflow launches
Result aggregation
Error notifications

Workflow Dependencies:

Scraping Workflows (parallel)
        ↓
Aggregation Workflow
        ↓
Cleaning Workflow
        ↓
Database Workflow
        ↓
Generation Workflow
        ↓
Deployment Workflow

Error Handling:

Retry logic for failed scrapes
Fallback data sources
Alert notifications
Manual intervention triggers

n8n Workflow Architecture

Individual Company Workflows

Template Structure:

Nodes:

Schedule Trigger: Cron-based execution
HTTP Request: Fetch career page/API
Data Extraction: Parse HTML or JSON
Transformation: Normalize to schema
Validation: Check data quality
Database Insert: Upsert to Supabase
Error Handler: Log and notify failures

Example: Tech Company API Workflow

Cron Trigger (daily 8am)
    ↓
HTTP Request → GET /api/jobs
    ↓
JSON Parse → Extract jobs array
    ↓
Loop → For each job
    ↓
Transform → Map to standard schema
    ↓
Validate → Check required fields
    ↓
Supabase → Upsert job record
    ↓
Success/Error notification

Aggregation Workflow

Combines All Sources:

Process:

Wait for scraping workflows to complete
Query Supabase for all recent jobs
Apply global deduplication
Generate aggregated JSON
Build static HTML
Commit to GitHub
Trigger Pages deployment

Data Pipeline Features

1. Automatic Deduplication

Strategies:

URL-based matching
Fuzzy title comparison
Company + title + location combo
Temporal proximity checks

Prevents:

Duplicate listings
Data bloat
User confusion
Processing overhead

2. Category Classification

Intelligent Categorization:

Keyword-based classification
Job title analysis
Description NLP (planned)
Company industry mapping

Categories:

Engineering
Data & Analytics
Sales & Business Development
Marketing
Operations
Finance
Human Resources
Customer Success
Design
Product Management

3. Link Validation

Ensures Quality:

HTTP status checks
Redirect following
Dead link removal
Canonical URL resolution

User Experience:

No broken application links
Direct to actual job page
Tracked click-throughs
Working apply buttons

4. Weekly Update Automation

Scheduled Maintenance:

Remove expired listings
Refresh active jobs
Update company data
Regenerate outputs
Performance optimization

Key Features

User-Facing

Job Seeker Benefits:

4,000+ current listings
Daily updates
Single search location
Fast filtering
Direct application links
Mobile-responsive

Search Capabilities:

Full-text search
Company filter
Job type filter
Category filter
Location filter
Date sorting

Technical

Data Engineering:

ETL pipeline automation
Multi-source integration
Real-time processing
Scalable architecture

Reliability:

Error recovery
Data validation
Monitoring alerts
Uptime tracking

Performance:

20-minute full refresh
Parallel processing
Efficient storage
Fast page loads

Impact & Metrics

Coverage:

20+ companies tracked
4,000+ active listings
Daily updates guaranteed
Comprehensive Singapore market

Performance:

20 minutes full aggregation
Daily automatic execution
Zero manual intervention
Self-healing workflows

Cost:

$0 hosting (GitHub Pages)
Minimal compute (n8n cloud free tier)
Low API costs
High value delivery

User Value:

Single source for SG jobs
Time saved vs. manual search
Never miss new postings
Clean, organized interface

Use Cases

1. Active Job Seekers

Daily check for new listings
Filter by preferences
Track application progress
Market research

2. Career Explorers

Survey available opportunities
Understand market demand
Identify growing companies
Salary research

3. Recruiters

Competitive intelligence
Market mapping
Company tracking
Candidate sourcing

4. Data Analysts

Job market trends
Hiring patterns
Skill demand analysis
Economic indicators

Technical Challenges & Solutions

Challenge 1: Heterogeneous Data Sources

Problem: Each company has unique structure

Solution:

Custom workflow per company
Flexible extraction patterns
Adaptive normalization
Template system for common patterns

Challenge 2: Rate Limiting

Problem: Avoid being blocked

Solution:

Respect robots.txt
Implement delays
Rotate user agents
Use API when available

Challenge 3: Data Freshness

Problem: Jobs expire quickly

Solution:

Daily updates
Automatic cleanup
Posted date tracking
Staleness detection

Challenge 4: Deployment Automation

Problem: Manual publishing is tedious

Solution:

Git automation
GitHub Pages integration
Atomic deployments
Rollback capability

Future Enhancements

Planned Features:

Expanded Coverage

50+ companies target
Regional expansion (APAC)
Remote job inclusion
Freelance opportunities

Advanced Features

Email alerts for new jobs
Saved searches
Application tracking
Company reviews integration

Data Intelligence

Salary predictions
Skill demand trends
Company growth indicators
Market analytics dashboard

API Access

Public JSON API
Webhooks for updates
Developer documentation
Rate-limited access

Technologies Used

Automation:

n8n workflow platform
Cron scheduling
Webhook triggers

Data Processing:

HTML parsing
JSON manipulation
Data normalization
Validation logic

Storage:

Supabase PostgreSQL
GitHub repository
JSON files
Static assets

Deployment:

GitHub Pages
Git automation
CDN distribution

Monitoring:

Error notifications
Performance tracking
Uptime monitoring

Conclusion

SG Jobs Aggregator demonstrates how modern workflow automation tools like n8n can build production-grade data platforms without traditional backend infrastructure. By orchestrating 20+ specialized workflows into a cohesive pipeline, it delivers daily-updated, comprehensive job listings to Singapore job seekers while showcasing data engineering best practices in web scraping, ETL, and automated deployment.

Impact Metrics

4,000+

Total Listings

20+

Workflows

20 min

Scraping Time

Daily

Update Frequency

The Problem

My Approach

Overview

The Problem

Fragmented Job Market

The Solution

Comprehensive n8n Workflow System

End-to-End Data Platform

Fast Aggregation

Technical Implementation

Phase 1: Information Gathering

Phase 2: Data Cleaning & Normalization

Phase 3: Database Management

Phase 4: Output Generation

Phase 5: Orchestration & Scheduling

n8n Workflow Architecture

Individual Company Workflows

Aggregation Workflow

Data Pipeline Features

1. Automatic Deduplication

2. Category Classification

3. Link Validation

4. Weekly Update Automation

Key Features

User-Facing

Technical

Impact & Metrics

Use Cases

1. Active Job Seekers

2. Career Explorers

3. Recruiters

4. Data Analysts

Technical Challenges & Solutions

Challenge 1: Heterogeneous Data Sources

Challenge 2: Rate Limiting

Challenge 3: Data Freshness

Challenge 4: Deployment Automation

Future Enhancements

Expanded Coverage

Advanced Features

Data Intelligence

API Access

Technologies Used

Links

Conclusion

Impact Metrics