SG Jobs Aggregator – Data Integration Platform
The Problem
Singapore job seekers face fragmented job listings across hundreds of company career sites with no centralized, up-to-date aggregation platform specific to Singapore marketMy Approach
Built 20+ n8n workflows to scrape, normalize, and aggregate 4,000+ job listings in 20 minutes, deployed as an automated daily data platform on GitHub PagesOverview
SG Jobs Aggregator is a comprehensive data integration platform that automatically scrapes, normalizes, and publishes Singapore job listings from 20+ company career sites. Built entirely on n8n workflows, it demonstrates end-to-end data engineering from heterogeneous sources to a clean, user-friendly web interface.
The Problem
Fragmented Job Market
Scattered Listings:
- Each company maintains own career site
- No standard format or structure
- Different application processes
- Varying update frequencies
Time-Consuming Search:
- Job seekers visit dozens of sites manually
- Difficult to track new postings
- Miss opportunities between visits
- Repetitive checking across platforms
Existing Solutions Fall Short:
- General job boards miss company-specific listings
- Paid aggregators focus on global markets
- Singapore-specific sites incomplete
- No automation for updates
Data Heterogeneity:
- Different HTML structures per site
- Various API formats
- Inconsistent metadata
- Diverse categorization schemes
The Solution
Comprehensive n8n Workflow System
20+ Custom Workflows:
Each workflow tailored to specific company:
- Custom HTML parsers
- API integrations
- Data extraction logic
- Error handling
- Rate limiting
Supports Multiple Source Types:
- Static HTML career pages
- Dynamic JavaScript-rendered sites
- REST APIs
- GraphQL endpoints
- RSS/Atom feeds
End-to-End Data Platform
Complete Pipeline Architecture:
Information Gathering
↓
Data Cleaning & Normalization
↓
Database Insert
↓
JSON Generation
↓
index.html Generation
↓
GitHub Pages Deployment
Fully Automated:
- Scheduled daily executions
- No manual intervention
- Automatic error recovery
- Self-healing workflows
Fast Aggregation
Performance Metrics:
- 4,000+ listings aggregated
- 20 minutes total execution time
- 20+ workflows run in parallel
- Daily updates automatically scheduled
Optimization Strategies:
- Parallel workflow execution
- Efficient scraping patterns
- Smart rate limiting
- Incremental updates
Technical Implementation
Phase 1: Information Gathering
Web Scraping Techniques:
For Static Sites:
// n8n HTTP Request + HTML Extract
GET company-careers-url
→ Parse HTML
→ Extract job listings
→ Structure data
For Dynamic Sites:
- JavaScript rendering
- API endpoint discovery
- XHR request interception
- JSON data extraction
For API-First Companies:
- REST API integration
- Authentication handling
- Pagination logic
- Response parsing
Challenges Solved:
- Anti-scraping measures
- Rate limiting
- Dynamic content loading
- Authentication requirements
- CAPTCHA avoidance
Phase 2: Data Cleaning & Normalization
Heterogeneous Data → Unified Schema:
Standardized Fields:
{
"company": "string",
"title": "string",
"location": "string",
"type": "Full-time | Part-time | Contract | Internship",
"category": "Engineering | Sales | Marketing | ...",
"applyUrl": "string",
"postedDate": "ISO 8601",
"description": "string",
"metadata": {}
}
Normalization Steps:
- Title Cleaning: Remove company names, standardize formats
- Location Parsing: Extract city, country, remote status
- Type Classification: Map various terms to standard types
- Category Inference: Classify by keywords and patterns
- Date Standardization: Convert to consistent format
- URL Validation: Ensure valid application links
Data Quality:
- Deduplication logic
- Validation rules
- Completeness checks
- Consistency enforcement
Phase 3: Database Management
Supabase PostgreSQL:
Schema Design:
CREATE TABLE jobs (
id SERIAL PRIMARY KEY,
company VARCHAR(255),
title TEXT,
location VARCHAR(255),
type VARCHAR(50),
category VARCHAR(100),
apply_url TEXT,
posted_date TIMESTAMP,
description TEXT,
metadata JSONB,
scraped_at TIMESTAMP,
updated_at TIMESTAMP
);
Operations:
- Upsert logic (insert or update)
- Historical tracking
- Expired job removal
- Performance indexing
Phase 4: Output Generation
Multi-Format Export:
JSON API:
{
"lastUpdated": "2025-12-01T08:00:00Z",
"totalJobs": 4127,
"companies": 23,
"jobs": [...]
}
Static HTML:
- Searchable table interface
- Filter by company, type, category
- Sort by date, company, title
- Responsive design
- Fast client-side operations
GitHub Pages Deployment:
- Automatic git commits
- Static site generation
- CDN distribution
- Zero hosting cost
Phase 5: Orchestration & Scheduling
n8n Cron Triggers:
- Daily execution at 8 AM SGT
- Parallel workflow launches
- Result aggregation
- Error notifications
Workflow Dependencies:
Scraping Workflows (parallel)
↓
Aggregation Workflow
↓
Cleaning Workflow
↓
Database Workflow
↓
Generation Workflow
↓
Deployment Workflow
Error Handling:
- Retry logic for failed scrapes
- Fallback data sources
- Alert notifications
- Manual intervention triggers
n8n Workflow Architecture
Individual Company Workflows
Template Structure:
Nodes:
- Schedule Trigger: Cron-based execution
- HTTP Request: Fetch career page/API
- Data Extraction: Parse HTML or JSON
- Transformation: Normalize to schema
- Validation: Check data quality
- Database Insert: Upsert to Supabase
- Error Handler: Log and notify failures
Example: Tech Company API Workflow
Cron Trigger (daily 8am)
↓
HTTP Request → GET /api/jobs
↓
JSON Parse → Extract jobs array
↓
Loop → For each job
↓
Transform → Map to standard schema
↓
Validate → Check required fields
↓
Supabase → Upsert job record
↓
Success/Error notification
Aggregation Workflow
Combines All Sources:
Process:
- Wait for scraping workflows to complete
- Query Supabase for all recent jobs
- Apply global deduplication
- Generate aggregated JSON
- Build static HTML
- Commit to GitHub
- Trigger Pages deployment
Data Pipeline Features
1. Automatic Deduplication
Strategies:
- URL-based matching
- Fuzzy title comparison
- Company + title + location combo
- Temporal proximity checks
Prevents:
- Duplicate listings
- Data bloat
- User confusion
- Processing overhead
2. Category Classification
Intelligent Categorization:
- Keyword-based classification
- Job title analysis
- Description NLP (planned)
- Company industry mapping
Categories:
- Engineering
- Data & Analytics
- Sales & Business Development
- Marketing
- Operations
- Finance
- Human Resources
- Customer Success
- Design
- Product Management
3. Link Validation
Ensures Quality:
- HTTP status checks
- Redirect following
- Dead link removal
- Canonical URL resolution
User Experience:
- No broken application links
- Direct to actual job page
- Tracked click-throughs
- Working apply buttons
4. Weekly Update Automation
Scheduled Maintenance:
- Remove expired listings
- Refresh active jobs
- Update company data
- Regenerate outputs
- Performance optimization
Key Features
User-Facing
Job Seeker Benefits:
- 4,000+ current listings
- Daily updates
- Single search location
- Fast filtering
- Direct application links
- Mobile-responsive
Search Capabilities:
- Full-text search
- Company filter
- Job type filter
- Category filter
- Location filter
- Date sorting
Technical
Data Engineering:
- ETL pipeline automation
- Multi-source integration
- Real-time processing
- Scalable architecture
Reliability:
- Error recovery
- Data validation
- Monitoring alerts
- Uptime tracking
Performance:
- 20-minute full refresh
- Parallel processing
- Efficient storage
- Fast page loads
Impact & Metrics
Coverage:
- 20+ companies tracked
- 4,000+ active listings
- Daily updates guaranteed
- Comprehensive Singapore market
Performance:
- 20 minutes full aggregation
- Daily automatic execution
- Zero manual intervention
- Self-healing workflows
Cost:
- $0 hosting (GitHub Pages)
- Minimal compute (n8n cloud free tier)
- Low API costs
- High value delivery
User Value:
- Single source for SG jobs
- Time saved vs. manual search
- Never miss new postings
- Clean, organized interface
Use Cases
1. Active Job Seekers
- Daily check for new listings
- Filter by preferences
- Track application progress
- Market research
2. Career Explorers
- Survey available opportunities
- Understand market demand
- Identify growing companies
- Salary research
3. Recruiters
- Competitive intelligence
- Market mapping
- Company tracking
- Candidate sourcing
4. Data Analysts
- Job market trends
- Hiring patterns
- Skill demand analysis
- Economic indicators
Technical Challenges & Solutions
Challenge 1: Heterogeneous Data Sources
Problem: Each company has unique structure
Solution:
- Custom workflow per company
- Flexible extraction patterns
- Adaptive normalization
- Template system for common patterns
Challenge 2: Rate Limiting
Problem: Avoid being blocked
Solution:
- Respect robots.txt
- Implement delays
- Rotate user agents
- Use API when available
Challenge 3: Data Freshness
Problem: Jobs expire quickly
Solution:
- Daily updates
- Automatic cleanup
- Posted date tracking
- Staleness detection
Challenge 4: Deployment Automation
Problem: Manual publishing is tedious
Solution:
- Git automation
- GitHub Pages integration
- Atomic deployments
- Rollback capability
Future Enhancements
Planned Features:
Expanded Coverage
- 50+ companies target
- Regional expansion (APAC)
- Remote job inclusion
- Freelance opportunities
Advanced Features
- Email alerts for new jobs
- Saved searches
- Application tracking
- Company reviews integration
Data Intelligence
- Salary predictions
- Skill demand trends
- Company growth indicators
- Market analytics dashboard
API Access
- Public JSON API
- Webhooks for updates
- Developer documentation
- Rate-limited access
Technologies Used
Automation:
- n8n workflow platform
- Cron scheduling
- Webhook triggers
Data Processing:
- HTML parsing
- JSON manipulation
- Data normalization
- Validation logic
Storage:
- Supabase PostgreSQL
- GitHub repository
- JSON files
- Static assets
Deployment:
- GitHub Pages
- Git automation
- CDN distribution
Monitoring:
- Error notifications
- Performance tracking
- Uptime monitoring
Links
- Live Site: eaziym.github.io/sg-jobs
- GitHub: github.com/eaziym/sg-jobs
- API Endpoint: Coming soon
Conclusion
SG Jobs Aggregator demonstrates how modern workflow automation tools like n8n can build production-grade data platforms without traditional backend infrastructure. By orchestrating 20+ specialized workflows into a cohesive pipeline, it delivers daily-updated, comprehensive job listings to Singapore job seekers while showcasing data engineering best practices in web scraping, ETL, and automated deployment.