PySpark Data Pipeline for Serasa Experian's Credit Analysis

Challenge

At Dextra Digital, I faced the challenge of processing 10M+ daily records for Serasa Experian's credit analysis system. Their existing solution couldn't handle the growing data volume, causing significant processing delays and affecting business decisions.

Technical Solution

I designed and implemented ETL data pipelines using PySpark and Hadoop that efficiently processed transaction data across distributed clusters. I focused on optimizing join operations and implementing custom partitioning strategies to handle skewed data distributions.

PySpark Hadoop PostgreSQL Python
Business Impact

The new data pipeline reduced processing time by 60%, allowing Serasa to analyze credit transactions in near real-time. This improved their ability to detect fraudulent patterns and provide more accurate credit assessments to financial institutions across Brazil.

Wildfire Detection Data Processing at Sintecsys

Challenge

At Sintecsys, I needed to build a reliable system to process and analyze 100K+ images daily from remote cameras to detect early-stage wildfires in real-time—a critical environmental and safety application.

Technical Solution

I developed an image processing pipeline that extracted key visual features and fed them into a machine learning detection system. The solution included automated data validation to handle corrupted images and varying lighting conditions that could trigger false positives.

Python OpenCV TensorFlow
Business Impact

The system successfully detected 500+ early-stage fires, significantly reducing response time and environmental damage. The pipeline's 98% accuracy rate provided reliable alerts that forestry agencies could trust, and false positives were reduced by 75%.

Vehicle Recognition System at Multiway

Challenge

At Multiway, I needed to build a backend system capable of processing and analyzing 1M+ vehicles daily with high detection accuracy for smart city applications. The existing Java-based system was struggling with performance issues and high maintenance costs.

Technical Solution

I led the migration from the legacy Java stack to a Python-based solution using TensorFlow for vehicle recognition algorithms. I designed a highly efficient data processing pipeline that handled multiple camera streams simultaneously, implemented license plate recognition algorithms, and created a real-time database indexing system to enable fast querying of vehicle history.

Python TensorFlow Flask APIs PostgreSQL Redis
Business Impact

The redesigned system achieved 98% detection accuracy while processing over 1 million vehicles daily. We realized a 70% performance improvement compared to the previous solution and achieved full SOC2 compliance for data handling. This enabled the expansion of the smart city platform to additional municipalities.

Real-time Network Monitoring at GPr Sistemas

Challenge

GPr Sistemas needed a scalable system to monitor 10,000+ ATM devices across various banks in real-time, with strict requirements for alert response times and uptime monitoring.

Technical Solution

I developed an SNMP-based monitoring backend that continuously collected performance metrics and operational status from banking network devices. The solution included a specialized time-series database for storing historical data, intelligent anomaly detection for preemptive alerts, and automated failover mechanisms.

Python Django REST Framework SNMP libraries Bootstrap + jQuery
Business Impact

The system achieved sub-1-second alert response time for critical device failures, enabling immediate intervention before customers were affected. We maintained 99.99% network monitoring uptime, providing banking clients with comprehensive real-time dashboards that significantly improved their operational visibility and reduced mean-time-to-repair for ATM issues.

Machine Learning Data Infrastructure for Security Ratings

Challenge

At SecurityScorecard, I needed to create resilient data ingestion pipelines to handle diverse security data sources with inconsistent formats and reliability issues, all while maintaining data accuracy for cybersecurity rating calculations.

Technical Solution

I built scalable ETL pipelines that standardized heterogeneous security data into consistent formats for analysis. The system included automated data validation, anomaly detection, and reconciliation processes to ensure data quality. I implemented backfill mechanisms to handle source system outages and recovery.

Python Snowflake AWS data services Data validation frameworks
Business Impact

The reliable data infrastructure enabled SecurityScorecard to rate 3x more third-party vendors per customer while maintaining data accuracy. This directly supported business growth and improved customer satisfaction as clients could evaluate more of their supply chain partners for security risks.

Need similar solutions for your business?

I help companies build high-performance, scalable systems that solve real business problems. Let's discuss how I can bring my expertise to your project.