ML Portfolio

Karla Altamirano
Data & ML Engineer

Software Engineer and Digital Transformation Specialist building machine learning solutions applied to real business problems. Based in Quito, Ecuador.

10 ML Projects
0.87 Best F1-score
1.3M+ Records analyzed
01

Projects

Project 01 — Classification

📉 Customer Churn Prediction

ML model to predict customer churn in the telecom sector. Compares Logistic Regression, Random Forest and XGBoost with SMOTE balancing. Identifies that month-to-month contracts have a 42.7% churn rate vs 2.8% for two-year contracts.

scikit-learn XGBoost SMOTE Classification Telco
0.87 AUC-ROC
7,043 Records
3 Models compared
↗

Project 02 — NLP

💎 Sentiment Analysis — Google Play Reviews

NLP pipeline to classify 27,948 app reviews as Positive, Negative or Neutral. TF-IDF vectorization with bigrams captures negations like "not good". Logistic Regression achieved F1 of 0.87 on a 3-class imbalanced dataset.

NLTK TF-IDF SMOTE NLP Google Play
0.87 F1-score
27,948 Reviews
3 Sentiment classes
↗

Project 03 — Unsupervised ML

🏭 Federal Violations Clustering

K-Means clustering on 76,310 U.S. companies with multi-agency federal enforcement records. Identifies 4 risk profiles from systemic violators to wage theft patterns. Applied log transformation + RobustScaler to handle extreme financial outliers.

K-Means PCA RobustScaler Clustering ESG
76,310 Companies
4 Risk profiles
7 Federal agencies
↗

Project 04 — Imbalanced Classification

ðŸ’ģ Credit Card Fraud Detection

Fraud detection on 284,807 real transactions with extreme class imbalance (0.17% fraud). Dual strategy: SMOTE inside pipeline + class_weight balancing. Random Forest detects 80 of 98 real frauds with only 11 false alarms.

Random Forest XGBoost SMOTE Fraud Detection Finance
0.87 Avg Precision
284K Transactions
0.17% Fraud rate
↗

Project 05 — Medical Classification

ðŸĶī Vertebral Column — Orthopaedic Diagnosis

ML model to classify patients into 3 orthopaedic diagnoses (Disc Hernia, Normal, Spondylolisthesis) from 6 biomechanical measurements obtained from spinal X-rays. Spondylolisthesis detected with F1 of 0.98. Includes interactive HTML report.

Random Forest SVM Cross-Validation Healthcare UCI Dataset
0.87 F1-score
310 Patients
4 Models compared
↗

Project 06 — E-Commerce Analytics

🛒 E-Commerce Customer Analytics

End-to-end analysis of 805,549 real transactions from a UK online retailer. Covers business EDA, RFM segmentation (K-Means k=4), sales forecasting with Gradient Boosting, and an item-based recommendation system using cosine similarity.

RFM K-Means Gradient Boosting Recommender E-Commerce
805K Transactions
4 RFM segments
2,387 Products
↗

Project 07 — D2C Analytics

âœĻ Skincare E-Commerce — Profitability & Customer Analytics

Full analysis of a D2C skincare brand across 6 relational tables. Part A: custom Profitability Score combining margin, discounts and return rate. Part B: CLV by acquisition channel, cohort retention analysis and return breakdown. Two interactive HTML reports.

CLV Cohort Analysis Profitability Score Skincare D2C
6 Relational tables
â‚đ2,242 Avg CLV
78.6% Best retention
↗

Project 08 — Sports Prediction

âš― FIFA World Cup 2026 — Match Predictor

ML model trained on 184 historical World Cup matches (1930–2022) to predict all 72 group stage fixtures for 2026. Logistic Regression with Bayesian smoothing and FIFA ranking features. Includes interactive HTML report with all group predictions and probability bars.

Logistic Regression Bayesian Smoothing FIFA Rankings Sports Analytics 2026
75.6% CV Accuracy
184 Historical matches
72 Matches predicted
↗

Project 09 — Geospatial Analytics

🌋 Global Volcano Analysis

Geospatial clustering of 1,432 volcanoes worldwide using K-Means. The algorithm rediscovered the 4 major volcanic belts (Ring of Fire, Andes, East African Rift, Asian arc) without any geographic labels. Includes interactive Plotly world map with hover details for every volcano.

K-Means Plotly Geospatial NCEI Database Ring of Fire
1,432 Volcanoes
4 Clusters
81% Active
↗

Project 10 — Predictive Health

🧎 Genetic Inheritance Analysis

Two models, two stories: a height predictor that couldn't beat an 1886 statistical formula (Galton's midparent rule), and a health risk classifier that achieved 78% accuracy using family disease history and parental age as the strongest predictors. Includes blood group inheritance patterns following real Mendelian ABO rules.

Gradient Boosting Regression Genetics Health Risk Mendelian Inheritance
78% Risk accuracy
3.2cm Height MAE
7,000 Families
↗
02

Stack

Machine Learning

  • scikit-learn
  • XGBoost
  • imbalanced-learn
  • SMOTE

NLP

  • NLTK
  • TF-IDF
  • WordCloud
  • Text preprocessing

Data

  • pandas
  • numpy
  • RobustScaler
  • PCA

Visualization

  • matplotlib
  • seaborn
  • plotly

Engineering

  • Python 3.11
  • FastAPI
  • GeneXus
  • GitHub