Data Quality Impact on AI Models
Financial Risk Assessment Case Study
Exploring how data quality issues directly impact AI model performance in financial forecasting and risk assessment
Project Overview
Investment Portfolio Risk Assessment with Machine Learning
Business Objective
Develop an AI model to predict investment portfolio risk levels and optimize allocation strategies for high-net-worth clients. The model aims to identify potential market downturns 1-3 months in advance to protect client assets.
Data Sources
- Historical market data from three exchanges (10 years)
- Client transaction records and portfolio compositions
- Economic indicators and central bank policy data
- Alternative data sources (news sentiment, social media)
Technical Approach
Gradient Boosting model optimized for time-series financial data with feature engineering to capture market volatility patterns. The model uses both structured financial data and unstructured sentiment data.
Key Challenge
Data quality inconsistencies across multiple sources lead to misleading model results, creating a false sense of security about risk predictions that could expose clients to unexpected losses.
Data Exploration & Quality Issues
Identifying problematic patterns in financial datasets
42%
of transaction records contain inconsistent timestamps
17%
of portfolio data has missing asset classifications
28%
of market data shows unexplained outliers
65%
of sentiment data lacks proper source attribution
Dataset Completeness Analysis
Critical Data Quality Issues
Temporal Inconsistency
High ImpactTransaction timestamps varied across systems, with some recording execution time and others settlement time, creating misleading patterns in time-sensitive models.
// Inconsistent timestamp formats in data
2023-04-15T09:30:00Z // UTC standard from exchange API
04/15/2023 9:30 AM // US format from internal systems
15-04-2023 // European format from partner bank
1681550600000 // Unix timestamp from trading platform
Classification Misalignment
Medium ImpactAsset classifications varied between systems, with the same instruments categorized differently across data sources, causing incorrect risk exposure calculations.
| Asset ID | System A | System B | System C |
|---|---|---|---|
| ETF-3892 | Equity | Mixed Allocation | Equity-International |
| BOND-47290 | Fixed Income | Corporate Debt | High Yield |
| STR-982 | Alternative | Structured Product | Derivative |
Survivorship Bias
Critical ImpactHistorical market data excluded delisted securities, creating overly optimistic performance metrics and blinding the model to market risk patterns.
AI Model Results
Comparing predictions based on raw data vs. quality-corrected data
Model with Raw Data
Key Insights
- Model predicts market downturns only 31% of the time (high false negative rate)
- Consistent underestimation of volatility in mixed-asset portfolios
- Over-confident in predicting stable periods (92% precision but only 45% recall for high-risk events)
- Feature importance heavily skewed toward recent market indicators
Model with Quality-Corrected Data
Key Insights
- Lower accuracy but significantly higher precision in detecting market downturns (76% vs. 31%)
- More balanced risk assessment across different asset classes
- Less confident but more reliable predictions (narrower confidence intervals)
- Feature importance more evenly distributed across leading indicators
- Better performance in extreme market conditions
Feature Importance Comparison
Data Quality Impact Analysis
Translating technical issues into business outcomes
Financial Impact
Potential Client Portfolio Losses
Using the uncorrected model would have resulted in a 12.7% average portfolio underperformance during market corrections, representing approximately €94M in preventable client losses across the test period.
Regulatory Risk
The uncorrected model fails to meet regulatory standards for model risk management:
- Non-compliance with BCBS 239 principles for effective risk data aggregation
- Inadequate model documentation for GDPR data lineage requirements
- Inability to explain model decisions to regulators due to data inconsistencies
- Potential fines of up to €10M or 2% of annual turnover
Client Trust
Client Retention Simulation
Simulation shows that deploying the uncorrected model would lead to approximately 17% client attrition within 24 months due to unexpected portfolio performance during market downturns.
Key Lessons & Remediation Strategy
Data Governance
Implemented a central data dictionary with standardized definitions across all systems, with mandatory metadata requirements and lineage tracking.
Quality Monitoring
Deployed automated data quality checks with data validation rules running in real-time, creating alerts for data drift or anomalies.
Model Transparency
Enhanced model documentation to include data quality metrics as part of model performance reporting, with clear disclosure of limitations.
Testing Strategy
Implemented adversarial testing with deliberately corrupted data to evaluate model robustness and establish performance boundaries.
Implementation & Technical Details
Resolving data quality issues and rebuilding the model
Data Quality Pipeline
Data Validation Example
import pandas as pd
import great_expectations as ge
# Load transaction data
transactions_df = pd.read_csv('transactions.csv')
# Convert to Great Expectations DataFrame
ge_df = ge.from_pandas(transactions_df)
# Define expectations
validation_results = ge_df.expect_column_values_to_be_of_type(
column="transaction_date",
type_="datetime64"
)
# Check for timestamp consistency
timestamp_format_validation = ge_df.expect_column_values_to_match_strftime_format(
column="transaction_date",
strftime_format="%Y-%m-%dT%H:%M:%SZ"
)
# Validate value ranges for transaction amounts
amount_validation = ge_df.expect_column_values_to_be_between(
column="amount",
min_value=0,
max_value=10000000
)
# Generate validation report
validation_report = {
"timestamp_validation": timestamp_format_validation.success,
"amount_validation": amount_validation.success,
"validation_failures": validation_results.result["unexpected_count"]
}
print(f"Data quality score: {(validation_report['timestamp_validation'] + validation_report['amount_validation'])/2 * 100}%")
Model Architecture
Model Training with Data Quality Weighting
import xgboost as xgb
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
# Load preprocessed data with quality scores
X = pd.read_csv('features_with_quality.csv')
y = pd.read_csv('targets.csv')
# Extract data quality scores as sample weights
data_quality_scores = X['data_quality_score']
X = X.drop('data_quality_score', axis=1)
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Train model with data quality weighting
model = xgb.XGBClassifier(
objective='binary:logistic',
n_estimators=100,
max_depth=4,
learning_rate=0.1
)
# Use data quality scores as sample weights
model.fit(
X_train,
y_train,
sample_weight=data_quality_scores[X_train.index],
eval_set=[(X_test, y_test)],
verbose=True
)
# Evaluate with quality-weighted metrics
y_pred_proba = model.predict_proba(X_test)[:, 1]
auc = roc_auc_score(y_test, y_pred_proba)
print(f"Model AUC: {auc:.4f}")
# Save feature importance
feature_importance = pd.DataFrame({
'feature': X.columns,
'importance': model.feature_importances_
}).sort_values('importance', ascending=False)
print("Top 10 features:")
print(feature_importance.head(10))