Data Quality Impact on AI Models

Financial Risk Assessment Case Study

Exploring how data quality issues directly impact AI model performance in financial forecasting and risk assessment

Project Overview

Investment Portfolio Risk Assessment with Machine Learning

Business Objective

Develop an AI model to predict investment portfolio risk levels and optimize allocation strategies for high-net-worth clients. The model aims to identify potential market downturns 1-3 months in advance to protect client assets.

Data Sources

  • Historical market data from three exchanges (10 years)
  • Client transaction records and portfolio compositions
  • Economic indicators and central bank policy data
  • Alternative data sources (news sentiment, social media)

Technical Approach

Gradient Boosting model optimized for time-series financial data with feature engineering to capture market volatility patterns. The model uses both structured financial data and unstructured sentiment data.

Key Challenge

Data quality inconsistencies across multiple sources lead to misleading model results, creating a false sense of security about risk predictions that could expose clients to unexpected losses.

Data Exploration & Quality Issues

Identifying problematic patterns in financial datasets

42%

of transaction records contain inconsistent timestamps

17%

of portfolio data has missing asset classifications

28%

of market data shows unexplained outliers

65%

of sentiment data lacks proper source attribution

Dataset Completeness Analysis

Critical Data Quality Issues

Temporal Inconsistency

High Impact

Transaction timestamps varied across systems, with some recording execution time and others settlement time, creating misleading patterns in time-sensitive models.

// Inconsistent timestamp formats in data
2023-04-15T09:30:00Z   // UTC standard from exchange API
04/15/2023 9:30 AM     // US format from internal systems
15-04-2023             // European format from partner bank
1681550600000          // Unix timestamp from trading platform

Classification Misalignment

Medium Impact

Asset classifications varied between systems, with the same instruments categorized differently across data sources, causing incorrect risk exposure calculations.

Asset ID System A System B System C
ETF-3892 Equity Mixed Allocation Equity-International
BOND-47290 Fixed Income Corporate Debt High Yield
STR-982 Alternative Structured Product Derivative

Survivorship Bias

Critical Impact

Historical market data excluded delisted securities, creating overly optimistic performance metrics and blinding the model to market risk patterns.

AI Model Results

Comparing predictions based on raw data vs. quality-corrected data

Model with Raw Data

78% Accuracy
0.82 AUROC
0.71 F1 Score

Key Insights

  • Model predicts market downturns only 31% of the time (high false negative rate)
  • Consistent underestimation of volatility in mixed-asset portfolios
  • Over-confident in predicting stable periods (92% precision but only 45% recall for high-risk events)
  • Feature importance heavily skewed toward recent market indicators

Model with Quality-Corrected Data

73% Accuracy
0.88 AUROC
0.79 F1 Score

Key Insights

  • Lower accuracy but significantly higher precision in detecting market downturns (76% vs. 31%)
  • More balanced risk assessment across different asset classes
  • Less confident but more reliable predictions (narrower confidence intervals)
  • Feature importance more evenly distributed across leading indicators
  • Better performance in extreme market conditions

Feature Importance Comparison

Data Quality Impact Analysis

Translating technical issues into business outcomes

Financial Impact

Potential Client Portfolio Losses

Using the uncorrected model would have resulted in a 12.7% average portfolio underperformance during market corrections, representing approximately €94M in preventable client losses across the test period.

Regulatory Risk

The uncorrected model fails to meet regulatory standards for model risk management:

  • Non-compliance with BCBS 239 principles for effective risk data aggregation
  • Inadequate model documentation for GDPR data lineage requirements
  • Inability to explain model decisions to regulators due to data inconsistencies
  • Potential fines of up to €10M or 2% of annual turnover

Client Trust

Client Retention Simulation

Simulation shows that deploying the uncorrected model would lead to approximately 17% client attrition within 24 months due to unexpected portfolio performance during market downturns.

Key Lessons & Remediation Strategy

Data Governance

Implemented a central data dictionary with standardized definitions across all systems, with mandatory metadata requirements and lineage tracking.

Quality Monitoring

Deployed automated data quality checks with data validation rules running in real-time, creating alerts for data drift or anomalies.

Model Transparency

Enhanced model documentation to include data quality metrics as part of model performance reporting, with clear disclosure of limitations.

Testing Strategy

Implemented adversarial testing with deliberately corrupted data to evaluate model robustness and establish performance boundaries.

Implementation & Technical Details

Resolving data quality issues and rebuilding the model

Data Quality Pipeline

Data Quality Pipeline

Data Validation Example

import pandas as pd
import great_expectations as ge

# Load transaction data
transactions_df = pd.read_csv('transactions.csv')

# Convert to Great Expectations DataFrame
ge_df = ge.from_pandas(transactions_df)

# Define expectations
validation_results = ge_df.expect_column_values_to_be_of_type(
    column="transaction_date", 
    type_="datetime64"
)

# Check for timestamp consistency
timestamp_format_validation = ge_df.expect_column_values_to_match_strftime_format(
    column="transaction_date",
    strftime_format="%Y-%m-%dT%H:%M:%SZ"
)

# Validate value ranges for transaction amounts
amount_validation = ge_df.expect_column_values_to_be_between(
    column="amount",
    min_value=0,
    max_value=10000000
)

# Generate validation report
validation_report = {
    "timestamp_validation": timestamp_format_validation.success,
    "amount_validation": amount_validation.success,
    "validation_failures": validation_results.result["unexpected_count"]
}

print(f"Data quality score: {(validation_report['timestamp_validation'] + validation_report['amount_validation'])/2 * 100}%")

Model Architecture

Model Architecture

Model Training with Data Quality Weighting

import xgboost as xgb
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

# Load preprocessed data with quality scores
X = pd.read_csv('features_with_quality.csv')
y = pd.read_csv('targets.csv')

# Extract data quality scores as sample weights
data_quality_scores = X['data_quality_score']
X = X.drop('data_quality_score', axis=1)

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train model with data quality weighting
model = xgb.XGBClassifier(
    objective='binary:logistic',
    n_estimators=100,
    max_depth=4,
    learning_rate=0.1
)

# Use data quality scores as sample weights
model.fit(
    X_train, 
    y_train,
    sample_weight=data_quality_scores[X_train.index],
    eval_set=[(X_test, y_test)],
    verbose=True
)

# Evaluate with quality-weighted metrics
y_pred_proba = model.predict_proba(X_test)[:, 1]
auc = roc_auc_score(y_test, y_pred_proba)
print(f"Model AUC: {auc:.4f}")

# Save feature importance
feature_importance = pd.DataFrame({
    'feature': X.columns,
    'importance': model.feature_importances_
}).sort_values('importance', ascending=False)

print("Top 10 features:")
print(feature_importance.head(10))