ericrottman.com

Applied AI on public CMS data

A system for comparing healthcare provider similarity using public CMS claims data across 1.57M providers (hostpitals, clinics, physician offices, etc). The app compares Medicare Part B (medical services/procedures) and Part D (prescription drugs) utilization patterns.

Provider Part B/D Utilization Comparison

Claims data is processed on AWS into vector representations, modeled as a graph in Neo4j, and explained through retrieval-augmented LLM reasoning.

CMS Provider Dashboard (Grafana)

Amazon Managed Grafana snapshot.

Open snapshot →
Grafana CMS dashboard screenshot

Architecture

AWS architecture
AWS architecture diagram

Request & Data Flow

Numbers correspond to components in the diagram.

  1. 1
    Route53
    DNS routes requests to CloudFront, which serves both the static site and app entry path.
  2. 2
    CloudFront + ACM + WAF
    Edge TLS terminates at CloudFront using an ACM cert in us-east-1; WAF applies request filtering before origin access.
  3. 3
    S3 static site (OAC)
    Tailwind site stored in a private bucket; CloudFront reads via Origin Access Control (no public access).
  4. 4
    ALB (HTTP 80)
    Internet-facing ALB in public subnets receives CloudFront origin traffic and forwards to Streamlit target group on HTTP 8501.
  1. 5
    ECS/Fargate: Streamlit
    Streamlit tasks run in public subnets across AZs behind ALB and connect to Neo4j over Bolt plus Bedrock for explanations.
  2. 6
    Neo4j + EFS
    Neo4j runs in private subnets with EFS-backed persistence. Streamlit reaches it via security-group-restricted Bolt 7687.
  3. 7
    Public route table + IGW
    Public subnet route tables send 0.0.0.0/0 to the Internet Gateway for internet-facing ALB and public-subnet task egress.
  4. 8
    VPC endpoints for ECS dependencies
    Interface endpoints (ECR API/DKR, STS, CloudWatch Logs, Secrets Manager) plus S3 gateway endpoint support ECS operations without requiring NAT for private-subnet workloads.

Data Pipeline

Pipeline for transforming CMS datasets into provider vectors and a Neo4j similarity graph.

CMS provider similarity pipeline diagram

Pipeline steps

Numbers correspond to the diagram.

  1. 1
    Raw CMS data
    Public CMS datasets (Part B, Part D, NPPES) used as the source for provider utilization patterns.
  2. 2
    Build provider features
    Produces derived provider tables (drugs, HCPCS, unified), including top utilization features and provider summary metrics (allowed_per_service, cost_per_claim, combined_outlier_score).
  3. 3
    Build provider vectors
    Converts features into dense numerical vectors representing each provider’s practice pattern.
  4. 4
    Normalize vectors (L2)
    Normalizes vectors so similarity comparisons use cosine distance consistently.
  1. 5
    Similarity & modeling (ECS task)
    Deep dive →
    Python job runs FAISS nearest-neighbor search with thresholds, builds a similarity graph, applies Louvain detection, and computes drift / peer context signals.
  2. 6
    Neo4j graph storage
    Provider nodes, similarity edges, and Louvain community assignments are written to Neo4j for graph analytics and RAG retrieval.
  3. 7
    Analytics & visualization (Glue → Athena → Grafana)
    Glue catalogs derived datasets. Grafana is used to visualize metrics and derived outputs.
  4. 8
    CI testing & quality dashboard
    Git commits trigger CodeBuild. Test results are published to a CloudWatch dashboard for visibility.

CI Unit Test Dashboard (CloudWatch)

Live CloudWatch dashboard showing CodeBuild unit tests.

Open dashboard →
CloudWatch CI dashboard screenshot