ericrottman.com

Architecture

AWS architecture

Request & Data Flow

Numbers correspond to components in the diagram.

1

Route53

DNS routes requests to CloudFront, which serves both the static site and app entry path.
2

CloudFront + ACM + WAF

Edge TLS terminates at CloudFront using an ACM cert in us-east-1; WAF applies request filtering before origin access.
3

S3 static site (OAC)

Tailwind site stored in a private bucket; CloudFront reads via Origin Access Control (no public access).
4

ALB (HTTP 80)

Internet-facing ALB in public subnets receives CloudFront origin traffic and forwards to Streamlit target group on HTTP 8501.

5

ECS/Fargate: Streamlit

Streamlit tasks run in public subnets across AZs behind ALB and connect to Neo4j over Bolt plus Bedrock for explanations.
6

Neo4j + EFS

Neo4j runs in private subnets with EFS-backed persistence. Streamlit reaches it via security-group-restricted Bolt 7687.
7

Public route table + IGW

Public subnet route tables send 0.0.0.0/0 to the Internet Gateway for internet-facing ALB and public-subnet task egress.
8

VPC endpoints for ECS dependencies

Interface endpoints (ECR API/DKR, STS, CloudWatch Logs, Secrets Manager) plus S3 gateway endpoint support ECS operations without requiring NAT for private-subnet workloads.

Data Pipeline

Pipeline for transforming CMS datasets into provider vectors and a Neo4j similarity graph.

CMS provider similarity pipeline diagram

Pipeline steps

Numbers correspond to the diagram.

1

Raw CMS data

Public CMS datasets (Part B, Part D, NPPES) used as the source for provider utilization patterns.
2

Build provider features

Produces derived provider tables (drugs, HCPCS, unified), including top utilization features and provider summary metrics (allowed_per_service, cost_per_claim, combined_outlier_score).
3

Build provider vectors

Converts features into dense numerical vectors representing each provider’s practice pattern.
4

Normalize vectors (L2)

Normalizes vectors so similarity comparisons use cosine distance consistently.

5

Similarity & modeling (ECS task)
Deep dive →

Python job runs FAISS nearest-neighbor search with thresholds, builds a similarity graph, applies Louvain detection, and computes drift / peer context signals.
6

Neo4j graph storage

Provider nodes, similarity edges, and Louvain community assignments are written to Neo4j for graph analytics and RAG retrieval.
7

Analytics & visualization (Glue → Athena → Grafana)

Glue catalogs derived datasets. Grafana is used to visualize metrics and derived outputs.
8

CI testing & quality dashboard

Git commits trigger CodeBuild. Test results are published to a CloudWatch dashboard for visibility.

Step 5 deep dive — Similarity & community modeling

What this step does

Turns each provider into a “practice-pattern fingerprint,” finds their closest peer providers, groups peers into communities, and assigns a drift_score that flags who looks unusual compared to their own peer group. The result becomes a Neo4j graph you can explore.

5.1

Build a fingerprint per provider

Each NPI gets one vector that combines: Top Part D drugs, Top Part B HCPCS, and a few standardized “summary metrics.”

Summary metrics: allowed_per_service cost_per_claim combined_outlier_score

Why normalize? So we compare pattern (direction), not raw volume (magnitude).

5.2

Find closest peers (nearest neighbors)

For each provider, we look up their top_k most similar peers using cosine similarity. Then we drop weak connections below sim_threshold.

Simple example

Provider A’s closest peers (cosine similarity): B=0.72 C=0.41 D=0.33

If sim_threshold = 0.35, we keep A→B and A→C, and drop A→D.

This keeps the graph clean: fewer “noisy” edges.

UI field: we also store a short top_peers list (capped) for quick display.

5.3

Group providers into peer communities

Using the kept similarity edges, we run community detection (Louvain) to identify clusters of providers with similar practice patterns.

Intuition

If A is strongly connected to B and C, and B/C are strongly connected to each other, Louvain tends to place them in the same community.

Communities represent “peer groups” discovered from real billing patterns.

5.4

Compute drift within the peer group

For each community, we compute an “average fingerprint” (centroid). A provider’s drift_score measures how far they are from that centroid: higher = more atypical within their peer group.

Simple drift example

If a provider is very similar to the community centroid: cosine = 0.95 → drift ≈ 0.05 (typical).

If they’re less similar: cosine = 0.60 → drift ≈ 0.40 (more unusual).

This is “within-community outlierness,” not a global anomaly score.

Neo4j graph model written by this step

Nodes

(:Provider) — keyed by NPI
(:Community) — peer cluster for a run

Relationships

(:Provider)-[:SIMILAR_TO]->(:Provider) — similarity weight
(:Provider)-[:IN_COMMUNITY]->(:Community) — drift_score + top_peers

Tip: This graph is what enables “why are these providers connected?” explanations and neighborhood-based retrieval.

Applied AI on public CMS data

Provider Part B/D Utilization Comparison

CMS Provider Dashboard (Grafana)

Architecture

Request & Data Flow

Data Pipeline

Pipeline steps

CI Unit Test Dashboard (CloudWatch)