LayeredLabs

Applied AI for better health outcomes, everywhere.

|||

Thesis

AI is transforming healthcare, but cost, privacy, and infrastructure still lock most patients and clinics out entirely.

Open-source models make local, private inference real. But access alone is not enough. Clinical AI must also be fair, auditable, and designed with the communities it serves in mind.

We are building and testing applied AI research toward better health outcomes for every patient, everywhere.

HuggingFace Dataset Downloads

NYC Community Clinics in Our Dataset

Work

NYC Clinic AI Infrastructure

A dataset and interactive map of AI deployment readiness across 637 NYC community health clinic records, spanning 21 languages. Published on HuggingFace.

NEISS Injury Dataset

A cleaned and structured dataset derived from CPSC NEISS emergency room injury surveillance data, formatted for clinical AI research.

BenchBase

An evaluation framework for assessing open-source language models on clinical tasks using standardized medical datasets.

Coming Soon

NYS Health Flyer Repository

The first centralized repository of New York State public health flyers, organized by language. Built for research on language availability of public health information and whether AI translations can capture dialect differences that matter to communities.

Research

Accepted: ICLR AIMS Workshop

When Education Shouldn't Matter: Counterfactual Bias in LLM-Based Emergency Triage

Question

Do LLMs proposed for emergency triage change their decisions based on patient education level, even though education is ethically irrelevant to clinical severity?

Why It Matters

Standard of care must not vary by education, income, or background. As clinical AI scales, unchecked demographic bias will disproportionately harm already-underserved communities, embedding health inequity into the infrastructure of medicine itself.

What We Did

We tested Qwen-2.5-72B and GPT-4o-mini on 87 clinical vignettes with education-level cues added, holding all medical information constant and measuring decision flips.

Does Structure Affect Accuracy? Pydantic vs. Unstructured Output on Clinical QA

Question

Does enforcing structured output in clinical LLM pipelines silently change model accuracy on medical reasoning tasks?

Why It Matters

Developers building clinical AI applications rely on structured outputs as standard engineering practice. If that choice silently degrades accuracy, it is a default decision quietly compromising every pipeline built on top of it.

What We Did

We compared Pydantic-enforced vs. unstructured output across GPT-4o-mini, Gemini, and Claude on MedQA benchmark questions.

Simplifying Orthopedic Patient Education with Open-Source LLMs

Question

Can open-source LLMs reliably simplify clinical patient education materials to a reading level that underserved patients can actually use?

Why It Matters

Better health literacy drives better adherence and better outcomes across all clinical settings. Open-source models that reliably deliver that give every clinic a zero-cost path to closing the gap.

What We Did

We evaluated open and closed-source LLMs on rewriting OrthoInfo content to an 8th-grade reading level, scored by BERTScore and Flesch-Kincaid grade.

In Progress

BenchBase: Frictionless Medical AI Evaluation with Full Data Provenance

Question

How can researchers and companies run medical AI evaluations consistently at scale, and how can anyone actually trust the benchmark results and model claims that follow?

Why It Matters

Reproducible, auditable evaluations are the foundation of trustworthy clinical AI. Without full provenance, results cannot be verified, replicated, or responsibly used to guide real deployment decisions.

What We Did

We built an open-source framework that runs any set of medical benchmarks in a single pass, assigns each question a unique hash for provenance, and saves every model input and output as a JSONL file, with automatic accuracy and per-benchmark metrics computed out of the box.

In Progress

Lost in Dialect: Bengali Translation Gaps in NYC Public Health Flyers

Question

Does the Bengali used in NYC public health flyers reflect the dialect its Bengali-speaking residents actually understand, or does a systematic translation gap leave them effectively unserved?

Why It Matters

Dialect mismatch in public health materials is a hidden form of health inequity. When official communications are written in a dialect residents cannot follow, the city's public health reach stops at the door of the very communities it is meant to protect.

What We Did

We built the first centralized repository of NYC public health flyers cataloged by language, examining whether official health communications are reaching the city's large immigrant population. Using Bengali as a case study, we applied AI to assess dialect accuracy in translations and evaluated whether AI-generated translations better serve the communities they are meant to reach.

LayeredLayeredLabsLabs

Thesis

Work

NYC Clinic AI Infrastructure

NEISS Injury Dataset

BenchBase

NYS Health Flyer Repository

Research

When Education Shouldn't Matter: Counterfactual Bias in LLM-Based Emergency Triage

Does Structure Affect Accuracy? Pydantic vs. Unstructured Output on Clinical QA

Simplifying Orthopedic Patient Education with Open-Source LLMs

BenchBase: Frictionless Medical AI Evaluation with Full Data Provenance

Lost in Dialect: Bengali Translation Gaps in NYC Public Health Flyers

LayeredLabs