AI is transforming healthcare, but cost, privacy, and infrastructure still lock most patients and clinics out entirely.
Open-source models make local, private inference real. But access alone is not enough. Clinical AI must also be fair, auditable, and designed with the communities it serves in mind.
We are building and testing applied AI research toward better health outcomes for every patient, everywhere.
The first centralized repository of New York State public health flyers, organized by language. Built for research on language availability of public health information and whether AI translations can capture dialect differences that matter to communities.
Question
Do LLMs proposed for emergency triage change their decisions based on patient education level, even though education is ethically irrelevant to clinical severity?
Why It Matters
Standard of care must not vary by education, income, or background. As clinical AI scales, unchecked demographic bias will disproportionately harm already-underserved communities, embedding health inequity into the infrastructure of medicine itself.
What We Did
We tested Qwen-2.5-72B and GPT-4o-mini on 87 clinical vignettes with education-level cues added, holding all medical information constant and measuring decision flips.
Question
Does enforcing structured output in clinical LLM pipelines silently change model accuracy on medical reasoning tasks?
Why It Matters
Developers building clinical AI applications rely on structured outputs as standard engineering practice. If that choice silently degrades accuracy, it is a default decision quietly compromising every pipeline built on top of it.
What We Did
We compared Pydantic-enforced vs. unstructured output across GPT-4o-mini, Gemini, and Claude on MedQA benchmark questions.
Question
Can open-source LLMs reliably simplify clinical patient education materials to a reading level that underserved patients can actually use?
Why It Matters
Better health literacy drives better adherence and better outcomes across all clinical settings. Open-source models that reliably deliver that give every clinic a zero-cost path to closing the gap.
What We Did
We evaluated open and closed-source LLMs on rewriting OrthoInfo content to an 8th-grade reading level, scored by BERTScore and Flesch-Kincaid grade.
Question
How can researchers and companies run medical AI evaluations consistently at scale, and how can anyone actually trust the benchmark results and model claims that follow?
Why It Matters
Reproducible, auditable evaluations are the foundation of trustworthy clinical AI. Without full provenance, results cannot be verified, replicated, or responsibly used to guide real deployment decisions.
What We Did
We built an open-source framework that runs any set of medical benchmarks in a single pass, assigns each question a unique hash for provenance, and saves every model input and output as a JSONL file, with automatic accuracy and per-benchmark metrics computed out of the box.
Question
Does the Bengali used in NYC public health flyers reflect the dialect its Bengali-speaking residents actually understand, or does a systematic translation gap leave them effectively unserved?
Why It Matters
Dialect mismatch in public health materials is a hidden form of health inequity. When official communications are written in a dialect residents cannot follow, the city's public health reach stops at the door of the very communities it is meant to protect.
What We Did
We built the first centralized repository of NYC public health flyers cataloged by language, examining whether official health communications are reaching the city's large immigrant population. Using Bengali as a case study, we applied AI to assess dialect accuracy in translations and evaluated whether AI-generated translations better serve the communities they are meant to reach.