The objective of this project is to develop an open-source NLP platform capable of ingesting structured and unstructured patient-level data and notes, and extracting clinically relevant features for the purpose of determining eligibility for clinical trials, constructing reproducible, computational phenotypes, and building patient-level predictive models.

I am currently focused on integrating supervised and unsupervised machine learning techniques, including non-negative matrix factorization for the purpose of topic modeling, and explore the extent to which real-valued, vector based representations of unstructured texts (e.g., patient notes, clinical trial inclusion criteria; user-generated queries, etc.) can be used for dimesionality reduction and/or standardized evaluation of semantically similar documents.

During the Fall 2018 semester, as a student in CS 8803: Data Analytics Using Deep Learning, I had the opportunity to work with my colleague, classmate (and lead ClarityNLP dev!) Charity Hilton on optimizing portions of this platform. Our final paper outlines our approach and presents our empirical results. We plan to integrate a subset of our improvements soon.

This project is under active development, and we host bi-weekly Cooking with ClarityNLP tutorial sessions, which you can join in real-time via webex; Jupyter notebooks and recordings from past sessions are also posted. We welcome questions and suggestions for future sessions via Twitter (use the hashtag #cookingWithClarityNLP), GitHub, or Slack.