During my internship at Synapxe, I worked on several AI/ML and data engineering projects in the healthcare domain, involving unauthorised data access detection, Retrieval-Augmented Generation, data loss prevention, and feature engineering optimisation. Most of my work involved building Python-based tools, testing different machine learning or NLP approaches, and improving existing systems so that they are faster, easier to use, and more reliable.
One of my main projects was an unauthorised data access detection tool, which classifies authorised and unauthorised access to patients' records. The records were flagged based on access time, login user role, and free-text explanations given by users. Since the explanation field was unstructured text, I explored NLP approaches such as spaCy, sentence-transformers, cosine similarity, K-Nearest Neighbours, Centroids, Support Vector Machines, Logistic Regression, Random Forest, and Naive Bayes to classify whether an explanation was acceptable or unacceptable.
I also worked on deploying LightRAG on Azure App Service. LightRAG is a graph-based Retrieval-Augmented Generation framework that extracts entities and relations across document chunks, allowing it to answer relation-centric queries more effectively. During the deployment, I had to debug Azure App Service issues, resolve wheelhouse dependency problems, understand startup commands, and work with Azure development tools such as Bash and SSH to inspect deployed files directly.
For the AI2D project, I helped optimise the feature building pipeline by migrating parts of the codebase from Pandas to Polars. I first studied the existing code flow, created flowcharts to understand the dependencies between files, then incrementally converted functions from the leaf nodes upwards. This reduced feature loading time from around 282 seconds to 22 seconds on a desktop, and from around 443 seconds to 78 seconds on a laptop.
I also explored data loss prevention use cases using local LLMs through Ollama. I built a basic prototype that extracts personally identifiable information from strings, then improved the reliability of extraction through prompt engineering, such as tailoring instructions to Singapore-specific formats for phone numbers and other sensitive information.