Datacast
Episode 42: Privacy-Preserving Natural Language Processing with Patricia Thaine
Episode Summary
Patricia Thaine is a Computer Science Ph.D. Candidate at the University of Toronto and a Postgraduate Affiliate at the Vector Institute researching privacy-preserving natural language processing, with a focus on applied cryptography. Her research interests also include computational methods for lost language decipherment. She is the Co-Founder and CEO of Private AI, a Toronto- and Berlin-based startup creating a suite of privacy tools that make it easy to comply with data protection regulations, mitigate cybersecurity threats, and maintain customer trust.
Episode Notes
Show Notes
- (2:55) Patricia talked about his interest in learning languages and living in different cultures.
- (4:05) Patricia talked about her experience volunteering as a translator at the International Network of Street Papers.
- (5:00) Patricia studied Liberal Arts at John Abbott College, English Literature at Concordia University, and Computer Science and Linguistics at McGill University during her undergraduate years.
- (8:06) Patricia worked at McGill Language Development Lab as a Research Assistant, which studied how children learn different types of words and sentences.
- (9:15) Patricia described her graduate school experience at the University of Toronto, where she researched lost language decipherment and writing systems.
- (11:19) Patricia talked about MedStory, which is a text-oriented visual prototype built to support the complexity of medical narratives (spearheaded by Nicole Sultanum).
- (12:35) Patricia explained her research paper, “Vowel and Consonant Classification through Spectral Decomposition.”
- (15:29) Patricia unpacked her blog post, “Why is Privacy-Preserving NLP Important?”
- (19:02) Patricia dissected her paper “Privacy-Preserving Character Language Modelling” that proposes a method for calculating character bigram and trigram probabilities over sensitive data using homomorphic encryption.
- (21:13) Patricia wrote a two-part series called “Homomorphic Encryption for Beginners.”
- (22:21) Patricia unwrapped her paper “Efficient Evaluation of Activation Functions over Encrypted Data” that shows how to represent the value of any function over a defined and bounded interval, given encrypted input data, without needing to decrypt any intermediate values before obtaining the function’s output.
- (25:33) Patricia elaborated on her paper “Extracting Bark-Frequency Cepstral Coefficients from Encrypted Signals,” which claims that extracting spectral features from encrypted signals is the first step towards achieving secure end-to-end automatic speech recognition over encrypted data.
- (27:38) Patricia explained why privacy is an essential attribute for speech recognition applications.
- (29:53) Patricia discussed her comprehensive guide on “Perfectly Privacy-Preserving AI” which dives into the four pillars of perfectly privacy-preserving AI and outlines potential combinatorial solutions to satisfy all four pillars.
- (37:53) Patricia shared her take on the differences working in academic and commercial settings (she is the founder and CEO of Private AI).
- (40:50) Patricia talked about Private AI’s GALATEA Anonymization Suite, which anonymizes data at the source and encrypts them using quantum-safe cryptography.
- (45:05) Patricia emphasized the importance of talking to customers when building a commercial product.
- (46:58) Patricia shared her experience as a Postgraduate Affiliate at Vector Institute, which works with institutions, industry, startups, incubators, and accelerators to advance AI research and drive its application, adoption, and commercialization across Canada.
- (49:09) Patricia shared her advice for young researchers by going deep into at least two domains and combining the knowledge.
- (50:30) Patricia shared her excitement for privacy and NLP research in the upcoming years.
- (52:36) Closing segment.
Her Contact Info
Her Recommended Resources
Episode Transcription
Key Takeaways
Below are highlights from my conversation with Patricia:
ON HER UNDERGRADUATE EDUCATION
- The liberal arts program at John Abbott College helps me get a better understanding of how the world works and hone writing and critical thinking skills.
- I then studied English Literature at Concordia University, thinking that was my great love. But after one year, I felt like there were different things that I could learn. I tried out International Relations and Philosophy, but none of them captured the essence of how the world works.
- I tried out Linguistics and appreciated its pattern-matching structure a lot. I then saw something called Computational Linguistics that got my interest, took my first programming class at McGill University, decided to switch major, and eventually pursued a Master’s degree in Computational Linguistics at the University of Toronto.
ON BEING A GRADUATE STUDENT AT THE UNIVERSITY OF TORONTO
- It has been a dream for me. During the initial interview with a professor who later became my advisor, he talked about the different projects that I could be involved with, including lost language decipherment and the analysis of ancient languages. That’s something I have always wanted to do, so I immediately accepted an offer to become his Master’s student.
- Later on, my research switched to writing systems, which are somewhat under-appreciated. I studied how to match sounds with certain characters in lost ancient languages and how to determine the syntaxes of languages, among other things.
ON PRIVACY-PRESERVING NATURAL LANGUAGE PROCESSING AND SPEECH PROCESSING
- The world is learning towards laws that enforce strict privacy requirements (GDPR and CCPA). These laws have sub-parameters towards what you can do with data. A lot of the technologies have not necessarily caught up to that (for companies to do what they want to do).
- Research in privacy-preserving NLP is specifically exciting and important because natural language contains the most sensitive data that we produce. I include speech processing in this category as well, considering that speech has even more personal data than pure text (socio-economic background, education, gender, etc.)
- Privacy goes hand-in-hand with security. Privacy ensures user access and user confidentiality, making it easier to keep the data secured and avoid data leaks.
ON PERFECTLY PRIVACY-PRESERVING AI APPLICATION
- I wrote a guide on “Perfectly Privacy-Preserving AI” to showcase the different parts of the ML pipeline that people need to be careful about (concerning privacy) and how they fit together.
- Federated learning brings computation to the devices where the data is collected. It needs to be combined with differential privacy, which allows us to make a generalization about the data rather than get specific information. One example of differential privacy is adding differentially-private noises to the data to make the system more robust.
- On top of them, we can add secure multiparty computation, where the resulting weights of the model can be combined with the weights of other models.
- There has been less research on model-privacy than on data-privacy.
ON FOUNDING AND RUNNING PRIVATE AI
- In academic research, I can afford to work on impractical and theoretical things that can potentially lead to future knowledge.
- On the commercial side, the hardest thing for me is to find the problems and build hand-crafted tools around them, given the understanding obtained from the academic realm.
- There’s a massive gap in the market where there were no privacy-preserving tools for developers without a private Machine Learning background. Private AI build tools that are generalizable and easy to integrate to address that need.
- Our primary use cases are (1) transferring sensitive datasets between different teams within organizations and (2) filtering dataset queries to prevent the amount of sensitive data being passed around.
- Another interesting use case is direct integration into the app and browser extension.
- The most important thing about starting a business is to talk to a lot of people. You’ll get an idea from those conversations to see whether or not your hypothesis is sensible. Also, it would be best if you built a prototype to ensure that their words match their actions.
ON BEING AN EFFECTIVE RESEARCHER
- I am a huge fan of combining topics, for example, privacy with NLP, or background in biology/healthcare with computer science.
- Machine learning, on its own, is excellent for conducting in-depth theoretical research. Still, with an extra domain knowledge where you go deep into two main areas, you’ll get an explosive combination that can lead to incredible outcomes.