Datacast

Episode 64: Improving Access to High-Quality Data with Fabiana Clemente

Episode Summary

Fabiana Clemente is a Data Scientist with a background that ranges from Business Intelligence to Big Data Development and IoT architecture. Throughout her professional career, she has been leading state-of-the-art projects in global companies and startups. She has an academic background in Applied Maths, and MSc in Data Management combined with nano degrees in Deep Learning and Secure and Private AI. As YData’s Co-Founder, she combines Data Privacy with Deep Learning as her main field of work and research, with the mission to unlock data with privacy by design. She also aims to inspire more women to follow her steps and join the tech community.

Episode Notes

Show Notes

Fabiana’s Contact Info

YData’s Resources

Mentioned Content

Blog Posts

Podcast

People

Recent Announcements/Articles

Episode Transcription

Key Takeaways

Here are highlights from my conversation with Fabiana:

On Studying Applied Mathematics

College was an exciting experience as I had the opportunity to touch on many different areas of mathematics. Some areas were more related to physics, for example. Some others were more related to statistics, which can be more applicable in more situations. I took classes in optimization, multivariate data analysis, applied statistics, and data science (even no one called it “data science” at the time). It was interesting to get that sense of mixing the theoretical and the applied side of math. Overall, the experience gave me a broader scope of what I can do as a professional.

On Building IoT Solutions at Vodafone

This opportunity came about during my attendance at the first Web Summit in Lisbon. An architect from Vodafone reached out to me. They had a position for a data solutions architect, someone with experience building databases and extracting analytics from them. This was quite a jump for me, as I had always been a developer before that. Being an architect designing scalable solutions for IoT was definitely a big challenge.

In IoT, there are two different perspectives on the way data can be used: in a batch manner or in real-time. So, in addition to the high data volume that I was dealing with, I had to set up an infrastructure to deal with different necessities for the speed of the insights being delivered. At the time, I was defining this from scratch, an architecture that (1) could cope with these requirements and (2) was scalable and flexible enough for new requirements in the future.

At the time, the big data architectures influencing my decision came from Spotify and Uber. However, the main concern was not designing the architecture to cope with the requirements but convincing people to accept something new and different from the status quo (RDBMS and traditional storages). This meant telling them that there were new open-source solutions to process data at the enterprise level. This also meant restructuring teams to accept and work with the new architecture.

On Getting Into Data Science

As a solution architect at Vodafone, I was building conceptually and making sure that any architecture could cope with other existing systems within the company. However, I was not developing myself and missing a lot of my time as a developer. I enjoyed learning new things and exploring them on my own. That led to my decision to work as a data scientist.

On Data Science vs. Business Intelligence

Data science can be considered a part of the BI scope, which is about using data to give business-specific insights. From a technical perspective:

On Founding YData

During my time as a data scientist, I felt the pain of accessing high-quality data. At times, I had to wait for 6 months or more to have access to data just to build a proof-of-concept. That was not feasible at all. How can I have faster access to data without concerns about its sensitivity and privacy, the real blockers to data access? On the other side of the same spectrum, if you are doing data science on a small company, you will have access to many things. That made me a bit uncomfortable because I totally extract insights for the wrong person.

While talking with my co-founder, we understood that this was not a problem just from my reality but also his reality. As we talked with other people, we saw that this was their reality as well. Data science teams were struggling with this. With these questions in mind, we started exploring data privacy space to better understand what types of solutions could provide the privacy needed and enable data scientists to do their work in a timely manner. I dug into privacy-enhancing technologies such as federated learning, differential privacy, and, eventually, synthetic data. Synthetic data is what I called friendly data science.

On Techniques to Generate Synthetic Data

Understanding that synthetic data was the way to go, we started exploring the techniques and algorithms that could help us make this possible. In a world of generative models, we have Bayesian networks, GANs, or LSTMs that are feasible for the job. But we definitely had another concern: how can we make synthetic data with the same utility, fidelity, and quality as the original data, while still guaranteeing variability and privacy? Out of the possible solutions, GANs appear to be the fittest for the job.

Of course, GANs are well-known for unstructured data such as images, audio, and text. However, there was a lack of research on using GANs for tabular and time-dependent data. GAN’s concept of having two networks working against each other is an interesting way to ensure that the generated data has the quality, utility, and privacy needed.

On GANs for Synthetic Data Generation

Synthetic data for tabular and time-series data is a new subject with its peculiarities. Data science teams, who are the target to use this kind of solution, need to be able to trust data generated using neural networks.

We are looking to keep updating this learning journey about generating synthetic data: how to achieve and trust it and educate the broader data science community about the benefits it can bring.

On Differential Privacy

With differential privacy, you introduce noise to the data you have to make it harder to re-identify someone. You can define the level of how hard it is to re-identify someone, which is known as the privacy budget. The noise you apply will decide how much privacy you will get versus how much utility of the data you will lose.

I think differential privacy and synthetic data are highly complementary. You have different data privacy needs at different stages or within different parts of your organization. Synthetic data allows data science teams to explore the information that they have. But let’s say your domain is healthcare. Then you want an extra step to ensure that the new synthetic data is even more private. The combination of generating synthetic data with differential privacy makes a lot of sense: you are generating synthetic data with differential privacy to ensure more perturbation.

I don’t know if this is a con or can be seen as a feature of differential privacy: if you want more privacy, you will lose the data's utility. If you apply differential privacy in the wrong manner, you will extract the wrong insight from your data.

On “The Cost of Poor Data Quality

While extracting insights from the data, you have to ensure that the data is correct; otherwise, you get invalid insights. If you apply these invalid insights to your business, you might make decisions that cost you big financial losses. Even if you build different models, the data that you chose for those models has a bigger impact at the end of the day. Thus, it’s important to take into consideration these factors in a data platform:

On Model Explainability

Model explainability lets us know the impact that certain variables have on a decision. You want to know why the model took a decision so you are aware of any flaws or biases in the data. In that sense, model explainability can help you understand potential problems that you have with your data. With good data quality, you are not afraid to justify why you trust a machine’s decision.

On Open-Source-as-a-Strategy

My co-founder and I enjoy well-developed open-source projects. They enable us throughout our careers to learn new things and experiment with new technologies. In addition, we have the opportunity to experiment with these projects before buying them.

Open-source is also about the community. If you give back without the expectation to receive, you create a sense of community where everyone is willing to contribute. It’s proven that you develop better solutions if you have more people thinking about them than having just one person. Because synthetic data is something so new in the market, we understood that we need to do open-source and show its value to the community. This is the educational path that we foresee.

Open-source ties up with our product. If the data science community trusts the data that we are developing, it’s far easier to convince people that they need a solution like the YData platform.

On Hiring

The use of synthetic data with deep learning is rare to be found in the development ecosystem. Therefore, the candidates should be excited about such an opportunity to work on cutting-edge technologies.

As an early-stage startup, we have to be sure that the value we are setting for the company right at the beginning can attract good developers out there. YData is a place of collaboration, where feedback is valuable and goes both ways between founders and employees.

On Women in Tech

I am quite active in the Portuguese women in the tech community. I am very present in discussions, podcasts, and public speaking about YData and entrepreneurship. Doing that as a woman is a way to inspire others to follow a similar path.