Datacast

Episode 85: Ad Exchange, Stream Processing, and Data Discovery Platform with Shinji Kim

Episode Summary

Shinji Kim is the Founder & CEO of Select Star, an intelligent data discovery platform that helps you understand your data. Previously, she was the CEO of Concord Systems, an NYC-based data infrastructure startup acquired by Akamai Technologies in 2016. She led the development of Akamai’s Internet-of-Things data platform for real-time messaging, log processing, and edge computing. Prior to Concord, Shinji was the first Product Manager hired at Yieldmo, where she led the Ad Format Lab, A/B testing, and yield optimization. Before Yieldmo, she analyzed data and built enterprise applications at Deloitte Consulting, Facebook, Sun Microsystems, and Barclays Capital. Shinji studied Software Engineering at the University of Waterloo and General Management at Stanford GSB. She also advises early-stage startups on product strategy, customer development, and company building.

Episode Notes

Show Notes

(02:00) Shinji reflected on her academic experience studying Software Engineering at the University of Waterloo in the late 2000s.
(04:19) Shinji shared valuable lessons learned from her undergraduate co-op experience with statistical analysis at Sun Microsystems, software engineering at Barclays Capital, and growth marketing at Facebook.
(08:52) Shinji shared lessons learned from being a Management Consultant at Deloitte.
(14:01) Shinji revisited her decision to quit the job at Deloitte and create a social puzzle game called Shufflepix.
(17:42) Shinji went over her time working as a Product Manager at the mobile ad exchange network YieldMo.
(22:25) Shinji discussed the problem of stream processing at YieldMo, which sparked the creation of Concord.
(26:17) Shinji unpacked the pain points with existing stream processing frameworks and the competitive advantage of using Concord.
(33:19) Shinji recalled her time at Akamai — initially as a data engineer in the Platform Engineering unit and later as a product manager for the IoT Edge Connect platform.
(37:26) Shinji explained why sharing context knowledge around data remains a largely unsolved problem.
(42:07) Shinji unpacked the three capabilities of an ideal data discovery platform: (1) exposing up-to-date operational metadata along with the documentation, (2) tracking the provenance of data back to its source, and (3) guiding data usage.
(46:59) Shinji unpacked the benefits of plugging BI tools into data discovery platforms and collecting metadata, which facilitates better visibility and understanding.
(52:36) Shinji discussed the role of a data discovery platform within the modern data stack.
(53:59) Shinji shared the hurdles that her team has to go through while finding early adopters of Select Star.
(55:48) Shinji shared valuable hiring lessons learned at Select Star.
(01:00:00) Shinji shared fundraising advice for founders currently seeking the right investors for their startups.
(01:04:41) Closing segment.

Shinji’s Contact Info

Select Star’s Resources

Mentioned Content

Articles

“The Next Evolution of Data Catalogs: Data Discovery Platforms” (Feb 2021)
“Data Discovery for Business Intelligence” (May 2021)

People

Martin Kleppmann (Author of Designing Data-Intensive Applications)
Emily Riederer (Senior Analytics Manager at Capital One)
Anya Prosvetova (Tableau DataDev Ambassador)

Book

“Managing Oneself” (by Peter Drucker)

Notes

My conversation with Shinji was recorded back in July 2021. Since then, many things have happened at Select Star:

General Availability launch on Product Hunt: https://www.producthunt.com/posts/selectstar
Snowflake partnership on data governance: https://blog.selectstar.com/selectstar-and-snowflake-partner-to-take-data-governance-to-a-new-level-a9d274e1d4c6
Case studies with Pitney Bowes and Handshake

About the show

Datacast features long-form, in-depth conversations with practitioners and researchers in the data community to walk through their professional journeys and unpack the lessons learned along the way. I invite guests coming from a wide range of career paths — from scientists and analysts to founders and investors — to analyze the case for using data in the real world and extract their mental models (“the WHY and the HOW”) behind their pursuits. Hopefully, these conversations can serve as valuable tools for early-stage data professionals as they navigate their own careers in the exciting data universe.

Datacast is produced and edited by James Le. Get in touch with feedback or guest suggestions by emailing khanhle.1013@gmail.com.

Subscribe by searching for Datacast wherever you get podcasts or click one of the links below:

If you’re new, see the podcast homepage for the most recent episodes to listen to, or browse the full guest list.

Episode Transcription

Key Takeaways

Here are the highlights from my conversation with Shinji:

On Her Undergraduate Education at Waterloo

I would say the University of Waterloo is an amazing place, as I was surrounded by other brilliant people. I grew up in Calgary near Alberta (which is not the biggest Canadian city), so getting to a place where everyone else was smart and excited about computer science was super nice. I studied software engineering, an intense program requiring many all-nighters in the computer lab. However, the best part about Waterloo is the Co-Op program, as I had the chance to complete six internships for different experiences and found out what I liked/what I did not like before graduation.

Before going to college, I built websites and took programming classes in high school. Learning about concepts such as linked lists, queues, stacks, or B-trees in algorithms and data structures classes exposed me to new ways of solving problems.

On Her Co-Op Experiences

At Sun Microsystems, I initially contributed to the sales forecasting platform that displayed the front-end part of our forecasts, compared to the actuals. It was data visualization designed for sales, marketing, and operations people to use. In the following internship at Sun, I was invited back to work on the actual sales forecasting model — doing more statistical programming and data modeling in an application environment.
At Barclays Capital, I worked on deprecating old/unused database instances and managing how databases were used at the bank.
At Facebook, even though my role was called Growth Marketing, I actually wrote ETL jobs to pipe out ad campaigns data from Google Adwords, put the data into analysis, and calculated the ROI and keywords.

Overall, these experiences were all related to working with data, so I learned a lot about how to utilize data and metadata better.

There are also many personal lessons that I have learned because they were three very different environments that I worked in.

Sun Microsystems was a research lab, so it was amazing to be a part of a company that was putting a lot of money into research. I sat in rooms with people who wrote my textbooks, which was a great learning environment and experience.
Barclays was more of a business-oriented environment, where a lot of the day-to-day involved budgeting and forecasting for the bank.
Facebook at the time was a 1000-person company. Everyone was so invested in one common goal.

For anyone still in college, try many things until you find the ones you like and dig deeper. Even later in my career, a lot of what I learned back then still contributes to what I do daily.

On Management Consulting at Deloitte

Because I worked as a developer, data scientist, and data engineer in my co-ops, I did not necessarily want to be a normal software engineer by the time I graduated college. However, it was not super clear what role I should go into. My former manager and other colleagues at Facebook came from management consulting, so I was encouraged to try it out.

Additionally, I got into computer science because I was interested in learning how things were built. But throughout my internships, my interest has shifted towards what we are building and how it is defined. Once I started working more with data, those questions converted into: Why are we building this? How did we decide to invest in this campaign or build this product? These questions come from business strategy — how to set business goals to get there. I felt it was worthwhile to go to New York and learn more about the business side.

At Deloitte, I learned everything outside of what I learned from Silicon Valley: time management, project management, corporate communication. A lot of the work that I have done as a management consultant was gathering a ton of data/research and distilling them into a 20-page executive presentation — whether it be a corporate strategy for dealing with a financial regulation or going after a new market or a quarterly operating plan review for CEOs/COOs/CIOs. These projects were very short, and I had to deliver value to executives quickly. I recalled doing ten different projects within a short time period at Deloitte.

It is important that you want to find your allies and sponsors inside any large corporation. I was active in meeting new people and learning about different service lines (industries that the company was operating in, projects that were ongoing) to understand where I should go next and what I should eventually specialize in. A lot of senior partners at the firm took the time and energy to coach me and provide guidance. If you take the initiative to reach out and seek what you are looking for, you will find the right people who can give you guidance and advice.

On Building Shufflepix

I met smart, hardworking, and amazing people at Deloitte, but I was more interested in building products project-wise. A consultant builds a plan, suggests a recommendation, and then leaves the project for others to implement. I missed building products and decided that it was not the best fit to stay in consulting. I first wanted to pay off my student loan and save runway to take some time off. Overall, it was a risky move to build Shufflepix, but I was still young at the time and could always go back to working at a big tech company. Starting Shufflepix was more of a challenge to me to see how it goes.

That experience got me to appreciate the different parts of building a product. Shufflepix is a puzzle game that turns pictures into puzzles that you can send to your friends to solve. I amassed about 50,000 users, not a massive community but a good enough success to maybe try the next game. However, gaming is such a hit-driven market and was not necessarily a business I wanted to build. That is why I decided to find another company to work with for a while.

On Working at YieldMo

We grew really fast at YieldMo, from less than 20 people when I joined to about 80 people after a year or so. There were so many different things that I got involved in, broken down into three primary responsibilities:

General product management: being a liaison between engineering and business sides, defining priorities and projects to deliver, implementing new product features, and doing quality assurance. YieldMo positioned itself as a mobile ad exchange network and provided different types of ad formats that were performing much better than the industry standard. Traditionally, mobile ads have been either a small banner ad or a full page that abrupt the user experience. YieldMo came up with new ad formats that look like something in the App Store. We would inject our HTML5 code into the client-side, which ended up performing much better for our advertisers. We ended up hiring a head of design and built a series of tests to test these new ad formats (A/B testing, multivariate testing) and optimize them for our publishers. I ended up owning many of those processes and frameworks.
Operations: I got involved with onboarding people since new employees came to me with product-related questions. Many of them did not have a background in AdTech, which is full of acronyms. I put together a lot of documentation on these terms. There are many other things like building board decks or talking to customers as well.
Hiring and building teams: I helped hire the head of design, two or three PMs, four or five engineers, and data scientists. That has been a fun experience starting projects from scratch and handing them to others.

On Developing Concord

With the fast growth that YieldMo had, we ran into many challenges around scaling our data infrastructure. As an ad network, we got events data from all the publishers we worked with, like CNN, Fox News, Reuters. These include activity data, be it impression, click-through, or scrolling. All these activities flowing into the system were hitting about 10 billion per day. Our main streaming pipeline (put together with Kafka, Storm, and HDFS) started to break, especially on the Storm side. This is back in 2012–2013: pre-Flink, pre-Spark, and Storm was the best tool at the time for distributed stream processing.

Alex, the lead engineer at the company, was often called in the middle of the night to fix system failures. He was very frustrated and started working on his own stream processor on the side written in C++. He and I were close friends, and he needed help productizing it. We ended up spending nights and weekends talking about how amazing stream processing is and how it could change so many other companies’ trajectories. I also started talking with people from the gaming industry that had to deal with events data. We realized that this was not a problem restricted to the ad industry but also other sectors like gaming and finance. We also talked with folks who worked on Apache Samza or Storm, and they were receptive to our architecture and thoughts about stream processing.

The advantage of using Concord (compared to other stream processing frameworks at the time) lies in its flexibility and performance:

Flexibility: Concord uses a pub-sub-like operator model, where you can operate each part of the model separately. Each runs its own Mesos container that is state-full. Concord has a key-value data store that you can manage its state and run Cron jobs by metadata. Thus, they can do runtime deployment of the models. Traditionally, while working Storm or Spark, you have to restart the whole server and redeploy if you change a spot instance of any model. With Concord, you can change different parts of your DAG as your event processor runs. As a result, you can run many scheduling and management jobs concurrently in production. Furthermore, Concord supports API in Python, Ruby, Go, C++, Java, Scala, and the like, while most other distributed processing frameworks were JVM (mostly supporting Java and Scala).
Performance: We did many bake-offs against Spark, Flink, and Storm streaming systems and always performed 10 to 20 times better regarding throughput. Our basic latency was always in tens of milliseconds.

Many of our initial pilot customers were interested in Concord because it had an amazing performance gain. But they actually ended up utilizing Concord thanks to its flexibility: doing runtime deployment and swapping components while jobs were still running. That was very relevant to Akamai because Akamai ensured that its customers could have available services at any time.

On Leading The IoT Edge Connect Platform at Akamai

I was running my own company with less than ten people. Then I joined a 6000-person public company that was operating in different countries. While working in New York, I had a team of 20 people distributed between three locations: Cambridge, Santa Clara, and Krakow. So I had to work with them remotely and get the IoT platform our for Beta while partnering with other platform teams. At Akamai, if you launch any product, you need to get approval from other VPs and directors to have your product on Akamai’s CDN network. It is just how a big company runs.

Additionally, IoT Connect is a hosted, distributed message queuing telemetry transport (MQTT) broker with a stream processor on top. Whether it is distributed on the edge network or not is already a challenging project. I learned a lot about cutback planning: Usually, when you build software, you plan for how many users will use it. I had to plan which servers we would deploy to in which regions. Many of the CDN servers were not designed for heavy compute, and we were pushing for sensor data processing which requires heavy compute. Thus, I helped design server spec and the business model for them, which was interesting.

On Data Discovery Platforms

The problem that we are solving at Select Star is a problem that I have dealt with as a data consumer and data producer in the past. I called it the data discovery problem, where I defined data discovery as finding and understanding the data. Finding means being able to find where the dataset or the dashboard is. Understanding means being able to know what the data truly represents. Sharing context knowledge about data is closely tied to the current state of the environment that many data practitioners are working with today:

Data documentation is not done well in many companies: There is no table or column-level comments. They may have some documentation on Notion, Confluence, or Google Docs, but it is not up-to-date. It is not something that many people refer to and update on. The context about the data (what it is about, where it comes from, and how it can be used) becomes tribal knowledge. You have to find the people who have worked with that data in the past to understand such knowledge.
As more companies start to move to the Modern Data Stack and utilize cloud data warehouses/data lakes as their primary source of a federated database, not one team knows the answer to the data context anymore. The data platform team maybe manage the infrastructure and the data warehouse, but the actual data transformations are created partly by the marketing, finance, or product teams.
With the increasingly decentralized ownership of the data, more business stakeholders will have direct access to the data. Only 5 to 10 years ago, they used to get emails of how their businesses were doing as a report. Now, they have data access in Looker, Tableau, or Mode and can check out the dashboard themselves. They can drag and drop the fields to filter it. They start asking questions like which filter to use, how to slice the data, how to slice it by this dimension, etc.

Operational metadata refers to everything you can find from your information schema and your data warehouse: When was the last time the table was refreshed? How big is the table? How many rows are there? What is the query execution time like? Select Star brings out all of this metadata. The ideal data discovery platform should bring upfront such operational metadata along with the documentation. Documentation should allow the data analysts to add annotations on top of the data so that they can explain what this dataset is about.

The ideal data discovery platform should not just have the schema, search engine, and documentation capabilities. It should also give you other data contexts like the relationships, the usage, or different tables/columns/dashboards. The provenance of data is around data lineage: Where does the data come from? What dashboards are created out of this data? How is the data used in other places?

Guiding data usage is around the popularity of the data. That is to inform others about best practices of utilizing the data without necessarily always manually documenting the data. An ideal data discovery platform guides the users based on what all the analysts have already done. For example, you should not probably use that column or table because nobody has queried it in the last 90 days.

On Business Intelligence Meeting Data Discovery

The benefit of integrating BI tools into a data discovery platform is to see the impact of data going into business domains. First and foremost, the understanding of upstream dependencies and downstream impact is two-fold:

For any engineers who are about to change or update a table or a column, they will see which dashboards will be impacted and who may be impacted. Usually, when the data team has to do IT support, it is primarily because a business person says that there is something off on the reports. There can be many different issues around why the data is off, but it can often be because an engineer who has changed the upstream table did not realize that other dashboards were using the data. A BI integration provides a quick and effective impact analysis for the engineers.
For any analysts who debug dashboards (especially dashboards they did not write), they can quickly figure out when a table gets updated last time or whether a column is calculated correctly. It is easy to do that when there is a data lineage between BI tools and their data discovery platform.

The second benefit is utilizing popularity in your data. With Select Star, your database tables are ordered by popularity. You can see the dashboards or tables that nobody is using anymore. There tend to be so many dashboards created as ad-hoc answers to ad-hoc business questions, but nobody wants to delete them because, who knows, maybe somebody is using them. Decluttering old dashboards is helpful because using the wrong dashboard that nobody is updating anymore can be a disaster if business decisions are made based on it.

The third benefit is empowering self-serve analytics. As more business stakeholders look at reports, workbooks, and dashboards (not tables and columns), they need a tool that enables them to get closer to the data. Having the BI integration lets them self-serve themselves and reduce the time required for the data team to perform ad-hoc support.

On Finding Early Adopters

Our current challenge is marketing. When you are a startup (regardless of the industry or business model), how do you find customers interested in your product when most people do not know about you? In the early days, I have been relying heavily on my network and scouting the right type of companies/people who might be interested in what we are building at Select Star. So far, it has been mostly warm intros. The other big part is the posts that I have written. Fortunately, many people enjoy those articles and have requested product demos.

I personally have not spent as much time and energy on marketing and distribution, which I now plan to do more.

On Hiring

It has been amazing to work with the current team, but it has been hard to find these people. Eventually, it becomes a number game: you do have to meet and talk to a lot of people to find the right ones. It is not just about the match of skill set and level. It is also about what they are interested in, whether a specific area or their fit for an early-stage startup.

Hiring is hard, but it is also worth keeping the bar high. Once someone joins your team, you will spend a lot of time and effort to ensure they are onboarded and ramped up. There is a big impact that each individual makes inside the team. If things do not work out, you have to make hard decisions quickly. Initially, there were a couple of people who were not necessarily the right fit for us, but you are going to make many mistakes as a founder (especially hiring). That is fine since you always need to rinse and repeat to figure things out in the long run.

We look for people who are growth-driven and interested in new problems and challenges. Because they have such interest and passion, they are more proactive than passive. I also value people with high integrity: doing the right thing and being collaborative/open. A lot of the startup life is hard but fun, so you can grow quickly while having fun. People who value that are more fit to startups, rather than those who care about compensation, title, or the exact role they do every day. Even as individual contributors, people end up wearing multiple hats, so there are those who enjoy that type of challenge and those who necessarily do not like that.

On Fundraising

Specifically for any data founders (and any other enterprise tech founders), it is crucial to find investors who know the data space. In the beginning, many founders think that: this investor invested in a company adjacent to what we do, so he/she would probably be interested in what we are doing. For example, I might say that this investor invested in Looker or Google Analytics, so he/she must be interested in Select Star as well. Usually, it is not because product analytics is different from data catalogs.

Thus, founders must do their homework to understand which funds and partners are excited about the data space. Many of them publish blog posts and stay active on Twitter, talking about their interests. So doing the research upfront will save founders a lot of time throughout the fundraising process.