Shinji Kim is the Founder & CEO of Select Star, an intelligent data discovery platform that helps you understand your data. Previously, she was the CEO of Concord Systems, an NYC-based data infrastructure startup acquired by Akamai Technologies in 2016. She led the development of Akamai’s Internet-of-Things data platform for real-time messaging, log processing, and edge computing. Prior to Concord, Shinji was the first Product Manager hired at Yieldmo, where she led the Ad Format Lab, A/B testing, and yield optimization. Before Yieldmo, she analyzed data and built enterprise applications at Deloitte Consulting, Facebook, Sun Microsystems, and Barclays Capital. Shinji studied Software Engineering at the University of Waterloo and General Management at Stanford GSB. She also advises early-stage startups on product strategy, customer development, and company building.
My conversation with Shinji was recorded back in July 2021. Since then, many things have happened at Select Star:
Datacast features long-form, in-depth conversations with practitioners and researchers in the data community to walk through their professional journeys and unpack the lessons learned along the way. I invite guests coming from a wide range of career paths — from scientists and analysts to founders and investors — to analyze the case for using data in the real world and extract their mental models (“the WHY and the HOW”) behind their pursuits. Hopefully, these conversations can serve as valuable tools for early-stage data professionals as they navigate their own careers in the exciting data universe.
Datacast is produced and edited by James Le. Get in touch with feedback or guest suggestions by emailing khanhle.1013@gmail.com.
Subscribe by searching for Datacast wherever you get podcasts or click one of the links below:
If you’re new, see the podcast homepage for the most recent episodes to listen to, or browse the full guest list.
Here are the highlights from my conversation with Shinji:
I would say the University of Waterloo is an amazing place, as I was surrounded by other brilliant people. I grew up in Calgary near Alberta (which is not the biggest Canadian city), so getting to a place where everyone else was smart and excited about computer science was super nice. I studied software engineering, an intense program requiring many all-nighters in the computer lab. However, the best part about Waterloo is the Co-Op program, as I had the chance to complete six internships for different experiences and found out what I liked/what I did not like before graduation.
Before going to college, I built websites and took programming classes in high school. Learning about concepts such as linked lists, queues, stacks, or B-trees in algorithms and data structures classes exposed me to new ways of solving problems.
Overall, these experiences were all related to working with data, so I learned a lot about how to utilize data and metadata better.
There are also many personal lessons that I have learned because they were three very different environments that I worked in.
For anyone still in college, try many things until you find the ones you like and dig deeper. Even later in my career, a lot of what I learned back then still contributes to what I do daily.
Because I worked as a developer, data scientist, and data engineer in my co-ops, I did not necessarily want to be a normal software engineer by the time I graduated college. However, it was not super clear what role I should go into. My former manager and other colleagues at Facebook came from management consulting, so I was encouraged to try it out.
Additionally, I got into computer science because I was interested in learning how things were built. But throughout my internships, my interest has shifted towards what we are building and how it is defined. Once I started working more with data, those questions converted into: Why are we building this? How did we decide to invest in this campaign or build this product? These questions come from business strategy — how to set business goals to get there. I felt it was worthwhile to go to New York and learn more about the business side.
At Deloitte, I learned everything outside of what I learned from Silicon Valley: time management, project management, corporate communication. A lot of the work that I have done as a management consultant was gathering a ton of data/research and distilling them into a 20-page executive presentation — whether it be a corporate strategy for dealing with a financial regulation or going after a new market or a quarterly operating plan review for CEOs/COOs/CIOs. These projects were very short, and I had to deliver value to executives quickly. I recalled doing ten different projects within a short time period at Deloitte.
It is important that you want to find your allies and sponsors inside any large corporation. I was active in meeting new people and learning about different service lines (industries that the company was operating in, projects that were ongoing) to understand where I should go next and what I should eventually specialize in. A lot of senior partners at the firm took the time and energy to coach me and provide guidance. If you take the initiative to reach out and seek what you are looking for, you will find the right people who can give you guidance and advice.
I met smart, hardworking, and amazing people at Deloitte, but I was more interested in building products project-wise. A consultant builds a plan, suggests a recommendation, and then leaves the project for others to implement. I missed building products and decided that it was not the best fit to stay in consulting. I first wanted to pay off my student loan and save runway to take some time off. Overall, it was a risky move to build Shufflepix, but I was still young at the time and could always go back to working at a big tech company. Starting Shufflepix was more of a challenge to me to see how it goes.
That experience got me to appreciate the different parts of building a product. Shufflepix is a puzzle game that turns pictures into puzzles that you can send to your friends to solve. I amassed about 50,000 users, not a massive community but a good enough success to maybe try the next game. However, gaming is such a hit-driven market and was not necessarily a business I wanted to build. That is why I decided to find another company to work with for a while.
We grew really fast at YieldMo, from less than 20 people when I joined to about 80 people after a year or so. There were so many different things that I got involved in, broken down into three primary responsibilities:
With the fast growth that YieldMo had, we ran into many challenges around scaling our data infrastructure. As an ad network, we got events data from all the publishers we worked with, like CNN, Fox News, Reuters. These include activity data, be it impression, click-through, or scrolling. All these activities flowing into the system were hitting about 10 billion per day. Our main streaming pipeline (put together with Kafka, Storm, and HDFS) started to break, especially on the Storm side. This is back in 2012–2013: pre-Flink, pre-Spark, and Storm was the best tool at the time for distributed stream processing.
Alex, the lead engineer at the company, was often called in the middle of the night to fix system failures. He was very frustrated and started working on his own stream processor on the side written in C++. He and I were close friends, and he needed help productizing it. We ended up spending nights and weekends talking about how amazing stream processing is and how it could change so many other companies’ trajectories. I also started talking with people from the gaming industry that had to deal with events data. We realized that this was not a problem restricted to the ad industry but also other sectors like gaming and finance. We also talked with folks who worked on Apache Samza or Storm, and they were receptive to our architecture and thoughts about stream processing.
The advantage of using Concord (compared to other stream processing frameworks at the time) lies in its flexibility and performance:
Many of our initial pilot customers were interested in Concord because it had an amazing performance gain. But they actually ended up utilizing Concord thanks to its flexibility: doing runtime deployment and swapping components while jobs were still running. That was very relevant to Akamai because Akamai ensured that its customers could have available services at any time.
I was running my own company with less than ten people. Then I joined a 6000-person public company that was operating in different countries. While working in New York, I had a team of 20 people distributed between three locations: Cambridge, Santa Clara, and Krakow. So I had to work with them remotely and get the IoT platform our for Beta while partnering with other platform teams. At Akamai, if you launch any product, you need to get approval from other VPs and directors to have your product on Akamai’s CDN network. It is just how a big company runs.
Additionally, IoT Connect is a hosted, distributed message queuing telemetry transport (MQTT) broker with a stream processor on top. Whether it is distributed on the edge network or not is already a challenging project. I learned a lot about cutback planning: Usually, when you build software, you plan for how many users will use it. I had to plan which servers we would deploy to in which regions. Many of the CDN servers were not designed for heavy compute, and we were pushing for sensor data processing which requires heavy compute. Thus, I helped design server spec and the business model for them, which was interesting.
The problem that we are solving at Select Star is a problem that I have dealt with as a data consumer and data producer in the past. I called it the data discovery problem, where I defined data discovery as finding and understanding the data. Finding means being able to find where the dataset or the dashboard is. Understanding means being able to know what the data truly represents. Sharing context knowledge about data is closely tied to the current state of the environment that many data practitioners are working with today:
Operational metadata refers to everything you can find from your information schema and your data warehouse: When was the last time the table was refreshed? How big is the table? How many rows are there? What is the query execution time like? Select Star brings out all of this metadata. The ideal data discovery platform should bring upfront such operational metadata along with the documentation. Documentation should allow the data analysts to add annotations on top of the data so that they can explain what this dataset is about.
The ideal data discovery platform should not just have the schema, search engine, and documentation capabilities. It should also give you other data contexts like the relationships, the usage, or different tables/columns/dashboards. The provenance of data is around data lineage: Where does the data come from? What dashboards are created out of this data? How is the data used in other places?
Guiding data usage is around the popularity of the data. That is to inform others about best practices of utilizing the data without necessarily always manually documenting the data. An ideal data discovery platform guides the users based on what all the analysts have already done. For example, you should not probably use that column or table because nobody has queried it in the last 90 days.
The benefit of integrating BI tools into a data discovery platform is to see the impact of data going into business domains. First and foremost, the understanding of upstream dependencies and downstream impact is two-fold:
The second benefit is utilizing popularity in your data. With Select Star, your database tables are ordered by popularity. You can see the dashboards or tables that nobody is using anymore. There tend to be so many dashboards created as ad-hoc answers to ad-hoc business questions, but nobody wants to delete them because, who knows, maybe somebody is using them. Decluttering old dashboards is helpful because using the wrong dashboard that nobody is updating anymore can be a disaster if business decisions are made based on it.
The third benefit is empowering self-serve analytics. As more business stakeholders look at reports, workbooks, and dashboards (not tables and columns), they need a tool that enables them to get closer to the data. Having the BI integration lets them self-serve themselves and reduce the time required for the data team to perform ad-hoc support.
Our current challenge is marketing. When you are a startup (regardless of the industry or business model), how do you find customers interested in your product when most people do not know about you? In the early days, I have been relying heavily on my network and scouting the right type of companies/people who might be interested in what we are building at Select Star. So far, it has been mostly warm intros. The other big part is the posts that I have written. Fortunately, many people enjoy those articles and have requested product demos.
I personally have not spent as much time and energy on marketing and distribution, which I now plan to do more.
It has been amazing to work with the current team, but it has been hard to find these people. Eventually, it becomes a number game: you do have to meet and talk to a lot of people to find the right ones. It is not just about the match of skill set and level. It is also about what they are interested in, whether a specific area or their fit for an early-stage startup.
Hiring is hard, but it is also worth keeping the bar high. Once someone joins your team, you will spend a lot of time and effort to ensure they are onboarded and ramped up. There is a big impact that each individual makes inside the team. If things do not work out, you have to make hard decisions quickly. Initially, there were a couple of people who were not necessarily the right fit for us, but you are going to make many mistakes as a founder (especially hiring). That is fine since you always need to rinse and repeat to figure things out in the long run.
We look for people who are growth-driven and interested in new problems and challenges. Because they have such interest and passion, they are more proactive than passive. I also value people with high integrity: doing the right thing and being collaborative/open. A lot of the startup life is hard but fun, so you can grow quickly while having fun. People who value that are more fit to startups, rather than those who care about compensation, title, or the exact role they do every day. Even as individual contributors, people end up wearing multiple hats, so there are those who enjoy that type of challenge and those who necessarily do not like that.
Specifically for any data founders (and any other enterprise tech founders), it is crucial to find investors who know the data space. In the beginning, many founders think that: this investor invested in a company adjacent to what we do, so he/she would probably be interested in what we are doing. For example, I might say that this investor invested in Looker or Google Analytics, so he/she must be interested in Select Star as well. Usually, it is not because product analytics is different from data catalogs.
Thus, founders must do their homework to understand which funds and partners are excited about the data space. Many of them publish blog posts and stay active on Twitter, talking about their interests. So doing the research upfront will save founders a lot of time throughout the fundraising process.