Episode 54: Information Retrieval Research, Data Science For Space Missions, and Open-Source Software with Chris Mattmann

Episode Summary

Chris Mattmann is the Chief Technology and Innovation Officer at NASA JPL. He was also JPL's first Principal Scientist in the area of Data Science. He has over 19 years of experience at JPL and has conceived, realized, and delivered the architecture for the next generation of reusable science data processing systems for NASA's Orbiting Carbon Observatory, NPP Sounder PEATE, and the Soil Moisture Active Passive (SMAP) Earth science missions. His work has been funded by NASA, DARPA, DHS, NSF, NIH, NLM, and private industry and commercial partnerships. He was the first Vice President (VP) of Apache OODT (Object Oriented Data Technology), the first NASA project at the Apache Software Foundation (ASF), and he led the project's transition from JPL to the ASF. He contributes to open source and was a former Director at the Apache Software Foundation (2013-18). He was one of the initial contributors to Apache Nutch as a member of its project management committee, the predecessor to Apache Hadoop. He is the progenitor of the Apache Tika framework, the digital "babel fish," and the de-facto content analysis and detection framework that exists. Today he contributes to TensorFlow and all things machine learning. Finally, he is the Director of the Information Retrieval & Data Science (IRDS) group at USC and Adjunct Associate Professor. He teaches graduate courses in Content Detection & Analysis & Search Engines & Information Retrieval. He has materially contributed to understanding the Deep Web and Dark Web through the DARPA MEMEX project. His work helped uncover the Panama Papers scandal.

Episode Notes


His Contact Info

His Recommended Resources

Episode Transcription

Key Takeaways

Here are the highlights from my interview with Chris Mattmann.

On Studying Computer Science at USC

USC was a lot of hard work — sitting in the computer lab learning Linux, which I knew nothing about at the time.

Data structures really kicked it off for me. I learned how to store state information using algorithms. The key for me was learning linear algebra — which is about transforming states using computation. That was a nice complementary to computer science.

In my second year at USC, I got a JPL job and became interested in databases and data modeling.

On Working at JPL as an Undergraduate

I sat one late night in the computer lab and applied for jobs on bulletin boards. A role looked for a database programmer to program earthquake databases. I was decent enough of a programmer, so I got the job.

Within the first three weeks at JPL, my project got canceled. They put me on a few other earthquake projects with the scientists at Caltech — writing SQL queries for atmospheric science, putting files on disk for the scientists to search for, etc.

On His Ph.D. Thesis about Software Architecture

I graduated with my bachelor’s and was working at JPL. I was motivated to get a Master’s degree to get additional income in the future.

The 2nd class that I took for my Master’s is called “Software Engineering for Embedded Systems,” taught by Dr. Nenad Medvidović, a young professor just coming out of a top-notch software engineering lab in UCI. He inspired me to pursue research afterward.

At JPL, I was working a lot on designing data systems. I realized that the software architects sometimes could not explain their design decisions to handle data dissemination. The contribution of my Ph.D. dissertation was the following:

I believe my work changed how we capture engineering data related to JPL and defined processes for evaluating data system design.

On Developing Apache Tika

Towards the end of my Ph.D., I took a class on search engines, one of the very few search engine classes across the US. For the final project, I used Nutch, an open-source framework by Doug Cutting (who invented Hadoop and Lucene), to create an RSS-parsing toolkit. Then, I became interested in the vibrant Apache open-source community and became a Nutch committer.

Fast forward to 2007, Nutch was re-architected based on the structure of MapReduce. Nutch was this big grandiose web crawler, but inside it lay a distributed file system, a distributed computational platform, a user interface of ranking and scoring methods, and even a content detection framework. Jérôme Charron and I pitched the idea of Tika as a separate content detector from Nutch. Later on, Jukka Zitting came on board to push Tika to the finish line. The first-generation of Tika got into financial institutions like Equifax and FICO.

In the second-generation of Tika, I worked on the MEMEX project with DARPA to stop human trafficking, using search engines that could mine the dark web for information. As we improved Tika to support multi-media format, a journalist-programmer used the framework to analyze the Panama Papers data leak.

On Teaching at USC

“Software Architectures”: My advisor Neno incepted that class, where most of the content came from his book “Software Architecture: Foundations, Theory, and Practice.” I helped to teach it after graduated from USC.

“Information Retrieval and Web Search Engines”: My former teacher Ellis Horowitz created that class, and I helped different semesters of it. It taught students the foundations of search engines. Given my systems background, I focused on technologies like Lucene, Solr, Nutch, and ElasticSearch.

“Content Detection and Analysis for Big Data”: We spent a lot of time talking about big data — the 5Vs, data mash-up, etc. The first assignment was about data enrichment, adding more features to the dataset. The second assignment was about content extraction at scale, generating structured data from unstructured data. The third assignment was about visualization and communication of your data science. I had done this class with UFO data, polar data, job data in South America, etc.

On Leading the Information Retrieval and Data Science Group at USC

At the IRDS group, we have trained 40 Master’s and 36 undergraduate students as research assistants, along with 3 post-docs. We have been funded a few big grants by NSF on polar cyberinfrastructure and open-source software. Today, we spent time working on sparkler (the web crawler) and Dockerized ML capabilities for Tika. These projects serve as the research arm for what we finally operationalize at NASA.

The group is also a good pipeline for giving USC students a chance to partner with NASA projects. I helped fill half of the search engine team via this group.

On His NASA’s JPL Career

In the first 10 years, I was on engineering and science — initially on data systems and then mission. I worked on projects such as Orbiting Carbon Observatory space mission and the Soil Moisture Active Passive earth science mission. I was also the computing lead for the Airborne Snow Observatory airborne mission.

After that, I went into technology development. From year 10 to 15, I built 60-to-70 million-dollar technology programs with DARPA, NSF, and other commercial industries.

From year 15 until now, I moved into the IT department, where my goal was to mature the people and data science discipline at JPL. As the Deputy CTO and then the division manager reporting to the CIO, I make sure that AI can be used for NASA’s missions (like creating robots that run on Mars's surface).

On The Apache Software Foundation

The Apache Foundation is a 501c3 non-profit organization with a billion-dollar valuation. Tools like Hadoop and Spark would not exist without it.

In the last decade, being on the Apache board was like being plugged into everything important in software — projects starting up, how companies are using them, big decisions on open-source licensing, etc.

What I am excited about open-source software nowadays: (1)MLOps and open-source ML frameworks developed by big organizations (like TensorFlow), (2) The future of learning with fewer labels (zero-shot and one-shot learning, for instance), and (3) AutoML that automates data science workflow such as model development, selection, and evaluation.

On “A Vision For Data Science

To get the best out of big data, I believe that four advancements are necessary:

On Writing “Machine Learning with TensorFlow

While writing the book, the biggest challenge for me is getting access to the system to go from training models on my laptop to distributed training.

Compared to the first edition, here are my core contributions:

  1. Updating code to the latest TensorFlow version 2.3
  2. Recognizing the importance of data preparation

On Differences Between Academia and Industry

On The Tech Community in Los Angeles

The tech scene here is dispersed.

The whole Silicon Beach thing consists of entertainment and aerospace engineering companies, coupled with software technology. We have many engineering and business innovators, ranging from big institutions like NASA to universities like USC and CalTech. Furthermore, we put back into the city what we get out of it.