Spark Summit 2017

Speakers

Alex Dadgar (Project Lead Hashicorp)

Alex is the project lead for Nomad, a distributed, highly-available cluster scheduler by HashiCorp. Prior to joining HashiCorp, Alex worked at Google where he architected a streaming-processing system to handle terabytes of YouTube data a day. Having seen the dream of infrastructure at Google, he joined HashiCorp to build it for the rest of the world!

Sessions

Wed, 7th Jun Room 2003

11:40 - 12:10 • Homologous Apache Spark Clusters using Nomad

Alex Dadgar Project Lead Hashicorp

Ali Zaidi (Data Scientist Microsoft)

Ali is a data scientist in the Algorithms and Data Science team at Microsoft. He spends his day trying to make distributed computing in the cloud easier, more efficient, and more enjoyable for data scientists and developers alike. He focuses on R, Spark, and Bayesian learning.

Sessions

Tue, 6th Jun Room 2003

14:40 - 15:10 • Extending the R API for Spark with SparkR and Microsoft R Server

Wed, 7th Jun Room 2002

14:00 - 14:30 • Natural Language Processing with CNTK and Apache Spark

Ali Zaidi Data Scientist Microsoft

David Ojika (Doctoral Student University of Florida)

David Ojika is an Intel-fellowship recipient and a 4th-year doctoral student of computer engineering at the University of Florida. He completed several internships at Intel, working on near-memory accelerators and on heterogeneous platforms (Xeon+FPGA). Working with Dr. Darin Acosta and Dr. Ann Gordon-Ross, his research focuses on the intersection of computing and physics by investigating machine learning systems that enhance the study of high-energy particles (such as muons) at CERN. In the summer of 2017, David will join Microsoft’s AI & Research group to embark on an internship with the group’s Project Catapult.

Sessions

Tue, 6th Jun Room 2002

05:00 - 05:30 • Speeding up Spark with Data Compression on XEON+FPGA

David Ojika Doctoral Student University of Florida

Derek Bennet (Platform Infrastructure Team Lead Stitch Fix)

Derek Bennett is the lead for the Platform Infrastructure team in the Algorithms group at Stitch Fix. He and his team develop and support our Spark capabilities, event logging infrastructure using Amazon Kinesis and Apache Kafka, along with associated tools and applications to help make data available and useable. Derek holds a Ph.D. in Operations Research from UC Berkeley.

Sessions

Tue, 6th Jun Room 2016

14:40 - 15:10 • Scaling Data Science Capabilities with Apache Spark at Stitch Fix

Derek Bennet Platform Infrastructure Team Lead Stitch Fix

Felix Cheung (PMC/Committer Microsoft)

Felix Cheung is a Committer of Apache Spark and a PMC/Committer of Apache Zeppelin. He has been active in the Big Data space for 3+ years, he is a co-organizer of the Seattle Spark Meetup, presented several times and he was a teaching assistant to the very popular edx Introduction to Big Data with Apache Spark, and Scalable Machine Learning MOOCs in the summer of 2015.

Sessions

Tue, 6th Jun Room 2020

11:00 - 11:30 • SSR: Structured Streaming on R for Machine Learning

Felix Cheung PMC/Committer Microsoft

Gwen Shapira (Product Manager Confluent)

Gwen is a product manager at Confluent. She has 15 years of experience working with code and customers to build scalable data architectures, integrating relational and big data technologies. Gwen is the author of “Kafka – The Definitive Guide” and “Hadoop Application Architectures”, and a frequent presenter at industry conferences. Gwen is a PMC member on the Apache Kafka project and committer on Apache Sqoop. When Gwen isn’t building data pipelines or thinking up new is-features, you can find her pedaling on her bike exploring the roads and trails of California, and beyond.

Sessions

Wed, 7th Jun Room 2016

17:00 - 17:30 • Stream all things-patterns of Modern Data Integration

Gwen Shapira Product Manager Confluent

Hossein Falaki (Software Engineer Databricks)

Hossein Falaki is a software engineer and data scientist at Databricks, working on the next big thing. Prior to that he was a data scientist at Apple’s personal assistant, Siri. He graduated with a Ph.D. in Computer Science from UCLA, where he was a member of the Center for Embedded Networked Sensing (CENS).

Sessions

Wed, 7th Jun Room 2002

16:20 - 16:50 • Apache SparkR under the hood: Your SparkR Applications

Hossein Falaki Software Engineer Databricks

J White Bear ( IBM)

University of Michigan—Computer Science Databases, Machine Learning/Computational Biology, Cryptography University of California San Francisco—Computational Biology/Bioinformatics Machine Learning/Multi Objective Optimization/Statistical Mechanics for Protein-Protein Interactions McGill University Machine Learning/Multi-objective Optimization for Path Planning/ Cryptography

Sessions

Tue, 6th Jun Room 2020

15:20 - 15:50 • An Online Spark Pipeline: Semi Supervised Learning and Automatic Retraining with Spark Streaming

J White Bear IBM

Jennifer Shin (Founder 8 Path Solutions)

Min Shen is an engineer on LinkedIn’s Hadoop infrastructure development team, where he builds services and tools to tackle scaling challenges in operating large-scale multi-tenancy Hadoop deployment. Recently, he has been helping with creating tools to support operating Spark at scale as well as developing and running Spark jobs easily at LinkedIn.

Sessions

Tue, 6th Jun Room 2022

14:40 - 15:10 • Fuzzy Matching on Apache Spark

Jennifer Shin Founder 8 Path Solutions

Jim Dowling (Associate Professor KTH Royal Institute of Technology)

Jim Dowling is an Associate Professor at the School of Information and Communications Technology in the Department of Software and Computer Systems at KTH Royal Institute of Technology as well as a Senior Researcher at SICS – Swedish ICT. He received his Ph.D. in Distributed Systems from Trinity College Dublin (2005) and worked at MySQL AB (2005-2007). He is a distributed systems researcher and his research interests are in the area of large-scale distributed computer systems. He is lead architect of Hadoop Open Platform-as-a-Service (www.hops.io), a next generation distribution of Hadoop for Humans.

Sessions

Tue, 6th Jun Room 2020

11:40 - 12:10 • Structured-Streaming-as-a-Service with Kafka, Yarn and Tooling

Jim Dowling Associate Professor KTH Royal Institute of Technology

Jonathan Bloom (Co-Founder, Hail Team Broad Institute of MIT and Harvard)

Jonathan Bloom is a mathematician, engineer, and co-founder of the Hail team at the Broad Institute of MIT and Harvard. Prior to joining the Broad, he did research in geometry and algebraic topology as a Moore Instructor and NSF Fellow in Mathematics at the Massachusetts Institute of Technology. While there, he re-architected the department’s introductory course on probability and statistics, now available on MIT OpenCourseWare. He received his B.A. from Harvard University and Ph.D. from Columbia University in Mathematics.

Sessions

Tue, 6th Jun Room 2002

11:00 - 11:30 • Scaling Genetic Data Analysis with Apache Spark

Jonathan Bloom Co-Founder, Hail Team Broad Institute of MIT and Harvard

Jordan Volz (Systems Engineer Cloudera)

Jordan Volz is a Systems Engineer at Cloudera. He helps clients design and implement big data solutions using Cloudera’s Distribution of Hadoop, across a variety of industry verticals. Previously, he has worked as a consultant for HP Autonomy delivering compliance archiving, e-Discovery, and electronic surveillance solutions to regulated financial services companies, and as a developer at Epic Systems building HIPPA-compliant EMR software.

Sessions

Wed, 7th Jun Room 2016

11:30 - 11:30 • Archiving, E-Discovery, and Supervison with Spark and Hadoop

Jordan Volz Systems Engineer Cloudera

Joseph Bradley (Software Engineer Databricks)

Joseph Bradley is a Spark Committer working on MLlib at Databricks. Previously, he was a postdoc at UC Berkeley after receiving his Ph.D. in Machine Learning from Carnegie Mellon U. in 2013. His research included probabilistic graphical models, parallel sparse regression, and aggregation mechanisms for peer grading in MOOCs.

Sessions

Tue, 6th Jun Room 2006

14:00 - 14:30 • Apache Spark MLLIB'S Past Trajectory and New Directions

Joseph Bradley Software Engineer Databricks

Kimoon Kim ( Pepperdata)

Kimoon joined Pepperdata in 2013. Previously, he worked for the Google Search and Yahoo Search teams for many years. Kimoon has hands-on experience with large distributed systems processing massive data sets.

Sessions

Wed, 7th Jun Room 2003

11:00 - 11:30 • HDFS on Kubernetes-Lessons Learned

Kimoon Kim Pepperdata

Leah McGuire (Technical Staff Salesforce.com)

Leah McGuire is a Lead Member of Technical Staff at Salesforce, building platforms to enable the integration of machine learning into Salesforce products. Before joining Salesforce, Leah was a Senior Data Scientist on the data products team at LinkedIn working on personalization, entity resolution, and relevance for a variety of LinkedIn data products. She completed a PhD and a Postdoctoral Fellowship in Computational Neuroscience at the University of California, San Francisco, and at University of California, Berkeley, where she studied the neural encoding and integration of sensory signals.

Sessions

Wed, 7th Jun Room 2022

11:00 - 11:30 • Embracing a Taxonomy of types to simplify Machine Learning

Leah McGuire Technical Staff Salesforce.com

Matteo Interlandi (Scientist Microsoft CISL)

Matteo Interlandi recently joined Microsoft CISL as a Research Scientist. Prior to joining Microsoft, Matteo was Postdoctoral Scholar at the University of California, Los Angeles. His research lies in between databases, distributed systems and declarative languages. In particular, he loves to build systems and tools that make it easier to design and implement data-driven distributed applications.

Sessions

Tue, 6th Jun Room 2002

11:40 - 12:10 • Lazy Join Optimizations without upfront statistics

Matteo Interlandi Scientist Microsoft CISL

Michael Malak ( Oracle)

Michael Malak is the lead author of Spark GraphX In Action and has been developing Spark solutions at two Fortune 200 companies since early 2013. He has been programming computers since before they could be bought pre-assembled in stores.

Sessions

Wed, 7th Jun Room 2002

12:20 - 12:50 • Neuro-Symbolic AI for Sentiment Analysis

Michael Malak Oracle

Min Shen (Engineer LinkedIn)

Sessions

Tue, 6th Jun Room 2022

12:20 - 12:50 • Random Walks on Large Scale Graphs with Apache Spark

Min Shen Engineer LinkedIn

Nan Zhu (Software Engineer Microsoft)

Nan Zhu is a Software Engineer from Microsoft, where he works on serving Spark Streaming/Structured Streaming on Azure HDInsight. He is a contributor of Apache Spark (known as CodingCat) and also serves as the committee member of Distributed Machine Learning Community (DMLC) and Apache MxNet (incubator).

Sessions

Tue, 6th Jun Room 2003

12:20 - 12:50 • Building a unified Data Pipeline with Apache Spark and XGBOOST

Nan Zhu Software Engineer Microsoft

Nikita Shamgunov (CTO MemSQL)

Nikita Shamgunov co-founded MemSQL and has served as CTO since inception. Prior to co-founding the company, Nikita worked on core infrastructure systems at Facebook. He served as a senior database engineer at Microsoft SQL Server for more than half a decade. Nikita holds a bachelor’s, master’s and doctorate in computer science, has been awarded several patents and was a world medalist in ACM programming contests.

Sessions

Wed, 7th Jun Room 2022

14:40 - 15:10 • Real-Time Image Recognition with Apache Spark

Nikita Shamgunov CTO MemSQL

Patrick Stuedi (Research Staff Member IBM)

I’m a member of the research staff at IBM research Zurich. My research interests are in distributed systems, networking and operating systems. I graduated with a PhD from ETH Zurich in 2008 and spent two years (2008-2010) as a Postdoc at Microsoft Research Silicon Valley. My current work is about exploiting fast network and storage hardware in data processing systems.

Sessions

Tue, 6th Jun Room 2002

12:20 - 12:50 • Running Apache Spark on a High-Performance Cluster Using RDMA and NVME Flash

Patrick Stuedi Research Staff Member IBM

Prabhu Kasinthan (Chief Data Engineer Paypal)

Prabhu Kasinathan is the chief data engineer in Big Data Platform at Paypal with 5+ years of big data experience. He is creating APIs, tools and services for Spark platform to support multi-tenancy and large scale computation-intensive applications. He is an expert in building data warehousing solutions on Hadoop and Teradata platform with 11+ years of data experience.

Sessions

Tue, 6th Jun Room 2016

11:00 - 11:30 • Spark Compute as a Service at Paypal

Prabhu Kasinthan Chief Data Engineer Paypal

Ross Gardler (VP Apache Software Foundation)

Ross Gardler has been involved with open source in one form or another since the mid ‘90s. He is a member of the Apache Software Foundation where he currently serves as the foundation’s President. He works at Microsoft on the Linux Compute team in Azure where he is responsible for the Azure Container Service.

Sessions

Wed, 7th Jun Room 2002

17:00 - 17:30 • Creating Personalized Container Solutions with Container Services

Ross Gardler VP Apache Software Foundation

Ryan Blue ( Netflix)

SRyan Blue works on open source projects, including Spark, Avro, and Parquet, at Netflix.

Sessions

Wed, 7th Jun Room 2006

14:00 - 14:30 • Improving Apache Spark with S3

Ryan Blue Netflix

Ryan Williams (Software Developer Mount Sinai School of Medicine)

Ryan writes tools for analyzing genomic data using Spark at Hammer Lab.

Sessions

Tue, 6th Jun Room 2003

16:20 - 16:50 • More Algorithms and Tools for Genomic Analysis on Apache Spark

Ryan Williams Software Developer Mount Sinai School of Medicine

Sam Penrose ( Mozilla)

Sam Penrose loves how working with data at scale for Mozilla brings out the power and beauty of mathematics. Previously he helped Industrial Light and Magic bring the power and beauty of giant robots out to movie screens everywhere.

Sessions

Wed, 7th Jun Room 2006

11:40 - 12:10 • Productive use of the Apache Spark Prompt

Sam Penrose Mozilla

Shay Nativ (Software Developer Redis Labs)

Shay is an experienced software developer, architect, and entrepreneur. He was the founder and VP R&D of Peak-Dynamics—an energy saving solution for water utilities and CTO at Utab, a web platform for musicians. Shay loves solving complex problems and writing performant code.

Sessions

Tue, 6th Jun Room 2003

17:40 - 18:10 • Building a Large Scale Recommendation Engine with Spark and REDIS-ML

Shay Nativ Software Developer Redis Labs

Songtao Guo (Principal Data Scientist LinkedIn)

Songtao Guo is a Principal Data Scientist and tech lead of Data Mining team at Linkedin where he leads many of data driven products and analytics systems. His work involves building large-scale knowledge base, inventing data mining platforms to scale business analytics and partnering with product, sales, and marketing to deliver impactful solutions. Before joining LinkedIn, Songtao was a senior researcher at AT&T interactive, focusing on improving data quality and search relevancy for local business search. He holds a PhD in computer science from University of North Carolina at Charlotte.

Sessions

Tue, 6th Jun Room 2016

15:20 - 15:50 • Transforming B2B Sales with Spark-Powered Sales Intelligence

Songtao Guo Principal Data Scientist LinkedIn

Ted Malaska (Technical Group Architect Blizzard Inc.)

Ted is working on the Battle.net team at Blizzard, helping support great titles like World of Warcraft, Overwatch, HearthStone, and much more. Previously, he was a Principal Solutions Architect at Cloudera, helping clients be successful with Hadoop and the Hadoop ecosystem. Previously, he was a Lead Architect at the Financial Industry Regulatory Authority (FINRA). He has also contributed code to Apache Flume, Apache Avro, Apache Yarn, Apache HDFS, Apache Spark, Apache Sqoop, and many more. Ted is also a co-author or O’Reilly “Hadoop Application Architectures” and a frequent speaker at many conferences, and a frequent blogger on data architectures.

Sessions

Tue, 6th Jun Room 2006

15:20 - 15:50 • Tricks of the Trade to be an Apache Spark Rock Star

Ted Malaska Technical Group Architect Blizzard Inc.

Tejas Patil (Software Engineer Facebook)

Tejas is a software engineer at Facebook. For the past 3 years, he has been part of the Data Infrastructure group at Facebook and primarily works on building large scale distributed data processing systems responsible for handling batch workloads. He is currently a PMC member and committer of Apache Nutch and has contributed to several open source projects. Tejas obtained a Master’s Degree in Computer Science from University Of California, Irvine.

Sessions

Tue, 6th Jun Room 2006

12:20 - 12:50 • Hive Bucketing in Apache Spark

Tejas Patil Software Engineer Facebook

Timothy Poterba (Engineer and Computational Biologist Broad Institute of MIT and Harvard)

Tim Poterba is an engineer and computational biologist on the Hail team at the Broad Institute of MIT and Harvard. Prior to joining the Broad, he studied protein folding dynamics at the Max Planck Institute for Biochemistry on a Fulbright Scholarship. He received his B.A. in Biophysics from Amherst College in 2013.

Sessions

Tue, 6th Jun Room 2002

11:00 - 11:30 • Scaling Genetic Data Analysis with Apache Spark

Timothy Poterba Engineer and Computational Biologist Broad Institute of MIT and Harvard

Wei Di (Business Analytic Data mining team LinkedIn)

Wei Di is currently the staff member in Business Analytic Data mining team. She is passionate about creating smart and scalable solutions that can impact millions of individuals and empower successful business. She has wide interests covering artificial intelligence, machine learning and computer vision. She was previously associated with eBay Human Language Technology and eBay Research Labs, with focus on large scale image understanding and joint learning from visual and text information. Prior to that, she was with Ancestry.com working in the areas of record linkage and search relevance. She received her PhD from Purdue University in 2011.

Sessions

Tue, 6th Jun Room 2016

15:20 - 15:50 • Transforming B2B Sales with Spark-Powered Sales Intelligence

Wei Di Business Analytic Data mining team LinkedIn

Xiangrui Meng (Software Engineer Databricks)

Xiangrui Meng is an Apache Spark PMC member and a software engineer at Databricks. His main interests center around developing and implementing scalable algorithms for scientific applications. He has been actively involved in the development and maintenance of Spark MLlib since he joined Databricks. Before Databricks, he worked as an applied research engineer at LinkedIn, where he was the main developer of an offline machine learning framework in Hadoop MapReduce. His Ph.D. work at Stanford is on randomized algorithms for large-scale linear regression problems.

Sessions

Tue, 6th Jun Room 2022

11:00 - 11:30 • Challenging Web-Scale Graph Analytics with Apache Spark

Xiangrui Meng Software Engineer Databricks

Yin Huai (Software Engineer Databricks)

Yin Huai is a Software Engineer at Databricks and mainly works on Spark SQL. Before joining Databricks, he was a PhD student at The Ohio State University and was advised by Xiaodong Zhang. His interests include storage systems, database systems, and query optimization. He is also an Apache Hive committer.

Sessions

Tue, 6th Jun Room 2006

11:00 - 11:30 • A Deep Dive into Spark SQL Catalyst Optimizer

Yin Huai Software Engineer Databricks