Speakers
Alex is the project lead for Nomad, a distributed, highly-available cluster scheduler by HashiCorp. Prior to joining HashiCorp, Alex worked at Google where he architected a streaming-processing system to handle terabytes of YouTube data a day. Having seen the dream of infrastructure at Google, he joined HashiCorp to build it for the rest of the world!
Sessions
Ali is a data scientist in the Algorithms and Data Science team at Microsoft. He spends his day trying to make distributed computing in the cloud easier, more efficient, and more enjoyable for data scientists and developers alike. He focuses on R, Spark, and Bayesian learning.
Sessions
Tue, 6th Jun Room 2003
14:40 - 15:10 • Extending the R API for Spark with SparkR and Microsoft R Server
Wed, 7th Jun Room 2002
14:00 - 14:30 • Natural Language Processing with CNTK and Apache Spark
David Ojika is an Intel-fellowship recipient and a 4th-year doctoral student of computer engineering at the University of Florida. He completed several internships at Intel, working on near-memory accelerators and on heterogeneous platforms (Xeon+FPGA). Working with Dr. Darin Acosta and Dr. Ann Gordon-Ross, his research focuses on the intersection of computing and physics by investigating machine learning systems that enhance the study of high-energy particles (such as muons) at CERN. In the summer of 2017, David will join Microsoft’s AI & Research group to embark on an internship with the group’s Project Catapult.
Sessions
Derek Bennett is the lead for the Platform Infrastructure team in the Algorithms group at Stitch Fix. He and his team develop and support our Spark capabilities, event logging infrastructure using Amazon Kinesis and Apache Kafka, along with associated tools and applications to help make data available and useable. Derek holds a Ph.D. in Operations Research from UC Berkeley.
Sessions
Tue, 6th Jun Room 2016
14:40 - 15:10 • Scaling Data Science Capabilities with Apache Spark at Stitch Fix
Felix Cheung is a Committer of Apache Spark and a PMC/Committer of Apache Zeppelin. He has been active in the Big Data space for 3+ years, he is a co-organizer of the Seattle Spark Meetup, presented several times and he was a teaching assistant to the very popular edx Introduction to Big Data with Apache Spark, and Scalable Machine Learning MOOCs in the summer of 2015.
Sessions
Gwen is a product manager at Confluent. She has 15 years of experience working with code and customers to build scalable data architectures, integrating relational and big data technologies. Gwen is the author of “Kafka – The Definitive Guide” and “Hadoop Application Architectures”, and a frequent presenter at industry conferences. Gwen is a PMC member on the Apache Kafka project and committer on Apache Sqoop. When Gwen isn’t building data pipelines or thinking up new is-features, you can find her pedaling on her bike exploring the roads and trails of California, and beyond.
Sessions
Hossein Falaki is a software engineer and data scientist at Databricks, working on the next big thing. Prior to that he was a data scientist at Apple’s personal assistant, Siri. He graduated with a Ph.D. in Computer Science from UCLA, where he was a member of the Center for Embedded Networked Sensing (CENS).
Sessions
University of Michigan—Computer Science Databases, Machine Learning/Computational Biology, Cryptography University of California San Francisco—Computational Biology/Bioinformatics Machine Learning/Multi Objective Optimization/Statistical Mechanics for Protein-Protein Interactions McGill University Machine Learning/Multi-objective Optimization for Path Planning/ Cryptography
Sessions
Min Shen is an engineer on LinkedIn’s Hadoop infrastructure development team, where he builds services and tools to tackle scaling challenges in operating large-scale multi-tenancy Hadoop deployment. Recently, he has been helping with creating tools to support operating Spark at scale as well as developing and running Spark jobs easily at LinkedIn.
Sessions
Jim Dowling is an Associate Professor at the School of Information and Communications Technology in the Department of Software and Computer Systems at KTH Royal Institute of Technology as well as a Senior Researcher at SICS – Swedish ICT. He received his Ph.D. in Distributed Systems from Trinity College Dublin (2005) and worked at MySQL AB (2005-2007). He is a distributed systems researcher and his research interests are in the area of large-scale distributed computer systems. He is lead architect of Hadoop Open Platform-as-a-Service (www.hops.io), a next generation distribution of Hadoop for Humans.
Sessions
Tue, 6th Jun Room 2020
11:40 - 12:10 • Structured-Streaming-as-a-Service with Kafka, Yarn and Tooling
Jonathan Bloom is a mathematician, engineer, and co-founder of the Hail team at the Broad Institute of MIT and Harvard. Prior to joining the Broad, he did research in geometry and algebraic topology as a Moore Instructor and NSF Fellow in Mathematics at the Massachusetts Institute of Technology. While there, he re-architected the department’s introductory course on probability and statistics, now available on MIT OpenCourseWare. He received his B.A. from Harvard University and Ph.D. from Columbia University in Mathematics.
Sessions
Jordan Volz is a Systems Engineer at Cloudera. He helps clients design and implement big data solutions using Cloudera’s Distribution of Hadoop, across a variety of industry verticals. Previously, he has worked as a consultant for HP Autonomy delivering compliance archiving, e-Discovery, and electronic surveillance solutions to regulated financial services companies, and as a developer at Epic Systems building HIPPA-compliant EMR software.
Sessions
Joseph Bradley is a Spark Committer working on MLlib at Databricks. Previously, he was a postdoc at UC Berkeley after receiving his Ph.D. in Machine Learning from Carnegie Mellon U. in 2013. His research included probabilistic graphical models, parallel sparse regression, and aggregation mechanisms for peer grading in MOOCs.
Sessions
Kimoon joined Pepperdata in 2013. Previously, he worked for the Google Search and Yahoo Search teams for many years. Kimoon has hands-on experience with large distributed systems processing massive data sets.
Sessions
Leah McGuire is a Lead Member of Technical Staff at Salesforce, building platforms to enable the integration of machine learning into Salesforce products. Before joining Salesforce, Leah was a Senior Data Scientist on the data products team at LinkedIn working on personalization, entity resolution, and relevance for a variety of LinkedIn data products. She completed a PhD and a Postdoctoral Fellowship in Computational Neuroscience at the University of California, San Francisco, and at University of California, Berkeley, where she studied the neural encoding and integration of sensory signals.
Sessions
Matteo Interlandi recently joined Microsoft CISL as a Research Scientist. Prior to joining Microsoft, Matteo was Postdoctoral Scholar at the University of California, Los Angeles. His research lies in between databases, distributed systems and declarative languages. In particular, he loves to build systems and tools that make it easier to design and implement data-driven distributed applications.
Sessions
Michael Malak is the lead author of Spark GraphX In Action and has been developing Spark solutions at two Fortune 200 companies since early 2013. He has been programming computers since before they could be bought pre-assembled in stores.
Sessions
Min Shen is an engineer on LinkedIn’s Hadoop infrastructure development team, where he builds services and tools to tackle scaling challenges in operating large-scale multi-tenancy Hadoop deployment. Recently, he has been helping with creating tools to support operating Spark at scale as well as developing and running Spark jobs easily at LinkedIn.
Sessions
Nan Zhu is a Software Engineer from Microsoft, where he works on serving Spark Streaming/Structured Streaming on Azure HDInsight. He is a contributor of Apache Spark (known as CodingCat) and also serves as the committee member of Distributed Machine Learning Community (DMLC) and Apache MxNet (incubator).
Sessions
Tue, 6th Jun Room 2003
12:20 - 12:50 • Building a unified Data Pipeline with Apache Spark and XGBOOST
Nikita Shamgunov co-founded MemSQL and has served as CTO since inception. Prior to co-founding the company, Nikita worked on core infrastructure systems at Facebook. He served as a senior database engineer at Microsoft SQL Server for more than half a decade. Nikita holds a bachelor’s, master’s and doctorate in computer science, has been awarded several patents and was a world medalist in ACM programming contests.
Sessions
I’m a member of the research staff at IBM research Zurich. My research interests are in distributed systems, networking and operating systems. I graduated with a PhD from ETH Zurich in 2008 and spent two years (2008-2010) as a Postdoc at Microsoft Research Silicon Valley. My current work is about exploiting fast network and storage hardware in data processing systems.
Sessions
Prabhu Kasinathan is the chief data engineer in Big Data Platform at Paypal with 5+ years of big data experience. He is creating APIs, tools and services for Spark platform to support multi-tenancy and large scale computation-intensive applications. He is an expert in building data warehousing solutions on Hadoop and Teradata platform with 11+ years of data experience.
Sessions
Ross Gardler has been involved with open source in one form or another since the mid ‘90s. He is a member of the Apache Software Foundation where he currently serves as the foundation’s President. He works at Microsoft on the Linux Compute team in Azure where he is responsible for the Azure Container Service.
Sessions
Wed, 7th Jun Room 2002
17:00 - 17:30 • Creating Personalized Container Solutions with Container Services
Sessions
Sessions
Tue, 6th Jun Room 2003
16:20 - 16:50 • More Algorithms and Tools for Genomic Analysis on Apache Spark
Sam Penrose loves how working with data at scale for Mozilla brings out the power and beauty of mathematics. Previously he helped Industrial Light and Magic bring the power and beauty of giant robots out to movie screens everywhere.
Sessions
Shay is an experienced software developer, architect, and entrepreneur. He was the founder and VP R&D of Peak-Dynamics—an energy saving solution for water utilities and CTO at Utab, a web platform for musicians. Shay loves solving complex problems and writing performant code.
Sessions
Tue, 6th Jun Room 2003
17:40 - 18:10 • Building a Large Scale Recommendation Engine with Spark and REDIS-ML
Songtao Guo is a Principal Data Scientist and tech lead of Data Mining team at Linkedin where he leads many of data driven products and analytics systems. His work involves building large-scale knowledge base, inventing data mining platforms to scale business analytics and partnering with product, sales, and marketing to deliver impactful solutions. Before joining LinkedIn, Songtao was a senior researcher at AT&T interactive, focusing on improving data quality and search relevancy for local business search. He holds a PhD in computer science from University of North Carolina at Charlotte.
Sessions
Ted is working on the Battle.net team at Blizzard, helping support great titles like World of Warcraft, Overwatch, HearthStone, and much more. Previously, he was a Principal Solutions Architect at Cloudera, helping clients be successful with Hadoop and the Hadoop ecosystem. Previously, he was a Lead Architect at the Financial Industry Regulatory Authority (FINRA). He has also contributed code to Apache Flume, Apache Avro, Apache Yarn, Apache HDFS, Apache Spark, Apache Sqoop, and many more. Ted is also a co-author or O’Reilly “Hadoop Application Architectures” and a frequent speaker at many conferences, and a frequent blogger on data architectures.
Sessions
Tejas is a software engineer at Facebook. For the past 3 years, he has been part of the Data Infrastructure group at Facebook and primarily works on building large scale distributed data processing systems responsible for handling batch workloads. He is currently a PMC member and committer of Apache Nutch and has contributed to several open source projects. Tejas obtained a Master’s Degree in Computer Science from University Of California, Irvine.
Sessions
Tim Poterba is an engineer and computational biologist on the Hail team at the Broad Institute of MIT and Harvard. Prior to joining the Broad, he studied protein folding dynamics at the Max Planck Institute for Biochemistry on a Fulbright Scholarship. He received his B.A. in Biophysics from Amherst College in 2013.
Sessions
Wei Di is currently the staff member in Business Analytic Data mining team. She is passionate about creating smart and scalable solutions that can impact millions of individuals and empower successful business. She has wide interests covering artificial intelligence, machine learning and computer vision. She was previously associated with eBay Human Language Technology and eBay Research Labs, with focus on large scale image understanding and joint learning from visual and text information. Prior to that, she was with Ancestry.com working in the areas of record linkage and search relevance. She received her PhD from Purdue University in 2011.
Sessions
Xiangrui Meng is an Apache Spark PMC member and a software engineer at Databricks. His main interests center around developing and implementing scalable algorithms for scientific applications. He has been actively involved in the development and maintenance of Spark MLlib since he joined Databricks. Before Databricks, he worked as an applied research engineer at LinkedIn, where he was the main developer of an offline machine learning framework in Hadoop MapReduce. His Ph.D. work at Stanford is on randomized algorithms for large-scale linear regression problems.
Sessions
Yin Huai is a Software Engineer at Databricks and mainly works on Spark SQL. Before joining Databricks, he was a PhD student at The Ohio State University and was advised by Xiaodong Zhang. His interests include storage systems, database systems, and query optimization. He is also an Apache Hive committer.