With the recent advancements in machine learning algorithms and statistical techniques, and the increasing ease of implementing them in Python, it is tempting to ignore the power and necessity of exploratory data analysis (EDA), the crucial step before diving into machine learning or statistical modeling. Simply applying machine learning algorithms without a proper orientation of the dataset can lead to wasted time and spurious conclusions. EDA allows practitioners to gain intuition for the pattern of the data, identify anomalies, narrow down a set of alternative modeling approaches, devise strategies to handle missing data, and ensure correct interpretation of the results. Further, EDA can rapidly generate insights and answer many questions without requiring complex modeling. Python is a fantastic language not only for machine learning, but also EDA. In this tutorial, we will walk through two hands-on examples of how to perform EDA using Python and discuss various EDA techniques for cross-section data, time-series data, and panel data. One example will demonstrate how to use EDA to answer questions, test business assumptions, and generate hypotheses for further analysis. The other example will focus on performing EDA to prepare for modeling. Between these two examples, we will cover: Data profiling and quality assessment, Basic describing of the data, Visualizing the data including interactive visualizations, Identifying patterns in the data (including patterns of correlated missing data), Dealing with many attributes (columns), Dealing with large datasets using sampling techniques, Informing the engineering of features for future modeling, Identifying challenges of using the data (e.g. skewness, outliers), Developing an intuition for interpreting the results of future modeling. The intended audience for this tutorial are aspiring and practicing data scientists and analysts, or anyone who wants to be able to get insights out of data. Students must have at least an intermediate-level knowledge of Python and some familiarity with analyzing data would be beneficial. Installation of Jupyter Notebook will be required (and potentially, we will also demonstrate analysis in JupyterLab, if its development in the next few months allows). Instructions will be sent on what packages to install beforehand.
Chloe Mawer
Chloe Mawer is a Senior Data Scientist at Silicon Valley Data Science, a small consulting company located in Mountain View, California that focuses on transforming businesses through data strategy, science, and engineering. At SVDS, Chloe has worked on problems for pharmaceutical and retail companies, which heavily rely on using Python for data analysis and modeling. Prior to SVDS, she obtained her PhD in Environmental Engineering at Stanford, where she focused on developing methods for monitoring water’s movement in the subsurface using electrical measurements.
Jonathan Whitmore
Jonathan Whitmore, PhD, is a Senior Data Scientist at Silicon Valley Data Science. He is the author of the O'Reilly screencast: Jupyter Notebook for Data Science Teams. Before moving into the tech industry, Dr. Whitmore worked as an astrophysicist in Melbourne, Australia, researching whether the fundamental physical constants have changed over the age of the universe. Dr. Whitmore received his PhD in physics from the University of California, San Diego.
Room 7
Wednesday, 17th May, 09:00 - 12:20