Introduction
Apache Spark is a library for handling and processing large-scale datasets in parallel, making it especially valuable when dealing with “big data”. This workshop will introduce Spark from scratch, so that participants can gain confidence in using Spark before using it in their own projects and research.
Spark can be used with many programming languages, including Python, Java, R, and others. The same framework can be used on many scales, whether that be prototyping on a laptop or on large-scale high performance computing resources. Spark includes modules such as MLlib (for machine learning) and Spark SQL (for operations on structured data) and many others, so the same interface can be used to perform many different tasks in the data science workflow. These features will be of use to many researchers and data scientists, and this event will provide an introduction to Spark for those who are curious and would like to know more.
About the event
This hands-on course will cover the following topics:
- Introduction to Spark
- Map, Filter and Reduce
- Running on a Spark Cluster
- Key-value pairs
- Correlations, logistic regression
- Decision trees, K-means
15 places are reserved for delegates from the Alan Turing Institute. To reserve your place, mention your affiliation to the Turing in the “reasons for participation” section of the registration form.
Attendees will be provided with access to EPCC's Tier2 Cirrus system for all practical exercises.
The practicals will be done using Jupyter notebooks so a basic knowledge of Python would be extremely useful.
This workshop is funded by EPCC, PRACE and the Alan Turing Institute.

