PACE Big Data Workshop

About this workshop:

This workshop is sponsored by the NSF's XSEDE (The Extreme Science and Engineering Development Environment, https://www.xsede.org/) program. Staff members from Texas Advanced Computing Center (https://www.tacc.utexas.edu/) will teach the workshop. The workshop is organized as four separate sessions to cover various topics in Big Data Analysis.  Although participants are strongly encouraged to attend all sessions, the workshop is designed in a way such that participants may just attend selected sessions based on their background, schedule and needs.

 

About Instructors:

Ruizhu Huang is a research associate in the data intensive computing group at TACC. He has years of experience in big data analytics, machine learning, and data visualization. He has involved in various projects developing technologies that bridge the gap between traditional machine learning approaches and next-generation, data intensive computing methods involving High-Performance Computing (HPC) resources

Amit Gupta is a Research Engineering/Scientist Associate III in the Data Mining and Statistics group at TACC. His research interests are in Distributed Systems and Tools to enable scaling of Big Data Applications on HPC infrastructure, Parallel Programming and Information Retrieval Systems for text. He has extensive experience with various applications ranging from scaling Transportation Simulations to Text Mining of Biological literature. He earned an MS in Computer Science from the University of Colorado at Boulder with Thesis research in the area of Operating Systems.

Dr. Weijia Xu is a research scientist and manager of Data Mining and Statistics group at TACC. He received his Ph.D. in Computer Science from The University of Texas At Austin. Dr. Xu has over 50 peer-reviewed conference and journal publications in similarity-based data retrieval, data analysis, and information visualization with data from various scientific domains. He has served on program committees for several workshops and conferences in big data and high-performance computing area.

Part One: Introduction to Hadoop and Spark [register here]

Time: Sept 28 08:30am-12:30pm
Location: Marcus Nano Rm 1116
Capacity: 30 people
 
The session will focus on introducing Hadoop and Spark cluster to beginner, the topic includes:
  • basic concepts used in MapReduce programming model
  • major components of a Hadoop cluster
  • how to get started with Hadoop on your own computer and with computing resources at TACC
  • introduce Spark programming models and how Spark can work with a Hadoop cluster
  • different ways to use Hadoop and Spark for analysis
 
Participants do not need have any particular programming background, but working knowledge of Linux operating system is preferred. Class includes 3 hours lecture and 1 hour hands-on.
 
No show fee $25.00 applies if you don't show up in the session without cancelling it 5 days before the class.
 
Part Two: Developing a scalable application with Spark [register here]
 
Time: Sept 28 1:30pm-5:30pm
Location: Marcus Nano Rm 1116
Capacity: 30 people
 
This session will focus on how to develop a scalable application with Spark programming model, the topic includes:
 
  • review Spark programming model
  • basic introduction to the Scala programming language
  • how to run a Spark application
  • keys features to make scalable application
  • how to get started development using Spark after the class
 
Participant is expected to have prior knowledge on the concept of Hadoop and Spark cluster, knowledge of any programming language is preferred but not required.Class includes 3 hours lecture and 1 hour hands-on.
 
No show fee $25.00 applies if you don't show up in the session without cancelling it 5 days before the class.
 
Part Three: Common Practices on Hadoop and Spark Ecosystem [register here]
 
Time: Sept 29 08:30am-12:30pm
Location: Marcus Nano Rm 1116
Capacity: 30 people
 
This session will focus on general practices for practical analysis problem, the topic includes:
  • running batch jobs with different cluster deployment mode
  • running interactive jobs
  • explore existing libraries and applications including Hadoop streaming, MLlib, SparkSQL and Graph X
  • Using Hadoop/Spark with R and Python
 
Participants should have basic knowledge, experience and are comfortable with coding with knowledge of the Hadoop system, concepts of parallelism. Class includes 3 hours lecture and 1 hour hands-on.
 
No show fee $25.00 applies if you don't show up in the session without cancelling it 5 days before the class.
 
Part Four: Advanced Topic on Big Data Analysis [register here]
 
Time: Sept 29 01:30pm-03:30pm
Location: Marcus Nano Rm 1116
Capacity: 30 people

 

This session will cover more algorithm details and also provides a hands-on consultation for GT researchers' application, we will collect the use cases before the session, and walk through the selected use cases in details to demonstrate how to resolve the real world problem more efficiently.

Event Details

Date/Time:

  • Thursday, September 28, 2017
    - Friday, September 29, 2017
Location: Atlanta, GA
Fee(s): None

For More Information Contact

Fang (Cherry) Liu (Ph.D.)

fang.liu at gatech.edu