
PySpark SQL Recipes: With HiveQL, Dataframe and Graphframes
Raju Kumar / Raman Mishra
Résumé
PySpark SQL Recipes starts with recipes on creating dataframes from different types of data source, data aggregation and summarization, and exploratory data analysis using PySpark SQL. You'll also discover how to solve problems in graph analysis using graphframes.
On completing this book, you'll have ready-made code for all your PySpark SQL tasks, including creating dataframes using data from different file formats as well as from SQL or NoSQL databases.
What You Will Learn
- Understand PySpark SQL and its advanced features
- Use SQL and HiveQL with PySpark SQL
- Work with structured streaming
- Optimize PySpark SQL
- Master graphframes and graph processing
Who This Book Is For Data scientists, Python programmers, and SQL programmers.
Chapter Goal: Reader will understand about PySpark, PySparkSQL , Catalyst Optimizer, Project Tungsten and Hive
No of pages 20-30
Sub -Topics
1. PySpark
2. PySparkSQL
3. Hive
4. Catalyst
5. Project Tungsten
Chapter 2: Some time with Installation
Chapter Goal: Learner will understand about installation of Spark, Hive, PostgreSQL, MySQL, MongoDB, Cassandra etc.
No of pages: 30 -40
Sub - Topics
1. Installation Spark
2. Installation Hive
3. Installation MySQL
4. Installation MongoDB
Chapter 3: IO in PySparkSQLChapter Goal: This chapter will provide recipes to the reader, which will enable them to create PySparkSQL DataFrame from different sources.
No of pages : 40-50
Sub - Topics:
1. Creating DataFrame from data.
2. Reading csv file to create Dataframe
3. Reading JSON file to create Dataframe.
4. Saving DataFrames to different formats.
Chapter 4 : Operations on PySparkSQL DataFrames
Chapter Goal: Reader will learn about data filtering, data manuipulation, data descriptive analysis , Dealing with missing value etc
No Of Pages ; 40 -50
1. Data filtering
2. Data manipulation
3. Row and column manipulation
Chapter 5 : Data Merging and Data Aggregation using PySparkSQL
Chapter Goal: Reader will learn about data merging and aggregation using PySparkSQL
1. Data Merging
2. Data aggregation
Chapter 6: SQL, NoSQL and PySparkSQL
Chapter Goal: Reader will learn to run SQL and HiveQL queries on Dataframe
No of pages : 30-40
Sub - Topics:
1. Running SQL on DataFrame
2. Running HiveQL
Chapter 7: Structured Streaming
Chapter Goal: Reader will understand about structured streaming
No of pages : 30-40
1. Different type of modes.
2. Data aggregation in structured streaming
3. Different type of sources
Chapter 8 : Optimizing PySparkSQL
Chapter Goal: Reader will learn about optimizing PySparkSQL
No Of pages : 20-30
Optimizing PySparkSQL
Chapter 9 : GraphFrames
Chapter Goal: Reader will understand about graph data analysis with Graphframes.
No of pages : 30-40
1. GraphFrame Creation
1. Page Rank
2. Breadth First Search
Sundar Rajan Raman is an artificial intelligence practitioner currently working at Bank of America. He holds a Bachelor of Technology degree from the National Institute of Technology, India. Being a seasoned Java and J2EE programmer he has worked on critical applications for companies such as AT&T, Singtel, and Deutsche Bank. He is also a seasoned big data architect. His current focus is on artificial intelligence space including machine learning and deep learning.
Caractéristiques techniques
PAPIER | |
Éditeur(s) | Apress |
Auteur(s) | Raju Kumar / Raman Mishra |
Parution | 18/03/2019 |
Nb. de pages | 323 |
EAN13 | 9781484243343 |
Avantages Eyrolles.com
Consultez aussi
- Les meilleures ventes en Graphisme & Photo
- Les meilleures ventes en Informatique
- Les meilleures ventes en Construction
- Les meilleures ventes en Entreprise & Droit
- Les meilleures ventes en Sciences
- Les meilleures ventes en Littérature
- Les meilleures ventes en Arts & Loisirs
- Les meilleures ventes en Vie pratique
- Les meilleures ventes en Voyage et Tourisme
- Les meilleures ventes en BD et Jeunesse