The Big R-Book: From Data Science to Learning Machines for the Professional

Philippe J S. De Brouwer

864 pages, parution le 09/06/2020

Ajouter à une liste

Indisponible

Résumé

Introduces professionals and scientists to statistics and machine learning using the programming language R Written by and for practitioners, this book provides an overall introduction to R, focusing on tools and methods commonly used in data science, and placing emphasis on practice and business use. It covers a wide range of topics in a single volume, including big data, databases, statistical machine learning, data wrangling, data visualization, and the reporting of results. The topics covered are all important for someone with a science/math background that is looking to quickly learn several practical technologies to enter or transition to the growing field of data science. The Big R-Book for Professionals: From Data Science to Learning Machines and Reporting with R includes nine parts, starting with an introduction to the subject and followed by an overview of R and elements of statistics. The third part revolves around data, while the fourth focuses on data wrangling. Part 5 teaches readers about exploring data. In Part 6 we learn to build models, Part 7 introduces the reader to the reality in companies, Part 8 covers reports and interactive applications and finally Part 9 introduces the reader to big data and performance computing. It also includes some helpful appendices. Provides a practical guide for non-experts with a focus on business users Contains a unique combination of topics including an introduction to R, machine learning, mathematical models, data wrangling, and reporting Uses a practical tone and integrates multiple topics in a coherent framework Demystifies the hype around machine learning and AI by enabling readers to understand the provided models and program them in R Shows readers how to visualize results in static and interactive reports Supplementary materials includes PDF slides based on the book's content, as well as all the extracted R-code and is available to everyone on a Wiley Book Companion Site The Big R-Book is an excellent guide for science technology, engineering, or mathematics students who wish to make a successful transition from the academic world to the professional. It will also appeal to all young data scientists, quantitative analysts, and analytics professionals, as well as those who make mathematical models.Foreword v About the Author vii Acknowledgements ix Preface / Why this book? xi Contents xv I Introduction 1 1 The Big Picture with Kondratiev and Kardashev 3 2 The Scientific Method and Data 7 3 Conventions 13 II Starting with R and Elements of Statistics 19 4 The Basics of R 21 4.1 Variables 27 4.2 Data Types 29 4.2.1 Elementary Data Types 29 4.2.2 Vectors 30 4.2.3 Lists 33 4.2.4 Matrices 39 4.2.5 Arrays 42 4.2.6 Factors 44 4.2.7 Data Frames 48 4.3 Operators 56 4.3.1 Arithmetic Operators 56 4.3.2 Relational Operators 57 4.3.3 Logical Operators 57 4.3.4 Assignment Operators 59 4.3.5 Other Operators 60 4.3.6 Loops 62 4.3.7 Functions 66 4.3.8 Packages 70 4.3.9 Strings 73 4.4 Selected Data Interfaces 76 4.4.1 CSV Files 76 4.4.2 Excel Files 80 4.4.3 Databases 80 4.5 Distributions 83 4.5.1 Normal Distribution 83 4.5.2 Binomial Distribution 85 5 Lexical Scoping and environments 91 5.1 Environments in R 92 5.2 Lexical Scoping in R 94 6 The Implementation of OO 99 6.1 Base Types 102 6.2 S3 Objects 104 6.2.1 Creating S3 objects 107 6.2.2 Creating generic methods 109 6.2.3 Method dispatch 110 6.2.4 Group generic functions 111 6.3 S4 Objects 114 6.3.1 Creating S4 Objects 114 6.3.2 Recognising objects, generic functions, and methods 122 6.3.3 Creating S4 Generics 124 6.3.4 Method dispatch 125 6.4 The reference class, refclass, RC or R5 model 127 6.4.1 Creating R5 objects 127 6.5 OO Conclusion 134 7 Tidy R with the Tidyverse 137 7.1 The Philosophy of the Tidyverse 138 7.2 Packages in the tidyverse 141 7.3 Working with the tidyverse 144 7.3.1 tibbles 144 7.3.2 Piping with R 150 7.3.3 Attention points when using the pipe command 151 7.3.3.1 Advanced piping 153 7.3.3.2 Conclusion 155 8 Elements of Descriptive Statistics 157 8.1 Measures of Central Tendency 158 8.1.1 Mean 158 8.1.2 The Median 161 8.1.3 The Mode 162 8.2 Measures of Variation or Spread 164 8.3 Measures of Covariation 166 8.4 Chi Square Tests 169 9 Further Reading 171 III Data Import 173 10 A short history of modern database systems 175 11 RDBMS 179 12 SQL 183 12.1 Designing the database 184 12.2 Building the database 187 12.3 Adding data to the database 196 12.4 Querying the database 200 12.5 Modifying an existing database 206 12.6 Advanced features of SQL 211 13 Connecting R to an SQL database 215 IV Data Wrangling 221 14 Anonymising Data 225 15 DataWrangling in the tidyverse 229 15.1 Tidy data 230 15.2 Importing the data 232 15.2.1 Importing from an SQL RDBMS 232 15.2.2 Importing flat files in the tidyverse 234 15.2.2.1 CSV Files 236 15.2.2.2 Making sense of fixed width files 238 15.3 Tidying up data with tidyr 243 15.3.1 Splitting tables 244 15.3.2 headers to data 249 15.3.3 Spreading one column over many 250 15.3.4 separate 252 15.3.5 Unite 254 15.3.6 Wrong Data 255 15.4 Playing with tipples: SQL-like functionality 256 15.4.1 Selecting 256 15.4.2 Filtering 256 15.4.3 Joining 258 15.4.4 Mutating 262 15.4.5 Set Operations 265 15.5 String Manipulation in the tidyverse 268 15.5.1 Basic string manipulation 269 15.5.2 Pattern matching with regular expressions 272 15.5.2.1 Regular Expressions 273 15.5.2.2 Functions using Regex 279 15.6 Dates with lubridate 287 15.6.0.1 ISO 8601 Format 288 15.6.0.2 Timezones 290 15.6.0.3 Extract and set date and time components 291 15.6.0.4 Calculating with date-times 293 15.7 Factors with forcats 298 16 Dealing with missing data 307 17 Data Binning 319 17.1 Tuning the binning procedure 323 17.2 More complex cases: matrix binning 329 17.3 Weight of evidence and information value 336 18 Factoring analysis and principle components 339 18.1 Principle components analysis 340 18.2 Factor Analysis 345 V Explore Data 349 19 Using Descriptive Statistics 353 20 Standard Charts & Graphs 357 20.1 Pie Charts 358 20.2 Bar Charts 359 20.3 Boxplots 361 20.4 Violin plots 363 20.5 Histograms 366 20.6 Scatterplots 368 20.7 Line Graphs 371 20.8 Plotting Functions 373 20.9 Maps and contour plots 374 21 Elected Visualization Methods 377 21.1 Heat-maps 377 21.2 Text Mining 379 21.2.1 Word Clouds 379 21.2.2 Word Associations 383 21.3 Colours in R 386 22 Time Series Analysis 393 22.1 Time Series in R 394 22.2 Forecasting 397 22.2.1 Moving Average 397 22.2.2 Seasonal Decomposition 403 VI Modelling 409 23 Regression Models 411 23.1 Linear Regression 411 23.2 Multiple Linear Regression 415 23.2.1 Poisson Regression 416 23.2.2 Non-Linear Regression 418 23.3 Performance of regression models 421 23.3.1 Mean Square Error (MSE) 421 23.3.2 R-Squared 421 23.3.3 Mean Average Deviation (MAD) 423 24 Classification Models 425 24.1 Logistic Regression 425 24.2 The performance of binary classification models 427 24.2.1 The Confusion Matrix and related measures 428 24.2.2 ROC 431 24.2.3 AUC 433 24.2.4 AUC Gini for logistic regression 435 24.2.5 Kolmogorov-Smirnov (KS) for logistic regression 436 24.2.6 Finding an Optimal Cut-off 439 25 Learning Machines 445 25.1 Decision Tree 447 25.1.1 Essential Background 447 25.1.2 Important considerations 452 25.1.3 Growing trees with R 455 25.1.4 Evaluating the performance of a decision tree 463 25.1.4.1 The performance of the regression tree 464 25.1.4.2 The performance of the classification tree 464 25.2 Random Forest 467 25.3 Artificial Neural Networks (ANN) 472 25.3.1 The basics of ANNs in R 472 25.3.2 An example of a work-flow to develop an ANN 475 25.4 Support Vector Machine 483 25.5 Unsupervised learning and clustering 487 25.5.1 k-means clustering 488 25.5.2 Fuzzy clustering 501 25.5.3 Hierarchical clustering 504 25.5.4 Other clustering methods 506 26 Towards a tidy modelling cycle with modelr 507 27 Model Validation 513 27.1 Model quality measures 515 27.2 Predictions and residuals 516 27.3 Bootstrapping 517 27.4 Cross-Validation 520 27.4.1 training and validating 521 27.5 Monte-Carlo Cross Validation 525 27.6 k-Fold Cross Validation 527 27.7 Comparison 529 27.8 Validation in a broader perspective 530 28 Labs 535 28.1 Financial Analysis with QuantMod 535 28.1.1 The quantmod data structure 539 28.1.2 Support functions supplied by quantmod 543 28.1.3 Financial modelling in quantmod 545 29 Multi Criteria Decision Analysis (MCDA) 553 29.1 What and Why 553 29.2 GeneralWork-flow 555 29.3 Identify the issue at hand: step 1 and 2 559 29.4 STEP 3: the decision matrix 561 29.4.1 Construct a decision matrix 561 29.4.2 Normalize the decision matrix 563 29.5 STEP 4: leave out inefficient and unacceptable alternatives 565 29.5.1 Unacceptable Alternatives 565 29.5.2 Dominance- inefficient alternatives 565 29.6 Printing preference relationships 568 29.7 STEP 6: MCDA Methods 570 29.7.1 Examples of non-compensatory methods 570 29.7.2 The weighted sum method (WSM) 571 29.7.3 WPM 574 29.7.4 ELECTRE 575 29.7.4.1 ELECTRE I 576 29.7.4.2 ELECTRE II 582 29.7.5 PROMethEE 584 29.7.5.1 PROMethEE I 587 29.7.5.2 PROMethEE II 597 29.7.6 PCA (Gaia) 602 29.7.7 Outranking methods 607 29.7.8 Goal Programming 608 29.8 Summary MCDA 611 VII Introduction to Companies 613 30 Financial Accounting 617 30.1 The Statements of Accounts 618 30.1.1 Income Statement 618 30.1.2 Net Income: The P&L statement 618 30.1.3 Balance Sheet 619 30.2 The Value Chain 621 30.3 Further Terminology 623 30.4 Selected Financial Ratios 625 31 Management Accounting 627 31.1 Introduction 628 31.2 Selected Methods in MA 630 31.2.1 Cost Accounting 630 31.2.2 Selected Cost Types 632 31.3 Selected Use Cases of MA 635 31.3.1 Balanced Scorecard 635 31.3.2 Key Performance Indicators 636 31.3.2.1 Selection of KPIs 638 32 Asset Valuation Basics 641 32.1 Time Value of Money 642 32.2 Cash 645 32.3 Bonds 646 32.3.1 Valuation of Bonds 648 32.3.2 Duration 650 32.3.2.1 Macaulay Duration 651 32.3.2.2 Modified Duration 652 32.4 Equities 654 32.4.1 Valuation of Equities 655 32.4.1.1 CAPM 656 32.4.2 Absolute Value Models 660 32.4.2.1 Dividend Discount Model 660 32.4.2.2 Free Cash Flow (FCF) 664 32.4.2.3 Discounted Cash Flow Model 666 32.4.2.4 Discounted Abnormal Operating Earnings valuation model 668 32.4.2.5 Net Asset Value Method or Cost Method 668 32.4.2.6 Excess Earnings Method 670 32.4.3 Relative Value Models 670 32.4.3.1 The Idea behind Relative Value Models 670 32.4.3.2 Some Ratios that can be used in relative value models 671 32.4.3.3 Measures Related to Company Value for External Stakeholders 673 32.4.3.4 Relative Value Models in Practice 680 32.4.3.5 Conclusions and Use 680 32.4.4 Selection of Valuation Methods 681 32.4.5 Pitfalls and Matters Requiring Attention for all Methods 682 32.4.5.1 Results and Sensitivity 682 32.5 Forwards and Futures 690 32.6 Options 692 32.6.1 Definitions 692 32.6.2 Commercial Aspects 695 32.6.3 Historic observations 696 32.6.4 Valuation of Options at Maturity 697 32.6.5 The Put-Call Parity 700 32.6.6 The Black & Scholes Model 702 32.6.6.1 Apply the Black and Scholes formula 703 32.6.7 Dependencies 705 32.6.8 Sensitivities: "the Greeks" 710 32.6.9 Delta Hedging 711 32.6.10 Linear Option Strategies 714 32.6.10.1 The Limits of the Black and Scholes Model 720 32.6.11 The Binomial Model 724 32.6.11.1 Risk Neutral Method 727 32.6.11.2 The Equivalent Portfolio Binomial Model 729 32.6.11.3 Summary Binomial Model 732 32.6.12 Exotic Options 732 32.6.13 Integrated Option Strategies 733 32.6.14 Capital Protected Structures 736 VIII Report 739 33 ggplot2 743 34 R-markdown 753 35 knitr and LATEX 757 36 An automated development cycle 761 37 Writing and communication skills 763 38 Interactive apps 767 38.1 Shiny 769 38.2 Browser born data visualization 773 38.2.1 HTML-widgets 773 38.2.2 ggvis 775 38.2.3 googleVis 777 38.3 Dashboards 779 38.3.1 The business case: a diversity dashboard 780 38.3.2 A dashboard with flexdashboard 785 38.3.2.1 Interactive dashboards with flexdashboard 790 38.3.3 A dashboard with shinydashboard 791 IX Appendices 795 39 Other Resources 797 40 Levels of Measurement 799 40.1 Nominal Scale 800 40.2 Ordinal Scale 801 40.3 Interval Scale 802 40.4 Ratio Scale 803 41 Trademark Notices 805 42 Code snippets not shown in the body of the book 809 43 Answers to questions 815 Bibliography 829 Index 839 Nomenclature 851

Philippe J.S. De Brouwer, PhD, is director at HSBC, guest professor at four universities (University of Warsaw, Jagiellonian University, Krakow School of Business and AGH University of Science and Technology) and honorary consul for Belgium in Krakow. As a professor, he builds bridges not only between universities and the industry, but also across disciplines. He teaches mathematicians leadership skills and non-mathematicians coding. As a scientist, he tries to combine research on financial markets, psychology, and investments to the benefit of the investor. As an honorary consul he is passionate about serving the community and helping initiatives grow.

Caractéristiques techniques

	PAPIER
Éditeur(s)	Wiley
Auteur(s)	Philippe J S. De Brouwer
Parution	09/06/2020
Nb. de pages	864
EAN13	9781119632726

Avantages Eyrolles.com

Livraison à partir de 0,01 € en France métropolitaine

Paiement en ligne SÉCURISÉ

Livraison dans le monde

Retour sous 15 jours

+ d'un million et demi de livres disponibles

The Big R-Book: From Data Science to Learning Machines for the Professional

Résumé

Caractéristiques techniques

Consultez aussi