Machine Learning

A series of articles dedicated to machine learning and statistics. All codes and exercises of this section are hosted on GitHub in a dedicated repository :

DataCast Interview : I recently gave an interview to DataCast, an excellent Data Science podcast. With James Le, we talked about Actuarial Science, being a young graduate, the Data industry, the Paris tech ecosystem…

Key Resources : Some important resources to to understand the basics of statistics.

A Full Guide to Linear Regression (Part 1) : We’ll explore the simple framework of OLS and multi-dimensional regression.

A Full Guide to Linear Regression (Part 2) : Random design matrix, Normal regression, Pseudo Least Squares and other extensions…

Basics of Statistical Hypothesis testing : How do we assess if a parameter is significant or not ? We’ll cover statistical tests, hypothesis and joint tests.

The Logistic Regression : The Logistic Regression brings a way to operate binary classification using underlying linear models. We’ll cover the basics of LR, the parameters to use and examples in Python.

Introduction to Time Series : A first approach to exploring a time series in Python with open data.

Key Concepts in Time Series : Stationarity, ergodicity… We’ll cover the key concepts of time series.

Basics of Time Series Forecasting : How do we make a series stationary ? How do we forecast ?

Time Series Forecasting with Facebook Prophet : Explore time series forecasting using the Prophet open-source package.

Handle missing values in Time Series : A quick illustration of backward filling and forward filling.

Key Concepts of Data Visualization : What is data viz ? Why use it ? What are the most famous tools ?

Automated Graphs with Visual Recommendation Systems : Visualization is a great way to enrich data exploration and analysis. Visual Recommendation systems aim to suggest automatically visualizations of a dataset.

Tableau-like in Python with Altair : Altair is a great Python library to create dashboards and interactive graphs like in Tableau.

Introduction to D3.js : D3 is a powerful JavaScript library that allows you to create graphs for web apps. In this introduction, we’ll cover the main concepts of D3.

A D3 tool to visualize French population : This tool demonstrates basic features of D3.js, including a nice color map, tooltips and zoom.

An Altair Web Tool : To visualize a T-SNE Embedding of the road characteristics in France, and their related accident profile. This could be used to enhance road rehabilitation works and reduce the number of accidents occuring yearly in France.

Basics and Motivation : A first approach to machine learning. We’ll go over the main motivations, the main kind of algorithms, what they can be used for…

Bayes Classifier : At the core of any algorithm, the Bayes Classifier is considered as one of the first algorithm to master.

Linear Discriminant Analysis (LDA) and QDA : In this article, we’ll cover the intuition behind LDA, when it should be used, and the maths behind it. We’ll also quick cover the Quadratic version of LDA.

Adaptative Boosting (AdaBoost) : A clear approach of boosting algorithms and adaptative boosting with illustrations. When should we use boosting ? What are the foundations of the algorithm ?

Gradient Boosting (Regression) : In this article, we’ll cover the basics of gradient boosting regression, and implement a high level version in Python.

Gradient Boosting (Classification) : In this article, we’ll cover the basics of gradient boosting classification as an extension of the Regression.

Large Scale Kernel Methods : Kernel methods offer a great way to solve complex problems. However, it gets computationally hard to implement them at scale. This is being solved by Large Scale Kernel methods.

Unsupervised Learning Cheat Sheet : A cheat sheet that recaps the main unsupervised learning algorithms. It includes an illustration, and the minimization problem for each of them.

Anomaly Detection : An overview of both supervised and unsupervised anomaly detection algorithms such as Isolation Forest.

EM for Gaussian Mixture Models and Hidden Markov Models : 140 detailed and visual slides on GMMs, HMMs and EM. Introducing GMMs as a clustering technique, comparing it with K-Means, details on how to train GMMs with EM, and overview of HMM training.

Markov Chains and Applications in Python : Markov Chains are the basic building block for Hidden Markov Models, widely used in image processing or in NLP.

Hidden Markov Models : In this article, we’ll go through the theory in a visual way and explore HMMs for a simple NLP task.

Build a language recognition app from scratch : HMMs and Viterbi decoding algorithm can be used to recognize the language of a text. Let’s implement this from scratch !

Introduction to Graphs : What is a graph ? Where are graphs being used ? What are the components of a graph ?

Graph Analysis, Erdos-Rényi, Barabasi-Albert : In this article, we cover the two main types of graphs, and describe a first approach to graph analysis.

Graph Algorithms : We’ll now explore the main graph algorithms and several use cases in a visual way with direct examples in Python.

Graph Learning : How can we handle missing links or missing nodes in graphs ?

Graph Embedding : A practical introduction to Graph Embedding with Node2Vec and Graph2Vec.

“Disrupting Resilient criminal networks through data analysis” paper summary: A summary and data exploration of an interesting paper on criminal networks in the Sicilian MAFIA.

“Structural Analysis of Criminal Network and Predicting Hidden Links using Machine Learning” paper summary: Summary and discussion of a paper tackling hidden link prediction as a supervised learning problem.

“Social network analysis as a tool for criminal intelligence:understanding its potential from the perspective of intelligence analysts” paper summary: A qualitative review on how Law Enforcement Agencies using Criminal Network Analysis tools, and my personal view on that.

A supervised learning approach to predicting nodes betweenness-centrality in time-varying networks: Can we predict which nodes will be central in the future? An explorative approach applied to Enron dataset with encouraging results.

GridSearch vs. RandomizedSearch : When it comes to parameter selection, you usually encounter 2 main solutions. GridSearch and RandomizedSearch. What is the main difference between these 2 techniques ? What are the pros and cons of each technique ?

Bayesian Hyperparameter Optimisation (HyperOpt) : Bayesian Hyperparameter Optimization is a great alternative to GridSearch and RandomizedSearch. How does it work ? How do you implement it in Python ?

AutoML with h2o : The interest in AutoML is rising over time. AutoML algorithms are reaching really good rankings in data science competitions. But what is AutoML ? How does it work ? When to use it ? And how can you implement an AutoML pipeline in Python ?

Machine Learning Explainability : In this series, I will summarize the course “Machine Learning Explaibnability” from Kaggle Learn. The full course is available here. We’ll cover permutation importance, partial dependence plots and SHAP Values.

Statistics in Matlab : Matlab remains a widely used language for statistics. In this article, we cover the main statistical topics in Matlab.

Introduction to Online learning : Online learning is a way to process data by streaming rather than by batch. This allows much faster and simpler implementations.

Linear Classification : The perceptron has been initially developped as an Online algorithm rather than a batch algorithm. Let’s dive into this first simple algorithm.