Easily available online data, immense improvement of theory and algorithms in recent times have influenced and enhanced the computation power in solving real world problems. A bunch of machine learning techniques are available now to make our job easier. K Nearest Neighbor (KNN), Decision Tree, Gradient Boosting Methodologies (GBM), Random Forest, Support Vector Machine (SVM) are some of the popular techniques that have emerged in the recent past. It may happen that multiple methods (machine learning as well as traditional methods) are applicable to a particular problem in hand and the researcher or the analyst may be confused to find a good reason for choosing one method over another. This paper aims to help you make intelligent decisions about where KNN is most suitable for your particular problem and how you can apply it.
KNN is a non-parametric method which does not assume anything about the underlying distribution. KNN models can be quickly refreshed/redeveloped using the most recent data and incorporating the most recent trends. As a result, KNN has become quite popular in recent times compared to conventional (eg. Linear/Logistic regression) methods. KNN is an example of hybrid approach which deploys both user-based and item-based methods in a ‘recommender system’ to make the predictions.
The basic methodology of KNN is to find k most similar labeled points (closest neighbors) among available sample points in a cell of volume V and assign the most frequent class among those neighbors to our query(or unlabeled ) point x.
KNN modeling works around four parameters:
- Features: The variables based on which similarity between two points is calculated
- Distance function: Distance metric to be used for computing similarity between points
- Neighborhood (k) : Number of neighbors to search for
- Scoring function: The function which combines the labels of the neighbors to form a single score for the query point
An important element of deploying KNN algorithm is how the point of test dataset is mapped on the training dataset. There are several ways to map test dataset to the training dataset. Prevalent approaches are Exhaustive Search, Exhaustive large search and Voronoi Partitioning.
KNN methodologies have been applied for both binary target events (e.g. attrition modeling, response modeling) as well as continuous target events (Monthly spend, Daily Balance etc.).
At Axtria, we believe KNN methodology will become more and more popular in the times to come. This would be aided by higher computational power of the technology infrastructure present in organizations. Further, over time, it would stop being a pure black box exercise, as it is being deployed by modelers today. Modelers would develop the “art” of determining appropriate thresholds of k. It would enable them to gain more proficiency in modeling through KNN technique. We can’t wait to see the power of techniques like KNN, unleashed upon the real world!