Data clustering is one of the most fundamental components of the machine learning toolkit. Many of these algorithms are easily implemented in Python, making it the go to language for data scientists when implementing canned routines.
At the PyData conference last fall, I attended this talk by Leland McInnes & John Healy which introduced the HDBSCAN algorithm as a better alternative to clustering (vs. traditional methods ala K means clustering, affinity propagation, mean shift, spectral clustering, etc.) . Its main advantages:
- You don’t have to pick number of clusters
- No spherical “cows” to use their terminology. There’s not some centroid assumption in the shape of clusters.
- Not every point has to belong to a cluster (more natural for clustering as opposed to partitioning)
For an intuitive explanation of the algorithm, see this.
I had been looking for a place to apply HDBSCAN in my own research for some time, and I think geographic clustering is the perfect one. In one of my current projects, we are estimating demand of hotels across many markets. In classical definitions of markets, we ignore the fact that there are varying locations of desirability within a city. Not all locations in NYC are the same. Moreover, the value of the location evolves over time as restaurants and attractions are built and/or become more prominent.
HDBSCAN potentially solves our intra-city markets problem by clustering venues together into local “neighborhoods.” There’s some intuitive appeal to the idea that areas of relatively uniform venue density that’s higher than its surroundings form “neighborhoods.” Moreover, these areas of high venue density are often non-spherical; e.g. the Las Vegas strip is all on one street and not all equidistant points from some central point. HDBSCAN, not dependent on spherical balls to form clusters, perfectly adapts to this characteristic of geographic data. Another characteristic is that each city may have a different number of defining neighborhoods. As HDBSCAN only requires a minimum cluster size as a hyperparameter, we do not need to pre-determine the number of neighborhoods to implement HDBSCAN as the clustering algorithm. Finally, in a city, there are often businesses that are not in some dense cluster, these “stand-alone” businesses should not be forced into a neighborhood – another feature accommodated by HDBSCAN.
The folder for this small project w/ data and the executable file is linked here, the resulting map looks like this! There’s much more that we can do (shade neighborhoods based on some data – popularity by review volume perhaps?).