Skip to content

Commit

Permalink
Merge pull request #654 from JelmerBot/dev/flasc-fixes
Browse files Browse the repository at this point in the history
Fix typo's and avoid internal numpy API in branch detection code.
  • Loading branch information
lmcinnes authored Aug 15, 2024
2 parents 2e7112d + 7037c60 commit 5559983
Show file tree
Hide file tree
Showing 3 changed files with 67 additions and 48 deletions.
32 changes: 20 additions & 12 deletions docs/how_to_detect_branches.rst
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
How to detect banches in clusters
How to detect branches in clusters
=================================

HDBSCAN\* is often used to find subpopulations in exploratory data
Expand All @@ -14,20 +14,21 @@ does not inform us of the branching structure:
.. image:: images/how_to_detect_branches_3_0.png

Alternatively, HDBSCAN\*’s leaf clusters provide more detail. They
segment the points of different branches into distint clusters. However,
segment the points of different branches into distinct clusters. However,
the partitioning and cluster hierarchy does not (necessarily) tell us how
those clusters combine into a larger shape.

.. image:: images/how_to_detect_branches_5_0.png

This is where the branch detection post-processing step comes into play.
The functionality is described in detail by `Bot et
al <https://arxiv.org/abs/2311.15887>`__. It operates on the detected
clusters and extracts a branch-hierarchy analogous to HDBSCAN\*’s
condensed cluster hierarchy. The process is very similar to HDBSCAN\*
clustering, except that it operates on an in-cluster eccentricity rather
than a density measure. Where peaks in a density profile correspond to
clusters, the peaks in an eccentricity profile correspond to branches:
al <https://arxiv.org/abs/2311.15887>`__ (please reference this paper when using
this functionality). It operates on the detected clusters and extracts a
branch-hierarchy analogous to HDBSCAN\*'s condensed cluster hierarchy. The
process is very similar to HDBSCAN\* clustering, except that it operates on an
in-cluster eccentricity rather than a density measure. Where peaks in a density
profile correspond to clusters, the peaks in an eccentricity profile correspond
to branches:

.. image:: images/how_to_detect_branches_7_0.png

Expand All @@ -41,11 +42,18 @@ The resulting partitioning reflects subgroups for clusters and their
branches:

.. code:: python
from hdbscan import HDBSCAN, BranchDetector
clusterer = HDBSCAN(min_cluster_size=15, branch_detection_data=True).fit(data)
branch_detector = BranchDetector(min_branch_size=15).fit(clusterer)
plot(branch_detector.labels_)
# Plot labels
plt.scatter(data[:, 0], data[:, 1], 1, color=[
"silver" if l < 0 else f"C{l % 10}" for l in branch_detector.labels_
])
plt.axis("off")
plt.show()
.. image:: images/how_to_detect_branches_9_0.png

Expand Down Expand Up @@ -75,7 +83,7 @@ Most guidelines for tuning HDBSCAN\* also apply for the branch detector:
``allow_single_cluster`` and mostly affects the EOM selection
strategy. When enabled, clusters with bifurcations will be given a
single label if the root segment contains most eccentricity mass
(i.e., branches already merge far from the center and most poinst are
(i.e., branches already merge far from the center and most points are
central).
- ``max_branch_size`` behaves like HDBSCAN\*’s ``max_cluster_size`` and
mostly affects the EOM selection strategy. Branches with more than
Expand All @@ -99,7 +107,7 @@ Two parameters are unique to the ``BranchDetector`` class:
all ``min_samples``-nearest neighbours.
- The ``"full"`` method connects all points with a mutual
reachability lower than the maximum distance in the cluster’s MST.
It represents all connectity at the moment the last point joins
It represents all connectivity at the moment the last point joins
the cluster.

These methods differ in their sensitivity, noise robustness, and
Expand Down Expand Up @@ -143,7 +151,7 @@ cluster.

The length of the branches also says something about the compactness /
elongatedness of clusters. For example, the branch hierarchy for the
orange ~-shaped cluster is quite different from the same hierarcy for
orange ~-shaped cluster is quite different from the same hierarchy for
the central o-shaped cluster.

.. code:: python
Expand Down
60 changes: 35 additions & 25 deletions hdbscan/plots.py
Original file line number Diff line number Diff line change
Expand Up @@ -948,36 +948,46 @@ def __init__(
branch_probabilities,
raw_data=None,
):
self._edges = np.core.records.fromarrays(
np.hstack(
(
np.concatenate(approximation_graphs),
np.repeat(
np.arange(len(approximation_graphs)),
[g.shape[0] for g in approximation_graphs],
)[None].T,
)
).transpose(),
names="parent, child, centrality, mutual_reachability, cluster",
formats="intp, intp, double, double, intp",
self._edges = np.array(
[
(edge[0], edge[1], edge[2], edge[3], cluster)
for cluster, edges in enumerate(approximation_graphs)
for edge in edges
],
dtype=[
("parent", np.intp),
("child", np.intp),
("centrality", np.float64),
("mutual_reachability", np.float64),
("cluster", np.intp),
],
)
self.point_mask = cluster_labels >= 0
self._raw_data = raw_data[self.point_mask, :] if raw_data is not None else None
self._points = np.core.records.fromarrays(
np.vstack(
self._points = np.array(
[
(
np.where(self.point_mask)[0],
labels[self.point_mask],
probabilities[self.point_mask],
cluster_labels[self.point_mask],
cluster_probabilities[self.point_mask],
cluster_centralities[self.point_mask],
branch_labels[self.point_mask],
branch_probabilities[self.point_mask],
i,
labels[i],
probabilities[i],
cluster_labels[i],
cluster_probabilities[i],
cluster_centralities[i],
branch_labels[i],
branch_probabilities[i],
)
),
names="id, label, probability, cluster_label, cluster_probability, cluster_centrality, branch_label, branch_probability",
formats="intp, intp, double, intp, double, double, intp, double",
for i in np.where(self.point_mask)[0]
],
dtype=[
("id", np.intp),
("label", np.intp),
("probability", np.float64),
("cluster_label", np.intp),
("cluster_probability", np.float64),
("cluster_centrality", np.float64),
("branch_label", np.intp),
("branch_probability", np.float64),
],
)
self._pos = None

Expand Down
23 changes: 12 additions & 11 deletions notebooks/How to detect branches.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# How to detect banches in clusters\n",
"# How to detect branches in clusters\n",
"\n",
"HDBSCAN\\* is often used to find subpopulations in exploratory data analysis\n",
"workflows. Not only clusters themselves, but also their shape can represent\n",
Expand Down Expand Up @@ -107,7 +107,7 @@
"metadata": {},
"source": [
"Alternatively, HDBSCAN\\*'s leaf clusters provide more detail. They segment the\n",
"points of different branches into distint clusters. However, the partitioning\n",
"points of different branches into distinct clusters. However, the partitioning\n",
"and cluster hierarchy does not (necessarily) tell us how those clusters combine\n",
"into a larger shape."
]
Expand Down Expand Up @@ -143,12 +143,13 @@
"source": [
"This is where the branch detection post-processing step comes into play. The\n",
"functionality is described in detail by [Bot et\n",
"al](https://arxiv.org/abs/2311.15887). It operates on the detected clusters and\n",
"extracts a branch-hierarchy analogous to HDBSCAN*'s condensed cluster hierarchy.\n",
"The process is very similar to HDBSCAN* clustering, except that it operates on\n",
"an in-cluster eccentricity rather than a density measure. Where peaks in a\n",
"density profile correspond to clusters, the peaks in an eccentricity profile\n",
"correspond to branches:"
"al](https://arxiv.org/abs/2311.15887) (please reference this paper when using\n",
"this functionality). It operates on the detected clusters and extracts a\n",
"branch-hierarchy analogous to HDBSCAN\\*'s condensed cluster hierarchy. The\n",
"process is very similar to HDBSCAN\\* clustering, except that it operates on an\n",
"in-cluster eccentricity rather than a density measure. Where peaks in a density\n",
"profile correspond to clusters, the peaks in an eccentricity profile correspond\n",
"to branches:"
]
},
{
Expand Down Expand Up @@ -269,7 +270,7 @@
" mostly affects the EOM selection strategy. When enabled, clusters with\n",
" bifurcations will be given a single label if the root segment contains most\n",
" eccentricity mass (i.e., branches already merge far from the center and most\n",
" poinst are central).\n",
" points are central).\n",
"- `max_branch_size` behaves like HDBSCAN\\*'s `max_cluster_size` and mostly\n",
" affects the EOM selection strategy. Branches with more than the specified\n",
" number of points are skipped, selecting their descendants in the hierarchy\n",
Expand All @@ -288,7 +289,7 @@
" minimum spanning tree under HDBSCAN\\*'s mutual reachability distance. This\n",
" graph contains the detected MST and all `min_samples`-nearest neighbours. \n",
" - The `\"full\"` method connects all points with a mutual reachability lower\n",
" than the maximum distance in the cluster's MST. It represents all connectity\n",
" than the maximum distance in the cluster's MST. It represents all connectivity\n",
" at the moment the last point joins the cluster. These methods differ in\n",
" their sensitivity, noise robustness, and computational cost. The `\"core\"`\n",
" method usually needs slightly higher `min_branch_size` values to suppress\n",
Expand Down Expand Up @@ -348,7 +349,7 @@
"source": [
"The length of the branches also says something about the compactness /\n",
"elongatedness of clusters. For example, the branch hierarchy for the orange\n",
"~-shaped cluster is quite different from the same hierarcy for the central\n",
"~-shaped cluster is quite different from the same hierarchy for the central\n",
"o-shaped cluster."
]
},
Expand Down

0 comments on commit 5559983

Please sign in to comment.