An introduction to biclustering

Introduction

This is the first in my series of biclustering posts. It serves as an introduction to the concept. The later posts will cover the individual algorithms that I am implementing for scikit-learn.

Before talking about biclustering, it is necessary to cover the basics of clustering. Clustering is a fundamental problem in data mining: given a set of objects, group them according to some measure of similarity. This deceptively simple concept has given rise to a wide variety of problems and algorithms to solve them. scikit-learn provides a number of clustering methods, but these represent only a small fraction of the diversity of the field. For a more detailed overview of clustering, there are several surveys available [1], [2], [3].

In many clustering problems, each of the \(n\) samples to be clustered is represented by a \(p\) -dimensional feature vector. The entire dataset may then be represented by a matrix of shape \(n\) x \(p\) . After applying some clustering algorithm, each sample belongs to one cluster.

To demonstrate, consider a clustering problem with 20 samples and 20 dimensions. Here is a scatterplot of the first two dimensions, with cluster membership designated by color:

By rearranging the rows of the data matrix, the samples belonging to each cluster can be made contiguous. In the original data (left) the clusters are not visible, but in the rearranged data (right), the correction partition is more obvious:

This view of clustering — partitioning the rows of the data matrix — leads to the definition of biclustering.

Biclustering: a data mining method that simultaneously clusters both rows and columns of a data matrix.

Clustering rows and columns together may seem a strange and unintuitive thing to do. To see why it might be useful, let us consider a simple example.

Motivating example: throwing a party

Bob is planning a housewarming party for his new three-room house. Each room has a separate sound system, so he wants to play different music in each room. As a conscientious host, Bob wants everyone to enjoy the music. Therefore, he needs to distribute albums and guests to each room in order to ensure that each guest hears their favorite songs.

Our host has invited 50 guests, and he owns 30 albums. He sends out a survey to each guest asking if they like or dislike each album. After receiving their responses, he collects the data into a 50 x 30 binary matrix \(\boldsymbol M\) , where \(M_{ij}=1\) if guest \(i\) likes album \(j\) .

Bob decides to bicluster this matrix to maximize the following objective function:

\begin{equation*} s(\boldsymbol M, \boldsymbol r, \boldsymbol c) = b(\boldsymbol r, \boldsymbol c) \cdot \sum_{i, j, k} M_{ij} r_{ki} c_{kj} \end{equation*}

where \(r_{ki}\) is an indicator variable for membership of guest \(i\) in cluster \(k\) , \(c_{kj}\) is an indicator variable for album membership, and \(b \in [0, 1]\) penalizes unbalanced solutions, i.e., those with biclusters of different sizes. As the difference in sizes between the largest and the smallest bicluster grows, \(b\) decays as:

\begin{equation*} b(\boldsymbol r, \boldsymbol c) = exp(\frac{max(\mathcal S) - min(\mathcal S)}{-\epsilon}) \end{equation*}

where

\begin{equation*} \mathcal S = \{ (\sum_{i} r_{ki}) \cdot (\sum_{j} c_{kj}) | k = 1..3 \} \end{equation*}

is the set of bicluster sizes, and \(\epsilon\) is a parameter that sets the aggressiveness of the penalty.

Bob uses the following algorithm to find his solution: starting with a random assignment of rows and columns to clusters, he iteratively reassigns row and columns to improve the objective function until convergence. In a simulated annealing fashion, he allows suboptimal reassignments to avoid local minima. However, this algorithm is not guaranteed to be optimal. If Bob had wanted an optimal solution, the naive approach would require trying every possible clustering, resulting in \(k^{n+p} = 3^{80}\) candidate solutions. This suggests that Bob’s problem is in a nonpolynomial complexity class. In fact, most formulations of biclustering problems are in NP-complete [5].

After biclustering, the rows and columns of the data matrix may be reordered to show the assignment of guests and albums to rooms. Here are the original data set (left) and the clusters that Bob found (right):

Although not everyone will enjoy every album, clearly this solution ensures that most people will enjoy most albums.

Conclusion

This article introduced biclustering through a contrived example, but these methods is not limited to throwing great parties. Any data that can be represented as a matrix is amenable to biclustering. It is a popular technique for analyzing gene expression data from microarray experiments. It has also been applied to recommendation systems, market research, databases, financial data, and agricultural data, as well as many other problems.

These diverse problems require more sophisticated algorithms than Bob’s party planning algorithm. His algorithm only works for binary data and assigns each row and each column to exactly one cluster. However, many other types of bicluster structures have been proposed. Those interested in a more detailed overview of the field may be interested in the surveys [4], [5], [6].

In the following weeks I will introduce some popular biclustering algorithms as I implement them. The next post will cover spectral biclustering.

References

[1]	Jain, A. K., Murty, M. N., & Flynn, P. J.(1999). Data clustering: a review. ACM computing surveys (CSUR), 31(3), 264-323.

[2]	Berkhin, P. (2006). A survey of clustering data mining techniques. In Grouping multidimensional data (pp. 25-71). Springer Berlin Heidelberg.

[3]	Jain, A. K.(2010). Data clustering: 50 years beyond K-means. Pattern Recognition Letters, 31(8), 651-666.

[4]	Tanay, A., Sharan, R., & Shamir, R. (2005). Biclustering algorithms: A survey. Handbook of computational molecular biology, 9, 26-1.

[5]	(1, 2) Madeira, S. C., & Oliveira, A. L.(2004). Biclustering algorithms for biological data analysis: a survey. Computational Biology and Bioinformatics, IEEE/ACM Transactions on, 1(1), 24-45.