In this blogpost we want to learn how to do dimensionality reduction for datasets.
This can be used to visualise word embeddings or other data with more than 2 or 3 dimensions.
For this, 2 algorithms in particular, T-SNE and PCA, are easy to use because they are already implemented in sklearn.
First, a dataset has to be loaded, in this case lets use a simple dataset from sklearn:
In [1]:
%matplotlib notebook
from sklearn.datasets import load_digits
digits = load_digits()
print("Data:")
print(digits.data)
print("Maximum Value:")
print(digits.data.max())
print("Normalized Data:")
print(digits.data/digits.data.max())
data = (digits.data/digits.data.max())[:500]
labels = digits.target[:500]
print(labels)
Now we need to do dimensionality reduction on the data:
In [2]:
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
twod_pca_data = TSNE(n_components=2, perplexity=100.0).fit_transform(data)
threed_pca_data = TSNE(n_components=3, perplexity=100.0).fit_transform(data)
In [4]:
import matplotlib.pyplot as plt
for label in set(digits.target_names):
data_for_label = twod_pca_data[labels == label]
plt.scatter(data_for_label[:, 0], data_for_label[:, 1], label=str(label))
plt.legend()
plt.tight_layout()
plt.show()
In [5]:
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure(figsize=(15, 10))
ax = fig.add_subplot(111, projection='3d')
print(ax)
for label in set(digits.target_names):
data_for_label = threed_pca_data[labels == label]
ax.scatter(data_for_label[:, 0], data_for_label[:, 1], data_for_label[:, 2], label=str(label), s=300)
plt.legend()
plt.tight_layout()
plt.show()