scikit-learn¶
Scikit-learn (sklearn) integrates machine learning algorithms in the tightly-knit scientific Python world, building upon numpy, scipy, and matplotlib. As a machine-learning module, it provides versatile tools for data mining and analysis in any field of science and engineering. It strives to be simple and efficient, accessible to everybody, and reusable in various contexts.
Policy¶
scikit-learn is freely available to users at HPC2N through an open source, commercially usable BSD license.
Citations
If you use scikit-learn in a scientific publication, they would appreciate citations to the following paper:
Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.
For more information, including bibtex entries, see Citing scikit-learn.
Overview¶
Scikit-learn (sklearn) is a powerful and easy-to-use open-source machine learning library for Python. It provides simple and efficient tools for data mining and data analysis, and it is built on NumPy, SciPy, and matplotlib. Scikit-learn is designed to interoperate with the Python numerical and scientific libraries.
More often that not, scikit-learn is used along with other popular libraries like tensorflow and pytorch to perform exploratory data analysis, data preprocessing, model selection, and evaluation. For our examples, we will use jupyter notebook on a CPU node to see visualization of the data and the results.
- Simple and efficient tools for predictive data analysis
- Accessible to everybody, and reusable in various contexts
- Built on NumPy, SciPy, and matplotlib
- Open source, commercially usable - BSD license
Usage at HPC2N¶
On HPC2N we have scikit-learn available as a module.
Loading¶
To use the scikit-learn module, add it to your environment. You can find versions with
and you can then find how to load a specific version (including prerequisites), with
Example
Loading the module for scikit-learn
1.4.2 for Python 3.11.3
This will draw in a number of other modules - you can see which ones with ml
after loading.
Running¶
To run a Python script that uses scikit-learn, you would load the module and prerequisite first.
Example
Running the short example “linreg.py” which uses scikit-learn (and also matplotlib):
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
# Generate some data
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([1, 3, 2, 3, 5])
# Create and fit the model
model = LinearRegression()
model.fit(X, y)
# Make predictions
y_pred = model.predict(X)
# Plot the results
plt.scatter(X, y, color='black')
plt.plot(X, y_pred, color='blue', linewidth=3)
plt.xlabel('X')
plt.ylabel('y')
plt.title('Linear Regression Example')
plt.show()
Reminder
Only run very short jobs directly on the login node. Anything longer should be run as a batch job!
Batch script¶
An example batch script to run the above (serial) “linreg.py” example would look like this:
#!/bin/bash
#SBATCH -A hpc2nXXXX-YYY # Change to your own
#SBATCH --time=00:10:00 # Asking for 10 minutes - change as needed
#SBATCH -n 1 # Asking for 1 core
# Load any modules you need, here for scikit-learn/1.4.2 and a compatible matplotlib version
module load GCC/12.3.0
module load scikit-learn/1.4.2
module load matplotlib/3.7.2
# Now run the Python code
python linreg.py
Additional info¶
More information can be found on