[NumPy Tips] Using NumPy to save and manage datasets in machine learning.

The article was first published on 若绾

Summary#

In machine learning, data is crucial. Therefore, for any machine learning project, managing and processing data is essential. Data management involves various aspects, including data collection, cleaning, storage, and processing. In this article, we will discuss how to use NumPy to save datasets for centralized management.

Saving Datasets with NumPy#

NumPy is a Python library for scientific computing. It provides a powerful multidimensional array object and a range of functions for manipulating these arrays. NumPy arrays can store different types of data, including numbers, strings, and boolean values. Hence, they are an ideal choice for storing datasets.

Saving a Single Array#

If your dataset consists of a single array, you can use the save function of NumPy.

numpy.save#

Function parameters:

file: file, str, or pathlib.Path

The file or filename where the data is saved. If file is a file object, the filename will not be changed. If file is a string or Path, a .npy extension will be appended to the filename if it does not already have one.

arr: array_like

The array data to be saved.

allow_pickle: bool, optional

Allow saving object arrays using Python pickles. Reasons for disallowing pickles include security (loading pickled data can execute arbitrary code) and portability (pickled objects may not be loadable on different Python installations, for example if the stored objects require libraries that are not available, and not all pickled data is compatible between Python 2 and Python 3). Default: True

fix_imports: bool, optional

Only useful on Python 3 for objects saved with Python 2 pickle. If fix_imports is True, pickle will try to map the new Python 3 names to the old module names used in Python 2, so that the pickle data stream is readable with Python 2.

For example, suppose we have a NumPy array named data, we can save it to a file named data.npy using the following code:

import numpy as np

data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
np.save('data.npy', data)

Loading Arrays from Files#

To load a dataset, we can use the load function of NumPy. For example, the following code loads the array saved in the file data.npy:

import numpy as np

data = np.load('data.npy')
print(data)

The output will be:

[[1 2 3]
 [4 5 6]
 [7 8 9]]

Saving Multiple Arrays Simultaneously#

If your dataset consists of multiple arrays, such as train_set, train_label, test_set, test_label, you can use the savez or savez_compressed function of NumPy to save the dataset. numpy.savez saves multiple arrays as an uncompressed .npz file, while numpy.savez_compressed saves the arrays as a compressed .npz file, which can save storage space.

numpy.savez #

Function parameters:

file: str or file

File or filename to which the data is saved. If file is a string or a Path, a .npz extension will be appended to the filename if it is not already there.

args: Arguments, optional

Arrays to save to the file. It is possible to save arrays that are not contiguous in memory. Please refer to the Examples section for an illustration of this.

kwds: Keyword arguments, optional

Arrays to save to the file. Arrays will be saved in the file with names corresponding to the keywords.

With kwds, the arrays will be saved with the names specified by the keywords. In this example, we create two NumPy arrays, array1 and array2, and then save them to a file named arrays.npz using numpy.savez. It is important to note that we need to specify a keyword argument for each array, which will be the name of the array in the file.

import numpy as np

array1 = np.array([1, 2, 3])
array2 = np.array([4, 5, 6])

# Save arrays to file
np.savez('arrays.npz', arr1=array1, arr2=array2)

numpy.savez_compressed #

Function parameters:

file: str or file

File or filename to which the data is saved. If file is a string or a Path, a .npz extension will be appended to the filename if it is not already there.

args: Arguments, optional

Arrays to save to the file. It is possible to save arrays that are not contiguous in memory. Please refer to the Examples section for an illustration of this.

kwds: Keyword arguments, optional

Arrays to save to the file. Arrays will be saved in the file with names corresponding to the keywords.

This example is similar to the previous one, but uses numpy.savez_compressed to save the arrays as a compressed .npz file.

import numpy as np

array1 = np.array([1, 2, 3])
array2 = np.array([4, 5, 6])

# Save arrays to compressed file
np.savez_compressed('compressed_arrays.npz', arr1=array1, arr2=array2)

Loading Arrays from Files#

To load arrays from an .npz file, you can use the numpy.load function:

import numpy as np

# Load saved arrays
loaded_arrays = np.load('arrays.npz')

# Access arrays by the names specified in the file
loaded_array1 = loaded_arrays['arr1']
loaded_array2 = loaded_arrays['arr2']

In this example, we use numpy.load to load the file named arrays.npz and access the arrays within it by the names specified earlier. The same approach applies to loading compressed .npz files.

Conclusion#

In this article, we discussed how to use NumPy to save datasets for centralized management. This is an important aspect of data management that should be given due attention in any machine learning project.