The article was first published on 若绾
Summary#
In machine learning, data is crucial. Therefore, for any machine learning project, managing and processing data is essential. Data management involves various aspects, including data collection, cleaning, storage, and processing. In this article, we will discuss how to use NumPy to save datasets for centralized management.
Saving Datasets with NumPy#
NumPy is a Python library for scientific computing. It provides a powerful multidimensional array object and a range of functions for manipulating these arrays. NumPy arrays can store different types of data, including numbers, strings, and boolean values. Hence, they are an ideal choice for storing datasets.
Saving a Single Array#
If your dataset consists of a single array, you can use the save
function of NumPy.
numpy.save#
Function parameters:
file: file, str, or pathlib.Path
The file or filename where the data is saved. If file is a file object, the filename will not be changed. If file is a string or Path, a
.npy
extension will be appended to the filename if it does not already have one.arr: array_like
The array data to be saved.
allow_pickle: bool, optional
Allow saving object arrays using Python pickles. Reasons for disallowing pickles include security (loading pickled data can execute arbitrary code) and portability (pickled objects may not be loadable on different Python installations, for example if the stored objects require libraries that are not available, and not all pickled data is compatible between Python 2 and Python 3). Default: True
fix_imports: bool, optional
Only useful on Python 3 for objects saved with Python 2 pickle. If fix_imports is True, pickle will try to map the new Python 3 names to the old module names used in Python 2, so that the pickle data stream is readable with Python 2.
For example, suppose we have a NumPy array named data
, we can save it to a file named data.npy
using the following code:
import numpy as np
data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
np.save('data.npy', data)
Loading Arrays from Files#
To load a dataset, we can use the load
function of NumPy. For example, the following code loads the array saved in the file data.npy
:
import numpy as np
data = np.load('data.npy')
print(data)
The output will be:
[[1 2 3]
[4 5 6]
[7 8 9]]
Saving Multiple Arrays Simultaneously#
If your dataset consists of multiple arrays, such as train_set, train_label, test_set, test_label, you can use the savez
or savez_compressed
function of NumPy to save the dataset. numpy.savez
saves multiple arrays as an uncompressed .npz
file, while numpy.savez_compressed
saves the arrays as a compressed .npz
file, which can save storage space.
numpy.savez#
Function parameters:
file: str or file
File or filename to which the data is saved. If file is a string or a Path, a
.npz
extension will be appended to the filename if it is not already there.args: Arguments, optional
Arrays to save to the file. It is possible to save arrays that are not contiguous in memory. Please refer to the Examples section for an illustration of this.
kwds: Keyword arguments, optional
Arrays to save to the file. Arrays will be saved in the file with names corresponding to the keywords.
With kwds
, the arrays will be saved with the names specified by the keywords. In this example, we create two NumPy arrays, array1
and array2
, and then save them to a file named arrays.npz
using numpy.savez
. It is important to note that we need to specify a keyword argument for each array, which will be the name of the array in the file.
import numpy as np
array1 = np.array([1, 2, 3])
array2 = np.array([4, 5, 6])
# Save arrays to file
np.savez('arrays.npz', arr1=array1, arr2=array2)
numpy.savez_compressed#
Function parameters:
file: str or file
File or filename to which the data is saved. If file is a string or a Path, a
.npz
extension will be appended to the filename if it is not already there.args: Arguments, optional
Arrays to save to the file. It is possible to save arrays that are not contiguous in memory. Please refer to the Examples section for an illustration of this.
kwds: Keyword arguments, optional
Arrays to save to the file. Arrays will be saved in the file with names corresponding to the keywords.
This example is similar to the previous one, but uses numpy.savez_compressed
to save the arrays as a compressed .npz
file.
import numpy as np
array1 = np.array([1, 2, 3])
array2 = np.array([4, 5, 6])
# Save arrays to compressed file
np.savez_compressed('compressed_arrays.npz', arr1=array1, arr2=array2)
Loading Arrays from Files#
To load arrays from an .npz
file, you can use the numpy.load
function:
import numpy as np
# Load saved arrays
loaded_arrays = np.load('arrays.npz')
# Access arrays by the names specified in the file
loaded_array1 = loaded_arrays['arr1']
loaded_array2 = loaded_arrays['arr2']
In this example, we use numpy.load
to load the file named arrays.npz
and access the arrays within it by the names specified earlier. The same approach applies to loading compressed .npz
files.
Conclusion#
In this article, we discussed how to use NumPy to save datasets for centralized management. This is an important aspect of data management that should be given due attention in any machine learning project.