earthaccess API
earthaccess is a Python library that simplifies data discovery and access to NASA Earth science data by providing a higher abstraction for NASA’s Search API (CMR) so that searching for data can be done using a simpler notation instead of low level HTTP queries.
This library handles authentication with NASA’s OAuth2 API (EDL) and provides HTTP and AWS S3 sessions that can be used with xarray and other PyData libraries to access NASA EOSDIS datasets directly allowing scientists get to their science in a simpler and faster way, reducing barriers to cloud-based data analysis.
collection_query()
Returns a query builder instance for NASA collections (datasets).
Returns:
| Type | Description |
|---|---|
CollectionQuery
|
a query builder instance for data collections. |
download(granules, local_path=None, provider=None, threads=8, *, show_progress=None, credentials_endpoint=None, pqdm_kwargs=None)
Retrieves data granules from a remote storage system. Provide the optional local_path argument to prevent repeated downloads.
- If we run this in the cloud, we will be using S3 to move data to
local_path. - If we run it outside AWS (us-west-2 region) and the dataset is cloud hosted, we'll use HTTP links.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
granules
|
Union[DataGranule, List[DataGranule], str, List[str]]
|
a granule, list of granules, a granule link (HTTP), or a list of granule links (HTTP) |
required |
local_path
|
Optional[Union[Path, str]]
|
Local directory to store the remote data granules. If not
supplied, defaults to a subdirectory of the current working directory
of the form |
None
|
provider
|
Optional[str]
|
if we download a list of URLs, we need to specify the provider. |
None
|
credentials_endpoint
|
Optional[str]
|
S3 credentials endpoint to be used for obtaining temporary S3 credentials. This is only required if
the metadata doesn't include it, or we pass urls to the method instead of |
None
|
threads
|
int
|
parallel number of threads to use to download the files, adjust as necessary, default = 8 |
8
|
show_progress
|
Optional[bool]
|
whether or not to display a progress bar. If not specified, defaults to |
None
|
pqdm_kwargs
|
Optional[Mapping[str, Any]]
|
Additional keyword arguments to pass to pqdm, a parallel processing library.
See pqdm documentation for available options. Default is to use immediate exception behavior
and the number of jobs specified by the |
None
|
Returns:
| Type | Description |
|---|---|
List[Path]
|
List of downloaded files |
Raises:
| Type | Description |
|---|---|
Exception
|
A file download failed. |
get_edl_token()
Returns the current token used for EDL.
Returns:
| Type | Description |
|---|---|
str
|
EDL token |
get_fsspec_https_session()
Returns a fsspec session that can be used to access datafiles across many different DAACs.
Returns:
| Type | Description |
|---|---|
AbstractFileSystem
|
An fsspec instance able to access data across DAACs. |
Examples:
get_requests_https_session()
Returns a requests Session instance with an authorized bearer token. This is useful for making requests to restricted URLs, such as data granules or services that require authentication with NASA EDL.
Returns:
| Type | Description |
|---|---|
Session
|
An authenticated requests Session instance. |
Examples:
get_s3_credentials(daac=None, provider=None, results=None)
Returns temporary (1 hour) credentials for direct access to NASA S3 buckets. We can use the daac name, the provider, or a list of results from earthaccess.search_data(). If we use results, earthaccess will use the metadata on the response to get the credentials, which is useful for missions that do not use the same endpoint as their DAACs, e.g. SWOT.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
daac
|
Optional[str]
|
a DAAC short_name like NSIDC or PODAAC, etc. |
None
|
provider
|
Optional[str]
|
if we know the provider for the DAAC, e.g. POCLOUD, LPCLOUD etc. |
None
|
results
|
Optional[List[DataGranule]]
|
List of results from search_data() |
None
|
Returns:
| Type | Description |
|---|---|
Dict[str, Any]
|
a dictionary with S3 credentials for the DAAC or provider |
get_s3_filesystem(daac=None, provider=None, results=None, endpoint=None)
Return an s3fs.S3FileSystem for direct access when running within the AWS us-west-2 region.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
daac
|
Optional[str]
|
Any DAAC short name e.g. NSIDC, GES_DISC |
None
|
provider
|
Optional[str]
|
Each DAAC can have a cloud provider. If the DAAC is specified, there is no need to use provider. |
None
|
results
|
Optional[DataGranule]
|
A list of results from search_data().
|
None
|
endpoint
|
Optional[str]
|
URL of a cloud provider credentials endpoint to be used for obtaining AWS S3 access credentials. |
None
|
Returns:
| Type | Description |
|---|---|
S3FileSystem
|
An authenticated s3fs session valid for 1 hour. |
get_s3fs_session(daac=None, provider=None, results=None)
Returns a fsspec s3fs file session for direct access when we are in us-west-2.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
daac
|
Optional[str]
|
Any DAAC short name e.g. NSIDC, GES_DISC |
None
|
provider
|
Optional[str]
|
Each DAAC can have a cloud provider. If the DAAC is specified, there is no need to use provider. |
None
|
results
|
Optional[DataGranule]
|
A list of results from search_data().
|
None
|
Returns:
| Type | Description |
|---|---|
S3FileSystem
|
An |
granule_query()
Returns a query builder instance for data granules.
Returns:
| Type | Description |
|---|---|
GranuleQuery
|
a query builder instance for data granules. |
login(strategy='all', persist=False, system=PROD)
Authenticate with Earthdata login (https://urs.earthdata.nasa.gov/).
Attempt to login via only the specified strategy, unless the "all"
strategy is used, in which case each of the individual strategies is
attempted in the following order, until one succeeds: "environment",
"netrc", "interactive". In this case, only when all strategies fail
does login fail.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
strategy
|
str
|
An authentication method.
|
'all'
|
persist
|
bool
|
if |
False
|
system
|
System
|
the Earthdata system to access |
PROD
|
Returns:
| Type | Description |
|---|---|
Auth
|
An instance of Auth. |
Raises:
| Type | Description |
|---|---|
LoginAttemptFailure
|
If the NASA Earthdata Login service rejects credentials. |
open(granules, provider=None, *, credentials_endpoint=None, show_progress=None, pqdm_kwargs=None, open_kwargs=None)
Returns a list of file-like objects that can be used to access files hosted on S3 or HTTPS by third party libraries like xarray.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
granules
|
Union[List[str], List[DataGranule]]
|
a list of granule instances or list of URLs, e.g. |
required |
provider
|
Optional[str]
|
e.g. POCLOUD, NSIDC_CPRD, etc. |
None
|
show_progress
|
Optional[bool]
|
whether or not to display a progress bar. If not specified, defaults to |
None
|
pqdm_kwargs
|
Optional[Mapping[str, Any]]
|
Additional keyword arguments to pass to pqdm, a parallel processing library.
See pqdm documentation for available options. Default is to use immediate exception behavior
and the number of jobs specified by the |
None
|
open_kwargs
|
Optional[Dict[str, Any]]
|
Additional keyword arguments to pass to |
None
|
Returns:
| Type | Description |
|---|---|
List[AbstractFileSystem]
|
A list of "file pointers" to remote (i.e. s3 or https) files. |
search_data(count=-1, **kwargs)
Search for dataset files (granules) using NASA's CMR.
https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html
The CMR does not permit queries across all granules in all collections in order to provide fast search responses. Granule queries must target a subset of the collections in the CMR using a condition like provider, provider_id, concept_id, collection_concept_id, short_name, version or entry_title.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
count
|
int
|
Number of records to get, -1 = all |
-1
|
kwargs
|
Dict
|
arguments to CMR:
|
{}
|
Returns:
| Type | Description |
|---|---|
List[DataGranule]
|
a list of DataGranules that can be used to access the granule files by using
|
Raises:
| Type | Description |
|---|---|
RuntimeError
|
The CMR query failed. |
Examples:
search_datasets(count=-1, **kwargs)
Search datasets (collections) using NASA's CMR.
https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
count
|
int
|
Number of records to get, -1 = all |
-1
|
kwargs
|
Dict
|
arguments to CMR:
|
{}
|
Returns:
| Type | Description |
|---|---|
List[DataCollection]
|
A list of DataCollection results that can be used to get information about a dataset, e.g. concept_id, doi, etc. |
Raises:
| Type | Description |
|---|---|
RuntimeError
|
The CMR query failed. |
Examples:
search_services(count=-1, **kwargs)
Search the NASA CMR for Services matching criteria.
See https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html#service.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
count
|
int
|
maximum number of services to fetch (if less than 1, all services matching specified criteria are fetched [default]) |
-1
|
kwargs
|
Any
|
keyword arguments accepted by the CMR for searching services |
{}
|
Returns:
| Type | Description |
|---|---|
List[Any]
|
list of services (possibly empty) matching specified criteria, in UMM |
List[Any]
|
JSON format |
Examples:
status(system=PROD, raise_on_outage=False)
Get the statuses of NASA's Earthdata services.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
system
|
System
|
The Earthdata system to access, defaults to PROD. |
PROD
|
raise_on_outage
|
bool
|
If True, raises exception on errors or outages. |
False
|
Returns:
| Type | Description |
|---|---|
dict[str, str]
|
A dictionary containing the statuses of Earthdata services. |
Examples:
>>> earthaccess.status()
{'Earthdata Login': 'OK', 'Common Metadata Repository': 'OK'}
>>> earthaccess.status(earthaccess.UAT)
{'Earthdata Login': 'OK', 'Common Metadata Repository': 'OK'}
Raises:
| Type | Description |
|---|---|
ServiceOutage
|
if at least one service status is not |
get_granule_credentials_endpoint_and_region(granule)
Retrieve credentials endpoint for direct access granule link.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
granule
|
DataGranule
|
The first granule being included in the virtual dataset. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
credentials_endpoint |
str
|
The S3 credentials endpoint. If this information is in the UMM-G record, then it is used from there. If not, a query for the collection is performed and the information is taken from the UMM-C record. |
region |
str
|
Region for the data. Defaults to us-west-2. If the credentials endpoint is retrieved from the UMM-C record for the collection, the Region information is also used from UMM-C. |
open_virtual_dataset(granule, group=None, access='indirect')
Open a granule as a single virtual xarray Dataset.
Uses NASA DMR++ metadata files to create a virtual xarray dataset with ManifestArrays. This virtual dataset can be used to create zarr reference files. See https://virtualizarr.readthedocs.io for more information on virtual xarray datasets.
Warning
This feature is current experimental and may change in the future. This feature relies on DMR++ metadata files which may not always be present for your dataset and you may get a FileNotFoundError.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
granule
|
DataGranule
|
The granule to open |
required |
group
|
str | None
|
Path to the netCDF4 group in the given file to open. If None, the root group will be opened. If the DMR++ file does not have groups, this parameter is ignored. |
None
|
access
|
str
|
The access method to use. One of "direct" or "indirect". Use direct when running on AWS, use indirect when running on a local machine. |
'indirect'
|
Returns:
| Type | Description |
|---|---|
Dataset
|
xarray.Dataset |
Examples:
>>> results = earthaccess.search_data(count=2, temporal=("2023"), short_name="SWOT_L2_LR_SSH_Expert_2.0")
>>> vds = earthaccess.open_virtual_dataset(results[0], access="indirect")
>>> vds
<xarray.Dataset> Size: 149MB
Dimensions: (num_lines: 9866, num_pixels: 69,
num_sides: 2)
Coordinates:
longitude (num_lines, num_pixels) int32 3MB ...
latitude (num_lines, num_pixels) int32 3MB ...
latitude_nadir (num_lines) int32 39kB ManifestArr...
longitude_nadir (num_lines) int32 39kB ManifestArr...
Dimensions without coordinates: num_lines, num_pixels, num_sides
Data variables: (12/98)
height_cor_xover_qual (num_lines, num_pixels) uint8 681kB ManifestArray<shape=(9866, 69), dtype=uint8, chunks=(9866, 69...
>>> vds.virtualize.to_kerchunk("swot_2023_ref.json", format="json")
open_virtual_mfdataset(granules, group=None, access='indirect', preprocess=None, parallel='dask', load=True, reference_dir=None, reference_format='json', **xr_combine_nested_kwargs)
Open multiple granules as a single virtual xarray Dataset.
Uses NASA DMR++ metadata files to create a virtual xarray dataset with ManifestArrays. This virtual dataset can be used to create zarr reference files. See https://virtualizarr.readthedocs.io for more information on virtual xarray datasets.
Warning
This feature is current experimental and may change in the future. This feature relies on DMR++ metadata files which may not always be present for your dataset and you may get a FileNotFoundError.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
granules
|
list[DataGranule]
|
The granules to open |
required |
group
|
str | None
|
Path to the netCDF4 group in the given file to open. If None, the root group will be opened. If the DMR++ file does not have groups, this parameter is ignored. |
None
|
access
|
str
|
The access method to use. One of "direct" or "indirect". Use direct when running on AWS, use indirect when running on a local machine. |
'indirect'
|
preprocess
|
callable | None
|
A function to apply to each virtual dataset before combining |
None
|
parallel
|
Literal['dask', 'lithops', False]
|
Open the virtual datasets in parallel (using dask.delayed or lithops) |
'dask'
|
load
|
bool
|
If load is True, earthaccess will serialize the virtual references in order to use lazy indexing on the resulting xarray virtual ds. |
True
|
reference_dir
|
str | None
|
Directory to store kerchunk references. If None, a temporary directory will be created and deleted after use. |
None
|
reference_format
|
Literal['json', 'parquet']
|
When load is True, earthaccess will serialize the references using this format, json (default) or parquet. |
'json'
|
xr_combine_nested_kwargs
|
Any
|
Xarray arguments describing how to concatenate the datasets. Keyword arguments for xarray.combine_nested. See https://docs.xarray.dev/en/stable/generated/xarray.combine_nested.html |
{}
|
Returns:
| Type | Description |
|---|---|
Dataset
|
Concatenated xarray.Dataset |
Examples:
>>> results = earthaccess.search_data(count=5, temporal=("2024"), short_name="MUR-JPL-L4-GLOB-v4.1")
>>> vds = earthaccess.open_virtual_mfdataset(results, access="indirect", load=False, concat_dim="time", coords="minimal", compat="override", combine_attrs="drop_conflicts")
>>> vds
<xarray.Dataset> Size: 29GB
Dimensions: (time: 5, lat: 17999, lon: 36000)
Coordinates:
time (time) int32 20B ManifestArray<shape=(5,), dtype=int32,...
lat (lat) float32 72kB ManifestArray<shape=(17999,), dtype=...
lon (lon) float32 144kB ManifestArray<shape=(36000,), dtype...
Data variables:
mask (time, lat, lon) int8 3GB ManifestArray<shape=(5, 17999...
sea_ice_fraction (time, lat, lon) int8 3GB ManifestArray<shape=(5, 17999...
dt_1km_data (time, lat, lon) int8 3GB ManifestArray<shape=(5, 17999...
analysed_sst (time, lat, lon) int16 6GB ManifestArray<shape=(5, 1799...
analysis_error (time, lat, lon) int16 6GB ManifestArray<shape=(5, 1799...
sst_anomaly (time, lat, lon) int16 6GB ManifestArray<shape=(5, 1799...
Attributes: (12/42)
Conventions: CF-1.7
title: Daily MUR SST, Final product
>>> vds.virtualize.to_kerchunk("mur_combined.json", format="json")
>>> vds = open_virtual_mfdataset(results, access="indirect", concat_dim="time", coords='minimal', compat='override', combine_attrs="drop_conflicts")
>>> vds
<xarray.Dataset> Size: 143GB
Dimensions: (time: 5, lat: 17999, lon: 36000)
Coordinates:
* lat (lat) float32 72kB -89.99 -89.98 -89.97 ... 89.98 89.99
* lon (lon) float32 144kB -180.0 -180.0 -180.0 ... 180.0 180.0
* time (time) datetime64[ns] 40B 2024-01-01T09:00:00 ... 2024-...
Data variables:
analysed_sst (time, lat, lon) float64 26GB dask.array<chunksize=(1, 3600, 7200), meta=np.ndarray>
analysis_error (time, lat, lon) float64 26GB dask.array<chunksize=(1, 3600, 7200), meta=np.ndarray>
dt_1km_data (time, lat, lon) timedelta64[ns] 26GB dask.array<chunksize=(1, 4500, 9000), meta=np.ndarray>
mask (time, lat, lon) float32 13GB dask.array<chunksize=(1, 4500, 9000), meta=np.ndarray>
sea_ice_fraction (time, lat, lon) float64 26GB dask.array<chunksize=(1, 4500, 9000), meta=np.ndarray>
sst_anomaly (time, lat, lon) float64 26GB dask.array<chunksize=(1, 3600, 7200), meta=np.ndarray>
Attributes: (12/42)
Conventions: CF-1.7
title: Daily MUR SST, Final product