Coast Train--Labeled imagery for training and evaluation of data-driven models for image segmentation

Metadata also available as - [Outline] - [Parseable text] - [XML]

Frequently anticipated questions:


What does this data set describe?

Title:
Coast Train--Labeled imagery for training and evaluation of data-driven models for image segmentation
Abstract:
Coast Train is a library of images of coastal environments, annotations, and corresponding thematic label masks (or ‘label images’) collated for the purposes of training and evaluating machine learning (ML), deep learning, and other models for image segmentation. It includes image sets from both geospatial satellite, aerial, and UAV imagery and orthomosaics, as well as non-geospatial oblique and nadir imagery. Images include a diverse range of coastal environments from the U.S. Pacific, Gulf of Mexico, Atlantic, and Great Lakes coastlines, consisting of time-series of high-resolution (≤1m) orthomosaics and satellite image tiles (10–30m). Each image, image annotation, and labelled image is available as a single NPZ zipped file. NPZ files follow the following naming convention: {datasource}_{numberofclasses}_{threedigitdatasetversion}.zip, where {datasource} is the source of the original images (for example, NAIP, Landsat 8, Sentinel 2), {numberofclasses} is the number of classes used to annotate the images, and {threedigitdatasetversion} is the three-digit code corresponding to the dataset version (in other words, 001 is version 1). Each zipped folder contains a collection of NPZ format files, each of which corresponds to an individual image. An individual NPZ file is named after the image that it represents and contains (1) a CSV file with detail information for every image in the zip folder and (2) a collection of the following NPY files: orig_image.npy (original input image unedited), image.npy (original input image after color balancing and normalization), classes.npy (list of classes annotated and present in the labelled image), doodles.npy (integer image of all image annotations), color_doodles.npy (color image of doodles.npy), label.npy (labelled image created from the classes present in the annotations), and settings.npy (annotation and machine learning settings used to generate the labelled image from annotations). All NPZ files can be extracted using the utilities available in Doodler (Buscombe, 2022). A merged CSV file containing detail information on the complete imagery collection is available at the top level of this data release, details of which are available in the Entity and Attribute section of this metadata file.
Supplemental_Information:
Any use of trade, product, or firm names is for descriptive purposes only and does not imply endorsement by the U.S. Government.
  1. How might this data set be cited?
    Wernette, Phillipe A., Buscombe, Daniel D., Fitzpatrick, Sharon, Favela, Jayce, Enwright, Nicholas, Goldstein, Evan, and Dunand, Erin, 20220319, Coast Train--Labeled imagery for training and evaluation of data-driven models for image segmentation: data release DOI:10.5066/P91NP87I, U.S. Geological Survey, Pacific Coastal and Marine Science Center, Santa Cruz, California.

    Online Links:

  2. What geographic area does the data set cover?
    West_Bounding_Coordinate: -180.0
    East_Bounding_Coordinate: 180.0
    North_Bounding_Coordinate: 90.0
    South_Bounding_Coordinate: -90.0
  3. What does it look like?
    coast_train_thumbnail.png (PNG)
    Split graphic with the original image on the left and the segmented right half of the image on the right.
  4. Does the data set describe conditions during a particular time period?
    Beginning_Date: 01-Jan-2008
    Ending_Date: 31-Dec-2020
    Currentness_Reference:
    date range of imagery in library
  5. What is the general form of this data set?
    Geospatial_Data_Presentation_Form:
    list of images and details in csv format; imagery in NumPy binary file format
  6. How does the data set represent geographic features?
    1. How are geographic features stored in the data set?
      Indirect_Spatial_Reference:
      Original images were downloaded at sites along the conterminous U.S. coastline, including sites along the U.S. Atlantic, Gulf of Mexico, Pacific, and Great Lakes coasts. Sites were selected to provide a representative sample of an array of coastal types (for example, sandy, cliff, marsh, wetland, developed). Refer to the self-contained NPZ files for more information on locations of original images.
    2. What coordinate system is used to represent geographic features?
  7. How does the data set describe geographic features?
    CoastTrain_imagery_details.csv
    Table containing detailed information about the imagery in this dataset. (Source: Producer defined)
    name
    Name of image source (Source: Producer defined) Unique identifier of image name.
    publisher
    Original publisher of the image source (Source: Producer defined) Unique identifier of image publisher.
    labels
    The image label file. One-hot-encoded label image (2D raster) in 8-bit unsigned integer. Each integer encodes a class label, incrementing through 'classes' starting at zero. (Source: Producer defined) Unique identifier of label image file.
    images
    The original image file used in classification. (Source: Producer defined) Unique identifier of image filename.
    annotation_image_filename
    Image filename with annotations. (Source: Producer defined) Unique identifier of annotation image filename.
    classes_array
    An array of classification classes in the image. (Source: Producer defined)
    ValueDefinition
    waterClassified as water
    whitewaterClassified as whitewater
    surfClassified as surf
    mud_siltClassified as mud or fine sediment
    sandClassified as bare sand
    gravelClassified as gravel
    gravel_shellClassified as mixture of gravel and shells
    cobble_gravelClassified as gravel with cobbles
    bedrockClassified as exposed bedrock
    ice_snowClassified as ice or snow
    bare_groundClassified as bare ground
    sedimentClassified as bare sediment
    other_natural_terrainClassified as other natural terrain
    other_bare_natural_terrainClassified as other bare natural terrain
    vegetatedClassified as vegetation
    vegetated_groundClassified as ground with vegetation
    vegetated_surfaceClassified as vegetation
    marsh_vegetationClassified as marsh vegetation
    terrestrial_vegetationClassified as terrestrial vegetation
    agriculturalClassified as agricultural
    cloudClassified as cloud
    developmentClassified as consisting of human development
    devClassified as consisting of human development
    coastal_defenseClassified as coastal defense
    buildingsClassified as building
    pavement_roadClassified as pavement
    vehiclesClassified as vehicles
    peopleClassified as person
    other_anthroClassified as other anthropogenic object
    unusualClassified as unusal object or land cover
    unknownClassified as unknown land cover
    no_dataNo data contained in pixels
    nodataNo data contained in pixels
    num_classes
    Number of classification classes in the image. (Source: Producer defined)
    Range of values
    Minimum:1
    Maximum:12
    classes_integer
    One integer per class in num_classes. (Source: Producer defined)
    Range of values
    Minimum:0
    Maximum:12
    classes_present_integer
    An array of integer classes present in the image. (Source: Producer defined)
    Range of values
    Minimum:0
    Maximum:12
    classes_present_array
    An array of classes present in the image. (Source: Producer defined) Values present in image from classes
    pen_width
    Final width in pixels of pen used to annotate in the Doodler program. (Source: Producer defined)
    Range of values
    Minimum:1
    Maximum:10
    CRF_theta
    Internal classifier hyperparameter used by the Doodler program. (Source: Producer defined)
    Range of values
    Minimum:1
    Maximum:1
    CRF_mu
    Internal classifier hyperparameter used by the Doodler program. (Source: Producer defined)
    Range of values
    Minimum:1
    Maximum:99
    CRF_downsample_factor
    Internal classifier hyperparameter used by the Doodler program. (Source: Producer defined)
    Range of values
    Minimum:1
    Maximum:5
    Classifier_downsample_factor
    Internal classifier hyperparameter used by the Doodler program. (Source: Producer defined)
    Range of values
    Minimum:1
    Maximum:8
    prob_of_unary_potential
    Internal classifier hyperparameter used by the Doodler program. (Source: Producer defined)
    Range of values
    Minimum:0.1
    Maximum:3.0
    doodle_spatial_density
    Proportion of the image annotated. (Source: Producer defined)
    Range of values
    Minimum:0.000526323
    Maximum:0.999627422
    num_of_scales
    Internal classifier hyperparameter used by the Doodler program. (Source: Producer defined)
    Range of values
    Minimum:3
    Maximum:3
    acc_georef
    Accuracy, in meters, of the specification of 'XMin', 'XMax' and 'YMin', 'YMax'. (Source: Producer defined)
    Range of values
    Minimum:1
    Maximum:11.248
    epsg
    EPSG code for the projected coordinate system. See 'CRS' attribute for a complete description of codes used. (Source: Producer defined)
    Range of values
    Minimum:26910
    Maximum:32618
    year
    Acquisition year of the image source. (Source: Producer defined)
    Range of values
    Minimum:2008
    Maximum:2021
    month
    Acquisition month of the image source. (Source: Producer defined)
    Range of values
    Minimum:1
    Maximum:12
    day
    Acquisition day of the image source. (Source: Producer defined)
    Range of values
    Minimum:1
    Maximum:31
    hour
    Acquisition hour of the image source. (Source: Producer defined)
    Range of values
    Minimum:0
    Maximum:23
    minute
    Acquisition minute of the image source. (Source: Producer defined)
    Range of values
    Minimum:0
    Maximum:59
    second
    Acquisition second of the image source. (Source: Producer defined)
    Range of values
    Minimum:0
    Maximum:59
    XMin
    Minimum easting of the image footprint. (Source: Producer defined)
    Range of values
    Minimum:233870.0
    Maximum:787860.0
    XMax
    Maximum easting of the image footprint. (Source: Producer defined)
    Range of values
    Minimum:235750.0
    Maximum:790530.0
    YMin
    Minimum northing of the image footprint. (Source: Producer defined)
    Range of values
    Minimum:2875253.0
    Maximum:5332914.0
    YMax
    Maximum northing of the image footprint. (Source: Producer defined)
    Range of values
    Minimum:2884030.0
    Maximum:5333378.0
    LonMin
    Minimum longitude (WGS84) of image footprint. (Source: Producer defined)
    Range of values
    Minimum:-124.0922272
    Maximum:-69.95201111
    LonMax
    Maximum longitude (WGS84) of image footprint. (Source: Producer defined)
    Range of values
    Minimum:-124.0478924
    Maximum:-69.9405098
    LatMin
    Minimum longitude (WGS84) of image footprint. (Source: Producer defined)
    Range of values
    Minimum:25.98761753
    Maximum:48.14810677
    LatMax
    Maximum latitude (WGS84) of image footprint. (Source: Producer defined)
    Range of values
    Minimum:26.06287667
    Maximum:48.15232107
    CRS
    The projected coordinate system description relating to XMin, XMax, YMin, YMax. (Source: Producer defined) Projected coordinate system definition
    px_m
    Horizontal size of pixel in meters. (Source: Producer defined)
    Range of values
    Minimum:0.15
    Maximum:15
    ImageHeightPx
    Number of pixels in horizontal dimension of height. (Source: Producer defined)
    Range of values
    Minimum:31
    Maximum:2481
    ImageWidthPx
    Number of pixels in horizontal dimension of width. (Source: Producer defined)
    Range of values
    Minimum:32
    Maximum:2209
    ImageBands
    Number of bands in the image. (Source: Producer defined)
    Range of values
    Minimum:3
    Maximum:3
    Entity_and_Attribute_Overview:
    Each image, image annotation, and labelled image is available as a single NPZ zipped file. NPZ files follow the following naming convention: {datasource}_{numberofclasses}_{threedigitdatasetversion}.zip, where {datasource} is the source of the original images (for example, NAIP, Landsat 8, Sentinel 2), {numberofclasses} is the number of classes used to annotate the images, and {threedigitdatasetversion} is the three digit code corresponding to the dataset version (in other words, 001 is version 1). Each zipped folder contains a collection of NPZ format files, each of which corresponds to an induvial image. An individual NPZ file is named after the image that it represents and contains (1) a CSV file with metadata information for every image and (2) a collection of the following NPY files: orig_image.npy (original input image unedited), image.npy (original input image after color balancing and normalization), classes.npy (list of classes annotated and present in the labelled image), doodles.npy (integer image of all image annotations), color_doodles.npy (color image of doodles.npy), label.npy (labelled image created from the classes present in the annotations), and settings.npy (annotation and machine learning settings used to generate the labelled image from annotations). All NPZ files can be extracted using the utilities available in Doodler (Buscombe, 2022; https://doi.org/10.5066/P9YVHL23).
    Entity_and_Attribute_Detail_Citation:
    The entity and attribute information was generated by the individual and/or agency identified as the originator of the data set. Please review the rest of the metadata record for additional details and information.

Who produced the data set?

  1. Who are the originators of the data set? (may include formal authors, digital compilers, and editors)
    • Phillipe A. Wernette
    • Daniel D. Buscombe
    • Sharon Fitzpatrick
    • Jayce Favela
    • Nicholas Enwright
    • Evan Goldstein
    • Erin Dunand
  2. Who also contributed to the data set?
  3. To whom should users address questions about the data?
    U.S. Geological Survey, Pacific Coastal and Marine Science Center
    Attn: PCMSC Science Data Coordinator
    2885 Mission Street
    Santa Cruz, CA

    831-427-4747 (voice)
    pcmsc_data@usgs.gov

Why was the data set created?

Training machine learning (ML) and other models for segmentation will greatly facilitate the creation of land cover maps from geospatial imagery with greater specificity, as well as mapping coastal sediments, transient waterbodies, landforms, and other features of interest, in both geospatial and non-geospatial imagery. Coast Train adheres to the principle of ‘Map Once, Use Many Times’ and is well positioned to transfer learning across a wide range of coastal environments.

How was the data set created?

  1. From what previous works were the data drawn?
  2. How were the data generated, processed, and modified?
    Date: 31-Dec-2021 (process 1 of 2)
    Image Annotation/Doodling--Each image was opened using Doodler with the class list provided in the metadata sheet. The user would then select a single class from the options on the right and click and hold on the image to begin drawing a line (annotate/doodle) where the selected class exists on the image. Annotations can be quick and as simple as a point or single line or as complex as a meandering or looping back series of lines. This annotation process was repeated one or more times for every class present in the image.
    Date: 31-Dec-2021 (process 2 of 2)
    Image Classification/Segmentation--Once image annotations were complete for all classes present in the image, the program will segment the image and classify every pixel in it by checking the “Compute/Show segmentation” box on the right. If the final image is not accurate to the classes present and their distribution, then the user can uncheck the “Compute/Show segmentation” box and repeat the annotation and classification/segmentation steps until they are satisfied with the final segmented image.
  3. What similar or related data should the user be aware of?
    Buscombe, Daniel D., 2022, Doodler--A web application built with plotly/dash for image segmentation with minimal supervision: software release DOI:10.5066/P9YVHL23, U.S. Geological Survey, Pacific Coastal and Marine Science Center, Santa Cruz, California.

    Online Links:

    Other_Citation_Details:
    Buscombe, D.D., 2022, Doodler--A web application built with plotly/dash for image segmentation with minimal supervision: U.S. Geological Survey software release, https://doi.org/10.5066/P9YVHL23

How reliable are the data; what problems remain in the data set?

  1. How well have the observations been checked?
    Mean Intersection over Union (IoU) scores for quantifying inter-labeler agreement were computed using 120 images across two datasets, namely NAIP (70 image pairs) and Sentinel-2 (50 image pairs), that have been labeled independently by experienced labelers. Mean IoU is the standard way to report agreement between two realizations of the same label image. Further, because IoU quantifies spatial overlap and is prone to class imbalance, Kullback-Leibler divergence scores were also computed to quantify agreement between class-frequency distributions. When comparing IoU and Dice scores, it is preferable to examine agreement using multiple independent metrics. The mean of mean IoU scores was 0.88, which we recommend using as an expected irreducible error. Previous research suggests that mean IoU scores tend to be inversely correlated with number of classes; therefore, this error is a conservative estimate.
  2. How accurate are the geographic locations?
    A formal accuracy assessment of the horizontal positional information in the dataset has not been conducted. Each of the input data sources has its own horizontal accuracy available in their source metadata.
  3. How accurate are the heights or depths?
  4. Where are the gaps in the data? What is missing?
    Dataset is considered complete for the information presented, as described in the abstract. Users are advised to read the rest of the metadata record carefully for additional details.
  5. How consistent are the relationships among the observations, including topology?
    All annotation values are integer based, with each integer corresponding to a unique class. The program used to generate the final classified/labelled images ensured that every pixel in the original image is classified into one of the annotated classes. There is no possibility that the actual values are outside of the reported ranges of values.

How can someone get a copy of the data set?

Are there legal restrictions on access or use of the data?
Access_Constraints None
Use_Constraints USGS-authored or produced data and information are in the public domain from the U.S. Government and are freely redistributable with proper metadata and source attribution. Please recognize and acknowledge the U.S. Geological Survey as the originator(s) of the dataset and in products derived from these data.
  1. Who distributes the data set? (Distributor 1 of 1)
    U.S. Geological Survey - CMGDS
    2885 Mission Street
    Santa Cruz, CA

    831-427-4747 (voice)
    pcmsc_data@usgs.gov
  2. What's the catalog number I need to order this data set? Images are provided in NPZ format. Each NPZ file corresponds to a single image that has been annotated/labelled and classified/segmented. The NPZ file names consist of the {image_source}_{number_of_classes}_{data_release_version_number}, delimited by underscores. The first element {image_source} represents the original source of the image (for example, Landsat 8 would be “L8”, Sentinel 2 would be “S2”), the second element {number_of_classes} represents the number of classes used during labelling (for example, “6 classes”, “11classes”, and The third element {data_release_version_number} represents the data release version that the image is part of (for example, all datasets for version 1 will have “001” as the third part of the NPZ filename). Each NPZ file contains at least seven (7) different NPY files: (1) orig_image.npy (original input image unedited), (2) image.npy (original input image after color balancing and normalization), (3) classes.npy (list of classes annotated and present in the labelled image), (4) doodles.npy (integer image of all image annotations), (5) color_doodles.npy (color image of doodles.npy), (6) label.npy (labelled image created from the classes present in the annotations), and (7) settings.npy (annotation and machine learning settings used to generate the labelled image from annotations). Some NPZ files may contain one or more additional sets of seven files with one or more zeros appended to the beginning of the NPY file names. These additional files are grouped by the number of zeros preceding the regular files described above and represent previous attempts at annotation and classification/segmentation for that image. For example, all NPY files with one zero appended to the beginning of the NPY file names represent the first attempt, all NPY files with two zeros appended to the beginning of the NPY file names represent the second attempt, etc. A merged CSV file (CoastTrain_imagery_details.csv) contains detailed information on the complete imagery collection.
  3. What legal disclaimers am I supposed to read?
    Unless otherwise stated, all data, metadata and related materials are considered to satisfy the quality standards relative to the purpose for which the data were collected. Although these data and associated metadata have been reviewed for accuracy and completeness and approved for release by the U.S. Geological Survey (USGS), no warranty expressed or implied is made regarding the display or utility of the data on any other system or for general or scientific purposes, nor shall the act of distribution constitute any such warranty.
  4. How can I download or order the data?
  5. What hardware or software do I need in order to use the data set?
    These data can be viewed with Doodler software (Buscombe, 2022; https://doi.org/10.5066/P9YVHL23).

Who wrote the metadata?

Dates:
Last modified: 19-Mar-2022
Metadata author:
U.S. Geological Survey, Pacific Coastal and Marine Science Center
Attn: PCMSC Science Data Coordinator
2885 Mission Street
Santa Cruz, CA

831-427-4747 (voice)
pcmsc_data@usgs.gov
Metadata standard:
Content Standard for Digital Geospatial Metadata (FGDC-STD-001-1998)

This page is <https://cmgds.marine.usgs.gov/catalog/pcmsc/DataReleases/CMGDS_DR_tool/DR_P91NP87I/CoastTrain_imagery_details_metadata.faq.html>
Generated by mp version 2.9.51 on Tue Mar 22 13:38:41 2022