The Linux Foundation Projects
Skip to main content

Definition

A document or representation that captures the relevant information about the datasets used in an AI system or other applications. It could include details such as dataset names, versions, sources, associated metadata, licensing information, and any other relevant attributes.  It refers to a description or summary of a dataset, including metadata, characteristics, and statistical information about the data. It provides insights into the structure, format, content, and properties of a dataset, helping users understand and analyze the data more effectively.

Personas

The different persona for Dataset Profile SOM are as follows:

  • Data Scientists work with datasets and developing AI models. They may utilize dataset profile SBOMs to understand the characteristics, quality, and availability of different datasets. This information helps them select appropriate datasets and assess their suitability for specific AI applications.
  • Data Engineers are involved in data pipeline development, integration, and transformation. They may refer to dataset profile SBOMs to understand the structure, format, and dependencies of datasets, facilitating the design and implementation of efficient data pipelines.
  • Data Owners, creators and curators: oversee the governance and management of data within an organization. They may use dataset profile SBOMs to ensure compliance with data privacy regulations, assess data quality and lineage, and enforce data governance policies.
  • Compliance Officers are responsible for ensuring adherence to legal and regulatory requirements. They may review dataset profile SBOMs to verify compliance with data protection laws, licensing obligations, or other regulatory frameworks governing data usage.
  • IT Operations may refer to dataset profile SBOMs to understand the datasets used in AI systems and assess their impact on infrastructure, storage, and performance requirements. This helps ensure proper resource allocation and system scalability.
  • Legal and Intellectual Property Teams may leverage dataset profile SBOMs to evaluate the licensing and usage rights associated with datasets. They can use this information to assess potential risks, verify compliance, and mitigate any legal issues related to dataset usage.

Auditors conducting audits or evaluations of AI systems may request dataset profile SBOMs to review the datasets used, evaluate data privacy and security measures, and verify compliance with relevant standards or regulations.

This table provides a high-level classification or taxonomy of datasets and maps them to relevant personas. It highlights the different types of data, such as structured, unstructured, time-series, image, text, audio, video, and geospatial data. Additionally, it includes personas like data scientists, data engineers, data analysts, data privacy officers, data governance managers, and IT administrators, indicating their respective roles and responsibilities related to dataset management.

PersonaClassification/Taxonomy
Data Scientist

  • Structured Data

  • Unstructured Data

  • Time-Series Data

  • Image Data

  • Text Data

  • Audio Data

  • Video Data

  • Geospatial Data

Data Engineer

  • Data Extraction and Transformation

  • Data Cleaning and Preprocessing

  • Data Integration and Fusion

  • Data Aggregation and Summarization

  • Data Wrangling and Feature Engineering

Data Analyst

  • Exploratory Data Analysis

  • Descriptive Analytics

  • Statistical Analysis

  • Data Visualization

Data Privacy Officer

  • Personal Identifiable Information (PII) Data

  • Sensitive Data

  • Anonymized Data

  • Consent Management

Data Governance Manager

  • Data Quality Assessment and Monitoring

  • Data Access and Security Policies

  • Data Compliance and Regulations

  • Data Retention and Deletion Policies

  • Data Sharing and Consent Management

IT Administrator

  • Data Storage and Backup

  • Data Access Control

  • Data Encryption and Security

  • Data Archiving and Retrieval

  • Data Transfer and Integration

Use Cases

National Public Datasets

Data.gov offers detailed descriptions, metadata, and documentation for each dataset, including information about its source, format, licensing, and update frequency. This helps users understand and evaluate the datasets before utilizing them

Open Data Portals

Many cities, governments, and organizations worldwide have open data portals that offer access to various datasets. These portals often provide descriptive information about each dataset, including its source, format, licensing, and data dictionary. While not explicitly employing SBOM terminology, these portals facilitate transparency and understanding of the datasets.

Organization Dataset Management

An organization collects and uses a dataset for training an AI model that is designed to detect objects in images. The dataset contains images sourced from various contributors, including a combination of open-source and proprietary images. In the process of creating the Dataset profile SBOM, the organization discovers a subset of the images that was obtained from an open-source image repository that had a known security vulnerability. This vulnerability allowed for potential injection of malicious code or embedded metadata in the images.

Security and Forensic Analysis

In the case of data poisoning incidents creating a dataset profile sBOM  helps in forensic analysis, understanding the attack vectors, and identifying the responsible parties.

Benefits

By creating a dataset profile sBOM it will help organizations improve their data governance, promote responsible AI development, and address emerging challenges related to data transparency, bias, and compliance.

Related Content

  • OpenML – open platform of multiple datasets