Cognite Data Quality

cognite-data-quality is a Python package for SHACL-based data quality validation in Cognite Data Fusion (CDF). It validates DMS instances (structured data models) and time series data against declarative rules, stores results in the CDF Records API, and provides tools to deploy and run validation in CDF Functions and Workflows.

Purpose

Industrial data in CDF must meet quality standards before it can be trusted for analytics, dashboards, or automation. Raw data often has gaps, outliers, missing required fields, or rule violations. Manual checks do not scale.

This package provides:

Declarative validation using SHACL (W3C standard). You define what good data looks like; the package checks whether your data conforms.
Automated validation in CDF (Functions, Workflows, Triggers) and locally in notebooks and scripts.
Traceable results in the Records API for monitoring, dashboards, and drilling into failures.

How It Works

Define rules in TTL (Turtle), JSON rule set, or YAML view config. Rules target views or time series.
Run validation — locally with run_validation() (no Function required), or in CDF via deployed Functions triggered by Workflows.
Post results to the Records API (pass/fail, violation details, severity).

The package includes inlined SHACL validation with CDF SPARQL functions and integrates with CDF Data Modeling (DMS). It supports both CDF SDK and INDSL functions for time series quality checks.

Workflows

Local validation and testing

Create a TOML file (e.g. config.toml) with CDF credentials.
Write SHACL rules (TTL). Use run_validation() against live data.
Iterate on rules; optionally set post_to_records=True with a RecordsConfig to test the full pipeline.
Use DataModelConfig and RecordsConfig for datamodel, instance space, stream, and rule set.

Production deployment

Deploy infrastructure with deploy_validation_infrastructure(client, settings_path=..., views_dir=..., function_secrets=...). This ensures containers (Records, OrchestrationState, FunctionValidationState), deploys the unified validation function, workflows, and triggers.
Optionally run the validation pipeline for a view with deploy_validation_pipeline(client, settings_path=..., view_external_id=..., wait=True) to process historic data and set up sync and monitor schedules.
Results are written to the Records API automatically.

Invoking deployed CDF Functions from Python

Use the invoke helpers to call already-deployed Cognite Functions from notebooks or scripts. These send the payload to the function running in CDF — they do not run validation locally.

from cognite_data_quality import call_validate_instances_shacl

result = call_validate_instances_shacl(client, data)

To run validation locally (no Function required), use run_validation() instead.

Main Capabilities

Capability	Description
run_validation	Run validation from DMS (no workflows). Supports TTL, JSON, or YAML. Use `datamodel`, `instance_space`, `records_config`, `limit`, `print_output`.
deploy_validation_infrastructure	Deploy all validation infrastructure (containers, function, workflows, triggers) from `settings_path` and `views_dir`.
deploy_validation_pipeline	Deploy and run the full validation pipeline for a view (historic partitions, sync trigger, monitor schedule).
call_validation	Invoke the deployed CDF Function with type-based dispatch (instance, instance_sync_cursor, timeseries, orchestrator, partitioned, test). Runs in CDF, not locally.
invoke helpers	`call_validate_instances_shacl`, `call_validate_timeseries_shacl`, etc. — thin wrappers around `call_validation()` for specific validation types. All invoke deployed CDF Functions.

Other Features

Credentials: Load from TOML (e.g. config.toml) with load_cognite_client_from_toml().
Rule sources: TTL, JSON rule set, or YAML view config.
Records output: Default container dataQuality:DataQualityValidationRecord; configurable stream and rule set in RecordsConfig.
Auto-loading: References between instances are loaded automatically when validating DMS data (configurable depth).

Next Steps

Installation: Set up the package and credentials.
Where to Start: Run your first validation and deploy.
Usage: Run validation, deploy, invoke, and configure rules.