Cognite Data Quality

cognite-data-quality is a Python package for SHACL-based data quality validation in Cognite Data Fusion (CDF). It validates DMS instances (structured data models) and time series data against declarative rules, stores results in the CDF Records API, and provides tools to deploy and run validation in CDF Functions and Workflows.

Purpose

Industrial data in CDF must meet quality standards before it can be trusted for analytics, dashboards, or automation. Raw data often has gaps, outliers, missing required fields, or rule violations. Manual checks do not scale.

This package provides:

Single instance data quality validation: validate required fields, datatypes, ranges, and patterns on individual industrial entities.
Graph consistency: validate cross-entity consistency (for example operation-tag vs equipment-tag alignment and relationship integrity).
Uniqueness: run SHACL-native global uniqueness checks (dqs:uniquenessConstraint / dqs:unique) with aggregate-first execution and overflow-safe output.
Time series: validate datapoint freshness, completeness, gaps, outliers, and value bounds with cdf_sdk and optional cdf_indsl functions.
Conditional logic: define "if this, then create this data" logic via SHACL-AF (sh:rule / sh:SPARQLRule) and emit RuleEngineResult.
Chained conditional logic: chain conditional outputs so downstream derived states depend on upstream inferred results (dqs:dependsOn, dqs:causedBy).
Automated validation in CDF (Functions, Workflows, Triggers) and locally in notebooks and scripts.
Traceable results in the Records API for monitoring, dashboards, and drilling into failures.

How It Works

Manage and persist rules in Data Product + RuleSet (recommended). Use YAML as the CDF Toolkit representation of those bindings plus deployment/runtime settings, and use TTL only as a legacy transition path.
Run validation — locally with run_validation() (no Function required), or in CDF via deployed Functions triggered by Workflows.
Post results to the Records API (pass/fail, violation details, severity).

The package includes inlined SHACL validation with CDF SPARQL functions and integrates with CDF Data Modeling (DMS). It supports both CDF SDK and INDSL functions for time series quality checks.

Workflows

Local validation and testing

Create a TOML file (e.g. config.toml) with CDF credentials.
Write SHACL rules (TTL). Use run_validation() against live data.
Iterate on rules; optionally set post_to_records=True with a RecordsConfig to test the full pipeline.
Use DataModelConfig and RecordsConfig for datamodel, instance space, stream, and rule set.

Production deployment

Deploy infrastructure with deploy_validation_infrastructure(client, settings_path=..., views_dir=..., function_secrets=...). This ensures containers (Records, OrchestrationState, FunctionValidationState), deploys the unified validation function, workflows, and triggers.
Optionally run the validation pipeline for a view with deploy_validation_pipeline(client, settings_path=..., view_external_id=..., wait=True) to process historic data and set up sync and monitor schedules.
Results are written to the Records API automatically.

Invoking deployed CDF Functions from Python

Use the invoke helpers to call already-deployed Cognite Functions from notebooks or scripts. These send the payload to the function running in CDF — they do not run validation locally.

from cognite_data_quality import call_validate_instances_shacl

result = call_validate_instances_shacl(client, data)

To run validation locally (no Function required), use run_validation() instead.

Main Capabilities

Capability	Description
run_validation	Run validation from DMS (no workflows). Recommended with YAML + `ruleset_references`; TTL remains supported for legacy transition.
deploy_validation_infrastructure	Deploy all validation infrastructure (containers, function, workflows, triggers) from `settings_path` and `views_dir`. Automatically provisions the `RuleEngineResult` container alongside `DataQualityValidationRecord`.
deploy_validation_pipeline	Deploy and run the full validation pipeline for a view (historic partitions, sync trigger, monitor schedule).
call_validation	Invoke the deployed CDF Function with type-based dispatch (instance, instance_sync_cursor, timeseries, orchestrator, partitioned, shacl, test). Runs in CDF, not locally.
invoke helpers	`call_validate_instances_shacl`, `call_validate_timeseries_shacl`, `call_validate_shacl`, etc. — thin wrappers around `call_validation()` for specific validation types. All invoke deployed CDF Functions.

Other Features

Credentials: Load from TOML (e.g. config.toml) with load_cognite_client_from_toml().
Rule sources: Data Product + RuleSet (primary), YAML as CDF Toolkit representation, TTL as legacy transition.
Records output: Quality violations → dataQuality:DataQualityValidationRecord; inference results → dataQuality:RuleEngineResult. Both configurable via RecordsConfig.
Auto-loading: References between instances are loaded automatically when validating DMS data (configurable depth).

Next Steps

Installation: Set up the package and credentials.
Where to Start: Run your first validation and deploy.

Usage Journey

The Usage documentation follows one end-to-end path from baseline validation to advanced conditional flows.

Start here:

Usage Guide

Follow this order: