Skip to content

Cognite Data Quality

Documentation Status PyPI

cognite-data-quality is a Python package for SHACL-based data quality validation in Cognite Data Fusion (CDF). It validates DMS instances (structured data models) and time series data against declarative rules, stores results in the CDF Records API, and provides tools to deploy and run validation in CDF Functions and Workflows.


Purpose

Industrial data in CDF must meet quality standards before it can be trusted for analytics, dashboards, or automation. Raw data often has gaps, outliers, missing required fields, or rule violations. Manual checks do not scale.

This package provides:

  • Single instance data quality validation: validate required fields, datatypes, ranges, and patterns on individual industrial entities.
  • Graph consistency: validate cross-entity consistency (for example operation-tag vs equipment-tag alignment and relationship integrity).
  • Uniqueness: run SHACL-native global uniqueness checks (dqs:uniquenessConstraint / dqs:unique) with aggregate-first execution and overflow-safe output.
  • Time series: validate datapoint freshness, completeness, gaps, outliers, and value bounds with cdf_sdk and optional cdf_indsl functions.
  • Conditional logic: define "if this, then create this data" logic via SHACL-AF (sh:rule / sh:SPARQLRule) and emit RuleEngineResult.
  • Chained conditional logic: chain conditional outputs so downstream derived states depend on upstream inferred results (dqs:dependsOn, dqs:causedBy).
  • Automated validation in CDF (Functions, Workflows, Triggers) and locally in notebooks and scripts.
  • Traceable results in the Records API for monitoring, dashboards, and drilling into failures.

How It Works

  1. Manage and persist rules in Data Product + RuleSet (recommended). Use YAML as the CDF Toolkit representation of those bindings plus deployment/runtime settings, and use TTL only as a legacy transition path.
  2. Run validation — locally with run_validation() (no Function required), or in CDF via deployed Functions triggered by Workflows.
  3. Post results to the Records API (pass/fail, violation details, severity).

The package includes inlined SHACL validation with CDF SPARQL functions and integrates with CDF Data Modeling (DMS). It supports both CDF SDK and INDSL functions for time series quality checks.


Workflows

Local validation and testing

  1. Create a TOML file (e.g. config.toml) with CDF credentials.
  2. Write SHACL rules (TTL). Use run_validation() against live data.
  3. Iterate on rules; optionally set post_to_records=True with a RecordsConfig to test the full pipeline.
  4. Use DataModelConfig and RecordsConfig for datamodel, instance space, stream, and rule set.

Production deployment

  1. Deploy infrastructure with deploy_validation_infrastructure(client, settings_path=..., views_dir=..., function_secrets=...). This ensures containers (Records, OrchestrationState, FunctionValidationState), deploys the unified validation function, workflows, and triggers.
  2. Optionally run the validation pipeline for a view with deploy_validation_pipeline(client, settings_path=..., view_external_id=..., wait=True) to process historic data and set up sync and monitor schedules.
  3. Results are written to the Records API automatically.

Invoking deployed CDF Functions from Python

Use the invoke helpers to call already-deployed Cognite Functions from notebooks or scripts. These send the payload to the function running in CDF — they do not run validation locally.

from cognite_data_quality import call_validate_instances_shacl

result = call_validate_instances_shacl(client, data)

To run validation locally (no Function required), use run_validation() instead.


Main Capabilities

Capability Description
run_validation Run validation from DMS (no workflows). Recommended with YAML + ruleset_references; TTL remains supported for legacy transition.
deploy_validation_infrastructure Deploy all validation infrastructure (containers, function, workflows, triggers) from settings_path and views_dir. Automatically provisions the RuleEngineResult container alongside DataQualityValidationRecord.
deploy_validation_pipeline Deploy and run the full validation pipeline for a view (historic partitions, sync trigger, monitor schedule).
call_validation Invoke the deployed CDF Function with type-based dispatch (instance, instance_sync_cursor, timeseries, orchestrator, partitioned, shacl, test). Runs in CDF, not locally.
invoke helpers call_validate_instances_shacl, call_validate_timeseries_shacl, call_validate_shacl, etc. — thin wrappers around call_validation() for specific validation types. All invoke deployed CDF Functions.

Other Features

  • Credentials: Load from TOML (e.g. config.toml) with load_cognite_client_from_toml().
  • Rule sources: Data Product + RuleSet (primary), YAML as CDF Toolkit representation, TTL as legacy transition.
  • Records output: Quality violations → dataQuality:DataQualityValidationRecord; inference results → dataQuality:RuleEngineResult. Both configurable via RecordsConfig.
  • Auto-loading: References between instances are loaded automatically when validating DMS data (configurable depth).

Next Steps

Usage Journey

The Usage documentation follows one end-to-end path from baseline validation to advanced conditional flows.

Start here:

Follow this order:

  1. Single instance data quality validation
  2. Graph consistency
  3. Uniqueness
  4. Time Series
  5. Conditional logic
  6. Chained conditional logic