Cognite Data Quality
cognite-data-quality is a Python package for SHACL-based data quality validation in Cognite Data Fusion (CDF). It validates DMS instances (structured data models) and time series data against declarative rules, stores results in the CDF Records API, and provides tools to deploy and run validation in CDF Functions and Workflows.
Purpose
Industrial data in CDF must meet quality standards before it can be trusted for analytics, dashboards, or automation. Raw data often has gaps, outliers, missing required fields, or rule violations. Manual checks do not scale.
This package provides:
- Single instance data quality validation: validate required fields, datatypes, ranges, and patterns on individual industrial entities.
- Graph consistency: validate cross-entity consistency (for example operation-tag vs equipment-tag alignment and relationship integrity).
- Uniqueness: run SHACL-native global uniqueness checks (
dqs:uniquenessConstraint/dqs:unique) with aggregate-first execution and overflow-safe output. - Time series: validate datapoint freshness, completeness, gaps, outliers, and value bounds with
cdf_sdkand optionalcdf_indslfunctions. - Conditional logic: define "if this, then create this data" logic via SHACL-AF (
sh:rule/sh:SPARQLRule) and emitRuleEngineResult. - Chained conditional logic: chain conditional outputs so downstream derived states depend on upstream inferred results (
dqs:dependsOn,dqs:causedBy). - Automated validation in CDF (Functions, Workflows, Triggers) and locally in notebooks and scripts.
- Traceable results in the Records API for monitoring, dashboards, and drilling into failures.
How It Works
- Manage and persist rules in Data Product + RuleSet (recommended). Use YAML as the CDF Toolkit representation of those bindings plus deployment/runtime settings, and use TTL only as a legacy transition path.
- Run validation — locally with
run_validation()(no Function required), or in CDF via deployed Functions triggered by Workflows. - Post results to the Records API (pass/fail, violation details, severity).
The package includes inlined SHACL validation with CDF SPARQL functions and integrates with CDF Data Modeling (DMS). It supports both CDF SDK and INDSL functions for time series quality checks.
Workflows
Local validation and testing
- Create a TOML file (e.g.
config.toml) with CDF credentials. - Write SHACL rules (TTL). Use
run_validation()against live data. - Iterate on rules; optionally set
post_to_records=Truewith aRecordsConfigto test the full pipeline. - Use
DataModelConfigandRecordsConfigfor datamodel, instance space, stream, and rule set.
Production deployment
- Deploy infrastructure with
deploy_validation_infrastructure(client, settings_path=..., views_dir=..., function_secrets=...). This ensures containers (Records, OrchestrationState, FunctionValidationState), deploys the unified validation function, workflows, and triggers. - Optionally run the validation pipeline for a view with
deploy_validation_pipeline(client, settings_path=..., view_external_id=..., wait=True)to process historic data and set up sync and monitor schedules. - Results are written to the Records API automatically.
Invoking deployed CDF Functions from Python
Use the invoke helpers to call already-deployed Cognite Functions from notebooks or scripts. These send the payload to the function running in CDF — they do not run validation locally.
from cognite_data_quality import call_validate_instances_shacl
result = call_validate_instances_shacl(client, data)
To run validation locally (no Function required), use run_validation() instead.
Main Capabilities
| Capability | Description |
|---|---|
| run_validation | Run validation from DMS (no workflows). Recommended with YAML + ruleset_references; TTL remains supported for legacy transition. |
| deploy_validation_infrastructure | Deploy all validation infrastructure (containers, function, workflows, triggers) from settings_path and views_dir. Automatically provisions the RuleEngineResult container alongside DataQualityValidationRecord. |
| deploy_validation_pipeline | Deploy and run the full validation pipeline for a view (historic partitions, sync trigger, monitor schedule). |
| call_validation | Invoke the deployed CDF Function with type-based dispatch (instance, instance_sync_cursor, timeseries, orchestrator, partitioned, shacl, test). Runs in CDF, not locally. |
| invoke helpers | call_validate_instances_shacl, call_validate_timeseries_shacl, call_validate_shacl, etc. — thin wrappers around call_validation() for specific validation types. All invoke deployed CDF Functions. |
Other Features
- Credentials: Load from TOML (e.g.
config.toml) withload_cognite_client_from_toml(). - Rule sources: Data Product + RuleSet (primary), YAML as CDF Toolkit representation, TTL as legacy transition.
- Records output: Quality violations →
dataQuality:DataQualityValidationRecord; inference results →dataQuality:RuleEngineResult. Both configurable viaRecordsConfig. - Auto-loading: References between instances are loaded automatically when validating DMS data (configurable depth).
Next Steps
- Installation: Set up the package and credentials.
- Where to Start: Run your first validation and deploy.
Usage Journey
The Usage documentation follows one end-to-end path from baseline validation to advanced conditional flows.
Start here:
Follow this order: