November 8, 2025

Industrial DataOps: Building Scalable Data Infrastructure for Manufacturing

If you're leading data and analytics for a manufacturing enterprise, you face a challenge that IT-focused data leaders don't encounter: industrial data comes from thousands of devices that were never designed to share information, flows through networks that evolved organically over decades, and needs to support decisions by people who don't understand the nuances of every sensor and machine.

John Harrington, co-founder of Hibe and former VP at Kepware Technologies (acquired by PTC), has spent his career solving industrial data integration challenges. His perspective on industrial DataOps addresses what actually breaks when organizations try scaling IoT and Industry 4.0 initiatives beyond proof-of-concept projects.

Here's what most organizations get wrong: they wire up point-to-point connections between systems, writing custom code for each integration. The first two connections work fine. By the tenth connection, nobody knows where data is flowing or why. You have a tangled web of custom integrations that's impossible to maintain, secure, or scale.

The successful approach treats data as a separate dimension from the ISA-95 application layers. Rather than forcing data to flow hierarchically through PLC to SCADA to MES to ERP, you route it directly from source to target through a central hub. This provides visibility, enables reuse, and makes security manageable.

This guide provides the framework for building industrial data infrastructure that scales based on implementations across manufacturing operations worldwide.

Industrial DataOps: Beyond IT DataOps

DataOps emerged in the IT world through companies like IBM and Hitachi to address data curation and delivery for analytics. Industrial DataOps adapts these concepts for manufacturing's unique challenges.

The core definition:Industrial DataOps is the practice of curating and delivering industrial data for analytics and visualization across diverse systems. It addresses how you prepare data from factory floors and deliver it to cloud systems, business intelligence tools, machine learning platforms, and other consumers.

Why industrial is different:IT DataOps assumes relatively clean data from transactional systems with defined schemas. Industrial DataOps deals with thousands of sensors and devices using hundreds of protocols, generating high-frequency data with quality issues, operated by people who understand their specific equipment but not necessarily the broader data ecosystem.

The integration explosion:Industry 4.0 and cloud adoption mean far more systems need access to industrial data. You're no longer just connecting sensors to SCADA systems. You're feeding cloud analytics, machine learning platforms, business intelligence tools, CMMS systems, ERP integration, and specialized applications—each with different data requirements.

The streamlining imperative:Without DataOps, each new system requires custom integration work. Implementation takes months. Scaling becomes impossible because every connection is a custom project. DataOps provides systematic approaches to data preparation and delivery that enable rapid adoption of new systems.

The key insight: you cannot scale industrial digital transformation through custom integrations. You need systematic data infrastructure.

Rethinking the ISA-95 Model for Data Flow

The Purdue model and ISA-95 standard remain valid for understanding application layers in manufacturing. But the assumption that data must flow hierarchically through these layers breaks in modern architectures.

The traditional hierarchy:ISA-95 defined layers from sensors to PLCs to SCADA/HMI to MES to ERP. Historically, data flowed upward through each layer. If you wanted sensor data in your ERP system, it traversed every intermediate layer.

Why this no longer works:Consider deploying vibration sensors for predictive maintenance. That data doesn't need to flow through PLCs—it's not used for machine control. It doesn't need SCADA visualization. The MES system doesn't consume it. Forcing it through these layers adds cost, complexity, and latency while burdening systems that already work well.

The new paradigm:Keep the ISA-95 layers for applications. They accurately describe the functional architecture of manufacturing systems. But treat data flow as a separate dimension. Connect data sources directly to their consumers, or route them through a single data hub.

If predictive maintenance analytics in Azure need vibration data, connect sensors to Azure directly. If ambient temperature monitoring feeds ERP resource planning, route it directly rather than through intermediate control systems.

Benefits of direct routing:You avoid adding load to validated control systems. You reduce latency for analytics. You cut implementation costs by eliminating unnecessary intermediate integrations. You can deploy new sensors for analytics without touching production control infrastructure.

The practical implementation:This requires a data hub that can consume from diverse sources (OPC servers, MQTT sensors, databases, REST APIs) and route to diverse targets (cloud platforms, analytics systems, enterprise applications). The hub handles standardization, normalization, and contextualization once rather than reimplementing transformations for each connection.

Think of it as networking evolution. Early computers connected point-to-point with cables. As scale increased, routers became essential. DataOps provides routing for industrial data flows.

Data Quality: Why It Matters Beyond the OT Domain

Data quality concerns in industrial environments differ fundamentally from IT data quality. Understanding why helps you build appropriate quality controls.

The domain shift:Previously, OT data was consumed by operators, SCADA administrators, and process engineers who understood equipment nuances. They knew which sensors occasionally glitched, when machines were offline, and what constituted valid data for specific equipment.

Now you're pushing that data to business analysts who compare across facilities, data scientists building models, and executives making strategic decisions. These consumers don't understand equipment quirks. They assume data validity.

Quality dimensions in industrial data:

Usability:Is data structured so target systems can consume it? If machine state is numeric but consumers need textual descriptions (Running, Stopped, Maintenance), the numeric value is unusable regardless of accuracy.

Accuracy:Sensors can be miscalibrated, experience drift, or provide readings outside valid ranges. Unlike IT systems where data entry is validated, sensors report physical reality—including when they malfunction.

Availability:Industrial environments lose connectivity regularly. Networks fail, devices go offline, and firewalls block communication. Quality isn't just "is this value correct" but "is this value available when needed."

Consistency:When comparing across equipment, inconsistent units of measure or data formats make comparisons meaningless. One facility measuring pressure in PSI and another in kilopascals creates analytics problems.

The decision impact:Business decisions based on poor-quality industrial data affect operations across enterprises. Resource allocation, capital investment, and strategic planning all depend on accurate visibility into operations. Quality issues that operators could work around locally become critical when data informs enterprise decisions.

The quality control approach:Implement quality checks at the data hub where you can standardize quality metrics across sources. Flag data with quality indicators when sensors report errors, connections are intermittent, or values fall outside expected ranges. Let consumers decide how to handle low-quality data rather than silently passing it through.

Standardization, Normalization, and Contextualization

These three concepts form the foundation of making industrial data usable at scale. Understanding what each means practically helps you implement effective data transformation.

Standardization: Creating consistent data models:You have multiple pumps from different manufacturers. Each reports data differently—different tag names, different measurement approaches, different information available. Standardization defines a common data model for all pumps.

For predictive maintenance, you standardize: manufacturer, location, current pressure, operational status, flow rate. Every pump provides this standard data set regardless of how the underlying device reports it.

Normalization: Making values comparable:Even standardized data models need normalized values. One pump reports pressure in PSI, another in kilopascals. PLCs often scale values for easier processing—representing 0-100 PSI as 0-1000 for integer arithmetic.

Normalization converts everything to consistent units and formats. All pressure readings become PSI (or all become kilopascals). Scaled values are converted to actual units. This enables meaningful comparisons and analytics across equipment.

Contextualization: Adding meaning:A pressure reading of 256 PSI means nothing without context. Which pump? Which facility? Which production line? What's being produced?

Contextualization adds metadata that provides meaning: Factory 2, Area 3, Line 5, Pump 12B, inflow to boiler section, running product SKU 45678. Now that 256 PSI reading enables analysis, troubleshooting, and comparison.

PLCs and control systems don't need this context—they operate locally on specific equipment. But once data leaves the OT domain for analytics, contextualization becomes essential.

Why centralize these transformations:You want to standardize, normalize, and contextualize once at a central point rather than reimplementing for each consumer. When Azure analytics need pump data, they get standardized, normalized, contextualized information. When your CMMS system needs it, it receives the same transformed data. When business intelligence tools consume it, no additional transformation is needed.

This centralized approach enables reuse and maintains consistency. Every consumer gets the same high-quality data rather than each reimplementing transformations differently.

Data Contracts, Semantics, and Governance

Successful data integration at scale requires systematic approaches beyond point-to-point connections. Three concepts provide this structure: data contracts, data semantics, and data governance.

Data contracts: Defining what's needed:Before integrating systems, define the contract: What use case are you solving? What data does the target system need? How should it be structured? What standards apply? How is data transferred?

For predictive maintenance in Azure, the contract specifies: which equipment types, what measurements, at what frequency, in what format, using which Azure services. This contract drives everything else.

Similarly, define source system contracts: What data is available? How is it accessed? What are the protocols and security requirements? Contracts at both ends enable systematic integration.

Data semantics: Transformation logic:With contracts defined, semantics describe the transformation: How do you align data from source format to target format? What calculations are needed? What context must be added? How do you handle missing or invalid data?

Systematizing these transformations rather than writing custom code for each integration enables reuse. When you add a second facility with similar equipment, you reuse transformation logic rather than starting over.

Data governance: Maintaining over time:The hardest challenge isn't initial setup—it's maintenance as systems evolve. Equipment changes, PLCs get reprogrammed, products shift, networks are reconfigured, and target systems add new capabilities.

Governance defines who controls what: Who approves changes to data models? How are source system changes communicated? What's the process for adding new data to contracts? How do you ensure changes don't break existing consumers?

The abstraction principle:These three concepts enable abstraction. Rather than managing millions of individual data tags, you manage contracts, semantic transformations, and governance processes. This makes industrial data infrastructure maintainable at scale.

Without systematic approaches, you end up with tangled custom integrations that nobody fully understands. With contracts, semantics, and governance, you have sustainable data infrastructure.

Data Modeling: The Key to Scaling

Managing millions of data tags individually is impossible. Data modeling creates abstractions that make large-scale deployments manageable.

The scale challenge:Modern factories have tens of thousands to millions of data points. You're not increasing OT or IT staff proportionally. You cannot manage this through manual one-off configuration of individual tags.

The modeling solution:Create abstract models representing equipment types. Rather than managing 1,000 tags from a specific pump, you manage a pump model. The pump model defines what data pumps provide and how to access it. You instantiate this model for each physical pump.

Now instead of managing millions of tags, you manage dozens of models and their instances. Do you want to send pump data to Azure? Configure it once at the model level, and all pump instances flow to Azure. No individual tag management needed.

ISA-95 as foundation:Most implementations start with ISA-95 hierarchy: enterprise, site, area, line, work cell, equipment. This structure naturally organizes industrial data and maps to how manufacturing operations are organized.

You create models at each level. The enterprise model contains sites. Site models contain areas. Area models contain lines. This hierarchy provides structure for organizing potentially millions of data points.

Device-level models:Beyond facility hierarchy, you model devices and equipment. Various industry groups define standards: MT Connect for machine tools, VDMA in Europe, BACnet for building automation. These provide starting points rather than building from scratch.

The practical approach:Start with established standards closest to your industry and use case. Customize based on specific needs. Don't try to model everything perfectly upfront—model what you need for current use cases, then expand incrementally.

Design reuse:Models enable design reuse. When you add a new facility or production line, you reuse existing models rather than starting over. When you integrate a new analytics system, it consumes existing models without requiring custom development.

Avoiding data swamps:Without models, data lakes become data swamps—filled with information but unusable because nobody understands structure or relationships. Models provide the semantic layer that makes data immediately usable by target systems.

Data modeling is not optional for scale. It's the difference between managing industrial data infrastructure and drowning in it.

Security Through Visibility and Control

Industrial data security requires more than encrypting communications. It requires knowing where data flows and controlling those flows centrally.

The visibility problem:Many organizations have implemented Industry 4.0 through proof-of-concept projects that grew organically. They wired up one system with custom code, then another, then another. Now they have mysterious computers in closets that nobody understands but everyone is afraid to turn off.

This creates security vulnerabilities. You can't secure what you don't understand. If you don't know where data flows or why, you cannot assess risk or respond to breaches.

The routing approach:DataOps provides a routing point for industrial data. Rather than point-to-point connections that nobody fully maps, data flows through a central hub. The hub ingests from sources, transforms, and routes to targets.

This provides visibility: You know every data flow. You can see what's connected to what. When you detect a security issue, you can immediately shut down affected routes. When you stop using a vendor's cloud service, you disable that route rather than hunting for wherever the connection might exist.

Protocol selection for security:Choose protocols based on security characteristics. MQTT publishes outbound only, avoiding incoming connection requests that require firewall openings. This is more secure than protocols where external systems query internal devices.

REST APIs and other request-response patterns require firewall configurations that expose internal systems. Pub-sub patterns limit exposure by keeping connections outbound-only from secure zones.

Segmentation and federation:Use network segmentation and DMZs appropriately. Deploy data hubs at network boundaries. One hub in the OT zone collects from devices. It federates to a hub in the DMZ. That hub routes to cloud targets. Each boundary provides security checkpoints.

Cloud as secure target:Security thinking has evolved. Early concerns about cloud security made organizations reluctant to use public clouds. Now many recognize that Azure, AWS, and Google have stronger security teams than most enterprises can deploy internally.

For remote workers needing operational visibility, cloud systems provide more secure access than VPNs into internal networks. Data in the cloud, accessed through cloud security controls, is often more secure than exposing internal systems.

The mapping imperative:Security fundamentally requires mapping your entire data infrastructure. Know what data exists, where it flows, which systems consume it, and why. Eliminate mysterious connections and unknown systems.

With complete visibility through a DataOps hub, you can secure industrial data effectively. Without it, you're hoping nothing breaks.

Conclusion

Industrial digital transformation fails when organizations treat data integration as a series of point-to-point connections. The first few connections work. Scaling breaks. Security becomes impossible to manage. Maintenance consumes increasing resources. Eventually you have a tangled web nobody understands.

Success requires treating industrial data as infrastructure that needs systematic architecture, not ad-hoc integration projects. That means implementing DataOps practices adapted for industrial environments.

The foundation is routing data through central hubs rather than creating point-to-point connections. This provides visibility—you know where data flows and why. It enables reuse—transform once rather than reimplementing for each consumer. It makes security manageable—you can see and control all data flows from a central point.

The next layer is data quality, standardization, normalization, and contextualization. Industrial data from OT systems needs transformation before business consumers can use it effectively. Implement these transformations centrally and systematically.

The advanced layer is data modeling that enables scale. You cannot manage millions of tags individually. Models create abstractions representing equipment types that you instantiate across facilities and production lines.