Page cover

5.2 Data Standardization & Data Harmonization

Across humanitarian and development operations, data is collected by multiple actors using diverse tools, definitions, and formats. While this reflects the diversity of programs and contexts, it also leads to significant challenges for the data of individual organizations and the sector at large. For instance, in terms of:

  • Inconsistent service definitions or participant categories (e.g., one team records β€œinfant” while another uses β€œ0–4 years” for targeting nutrition support or early childhood services).

  • Different geographic references (e.g., informal names vs. P-codes)

  • Conflicting formats (e.g., "male, female" vs. "1, 2" for sex)

These inconsistencies complicate comparability of data, compromise the reliability of analysis, reduce the usefulness of shared data, and undermine coordination.

To address this, two complementary approaches can be applied. In context of this handbook, these are defined as follows:

Data standardization refers to the development and application of common formats, definitions, structures, and classifications β€” before data is collected.

Example: A consortium agrees in advance to use the same age group categories (e.g., 0–4, 5–17, 18–59, 60+) and gender codes (#sex = male/female) across all household surveys, ensuring that collected data can be immediately aggregated and compared.

Data harmonization refers to the process of aligning non-standardized data from different sources to enable comparison, integration, and use β€” after data is collected.

Example: Three organizations conduct separate assessments using different terms for shelter types (e.g., "tent", "emergency shelter", "temporary structure"). Harmonization involves mapping these values to a common classification system.

By standardizing data (prospectively) and harmonizing data (retrospectively), humanitarian actors can:

  • Deliver services more equitably across locations and teams

  • Support smoother handovers and continuity in case management

  • Improve targeting and reduce exclusion errors in participant selection

  • Aggregate and compare data across programs, partners and geographies

  • Reduce duplication and improve efficiency

  • Ensure consistency in donor reporting and inter-agency coordination

A common misconception is that adopting global data standards reduces flexibility and responsiveness to local needs. While this risk exists, it reflects a false dichotomy: that one must choose between global alignment and local relevance. In reality, well-designed standardization frameworks can accommodate both.

This chapter present best practices, practical tools, and a step-by-step approach to achieve standardization and harmonization in program data workflows.


5.2.1 Three Dimensions of Data

To enable comparability and interoperability across sources, data harmonization and standardization must address different types of heterogeneity - as outlined by Cheng et al (2024). These include:

  • Semantics (i.e. intended meaning): Harmonizing semantics requires understanding the definitions and scope of each variable. For example, if not common standard exists, a variable labeled "youth" may refer to ages 15–24 in one dataset and 18–30 in another. Harmonization requires reviewing whether datasets measure the same concepts. It may involve grouping or reclassifying values to ensure conceptual alignment across datasets. Similarly, different terms (e.g. "adolescents" and "teenagers") may mean the same thing in practice, but require validation to confirm equivalence.

  • Syntax (i.e. data format): This refers to the technical encoding of the data. Datasets may come in multiple formats such as CSV, Excel, JSON, XML, or HTML. Even when the data content is conceptually similar, these formats require conversion and processing before they can be used in combination. For example, survey results exported from KoBoToolbox in XLS format may need to be transformed into CSV to align with other data in statistical software.

  • Structure (i.e. conceptual schema): This involves how variables and observations are organized. Structured data (e.g. tables where rows represent individual beneficiaries) differ significantly from unstructured data (e.g. free text in qualitative interviews). Even among structured datasets, variations exist in design. For instance, some assessments may capture an event in a single row with start and end dates as separate variables, while others record daily entries across multiple rows. Harmonizing such differences may require reshaping data or creating unique identifiers to align events across rows and formats.

By addressing these three dimensionsβ€”syntax, structure, and semanticsβ€”organizations can ensure more accurate aggregation of data from diverse sources. On top of that, it improves transparency and reproducibility of analysis.


5.2.2 Standardization - yes or no?

Not all data are equally suitable for standardization, therefore different types of data commonly captured in humanitarian operations will briefly be assessed according to their suitability for standardization.

What should be standardized?

These data types are structural and operationally essential, and their standardization allows for streamlined coordination, reporting, and integration.

  • Participant status classifications: Terms like registered, eligible, enrolled, served, and referred should be standardized to ensure consistent tracking across services and partners.

  • Service types and delivery modalities: Define standardized labels for service modalities (e.g. in-kind, cash-based, remote, mobile, static site) to unify implementation and improve coordination.

  • Activity codes and program event types: Agree on consistent naming and coding for core activities such as distributions, case follow-ups, training sessions, or site visits to streamline data analysis and archiving.

  • Referral pathways and outcomes: Use standard codes or tags to describe referral reasons, destination services, and outcomes (e.g., referred to health, refused, completed, in progress).

  • Programme document metadata: Standardize naming conventions and metadata fields (e.g. for intake forms, service logs, or progress checklists) to improve file traceability and integration.

  • Administrative divisions and locations: Use standardized geographic coding systems (e.g. P-Codes, ISO-Codes). This supports data aggregation, mapping, and alignment with national and inter-agency systems.

  • Population groups and demographics: Standard classification systems like the Washington Group Questions improve consistency and inclusivity.

  • Indicators and metrics: Standardizing indicators (e.g. via Sphere or cluster frameworks) is essential for evaluating impact, ensuring comparability, and meeting donor reporting requirements.

  • Sectors and activities: Referring to sectors and activities in a consistent way avoids confusion and misunderstandings, and facilitates the aggregation of programmatic data.

  • Time formats: Using a common format for time and date (e.g. ISO 8601 - YYYY-MM-DD) is critical for temporal alignment across tools and databases, and supports timeline analysis.

These data types are required across nearly all humanitarian operations. Standardization reduces duplication, errors, and processing time, and enables collaboration across actors and systems.

Where to be cautions?

These data types vary in meaning depending on context. Standardization is possible, but should involve consultation, documentation, and often adaptation to local nuances.

  • Vulnerability classifications: Definitions of vulnerability (e.g. β€œfemale-headed household”, β€œat-risk youth”) differ across contexts. Over-simplification can obscure local realities.

  • Population groups and ethnicity: Labels like "IDP", "returnee", or "migrant" are often used inconsistently and carry different social, political and legal meaning. Similarly, the use of ethnic categories requires strong ethical justification and contextual sensitivity, respecting how communities self-identify.

  • Local languages and terminologies: Translation and interpretation must be done carefully to preserve meaning and avoid misrepresentation.

These categories are often politically sensitive or locally contested. Rigid standardization may reinforce exclusion or lead to incorrect conclusions if applied without contextual adaptation.

What not to standardize?

Some data types pose protection risks or are too fluid or contextual to benefit from standardization.

  • Protection-sensitive personal data: Legal status, religious affiliation, and exact geolocations of individuals is very sensitive information. Standardization could increase exposure risks if not handled securely.

  • Local expressions and perceptions: These are deeply embedded in culture and context. Standardizing them risks erasing local knowledge or imposing inappropriate categories.

  • Context-specific phenomena: Events, traditional coping mechanisms, or local practices that don't map well onto global categories.

Protection, dignity, and context-sensitive programming take precedence over data comparability. In such cases, data should be treated with flexibility, anonymization, or used qualitatively.

As emphasized in the IASC Operational Guidance on Data Responsibility and chapter 1.2 of this Handbook, data management practices, including standardization, should always balance technical value with ethical and contextual relevance. When in doubt, prioritize the rights, safety, and perspectives of affected communities.


5.2.3 A Step-by-Step Guide

This section provides a step-by-step guide for data standardization (before data collection) and data harmonization (after data collection, when standardization has not occurred). These steps ensure alignment with consistent definitions, formats, structures, and classifications across datasets, while accounting for key dimensions of data: syntax, structure, and semantics.

Data Standardization (if data has not been collected yet)

Step 1: Identify Standardization Needs

  • Compare data needs for internal and external reporting, using a mapping table Review the key programme service delivery workflows and assess whether data formats, categories, and structures are consistent across teams and partners. Identify where standardization can improve alignment across forms, streamline coordination, and reduce duplication in day-to-day programme operations.

Step 2: Define Standards

  • Define naming conventions for variables as well as the data formats and disaggregation categories with relevant colleagues, using a taxonomy.

Step 3: Communicate Standards

  • Raise awareness about data standards (both organization-specific and inter-agency) and make them easily available to ensure they are broadly utilized.

Step 4: Utilize Standards

  • Choose and centralize data collection tools (e.g., KoBo Toolbox, SurveyCTO) and design assessments and/or data collection forms, that adhere to data standards in terms of semantics and schema.

Step 5: Validate & Revisit Standards

  • Monitor whether collected data adheres to agreed standards and collectively expand and revisit the data dictionary, contributing to data literacy within the organization.

Data Harmonization (if data has already been collected)

Step 1: Identify Harmonization Needs

  • Compare variables across datasets, using a mapping table to identify differences and overlaps.

Step 2: Syntax Harmonization

  • Convert all dataset to a harmonized format (e.g., csv, xlsx) and align data formats of columns (e.g. dates formatted consistently as DD/MM/YYYY).

Step 3: Schema Harmonization

  • Reshape datasets if needed (e.g., converting wide to long format) to assure consistent row-per-unit structure (e.g, one row per person, household, or event).

Step 4: Semantic Harmonization

  • Harmonize values where needed (e.g., replacing "Male, Female", with "1, 2") and group or adjust categories to common definitions (e.g. harmonize age groups) - using the mapping done in Step 1.

Step 5: Merge, Validate, and Share Harmonized Dataset

  • Merge datasets, where relevant, apply HXL tags, share the harmonized dataset with accompanying mapping tables, and change logs.

Even with standardized data collection, minor harmonization is often required before merging datasets (e.g. when different tools are used). Therefore, it is helpful to start early to define standards, minimizing time-consuming harmonization later on, and to use what is already in place (e.g. organization-specific or inter-agency standards).

While data harmonization can be done with low technical requirements, for example using Excel, tools like Python and R might be more suitable, especially for recurring tasks that can be automated. For more information on automation, see chapter 5.3.


5.2.4 Taxonomies: A practical tool

A taxonomy is a structured, often hierarchical list of categories or conceptsβ€”such as sectors, intervention types, or vulnerability classificationsβ€”used to consistently label and group information. Working with taxonomies helps to clearly define the meaning (semantics) of specific terms and their relationship to each other.

NRC, for example, has recently developed a taxonomy for their "Core Competencies", which defines activities (e.g. "Water Trucking"), activity types (e.g. "Recurring Distribution"), and other related elements, organized in a hierarchical order (see table below). The taxonomy then serves as a global reference for terminologies and relationships between concepts and can be used, for example, when building surveys or harmonizing data. This approach:

  • Prevents inconsistent use of terminologies and thereby reduces ambiguity in reporting

  • Enables automated data transformation and aggregation (e.g. consolidating activities by type)

Exemplary taxonomy:

Sector
Activity
Activity Type
Indicators

WASH

Water Trucking

Recurring Distribution

# of individuals reached (estimate)

# of liters provided per person

Borehole Drilling

Construction

# of individuals with access to water points (estimate)

# of liters provided per person per day

# of water sources constructed or rehabilitated

...

...

...

...

The above table is exemplary. Columns for definitions, modalities, or other relevant organization- or application-specific elements could be added as needed.

Taxonomies are also a useful tool to map global standards against context-specific categories, practically resolving the false dichotomy of having to choose between global alignment and local relevance.


5.2.5 Governing Data Standards

Once data standards have been defined (in form of taxonomies or otherwise), they need to be applied, revised and adapted. Hence, the success of standardization efforts depends not only on technical design, but on clear governance structures. Without defined responsibilities, even well-developed standards may be inconsistently applied or quickly eroded. A RACI Matrix can help to clearly identify and assign roles and responsibilities:

RACI Role
Application in Standardization

Responsible

Who defines and maintains the standard (e.g. global IM unit)

Accountable

Who ensures adherence (e.g. local IM officer)

Consulted

Who should contribute to standard design (e.g. local partners, sector specialist)

Informed

Who should be notified of updates (e.g. field teams, data entry staff)

Integrating data standards in a broader IM governance framework matters because it:

  • Ensures accountability for creating, reviewing, and updating standards.

  • Clarifies who must use the standards and how compliance is monitored.

  • Builds ownership among data producers, users, and decision-makers.

For a broader take on IM governance see chapter 1.5


5.2.6 Ongoing Efforts

The field of data standardization and harmonization is evolving rapidly, driven by increasing demand for interoperability, automation, and coordination across humanitarian actors. Notably, the Information Management Working Group (IMWG) Data Standards Subgroup is developing guidance and tools to:

  • Define common humanitarian data elements and schemas

  • Align indicator definitions across sectors and agencies

  • Promote metadata standardization and data documentation best practices

  • Support the interoperability of reporting tools and systems

Aside from that, there is a trend toward adopting platforms and information systems that support:

  • Modular data schemas that can be flexibly adapted and reused

  • Open APIs for real-time data exchange between systems

  • Federated models that allow decentralised data sharing while maintaining common standards

The functionalities of these platforms (e.g. ActivityInfo, Oort, or organization-specific systems like NRC's CORE) can reduce the manual burden of harmonization and help standardize incoming data streams more efficiently.


REFERENCES & FURTHER READINGS

Cheng et al. (2024). A general primer for data harmonization. Sci Data 11, 152.

OCHA (2025). IM Toolbox - Common Operational Datasets (CODs).

OCHA (2025). IM Toolbox - Global IMWG Data Standards SubGroup.

OCHA (2024). IM Toolbox - P-Codes.

OCHA (n.d). The Humanitarian Exchange Language (HXL).

OCHA (n.d). The Humanitarian Data Exchange (HDX).

CartONG (2021). Checklist of key considerations to keep in mind in order to select a new digital data collection tool in a responsible way.

IATI (n.d). International Aid Transparency Initiative.

IFRC (2023). Data Playbook Toolkit - Should We Apply Standards to Our Data?.

ODI (n.d.). Open Standards for Data.

Sphere (2018). The Sphere Handbook.

Last updated