NIST Big Data Interoperability Framework: Volume 6, Reference Architecture

NIST Special Publication 1500-6r2

NIST Big Data Interoperability

Framework:

Volume 6, Reference Architecture

Version 3

NIST Big Data Public Working Group

Definitions and Taxonomies Subgroup

This publication is available free of charge from:

https://doi.org/10.6028/NIST.SP.1500-6r2

NIST Special Publication 1500-6r2

NIST Big Data Interoperability

Framework:

Volume 6, Reference Architecture

Version 3

NIST Big Data Public Working Group

Definitions and Taxonomies Subgroup

Information Technology Laboratory

National Institute of Standards and Technology

Gaithersburg, MD 20899

This publication is available free of charge from:

https://doi.org/10.6028/NIST.SP.1500-6r2

October 2019

U.S. Department of Commerce

Wilbur L. Ross, Jr., Secretary

National Institute of Standards and Technology

Walter Copan, NIST Director and Undersecretary of Commerce for Standards and Technology

NIST BIG DATA INTEROPERABILITY FRAMEWORK: VOLUME 6, REFERENCE ARCHITECTURE

National Institute of Standards and Technology (NIST) Special Publication 1500-6r2

75 pages (October 2019)

NIST Special Publication series 1500 is intended to capture external perspectives related to NIST

standards, measurement, and testing-related efforts. These external perspectives can come from industry,

academia, government, and others. These reports are intended to document external perspectives and do

not represent official NIST positions.

Certain commercial entities, equipment, or materials may be identified in this document to describe an

experimental procedure or concept adequately. Such identification is not intended to imply

recommendation or endorsement by NIST, nor is it intended to imply that the entities, materials, or

equipment are necessarily the best available for the purpose.

There may be references in this publication to other publications currently under development by NIST in

accordance with its assigned statutory responsibilities. The information in this publication, including

concepts and methodologies, may be used by federal agencies even before the completion of such

companion publications. Thus, until each publication is completed, current requirements, guidelines, and

procedures, where they exist, remain operative. For planning and transition purposes, federal agencies

may wish to closely follow the development of these new publications by NIST.

Organizations are encouraged to review all publications during public comment periods and provide

feedback to NIST. All NIST publications are available at http://www.nist.gov/publication-portal.cfm.

Copyrights and Permissions

Official publications of the National Institute of Standards and Technology are not subject to copyright in

the United States. Foreign rights are reserved. Questions concerning the possibility of copyrights in

foreign countries should be referred to the Office of Chief Counsel at NIST via email to

nistcounsel@nist.gov

Comments on this publication may be submitted to Wo Chang

National Institute of Standards and Technology

Attn: Wo Chang, Information Technology Laboratory

100 Bureau Drive (Mail Stop 8900) Gaithersburg, MD 20899-8930

Email: SP1500comm[email protected]

This publication is available free of charge from: https://doi.org/10.6028/NIST.SP.1500-6r2

NIST BIG DATA INTEROPERABILITY FRAMEWORK: VOLUME 6, REFERENCE ARCHITECTURE

iii

Reports on Computer Systems Technology

The Information Technology Laboratory (ITL) at NIST promotes the U.S. economy and public welfare by

providing technical leadership for the Nation’s measurement and standards infrastructure. ITL develops

tests, test methods, reference data, proof of concept implementations, and technical analyses to advance

the development and productive use of information technology (IT). ITL’s responsibilities include the

development of management, administrative, technical, and physical standards and guidelines for the

cost-effective security and privacy of other than national security-related information in federal

information systems. This document reports on ITL’s research, guidance, and outreach efforts in IT and

its collaborative activities with industry, government, and academic organizations.

Abstract

Big Data is a term used to describe the large amount of data in the networked, digitized, sensor-laden,

information-driven world. While opportunities exist with Big Data, the data can overwhelm traditional

technical approaches, and the growth of data is outpacing scientific and technological advances in data

analytics. To advance progress in Big Data, the NIST Big Data Public Working Group (NBD-PWG) is

working to develop consensus on important fundamental concepts related to Big Data. The results are

reported in the NIST Big Data Interoperability Framework (NBDIF) series of volumes. This volume,

Volume 6, summarizes the work performed by the NBD-PWG to characterize Big Data from an

architecture perspective, presents the NIST Big Data Reference Architecture (NBDRA) conceptual

model, discusses the roles and fabrics of the NBDRA, presents an activities view of the NBDRA to

describe the activities performed by the roles, and presents a functional component view of the NBDRA

containing the classes of functional components that carry out the activities.

Keywords

Activities view; Big Data Application Provider; Big Data; Big Data characteristics; Data Consumer; Data

Provider; Framework Provider; functional component view; Management Fabric; reference architecture;

Security and Privacy Fabric; System Orchestrator; use cases.

This publication is available free of charge from: https://doi.org/10.6028/NIST.SP.1500-6r2

NIST BIG DATA INTEROPERABILITY FRAMEWORK: VOLUME 6, REFERENCE ARCHITECTURE

Acknowledgements

This document reflects the contributions and discussions by the membership of the NBD-PWG, co-

chaired by Wo Chang (NIST ITL), Bob Marcus (ET-Strategies), and Chaitan Baru (San Diego

Supercomputer Center; National Science Foundation). For all versions, the Subgroups were led by the

following people: Nancy Grady (SAIC), Natasha Balac (SDSC), and Eugene Luster (R2AD) for the

Definitions and Taxonomies Subgroup; Geoffrey Fox (Indiana University) and Tsegereda Beyene (Cisco

Systems) for the Use Cases and Requirements Subgroup; Arnab Roy (Fujitsu), Mark Underwood

(Krypton Brothers; Synchrony Financial), and Akhil Manchanda (GE) for the Security and Privacy

Subgroup; David Boyd (InCadence Strategic Solutions), Orit Levin (Microsoft), Don Krapohl

(Augmented Intelligence), and James Ketner (AT&T) for the Reference Architecture Subgroup; and

Russell Reinsch (Center for Government Interoperability), David Boyd (InCadence Strategic Solutions),

Carl Buffington (Vistronix), and Dan McClary (Oracle), for the Standards Roadmap Subgroup.

The editors for this document were the following:

• Version 1: Orit Levin (Microsoft), David Boyd (InCadence Strategic Solutions), and

Wo Chang (NIST)

• Version 2: David Boyd (InCadence Strategic Solutions) and Wo Chang (NIST)

• Version 3: David Boyd (InCadence Strategic Solutions) and Wo Chang (NIST)

Laurie Aldape (Energetics Incorporated) and Elizabeth Lennon (NIST) provided editorial assistance

across all NBDIF volumes.

NIST SP1500-6, Version 3 has been collaboratively authored by the NBD-PWG. As of the date of this

publication, there are over six hundred NBD-PWG participants from industry, academia, and government.

Federal agency participants include the National Archives and Records Administration (NARA), National

Aeronautics and Space Administration (NASA), National Science Foundation (NSF), and the U.S.

Departments of Agriculture, Commerce, Defense, Energy, Health and Human Services, Homeland

Security, Transportation, Treasury, and Veterans Affairs.

NIST would like to acknowledge the specific contributions

to this volume, during Version 1, Version 2,

and/or Version 3 activities, by the following NBD-PWG members:

Chaitan Baru

University of California, San

Diego, Supercomputer Center

Janis Beach

Information Management Services,

Inc.

David Boyd

InCadence Strategic Solutions

Scott Brim

Internet2

Gregg Brown

Microsoft

Carl Buffington

Vistronix

Pavithra Kenjige

PK Technologies

James Kobielus

IBM

Donald Krapohl

Augmented Intelligence

Orit Levin

Microsoft

Eugene Luster

DISA/R2AD

Serge Manning

Huawei USA

Robert Marcus

ET-Strategies

Felix Njeh

U.S. Department of the Army

Gururaj Pandurangi

Avyan Consulting Corp.

Linda Pelekoudas

Strategy and Design Solutions

Dave Raddatz

SiliconGraphics International

Corp.

Russell Reinsch

Center for Government

Interoperability

John Rogers

“Contributors” are members of the NIST Big Data Public Working Group who dedicated great effort to prepare

and gave substantial time on a regular basis to research and development in support of this document.

This publication is available free of charge from: https://doi.org/10.6028/NIST.SP.1500-6r2

NIST BIG DATA INTEROPERABILITY FRAMEWORK: VOLUME 6, REFERENCE ARCHITECTURE

Yuri Demchenko

University of Amsterdam

Jill Gemmill

Clemson University

Nancy Grady

SAIC

Ronald Hale

ISACA

Keith Hare

JCC Consulting, Inc.

Richard Jones

The Joseki Group LLC

Gary Mazzaferro

AlloyCloud, Inc.

Shawn Miller

U.S. Department of Veterans

Affairs

Sanjay Mishra

Verizon

Vivek Navale

NARA

Quyen Nguyen

U.S. Census Bureau

Arnab Roy

Fujitsu

Michael Seablom

NASA

Rupinder Singh

McAfee, Inc.

Anil Srivastava

Open Health Systems Laboratory

Glenn Wasson

SAIC

Timothy Zimmerlin

Consultant

Alicia Zuniga-Alvarado

Consultant

This publication is available free of charge from: https://doi.org/10.6028/NIST.SP.1500-6r2

NIST BIG DATA INTEROPERABILITY FRAMEWORK: VOLUME 6, REFERENCE ARCHITECTURE

TABLE OF CONTENTS

EXECUTIVE SUMMARY .................................................................................................................................... VIII

1 INTRODUCTION ......................................................................................................................................... 1

1.1 BACKGROUND .................................................................................................................................................. 1

1.2 SCOPE AND OBJECTIVES OF THE REFERENCE ARCHITECTURES SUBGROUP ..................................................................... 3

1.3 REPORT PRODUCTION ........................................................................................................................................ 3

1.4 REPORT STRUCTURE .......................................................................................................................................... 4

2 HIGH-LEVEL REFERENCE ARCHITECTURE REQUIREMENTS ........................................................................... 6

2.1 USE CASES AND REQUIREMENTS .......................................................................................................................... 6

2.2 REFERENCE ARCHITECTURE SURVEY ...................................................................................................................... 8

2.3 TAXONOMY ...................................................................................................................................................... 8

3 NBDRA CONCEPTUAL MODEL .................................................................................................................... 11

3.1 SYSTEM ORCHESTRATOR ................................................................................................................................... 14

3.2 DATA PROVIDER ............................................................................................................................................. 15

3.3 BIG DATA APPLICATION PROVIDER ..................................................................................................................... 16

3.4 BIG DATA FRAMEWORK PROVIDER ..................................................................................................................... 17

3.5 DATA CONSUMER ........................................................................................................................................... 18

3.6 MANAGEMENT FABRIC OF THE NBDRA .............................................................................................................. 18

3.7 SECURITY AND PRIVACY FABRIC OF THE NBDRA ................................................................................................... 19

4 NBDRA ARCHITECTURE VIEWS .................................................................................................................. 20

4.1 ACTIVITIES VIEW ............................................................................................................................................. 22

4.1.1 System Orchestrator .............................................................................................................................. 22

4.1.2 Big Data Application Provider ................................................................................................................ 23

4.1.2.1 Collection ...................................................................................................................................................... 23

4.1.2.2 Preparation ................................................................................................................................................... 24

4.1.2.3 Analytics ........................................................................................................................................................ 24

4.1.2.4 Visualization .................................................................................................................................................. 24

4.1.2.5 Access ........................................................................................................................................................... 24

4.1.3 Big Data Framework Provider ................................................................................................................ 25

4.1.3.1 Infrastructure Activities ................................................................................................................................ 25

4.1.3.2 Platform Activities ......................................................................................................................................... 25

4.1.3.3 Processing Activities ...................................................................................................................................... 25

4.1.4 Management Fabric Activities ............................................................................................................... 26

4.1.4.1 System Management .................................................................................................................................... 26

4.1.4.2 Big Data Life Cycle Management .................................................................................................................. 26

4.1.5 Security and Privacy Fabric Activities ..................................................................................................... 28

4.2 FUNCTIONAL COMPONENT VIEW ....................................................................................................................... 28

4.2.1 System Orchestrator .............................................................................................................................. 29

4.2.2 Big Data Application Provider ................................................................................................................ 29

4.2.2.1 MapReduce ................................................................................................................................................... 30

4.2.2.2 Bulk Synchronous Parallel ............................................................................................................................. 31

4.2.3 Big Data Framework Provider ................................................................................................................ 32

4.2.3.1 Infrastructure Frameworks ........................................................................................................................... 32

4.2.3.2 Data Platform Frameworks ........................................................................................................................... 35

4.2.3.3 Processing Frameworks ................................................................................................................................ 46

4.2.3.4 Crosscutting Components ............................................................................................................................. 49

4.2.4 Management Fabric ............................................................................................................................... 50

This publication is available free of charge from: https://doi.org/10.6028/NIST.SP.1500-6r2

NIST BIG DATA INTEROPERABILITY FRAMEWORK: VOLUME 6, REFERENCE ARCHITECTURE

vii

4.2.4.1 Monitoring Frameworks ............................................................................................................................... 51

4.2.4.2 Provisioning/Configuration Frameworks ...................................................................................................... 51

4.2.4.3 Package Managers ........................................................................................................................................ 51

4.2.4.4 Resource Managers ...................................................................................................................................... 52

4.2.4.5 Data Life Cycle Managers .............................................................................................................................. 52

4.2.5 Security and Privacy Fabric .................................................................................................................... 53

4.2.5.1 Authentication and Authorization Frameworks ............................................................................................ 53

4.2.5.2 Audit Frameworks ......................................................................................................................................... 53

5 SUMMARY ................................................................................................................................................ 54

APPENDIX A: DEPLOYMENT CONSIDERATIONS .................................................................................................. 55

APPENDIX B: TERMS AND DEFINITIONS ............................................................................................................. 58

APPENDIX C: ACRONYMS .................................................................................................................................. 60

APPENDIX D: RESOURCES AND BIBLIOGRAPHY .................................................................................................. 62

FIGURES

FIGURE 1: NBDIF DOCUMENTS NAVIGATION DIAGRAM PROVIDES CONTENT FLOW BETWEEN VOLUMES ........................................... 5

FIGURE 2: NBDRA TAXONOMY ............................................................................................................................................ 9

FIGURE 3: NIST BIG DATA REFERENCE ARCHITECTURE (NBDRA) .............................................................................................. 11

FIGURE 4: MULTIPLE INSTANCES OF NBDRA COMPONENTS INTERACT AS PART OF A LARGER SYSTEM .............................................. 13

FIGURE 5: BIG DATA SYSTEM WITHIN A SYSTEM OF SYSTEMS VIEW ............................................................................................ 14

FIGURE 6: NBDRA VIEW CONVENTIONS .............................................................................................................................. 20

FIGURE 7: TOP LEVEL ROLES AND FABRICS ............................................................................................................................. 21

FIGURE 8: TOP-LEVEL CLASSES OF ACTIVITIES WITHIN THE ACTIVITIES VIEW ................................................................................ 22

FIGURE 9: COMMON CLASSES OF FUNCTIONAL COMPONENTS .................................................................................................. 29

FIGURE 10: DATA ORGANIZATION APPROACHES ..................................................................................................................... 35

FIGURE 11: DATA STORAGE TECHNOLOGIES .......................................................................................................................... 39

FIGURE 12: DIFFERENCES BETWEEN ROW-ORIENTED AND COLUMN-ORIENTED STORES ................................................................ 41

FIGURE 13: COLUMN FAMILY SEGMENTATION OF THE COLUMNAR STORES MODEL ...................................................................... 42

FIGURE 14: OBJECT NODES AND RELATIONSHIPS OF GRAPH DATABASES ..................................................................................... 45

FIGURE 15: INFORMATION FLOW ........................................................................................................................................ 47

FIGURE A-1: BIG DATA FRAMEWORK DEPLOYMENT OPTIONS ................................................................................................... 55

TABLES

TABLE 1: MAPPING USE CASE CHARACTERIZATION CATEGORIES TO REFERENCE ARCHITECTURE COMPONENTS AND FABRICS .................. 6

TABLE 2: 13 DWARFS—ALGORITHMS FOR SIMULATION IN THE PHYSICAL SCIENCES ...................................................................... 30

This publication is available free of charge from: https://doi.org/10.6028/NIST.SP.1500-6r2

NIST BIG DATA INTEROPERABILITY FRAMEWORK: VOLUME 6, REFERENCE ARCHITECTURE

viii

Executive Summary 1

The NIST Big Data Public Working Group (NBD-PWG) Reference Architecture Subgroup prepared this 2

NIST Big Data Interoperability Framework (NBDIF): Volume 6, Reference Architecture document to 3

provide a vendor-neutral, technology- and infrastructure-agnostic conceptual model and examine related 4

issues. The NIST Big Data Reference Architecture (NBDRA) which consists of a conceptual model and 5

two architectural views, was a collaborative effort within the Reference Architecture Subgroup and with 6

the other NBD-PWG subgroups. The goal of the NBD-PWG Reference Architecture Subgroup is to 7

develop an open reference architecture for Big Data that achieves the following objectives: 8

• Provides a common language for the various stakeholders; 9

• Encourages adherence to common standards, specifications, and patterns; 10

• Provides consistent methods for implementation of technology to solve similar problem sets; 11

• Illustrates and improves understanding of the various Big Data components, processes, and 12

systems, in the context of a vendor- and technology- agnostic Big Data conceptual model 13

• Provides a technical reference for U.S. government departments, agencies, and other consumers 14

to understand, discuss, categorize, and compare Big Data solutions; and 15

• Facilitates analysis of candidate standards for interoperability, portability, reusability, and 16

extendibility 17

The NIST Big Data Interoperability Framework (NBDIF) was released in three versions, which 18

correspond to the three stages of the NBD-PWG work. Version 3 (current version) of the NBDIF volumes 19

resulted from Stage 3 work with major emphasis on the validation of the NBDRA Interfaces and content 20

enhancement. Stage 3 work built upon the foundation created during Stage 2 and Stage 1. The current 21

effort documented in this volume reflects concepts developed within the rapidly evolving field of Big 22

Data. The three stages (in reverse order) aim to achieve the following with respect to the NIST Big Data 23

Reference Architecture (NBDRA). 24

Stage 3: Validate the NBDRA by building Big Data general applications through the general 25

interfaces; 26

Stage 2: Define general interfaces between the NBDRA components; and 27

Stage 1: Identify the high-level Big Data reference architecture key components, which are 28

technology-, infrastructure-, and vendor-agnostic. 29

The NBDIF consists of nine volumes, each of which addresses a specific key topic, resulting from the 30

work of the NBD-PWG. The nine volumes are as follows: 31

• Volume 1, Definitions [1] 32

• Volume 2, Taxonomies [2] 33

• Volume 3, Use Cases and General Requirements [3] 34

• Volume 4, Security and Privacy [4] 35

• Volume 5, Architectures White Paper Survey [5] 36

• Volume 6, Reference Architecture (this volume) 37

• Volume 7, Standards Roadmap [6] 38

• Volume 8, Reference Architecture Interfaces [7] 39

• Volume 9, Adoption and Modernization [8] 40

During Stage 1, Volumes 1 through 7 were conceptualized, organized, and written. The finalized Version 41

1 documents can be downloaded from the V1.0 Final Version page of the NBD-PWG website 42

(https://bigdatawg.nist.gov/V1_output_docs.php

). 43

This publication is available free of charge from: https://doi.org/10.6028/NIST.SP.1500-6r2

NIST BIG DATA INTEROPERABILITY FRAMEWORK: VOLUME 6, REFERENCE ARCHITECTURE

During Stage 2, the NBD-PWG developed Version 2 of the NBDIF Version 1 volumes, with the 44

exception of Volume 5, which contained the completed architecture survey work that was used to inform 45

Stage 1 work of the NBD-PWG. The goals of Stage 2 were to enhance the Version 1 content, define 46

general interfaces between the NBDRA components by aggregating low-level interactions into high-level 47

general interfaces, and demonstrate how the NBDRA can be used. As a result of the Stage 2 work, the 48

need for NBDIF Volume 8 and NBDIF Volume 9 was identified and the two new volumes were created. 49

Version 2 of the NBDIF volumes, resulting from Stage 2 work, can be downloaded from the V2.0 Final 50

Version page of the NBD-PWG website (https://bigdatawg.nist.gov/V2_output_docs.php

). 51

This publication is available free of charge from: https://doi.org/10.6028/NIST.SP.1500-6r2

NIST BIG DATA INTEROPERABILITY FRAMEWORK: VOLUME 6, REFERENCE ARCHITECTURE

1 INTRODUCTION 53

1.1 BACKGROUND 54

There is broad agreement among commercial, academic, and government leaders about the potential of 55

Big Data to spark innovation, fuel commerce, and drive progress. Big Data is the common term used to 56

describe the deluge of data in today’s networked, digitized, sensor-laden, and information-driven world. 57

The availability of vast data resources carries the potential to answer questions previously out of reach, 58

including the following: 59

• How can a potential pandemic reliably be detected early enough to intervene? 60

• Can new materials with advanced properties be predicted before these materials have ever been 61

synthesized? 62

• How can the current advantage of the attacker over the defender in guarding against cyber-63

security threats be reversed? 64

There is also broad agreement on the ability of Big Data to overwhelm traditional approaches. The growth 65

rates for data volumes, speeds, and complexity are outpacing scientific and technological advances in data 66

analytics, management, transport, and data user spheres. 67

Despite widespread agreement on the inherent opportunities and current limitations of Big Data, a lack of 68

consensus on some important fundamental questions continues to confuse potential users and stymie 69

progress. These questions include the following: 70

• How is Big Data defined? 71

• What attributes define Big Data solutions? 72

• What is new in Big Data? 73

• What is the difference between Big Data and bigger data that has been collected for years? 74

• How is Big Data different from traditional data environments and related applications? 75

• What are the essential characteristics of Big Data environments? 76

• How do these environments integrate with currently deployed architectures? 77

• What are the central scientific, technological, and standardization challenges that need to be 78

addressed to accelerate the deployment of robust, secure Big Data solutions? 79

Within this context, on March 29, 2012, the White House announced the Big Data Research and 80

Development Initiative [9]. The initiative’s goals include helping to accelerate the pace of discovery in 81

science and engineering, strengthening national security, and transforming teaching and learning by 82

improving analysts’ ability to extract knowledge and insights from large and complex collections of 83

digital data. 84

Six federal departments and their agencies announced more than $200 million in commitments spread 85

across more than 80 projects, which aim to significantly improve the tools and techniques needed to 86

access, organize, and draw conclusions from huge volumes of digital data. The initiative also challenged 87

industry, research universities, and nonprofits to join with the federal government to make the most of the 88

opportunities created by Big Data. 89

Motivated by the White House initiative and public suggestions, the National Institute of Standards and 90

Technology (NIST) has accepted the challenge to stimulate collaboration among industry professionals to 91

further the secure and effective adoption of Big Data. As one result of NIST’s Cloud and Big Data Forum 92

held on January 15–17, 2013, there was strong encouragement for NIST to create a public working group 93

for the development of a Big Data Standards Roadmap. Forum participants noted that this roadmap 94

This publication is available free of charge from: https://doi.org/10.6028/NIST.SP.1500-6r2

NIST BIG DATA INTEROPERABILITY FRAMEWORK: VOLUME 6, REFERENCE ARCHITECTURE

should define and prioritize Big Data requirements, including interoperability, portability, reusability, 95

extensibility, data usage, analytics, and technology infrastructure. In doing so, the roadmap would 96

accelerate the adoption of the most secure and effective Big Data techniques and technology. 97

On June 19, 2013, the NIST Big Data Public Working Group (NBD-PWG) was launched with extensive 98

participation by industry, academia, and government from across the nation. The scope of the NBD-PWG 99

involves forming a community of interests from all sectors—including industry, academia, and 100

government—with the goal of developing consensus on definitions, taxonomies, secure reference 101

architectures, security and privacy, and, from these, a standards roadmap. Such a consensus would create 102

a vendor-neutral, technology- and infrastructure-independent framework that would enable Big Data 103

stakeholders to identify and use the best analytics tools for their processing and visualization requirements 104

on the most suitable computing platform and cluster, while also allowing added value from Big Data 105

service providers. 106

The NIST Big Data Interoperability Framework (NBDIF) was released in three versions, which 107

correspond to the three stages of the NBD-PWG work. Version 3 (current version) of the NBDIF volumes 108

resulted from Stage 3 work with major emphasis on the validation of the NBDRA Interfaces and content 109

enhancement. Stage 3 work built upon the foundation created during Stage 2 and Stage 1. The current 110

effort documented in this volume reflects concepts developed within the rapidly evolving field of Big 111

Data. The three stages (in reverse order) aim to achieve the following with respect to the NIST Big Data 112

Reference Architecture (NBDRA). 113

Stage 3: Validate the NBDRA by building Big Data general applications through the general 114

interfaces; 115

Stage 2: Define general interfaces between the NBDRA components; and 116

Stage 1: Identify the high-level Big Data reference architecture key components, which are 117

technology-, infrastructure-, and vendor-agnostic. 118

The NBDIF consists of nine volumes, each of which addresses a specific key topic, resulting from the 119

work of the NBD-PWG. The nine volumes are as follows: 120

• Volume 1, Definitions [1] 121

• Volume 2, Taxonomies [2] 122

• Volume 3, Use Cases and General Requirements [3] 123

• Volume 4, Security and Privacy [4] 124

• Volume 5, Architectures White Paper Survey [5] 125

• Volume 6, Reference Architecture (this volume) 126

• Volume 7, Standards Roadmap [6] 127

• Volume 8, Reference Architecture Interfaces [7] 128

• Volume 9, Adoption and Modernization [8] 129

During Stage 1, Volumes 1 through 7 were conceptualized, organized, and written. The finalized Version 130

1 documents can be downloaded from the V1.0 Final Version page of the NBD-PWG website 131

(https://bigdatawg.nist.gov/V1_output_docs.php

). 132

During Stage 2, the NBD-PWG developed Version 2 of the NBDIF Version 1 volumes, with the 133

exception of Volume 5, which contained the completed architecture survey work that was used to inform 134

Stage 1 work of the NBD-PWG. The goals of Stage 2 were to enhance the Version 1 content, define 135

general interfaces between the NBDRA components by aggregating low-level interactions into high-level 136

general interfaces, and demonstrate how the NBDRA can be used. As a result of the Stage 2 work, the 137

need for NBDIF Volume 8 and NBDIF Volume 9 was identified and the two new volumes were created. 138

Version 2 of the NBDIF volumes, resulting from Stage 2 work, can be downloaded from the V2.0 Final 139

Version page of the NBD-PWG website (https://bigdatawg.nist.gov/V2_output_docs.php

). 140

This publication is available free of charge from: https://doi.org/10.6028/NIST.SP.1500-6r2

NIST BIG DATA INTEROPERABILITY FRAMEWORK: VOLUME 6, REFERENCE ARCHITECTURE

1.2 SCOPE AND OBJECTIVES OF THE REFERENCE 141

ARCHITECTURES SUBGROUP 142

Reference architectures provide “an authoritative source of information about a specific subject area that 143

guides and constrains the instantiations of multiple architectures and solutions [10].” Reference 144

architectures generally serve as a foundation for solution architectures and may also be used for 145

comparison and alignment of instantiations of architectures and solutions. 146

The goal of the NBD-PWG Reference Architecture Subgroup is to develop an open reference architecture 147

for Big Data that achieves the following objectives: 148

• Provides a common language for the various stakeholders; 149

• Encourages adherence to common standards, specifications, and patterns; 150

• Provides consistent methods for implementation of technology to solve similar problem sets; 151

• Illustrates and improves understanding of the various Big Data components, processes, and 152

systems, in the context of a vendor- and technology-agnostic Big Data conceptual model; 153

• Provides a technical reference for U.S. government departments, agencies, and other consumers 154

to understand, discuss, categorize, and compare Big Data solutions; and 155

• Facilitates analysis of candidate standards for interoperability, portability, reusability, and 156

extendibility. 157

The NBDRA is a high-level conceptual model crafted to serve as a tool to facilitate open discussion of the 158

requirements, design structures, and operations inherent in Big Data. The NBDRA is intended to facilitate 159

the understanding of the operational intricacies in Big Data. It does not represent the system architecture 160

of a specific Big Data system, but rather is a tool for describing, discussing, and developing system-161

specific architectures using a common framework of reference. The model is not tied to any specific 162

vendor products, services, or reference implementation, nor does it define prescriptive solutions that 163

inhibit innovation. 164

The NBDRA does not address the following: 165

• Detailed specifications for any organization’s operational systems; 166

• Detailed specifications of information exchanges or services; and 167

• Recommendations or standards for integration of infrastructure products. 168

1.3 REPORT PRODUCTION 169

A wide spectrum of Big Data architectures has been explored and developed as part of various industry, 170

academic, and government initiatives. The development of the NBDRA and material contained in this 171

volume involved the following steps: 172

1. Announce that the NBD-PWG Reference Architecture Subgroup is open to the public to 173

attract and solicit a wide array of subject matter experts and stakeholders in government, 174

industry, and academia; 175

2. Gather publicly available Big Data architectures and materials representing various 176

stakeholders, different data types, and diverse use cases;

177

3. Examine and analyze the Big Data material to better understand existing concepts, usage, 178

goals, objectives, characteristics, and key elements of Big Data, and then document the 179

Many of the architecture use cases were originally collected by the NBD-PWG Use Case and Requirements

Subgroup and can be accessed at http://bigdatawg.nist.gov/usecases.php

This publication is available free of charge from: https://doi.org/10.6028/NIST.SP.1500-6r2

NIST BIG DATA INTEROPERABILITY FRAMEWORK: VOLUME 6, REFERENCE ARCHITECTURE

findings using NIST’s Big Data taxonomies model (presented in NBDIF: Volume 2, 180

Taxonomies); 181

4. Develop a technology-independent, open reference architecture based on the analysis of Big 182

Data material and inputs received from other NBD-PWG subgroups; 183

5. Identify workflow and interactions from the System Operator to the rest of the NBDRA 184

components; and 185

6. Develop an Activities View and a Functional Component View of the NBDRA to describe 186

the activities performed by the roles and fabrics along with the functional components that 187

carry out the activities. 188

To achieve technical and high-quality document content, this document will go through a public comment 189

period along with NIST internal review. 190

1.4 REPORT STRUCTURE 191

The organization of this document roughly corresponds to the process used by the NBD-PWG to develop 192

the NBDRA. Following the introductory material presented in Section 1, the remainder of this document 193

is organized as follows: 194

• Section 2 summarizes the work of other NBD-PWG Subgroups that informed the formation of the 195

NBDRA. 196

• Section 3 presents the NBDRA conceptual model, which is a vendor- and technology-agnostic 197

Big Data conceptual model. 198

• Section 4 explores two different views of the NBDRA, the activities view, which examines the 199

activities carried out by the NBDRA roles, and the functional component view, which examines 200

the functional components that carry out the activities 201

• Section 5 summarizes conclusions of this volume. 202

While each NBDIF volume was created with a specific focus within Big Data, all volumes are 203

interconnected. During the creation of the volumes, information from some volumes was used as input for 204

other volumes. Broad topics (e.g., definition, architecture) may be discussed in several volumes with each 205

discussion circumscribed by the volume’s particular focus. Arrows shown in Figure 1 indicate the main 206

flow of information input and/or output from the volumes. Volumes 2, 3, and 5 (blue circles) are 207

essentially standalone documents that provide output to other volumes (e.g., to Volume 6). These 208

volumes contain the initial situational awareness research. During the creation of Volumes 4, 7, 8, and 9 209

(green circles), input from other volumes was used. The development of these volumes took into account 210

work on the other volumes. Volumes 1 and 6 (red circles) were developed using the initial situational 211

awareness research and continued to be modified based on work in other volumes. The information from 212

these volumes was also used as input to the volumes in the green circles. 213

214

This publication is available free of charge from: https://doi.org/10.6028/NIST.SP.1500-6r2

NIST BIG DATA INTEROPERABILITY FRAMEWORK: VOLUME 6, REFERENCE ARCHITECTURE

215

Figure 1: NBDIF Documents Navigation Diagram Provides Content Flow Between Volumes 216

217

This publication is available free of charge from: https://doi.org/10.6028/NIST.SP.1500-6r2

NIST BIG DATA INTEROPERABILITY FRAMEWORK: VOLUME 6, REFERENCE ARCHITECTURE

2 HIGH-LEVEL REFERENCE 218

ARCHITECTURE REQUIREMENTS 219

The development of a Big Data reference architecture requires a thorough understanding of current 220

techniques, issues, and concerns. To this end, the NBD-PWG collected use cases to gain an understanding 221

of current applications of Big Data, conducted a survey of reference architectures to understand 222

commonalities within Big Data architectures in use, developed a taxonomy to understand and organize 223

the information collected, and reviewed existing technologies and trends relevant to Big Data. The results 224

of these NBD-PWG activities were used in the development of the NBDRA and are briefly summarized 225

in this section extracted from the corresponding other parts of the NBDIF. 226

2.1 USE CASES AND REQUIREMENTS 227

To develop the use cases, publicly available information was collected for various Big Data architectures 228

in nine broad areas, or application domains. Participants in the NBD-PWG Use Case and Requirements 229

Subgroup and other interested parties provided the use case details via a template, which helped to 230

standardize the responses and facilitate subsequent analysis and comparison of the use cases. However, 231

submissions still varied in levels of detail, quantitative data, or qualitative information. The NBDIF: 232

Volume 3, Use Cases and General Requirements document presents the original use cases, an analysis of 233

the compiled information, and the requirements extracted from the use cases. 234

The extracted requirements represent challenges faced in seven characterization categories (Table 1) 235

developed by the Subgroup. Requirements specific to the use cases were aggregated into high-level 236

generalized requirements, which are vendor and technology neutral. 237

The use case characterization categories were used as input in the development of the NBDRA and map 238

directly to NBDRA components and fabrics as shown in Table 1. 239

Table 1: Mapping Use Case Characterization Categories to 240

Reference Architecture Components and Fabrics 241

ASE

HARACTERIZATION

CATEGORIES

EFERENCE

RCHITECTURE

OMPONENTS

AND FABRICS

Data sources

→

Data Provider

Data transformation

→

Big Data Application Provider

Capabilities

→

Big Data Framework Provider

Data consumer

→

Data Consumer

Security and privacy

→

Security and Privacy Fabric

Life cycle management

→

System Orchestrator; Management Fabric

Other requirements

→

To all components and fabrics

242

The high-level generalized requirements are presented below. The development of these generalized 243

requirements is presented in the NBDIF: Volume 3, Use Cases and Requirements document. 244

This publication is available free of charge from: https://doi.org/10.6028/NIST.SP.1500-6r2

NIST BIG DATA INTEROPERABILITY FRAMEWORK: VOLUME 6, REFERENCE ARCHITECTURE

DATA SOURCE REQUIREMENTS (DSR) 245

• DSR-1: Reliable, real-time, asynchronous, streaming, and batch processing to collect data from 246

centralized, distributed, and cloud data sources, sensors, or instruments 247

• DSR-2: Slow, bursty, and high throughput data transmission between data sources and computing 248

clusters 249

• DSR-3: Diversified data content ranging from structured and unstructured text, documents, 250

graphs, websites, geospatial, compressed, timed, spatial, multimedia, simulation, and instrumental 251

(i.e., system managements and monitoring) data 252

TRANSFORMATION PROVIDER REQUIREMENTS (TPR) 253

• TPR-1: Diversified, compute-intensive, statistical and graph analytic processing and machine-254

learning techniques 255

• TPR-2: Batch and real-time analytic processing 256

• TPR-3: Processing large diversified data content and modeling 257

• TPR-4: Processing data in motion (e.g., streaming, fetching new content, data tracking, 258

traceability, data change management, and data boundaries) 259

CAPABILITY PROVIDER REQUIREMENTS (CPR) 260

• CPR-1: Legacy software and advanced software packages 261

• CPR-2: Legacy and advanced computing platforms 262

• CPR-3: Legacy and advanced distributed computing clusters, co-processors, input/output (I/O) 263

processing 264

• CPR-4: Advanced networks (e.g., software-defined network [SDN]) and elastic data transmission, 265

including fiber, cable, and wireless networks (e.g., local area network, wide area network, 266

metropolitan area network, Wi-Fi) 267

• CPR-5: Legacy, large, virtual, and advanced distributed data storage 268

• CPR-6: Legacy and advanced programming executables, applications, tools, utilities, and libraries 269

DATA CONSUMER REQUIREMENTS (DCR) 270

• DCR-1: Fast searches from processed data with high relevancy, accuracy, and recall 271

• DCR-2: Diversified output file formats for visualization, rendering, and reporting 272

• DCR-3: Visual layout for results presentation 273

• DCR-4: Rich user interface for access using browser, visualization tools 274

• DCR-5: High-resolution, multidimensional layer of data visualization 275

• DCR-6: Streaming results to clients 276

SECURITY AND PRIVACY REQUIREMENTS (SPR) 277

• SPR-1: Protect and preserve security and privacy of sensitive data. 278

• SPR-2: Support sandbox, access control, and multi-tenant, multilevel, policy-driven 279

authentication on protected data and ensure that these are in line with accepted governance, risk, 280

and compliance (GRC) and confidentiality, integrity, and availability (CIA) best practices. 281

This publication is available free of charge from: https://doi.org/10.6028/NIST.SP.1500-6r2

NIST BIG DATA INTEROPERABILITY FRAMEWORK: VOLUME 6, REFERENCE ARCHITECTURE

LIFE CYCLE MANAGEMENT REQUIREMENTS (LMR) 282

• LMR-1: Data quality curation, including preprocessing, data clustering, classification, reduction, 283

and format transformation 284

• LMR-2: Dynamic updates on data, user profiles, and links 285

• LMR-3: Data life cycle and long-term preservation policy, including data provenance 286

• LMR-4: Data validation 287

• LMR-5: Human annotation for data validation 288

• LMR-6: Prevention of data loss or corruption 289

• LMR-7: Multisite (including cross-border, geographically dispersed) archives 290

• LMR-8: Persistent identifier and data traceability 291

• LMR-9: Standardization, aggregation, and normalization of data from disparate sources 292

OTHER REQUIREMENTS (OR) 293

• OR-1: Rich user interface from mobile platforms to access processed results 294

• OR-2: Performance monitoring on analytic processing from mobile platforms 295

• OR-3: Rich visual content search and rendering from mobile platforms 296

• OR-4: Mobile device data acquisition and management 297

• OR-5: Security across mobile devices and other smart devices such as sensors 298

2.2 REFERENCE ARCHITECTURE SURVEY 299

The NBD-PWG Reference Architecture Subgroup conducted a survey of current reference architectures 300

to advance the understanding of the operational intricacies in Big Data and to serve as a tool for 301

developing system-specific architectures using a common reference framework. The Subgroup surveyed 302

currently published Big Data platforms by leading companies or individuals supporting the Big Data 303

framework and analyzed the collected material. 304

This effort revealed a consistency between Big Data architectures that served in the development of the 305

NBDRA. Survey details, methodology, and conclusions are reported in NBDIF: Volume 5, Architectures 306

White Paper Survey. 307

2.3 TAXONOMY 308

The NBD-PWG Definitions and Taxonomy Subgroup focused on identifying Big Data concepts, defining 309

terms needed to describe the new Big Data paradigm, and defining reference architecture terms. The 310

reference architecture taxonomy presented below provides a hierarchy of the components of the reference 311

architecture. Additional taxonomy details are presented in the NBDIF: Volume 2, Taxonomy document. 312

Figure 2 outlines potential actors for the seven roles developed by the NBD-PWG Definition and 313

Taxonomy Subgroup. The blue boxes contain the name of the role at the top with potential actors listed 314

directly below. 315

This publication is available free of charge from: https://doi.org/10.6028/NIST.SP.1500-6r2

NIST BIG DATA INTEROPERABILITY FRAMEWORK: VOLUME 6, REFERENCE ARCHITECTURE

316

Figure 2: NBDRA Taxonomy 317

SYSTEM ORCHESTRATOR 318

The System Orchestrator provides the overarching requirements that the system must fulfill, including 319

policy, governance, architecture, resources, and business requirements, as well as monitoring or auditing 320

activities to ensure that the system complies with those requirements. The System Orchestrator role 321

provides system requirements, high-level design, and monitoring for the data system. While the role 322

predates Big Data systems, some related design activities have changed within the Big Data paradigm. 323

DATA PROVIDER 324

A Data Provider makes data available to itself or to others. In fulfilling its role, the Data Provider creates 325

an abstraction of various types of data sources (such as raw data or data previously transformed by 326

another system) and makes them available through different functional interfaces. The actor fulfilling this 327

role can be part of the Big Data system, internal to the organization in another system, or external to the 328

organization orchestrating the system. While the concept of a Data Provider is not new, the greater data 329

collection and analytics capabilities have opened up new possibilities for providing valuable data. 330

BIG DATA APPLICATION PROVIDER 331

The Big Data Application Provider executes the manipulations of the data life cycle to meet requirements 332

established by the System Orchestrator. This is where the general capabilities within the Big Data 333

framework are combined to produce the specific data system. While the activities of an application 334

provider are the same whether the solution being built concerns Big Data or not, the methods and 335

techniques have changed because the data and data processing is parallelized across resources. 336

This publication is available free of charge from: https://doi.org/10.6028/NIST.SP.1500-6r2

NIST BIG DATA INTEROPERABILITY FRAMEWORK: VOLUME 6, REFERENCE ARCHITECTURE

BIG DATA FRAMEWORK PROVIDER 337

The Big Data Framework Provider has general resources or services to be used by the Big Data 338

Application Provider in the creation of the specific application. There are many new components from 339

which the Big Data Application Provider can choose in using these resources and the network to build the 340

specific system. This is the role that has seen the most significant changes because of Big Data. 341

The Big Data Framework Provider consists of one or more instances of the three subcomponents: 342

infrastructure frameworks, data platforms, and processing frameworks. There is no requirement that all 343

instances at a given level in the hierarchy be of the same technology and, in fact, most Big Data 344

implementations are hybrids combining multiple technology approaches. These provide flexibility and 345

can meet the complete range of requirements that are driven from the Big Data Application Provider. Due 346

to the rapid emergence of new techniques, this is an area that will continue to need discussion. 347

DATA CONSUMER 348

The Data Consumer receives the value output of the Big Data system. In many respects, it is the recipient 349

of the same type of functional interfaces that the Data Provider exposes to the Big Data Application 350

Provider. After the system adds value to the original data sources, the Big Data Application Provider then 351

exposes that same type of functional interfaces to the Data Consumer. 352

SECURITY AND PRIVACY FABRIC 353

Security and privacy issues affect all other components of the NBDRA. The Security and Privacy Fabric 354

interacts with the System Orchestrator for policy, requirements, and auditing and also with both the Big 355

Data Application Provider and the Big Data Framework Provider for development, deployment, and 356

operation. The NBDIF: Volume 4, Security and Privacy document discusses security and privacy topics. 357

MANAGEMENT FABRIC 358

The Big Data characteristics of volume, velocity, variety, and variability demand a versatile system and 359

software management platform for provisioning, software and package configuration and management, 360

along with resource and performance monitoring and management. Big Data management involves 361

system, data, security, and privacy considerations at scale, while maintaining a high level of data quality 362

and secure accessibility. 363

364

This publication is available free of charge from: https://doi.org/10.6028/NIST.SP.1500-6r2

NIST BIG DATA INTEROPERABILITY FRAMEWORK: VOLUME 6, REFERENCE ARCHITECTURE

3 NBDRA CONCEPTUAL MODEL 365

As discussed in Section 2, the NBD-PWG Reference Architecture Subgroup used a variety of inputs from 366

other NBD-PWG subgroups in developing a vendor-neutral, technology- and infrastructure-agnostic 367

conceptual model of Big Data architecture. This conceptual model, the NBDRA, is shown in Figure 3 and 368

represents a Big Data system comprised of five logical functional components connected by 369

interoperability interfaces (i.e., services). Two fabrics envelop the components, representing the 370

interwoven nature of management and security and privacy with all five of the components. 371

The NBDRA is intended to enable system engineers, data scientists, software developers, data architects, 372

and senior decision makers to develop solutions to issues that require diverse approaches due to 373

convergence of Big Data characteristics within an interoperable Big Data ecosystem. It provides a 374

framework to support a variety of business environments, including tightly integrated enterprise systems 375

and loosely coupled vertical industries, by enhancing understanding of how Big Data complements and 376

differs from existing analytics, business intelligence, databases, and systems. 377

378

Figure 3: NIST Big Data Reference Architecture (NBDRA) 379

This publication is available free of charge from: https://doi.org/10.6028/NIST.SP.1500-6r2

NIST BIG DATA INTEROPERABILITY FRAMEWORK: VOLUME 6, REFERENCE ARCHITECTURE

Note: None of the terminology or diagrams in these documents is intended to imply any business or 380

deployment model. The terms provider and consumer as used are descriptive of general roles and are 381

meant to be informative in nature. 382

The NBDRA is organized around five major roles and multiple sub-roles aligned along two axes 383

representing the two Big Data value chains: Information Value (horizontal axis) and Information 384

Technology (IT; vertical axis). Along the Information Value axis, the value is created by data collection, 385

integration, analysis, and applying the results following the value chain. Along the IT axis, the value is 386

created by providing networking, infrastructure, platforms, application tools, and other IT services for 387

hosting of and operating the Big Data in support of required data applications. At the intersection of both 388

axes is the Big Data Application Provider role, indicating that data analytics and its implementation 389

provide the value to Big Data stakeholders in both value chains. The term provider as part of the Big Data 390

Application Provider and Big Data Framework Provider is there to indicate that those roles provide or 391

implement specific activities and functions within the system. It does not designate a service model or 392

business entity. 393

The five main NBDRA roles, shown in Figure 3 and discussed in detail in Section 3, represent different 394

technical roles that exist in every Big Data system. These roles are the following: 395

• System Orchestrator, 396

• Data Provider, 397

• Big Data Application Provider, 398

• Big Data Framework Provider, and 399

• Data Consumer. 400

The two fabric roles shown in Figure 3 encompassing the five main roles are: 401

• Management, and 402

• Security and Privacy. 403

These two fabrics provide services and functionality to the five main roles in the areas specific to Big 404

Data and are crucial to any Big Data solution. 405

The DATA arrows in Figure 3 show the flow of data between the system’s main roles. Data flows 406

between the roles either physically (i.e., by value) or by providing its location and the means to access it 407

(i.e., by reference). The SW arrows show transfer of software tools for processing of Big Data in situ. The 408

Service Use arrows represent software programmable interfaces. While the main focus of the NBDRA is 409

to represent the run-time environment, all three types of communications or transactions can happen in 410

the configuration phase as well. Manual agreements (e.g., service-level agreements) and human 411

interactions that may exist throughout the system are not shown in the NBDRA. 412

Within a given Big Data Architecture implementation, there may be multiple instances of elements 413

performing the Data Provider, Data Consumer, Big Data Framework Provider, and Big Data Application 414

Provider roles. Thus, in a given Big Data implementation, there may be multiple Big Data applications 415

which use different frameworks to meet requirements. For example, one application may focus on 416

ingestion and analytics of streaming data and would use a framework based on components suitable for 417

that purpose, while another application may perform data warehouse style batch analytics which would 418

leverage a different framework. Figure 4 below shows how such multiple instances may interact as part of 419

a larger integrated system. As illustrated in the conceptual model, there should be a common Security and 420

Privacy, and Management roles across the architecture. The crosscutting roles are sometimes referred to 421

as fabrics because they must touch all the other roles and sub-roles within the Architecture. 422

This publication is available free of charge from: https://doi.org/10.6028/NIST.SP.1500-6r2

NIST BIG DATA INTEROPERABILITY FRAMEWORK: VOLUME 6, REFERENCE ARCHITECTURE

423

Figure 4: Multiple Instances of NBDRA Components Interact as Part of a Larger System 424

The roles in the Big Data ecosystem perform activities and are implemented via functional components. 425

In system development, actors and roles have the same relationship as in the movies, but system 426

development actors can represent individuals, organizations, software, or hardware. According to the Big 427

Data taxonomy, a single actor can play multiple roles, and multiple actors can play the same role. The 428

NBDRA does not specify the business boundaries between the participating actors or stakeholders, so the 429

roles can either reside within the same business entity or can be implemented by different business 430

entities. Therefore, the NBDRA is applicable to a variety of business environments, from tightly 431

integrated enterprise systems to loosely coupled vertical industries that rely on the cooperation of 432

independent stakeholders. As a result, the notion of internal versus external functional components or 433

roles does not apply to the NBDRA. However, for a specific use case, once the roles are associated with 434

specific business stakeholders, the functional components and the activities they perform would be 435

considered as internal or external—subject to the use case’s point of view. 436

The NBDRA does support the representation of stacking or chaining of Big Data systems. For example, a 437

Data Consumer of one system could serve as a Data Provider to the next system down the stack or chain. 438

Figure 5 below shows how a given Big Data Architecture implementation would operate in context with 439

other systems, users, or Big Data implementations. 440

This publication is available free of charge from: https://doi.org/10.6028/NIST.SP.1500-6r2

NIST BIG DATA INTEROPERABILITY FRAMEWORK: VOLUME 6, REFERENCE ARCHITECTURE

441

Figure 5: Big Data System within a System of Systems View 442

The following paragraphs provide high-level descriptions of the primary roles within the NBDRA. 443

Section 4 contains more detailed descriptions of the sub-roles, activities, and functional components. 444

3.1 SYSTEM ORCHESTRATOR 445

The System Orchestrator role includes defining and integrating the required data application activities 446

into an operational vertical system. Typically, the System Orchestrator involves a collection of more 447

specific roles, performed by one or more actors, which manage and orchestrate the operation of the Big 448

Data system. These actors may be human components, software components, or some combination of the 449

two. 450

The function of the System Orchestrator is to configure and manage the other components of the Big Data 451

architecture to implement one or more workloads that the architecture is designed to execute. The 452

workloads managed by the System Orchestrator may be assigning/provisioning framework components to 453

individual physical or virtual nodes at the lower level or providing a graphical user interface that supports 454

the specification of workflows linking together multiple applications and components at the higher level. 455

The System Orchestrator may also, through the Management Fabric, monitor the workloads and system to 456

confirm that specific quality of service requirements are met for each workload, and may actually 457

elastically assign and provision additional physical or virtual resources to meet workload requirements 458

resulting from changes/surges in the data or number of users/transactions. 459

The NBDRA represents a broad range of Big Data systems, from tightly coupled enterprise solutions 460

(integrated by standard or proprietary interfaces) to loosely coupled vertical systems maintained by a 461

variety of stakeholders bounded by agreements and standard or standard-de-facto interfaces. 462

In an enterprise environment, the System Orchestrator role is typically centralized and can be mapped to 463

the traditional role of system governor that provides the overarching requirements and constraints, which 464

the system must fulfill, including policy, architecture, resources, or business requirements. A system 465

governor works with a collection of other roles (e.g., data manager, data security, and system manager) to 466

implement the requirements and the system’s functionality. 467

This publication is available free of charge from: https://doi.org/10.6028/NIST.SP.1500-6r2

NIST BIG DATA INTEROPERABILITY FRAMEWORK: VOLUME 6, REFERENCE ARCHITECTURE

In a loosely coupled vertical system, the System Orchestrator role is typically decentralized. Each 468

independent stakeholder is responsible for its own system management, security, and integration, as well 469

as integration within the Big Data distributed system using the interfaces provided by other stakeholders. 470

3.2 DATA PROVIDER 471

The Data Provider role introduces new data or information feeds into the Big Data system for discovery, 472

access, and transformation by the Big Data system. New data feeds are distinct from the data already in 473

use by the system and residing in the various system repositories. Similar technologies can be used to 474

access both new data feeds and existing data. The Data Provider actors can be anything from a sensor, to 475

a human inputting data manually, to another Big Data system. 476

One of the important characteristics of a Big Data system is the ability to import and use data from a 477

variety of data sources. Data sources can be internal or public records, tapes, images, audio, videos, 478

sensor data, web logs, system and audit logs, HyperText Transfer Protocol (HTTP) cookies, and other 479

sources. Humans, machines, sensors, online and offline applications, Internet technologies, and other 480

actors can also produce data sources. 481

The roles of Data Provider and Big Data Application Provider often belong to different organizations, 482

unless the organization implementing the Big Data Application Provider owns the data sources. 483

Consequently, data from different sources may have different security and privacy considerations. In 484

fulfilling its role, the Data Provider creates an abstraction of the data sources. In the case of raw data 485

sources, the Data Provider can potentially clean, correct, and store the data in an internal format that is 486

accessible to the Big Data system that will ingest it. 487

The Data Provider can also provide an abstraction of data previously transformed by another system (i.e., 488

legacy system, another Big Data system). In this case, the Data Provider would represent a Data 489

Consumer of the other system. For example, Data Provider 1 could generate a streaming data source from 490

the operations performed by Data Provider 2 on a dataset at rest. 491

Data Provider activities include the following, which are common to most systems that handle data: 492

• Collecting the data; 493

• Persisting the data; 494

• Providing transformation functions for data scrubbing of sensitive information such as personally 495

identifiable information (PII); 496

• Creating the metadata describing the data source(s), usage policies/access rights, and other 497

relevant attributes; 498

• Enforcing access rights on data access; 499

• Establishing formal or informal contracts for data access authorizations; 500

• Making the data accessible through suitable programmable push or pull interfaces; 501

• Providing push or pull access mechanisms; and 502

• Publishing the availability of the information and the means to access it. 503

The Data Provider exposes a collection of interfaces (or services) for discovering and accessing the data. 504

These interfaces would typically include a registry so that applications can locate a Data Provider, 505

identify the data of interest it contains, understand the types of access allowed, understand the types of 506

analysis supported, locate the data source, determine data access methods, identify the data security 507

requirements, identify the data privacy requirements, and other pertinent information. Therefore, the 508

interface would provide the means to register the data source, query the registry, and identify a standard 509

set of data contained by the registry. 510

Subject to Big Data characteristics (i.e., volume, variety, velocity, and variability) and system design 511

considerations, interfaces for exposing and accessing data would vary in their complexity and can include 512

This publication is available free of charge from: https://doi.org/10.6028/NIST.SP.1500-6r2

NIST BIG DATA INTEROPERABILITY FRAMEWORK: VOLUME 6, REFERENCE ARCHITECTURE

both push and pull software mechanisms. These mechanisms can include subscription to events, listening 513

to data feeds, querying for specific data properties or content, and the ability to submit a code for 514

execution to process the data in situ. Because the data can be too large to economically move across the 515

network, the interface could also allow the submission of analysis requests (e.g., software code 516

implementing a certain algorithm for execution), with the results returned to the requestor. Data access 517

may not always be automated, but might involve a human role logging into the system and providing 518

directions where new data should be transferred (e.g., establishing a subscription to an email-based data 519

feed). 520

The interface between the Data Provider and Big Data Application Provider typically will go through 521

three phases: initiation, data transfer, and termination. The initiation phase is started by either party and 522

often includes some level of authentication/authorization. The phase may also include queries for 523

metadata about the source or consumer, such as the list of available topics in a publish/subscribe 524

(pub/sub) model and the transfer of any parameters (e.g., object count/size limits or target storage 525

locations). Alternatively, the phase may be as simple as one side opening a socket connection to a known 526

port on the other side. 527

The data transfer phase may be a push from the Data Provider or a pull by the Big Data Application 528

Provider. It may also be a singular transfer or involve multiple repeating transfers. In a repeating transfer 529

situation, the data may be a continuous stream of transactions/records/bytes. In a push scenario, the Big 530

Data Application Provider must be prepared to accept the data asynchronously but may also be required 531

to acknowledge (or negatively acknowledge) the receipt of each unit of data. In a pull scenario, the Big 532

Data Application Provider would specifically generate a request that defines through parameters of the 533

data to be returned. The returned data could itself be a stream or multiple records/units of data, and the 534

data transfer phase may consist of multiple request/send transactions. 535

The termination phase could be as simple as one side simply dropping the connection or could include 536

checksums, counts, hashes, or other information about the completed transfer. 537

3.3 BIG DATA APPLICATION PROVIDER 538

The Big Data Application Provider role executes a specific set of operations along the data life cycle to 539

meet the requirements established by the System Orchestrator, as well as meeting security and privacy 540

requirements. The Big Data Application Provider is the architecture component that encapsulates the 541

business logic and functionality to be executed by the architecture. The Big Data Application Provider 542

activities include the following: 543

• Collection, 544

• Preparation, 545

• Analytics, 546

• Visualization, and 547

• Access. 548

These activities are represented by the subcomponents of the Big Data Application Provider as shown in 549

Figure 3. The execution of these activities would typically be specific to the application and, therefore, 550

are not candidates for standardization. However, the metadata and the policies defined and exchanged 551

between the application’s subcomponents could be standardized when the application is specific to a 552

vertical industry. 553

While many of these activities exist in traditional data processing systems, the data volume, velocity, 554

variety, and variability present in Big Data systems radically change their implementation. Traditional 555

algorithms and mechanisms of traditional data processing implementations need to be adjusted and 556

optimized to create applications that are responsive and can grow to handle ever-growing data collections. 557

This publication is available free of charge from: https://doi.org/10.6028/NIST.SP.1500-6r2

NIST BIG DATA INTEROPERABILITY FRAMEWORK: VOLUME 6, REFERENCE ARCHITECTURE

As data propagates through the ecosystem, it is being processed and transformed in different ways in 558

order to extract the value from the information. Each activity of the Big Data Application Provider can be 559

implemented by independent stakeholders and deployed as stand-alone services. 560

The Big Data Application Provider can be a single instance or a collection of more granular Big Data 561

Application Providers, each implementing different steps in the data life cycle. Each of the activities of 562

the Big Data Application Provider may be a general service invoked by the System Orchestrator, Data 563

Provider, or Data Consumer, such as a web server, a file server, a collection of one or more application 564

programs, or a combination. There may be multiple and differing instances of each activity or a single 565

program may perform multiple activities. Each of the activities is able to interact with the underlying Big 566

Data Framework Providers as well as with the Data Providers and Data Consumers. In addition, these 567

activities may execute in parallel or in any number of sequences and will frequently communicate with 568

each other through the messaging/communications element of the Big Data Framework Provider. Also, 569

the functions of the Big Data Application Provider, specifically the collection and access activities, will 570

interact with the Security and Privacy Fabric to perform authentication/authorization and record/maintain 571

data provenance. 572

Each of the functions can run on a separate Big Data Framework Provider or all can use a common Big 573

Data Framework Provider. The considerations behind these different system approaches would depend on 574

potentially different technological needs, business and/or deployment constraints (including privacy), and 575

other policy considerations. The baseline NBDRA does not show the underlying technologies, business 576

considerations, and topological constraints, thus making it applicable to any kind of system approach and 577

deployment. 578

For example, the infrastructure of the Big Data Application Provider would be represented as one of the 579

Big Data Framework Providers. If the Big Data Application Provider uses external/outsourced 580

infrastructures as well, it or they will be represented as another or multiple Big Data Framework 581

Providers in the NBDRA. The multiple blocks behind the Big Data Framework Providers in Figure 3 582

indicate that multiple Big Data Framework Providers can support a single Big Data Application Provider. 583

3.4 BIG DATA FRAMEWORK PROVIDER 584

The Big Data Framework Provider typically consists of one or more hierarchically organized instances of 585

the components in the NBDRA IT value chain (Figure 3). There is no requirement that all instances at a 586

given level in the hierarchy be of the same technology. In fact, most Big Data implementations are 587

hybrids that combine multiple technology approaches in order to provide flexibility or meet the complete 588

range of requirements, which are driven from the Big Data Application Provider. 589

Many of the recent advances related to Big Data have been in the area of frameworks designed to scale to 590

Big Data needs (e.g., addressing volume, variety, velocity, and variability) while maintaining linear or 591

near-linear performance. These advances have generated much of the technology excitement in the Big 592

Data space. Accordingly, there is a great deal more information available in the frameworks area 593

compared to the other components, and the additional detail provided for the Big Data Framework 594

Provider in this document reflects this imbalance. 595

The Big Data Framework Provider comprises the following three sub-roles (from the bottom to the top): 596

• Infrastructure Frameworks, 597

• Data Platform Frameworks, and 598

• Processing Frameworks. 599

This publication is available free of charge from: https://doi.org/10.6028/NIST.SP.1500-6r2

NIST BIG DATA INTEROPERABILITY FRAMEWORK: VOLUME 6, REFERENCE ARCHITECTURE

3.5 DATA CONSUMER 600

Similar to the Data Provider, the role of Data Consumer within the NBDRA can be an actual end user or 601

another system. In many ways, this role is the mirror image of the Data Provider, with the entire Big Data 602

framework appearing like a Data Provider to the Data Consumer. The activities associated with the Data 603

Consumer role include the following: 604

• Search and Retrieve, 605

• Download, 606

• Analyze Locally, 607

• Reporting, 608

• Visualization, and 609

• Data to Use for Their Own Processes. 610

The Data Consumer uses the interfaces or services provided by the Big Data Application Provider to get 611

access to the information of interest. These interfaces can include data reporting, data retrieval, and data 612

rendering. 613

This role will generally interact with the Big Data Application Provider through its access function to 614

execute the analytics and visualizations implemented by the Big Data Application Provider. This 615

interaction may be demand-based, where the Data Consumer initiates the command/transaction and the 616

Big Data Application Provider replies with the answer. The interaction could include interactive 617

visualizations, creating reports, or drilling down through data using business intelligence functions 618

provided by the Big Data Application Provider. Alternately, the interaction may be stream- or push-based, 619

where the Data Consumer simply subscribes or listens for one or more automated outputs from the 620

application. In almost all cases, the Security and Privacy fabric around the Big Data architecture would 621

support the authentication and authorization between the Data Consumer and the architecture, with either 622

side able to perform the role of authenticator/authorizer and the other side providing the credentials. Like 623

the interface between the Big Data architecture and the Data Provider, the interface between the Data 624

Consumer and Big Data Application Provider would also pass through the three distinct phases of 625

initiation, data transfer, and termination. 626

3.6 MANAGEMENT FABRIC OF THE NBDRA 627

The Big Data characteristics of volume, velocity, variety, and variability demand a versatile management 628

platform for storing, processing, and managing complex data. Management of Big Data systems should 629

handle both system- and data-related aspects of the Big Data environment. The Management Fabric of the 630

NBDRA encompasses two general groups of activities: system management and Big Data life cycle 631

management (BDLM). System management includes activities such as provisioning, configuration, 632

package management, software management, backup management, capability management, resources 633

management, and performance management. BDLM involves activities surrounding the data life cycle of 634

collection, preparation/curation, analytics, visualization, and access. 635

As discussed above, the NBDRA represents a broad range of Big Data systems—from tightly coupled 636

enterprise solutions integrated by standard or proprietary interfaces to loosely coupled vertical systems 637

maintained by a variety of stakeholders or authorities bound by agreements, standard interfaces, or de 638

facto standard interfaces. Therefore, different considerations and technical solutions would be applicable 639

for different cases. 640

This publication is available free of charge from: https://doi.org/10.6028/NIST.SP.1500-6r2

NIST BIG DATA INTEROPERABILITY FRAMEWORK: VOLUME 6, REFERENCE ARCHITECTURE

3.7 SECURITY AND PRIVACY FABRIC OF THE NBDRA 641

Security and privacy considerations form a fundamental aspect of the NBDRA. This is geometrically 642

depicted in Figure 3 by the Security and Privacy Fabric surrounding the five main components, indicating 643

that all components are affected by security and privacy considerations. Thus, the role of security and 644

privacy is correctly depicted in relation to the components but does not expand into finer details, which 645

may be more accurate but are best relegated to a more detailed security and privacy reference 646

architecture. The Data Provider and Data Consumer are included in the Security and Privacy Fabric since, 647

at the least, they may often nominally agree on security protocols and mechanisms. The Security and 648

Privacy Fabric is an approximate representation that alludes to the intricate interconnected nature and 649

ubiquity of security and privacy throughout the NBDRA. Additional details about the Security and 650

Privacy Fabric are included in the NIST Interoperability Framework: Volume 4, Security and Privacy 651

document. 652

This publication is available free of charge from: https://doi.org/10.6028/NIST.SP.1500-6r2

NIST BIG DATA INTEROPERABILITY FRAMEWORK: VOLUME 6, REFERENCE ARCHITECTURE

4 NBDRA ARCHITECTURE VIEWS 653

As outlined in Section 3, the five main roles and two fabrics of the NBDRA represent the different 654

categories of technical activities and functional components within a Big Data system. In order to apply 655

the NBDRA to a particular system, it is necessary to construct architecture views of these activities and 656

the functional components that implement them. In constructing these views, the following definitions 657

apply: 658

Role: A related set of functions performed by one or more actors. 659

Sub-Role: A closely related sub-set of functions within a larger role. 660

Activity: A class of functions performed to fulfill the needs of one or more roles. 661

Example: Data Collection is a class of activities through which a Big Data Application 662

Provider obtains data. Instances of such would be web crawling, File Transfer Protocol 663

(FTP) site, web services, database queries, etc. 664

Functional Component: A class of physical items which support one or more activities 665

within a role. Example: Stream Processing Frameworks are a class of computing 666

frameworks which implement processing of streaming data. Instances of such 667

frameworks would include SPARK and STORM. 668

In order to promote consistency and the ability to easily compare and contrast the views of different 669

architecture implementations, the NBDRA is proposing the conventions shown in Figure 6 for the 670

activities and functional component views. 671

672

Figure 6: NBDRA View Conventions 673

The process of applying the NBDRA to a specific architecture implementation involves creating two 674

views of the architecture. The first view is the Activities View where one would enumerate the activities 675

to be accomplished by each role and sub-role within the system. Since there could be multiple instances 676

of different roles within a given system architecture, it would be appropriate to construct separate 677

architecture views for each instance since the role would likely be performing different activities though 678

different functional components. 679

This publication is available free of charge from: https://doi.org/10.6028/NIST.SP.1500-6r2

NIST BIG DATA INTEROPERABILITY FRAMEWORK: VOLUME 6, REFERENCE ARCHITECTURE

Figure 7 below provides a broad skeleton for construction of the activity views in terms of the roles and 680

fabrics which anchor each view into a common framework. Depending on the specifics of a particular 681

architecture, it may helpful to visually rearrange these components, show multiple instances where 682

appropriate, and even construct separate sub-view diagrams for each role. These choices are entirely 683

dependent on the specific architecture requirements. 684

685

Figure 7: Top Level Roles and Fabrics 686

Sections 4.1 and 4.2 provide high-level examples of the types and classes of activities and functional 687

components, respectively, that may be required to support a given architecture implementation. General 688

classes and descriptions are provided in both cases because across the range of potential Big Data 689

applications and architectures, the potential specific activities would be too numerous to enumerate and 690

the rapid evolution of software/hardware functional components makes a complete list impractical. 691

It should also be noted that as one goes lower down the IT value chain of the architecture, the diversity 692

and details of the activities and functional components would be less varied. 693

Finally, the sections below do not attempt to provide activity or functional component details for the Data 694

Provider or Data Consumer roles. There are two reasons for this. First, a Data Provider could be anything 695

from a simple sensor to a full-blown Big Data system itself. Providing a comprehensive list would be 696

impractical as shown in the System of Systems View in Figure 5 above. Second, often the Data Provider 697

and Data Consumer roles are supported by elements external to the architecture being developed and, thus 698

are outside the control of the architect. The user of this report should enumerate and document those 699

activities and functions to the extent it makes sense for their specific architecture. In cases where the Data 700

Provider and Data Consumer roles are within the architecture boundary, the user is advised to create 701

views based on similar roles, activities, and functional components found in the sections below. In cases 702

This publication is available free of charge from: https://doi.org/10.6028/NIST.SP.1500-6r2

NIST BIG DATA INTEROPERABILITY FRAMEWORK: VOLUME 6, REFERENCE ARCHITECTURE

where those roles are external to the architecture, the user should document any activities or components 703

on which the architecture is dependent. For example, activities and components related to authentication 704

or service-level agreements should be captured. 705

4.1 ACTIVITIES VIEW 706

As described above, the activities view is meant to describe what is performed or accomplished by 707

various roles in the Big Data system. As per the definitions, an activity can be something performed by a 708

person, organization, software, or hardware. Figure 8 below provides some top-level classes of activities 709

by roles and sub-roles which may be applicable to a Big Data architecture implementation. The following 710

paragraphs describe the roles and the classes of activities associated with those roles. The user is advised 711

to use these examples primarily as guides and to create more specific classes of activities and associated 712

descriptions as required to document their architecture. 713

714

Figure 8: Top-Level Classes of Activities Within the Activities View 715

Because the Data Provider and Data Consumer roles can represent anything such as another computer 716

system, a Big Data system, a person sitting at a keyboard, or remote sensors, the sub-roles and classes of 717

activities associated with these roles can encompass any of the activity classes defined below or others. 718

Users of the NBDRA should define the classes of activities and particular activities that address specific 719

concerns related to their architecture implementation. 720

The following paragraphs describe the general classes of activities implemented within the roles, sub-721

roles, and fabrics of the NBDRA. 722

4.1.1 SYSTEM ORCHESTRATOR 723

The activities within the System Orchestrator role set the overall ownership, governance, and policy 724

functions for the Big Data system by defining the appropriate requirements. These activities take place 725

This publication is available free of charge from: https://doi.org/10.6028/NIST.SP.1500-6r2

NIST BIG DATA INTEROPERABILITY FRAMEWORK: VOLUME 6, REFERENCE ARCHITECTURE

primarily during the system definition phase but must be revisited periodically throughout the life cycle of 726

the system. The other primary aspect of activities under this role is the monitoring of compliance with the 727

associated requirements. 728

Some classes of activities that could be defined for this role in the architecture include requirements 729

definition and compliance monitoring for: 730

• Business Ownership: This activity class defines which stakeholders own and have responsibility 731

for the various parts of the Big Data System. This activity would define the ownership and 732

responsibility for the activities and functional components of the rest of the system and how that 733

ownership will be monitored. 734

• Governance: This activity class would define the policies and process for governance of the 735

overall system. These governance requirements would in turn be executed and monitored by the 736

stakeholders defined as owners for the respective parts of the system. 737

• System Architecture: This class of activities involves defining the overall requirements that must 738

be met by the system architecture. In general, activities in this class establish the technical 739

guidelines that the overall system must meet and then provide the policies for monitoring the 740

overall architecture to verify that it remains in compliance with the requirements. 741

• Data Science: Activities in this class would define many of the requirements that must be met by 742

individual algorithms or applications within the system. These could include accuracy of 743

calculations or the precision/recall of data mining algorithms. 744

• Security/Privacy: While no classes of activities are considered mandatory, this class is certainly 745

the most critical and any architecture without well-defined security and privacy requirements and 746

associated monitoring is bound to be at extreme risk. Security deals with the control of access to 747

the system and its data and is required to ensure the privacy of personal or corporate information. 748

Privacy relates to both securing personal information but also defining the policies and controls 749

by which that information or derived information may or may not be shared. 750

Other classes of activities that may be addressed include the following: 751

• Quality Management, 752

• Service Management, and 753

• Audit Requirements. 754

4.1.2 BIG DATA APPLICATION PROVIDER 755

4.1.2.1 Collection 756

In general, the collection activity of the Big Data Application Provider handles the interface with the Data 757

Provider. This may be a general service, such as a file server or web server configured by the System 758

Orchestrator to accept or perform specific collections of data, or it may be an application-specific service 759

designed to pull data or receive pushes of data from the Data Provider. Since this activity is receiving data 760

at a minimum, it must store/buffer the received data until it is persisted through the Big Data Framework 761

Provider. This persistence need not be to physical media but may simply be to an in-memory queue or 762

other service provided by the processing frameworks of the Big Data Framework Provider. The collection 763

activity is likely where the extraction portion of the Extract, Transform, Load (ETL)/Extract, Load, 764

Transform (ELT) cycle is performed. At the initial collection stage, sets of data (e.g., data records) of 765

similar structure are collected (and combined), resulting in uniform security, policy, and other 766

considerations. Initial metadata is created (e.g., subjects with keys are identified) to facilitate subsequent 767

aggregation or look-up methods. 768

This publication is available free of charge from: https://doi.org/10.6028/NIST.SP.1500-6r2

NIST BIG DATA INTEROPERABILITY FRAMEWORK: VOLUME 6, REFERENCE ARCHITECTURE

4.1.2.2 Preparation 769

The preparation activity is where the transformation portion of the ETL/ELT cycle is likely performed, 770

although analytics activity will also likely perform advanced parts of the transformation. Tasks performed 771

by this activity could include data validation (e.g., checksums/hashes, format checks), cleaning (e.g., 772

eliminating bad records/fields), outlier removal, standardization, reformatting, or encapsulating. This 773

activity is also where source data will frequently be persisted to archive storage in the Big Data 774

Framework Provider and provenance data will be verified or attached/associated. Verification or 775

attachment may include optimization of data through manipulations (e.g., deduplication) and indexing to 776

optimize the analytics process. This activity may also aggregate data from different Data Providers, 777

leveraging metadata keys to create an expanded and enhanced dataset. 778

4.1.2.3 Analytics 779

The analytics activity of the Big Data Application Provider includes the encoding of the low-level 780

business logic of the Big Data system (with higher-level business process logic being encoded by the 781

System Orchestrator). The activity implements the techniques to extract knowledge from the data based 782

on the requirements of the vertical application. The requirements specify the data processing algorithms 783

for processing the data to produce new insights that will address the technical goal. The analytics activity 784

will leverage the processing frameworks to implement the associated logic. This typically involves the 785

activity providing software that implements the analytic logic to the batch and/or streaming elements of 786

the processing framework for execution. The messaging/communication framework of the Big Data 787

Framework Provider may be used to pass data or control functions to the application logic running in the 788

processing frameworks. The analytic logic may be broken up into multiple modules to be executed by the 789

processing frameworks which communicate, through the messaging/communication framework, with 790

each other and other functions instantiated by the Big Data Application Provider. 791

4.1.2.4 Visualization 792

The visualization activity of the Big Data Application Provider prepares elements of the processed data 793

and the output of the analytic activity for presentation to the Data Consumer. The objective of this activity 794

is to format and present data in such a way as to optimally communicate meaning and knowledge. The 795

visualization preparation may involve producing a text-based report or rendering the analytic results as 796

some form of graphic. The resulting output may be a static visualization and may simply be stored 797

through the Big Data Framework Provider for later access. However, the visualization activity frequently 798

interacts with the access activity, the analytics activity, and the Big Data Framework Provider (processing 799

and platform) to provide interactive visualization of the data to the Data Consumer based on parameters 800

provided to the access activity by the Data Consumer. The visualization activity may be completely 801

application implemented, leverage one or more application libraries, or may use specialized visualization 802

processing frameworks within the Big Data Framework Provider. 803

4.1.2.5 Access 804

The access activity within the Big Data Application Provider is focused on the communication/interaction 805

with the Data Consumer. Similar to the collection activity, the access activity may be a generic service 806

such as a web server or application server that is configured by the System Orchestrator to handle specific 807

requests from the Data Consumer. This activity would interface with the visualization and analytic 808

activities to respond to requests from the Data Consumer (who may be a person) and uses the processing 809

and platform frameworks to retrieve data to respond to Data Consumer requests. In addition, the access 810

activity confirms that descriptive and administrative metadata and metadata schemes are captured and 811

maintained for access by the Data Consumer and as data is transferred to the Data Consumer. The 812

interface with the Data Consumer may be synchronous or asynchronous in nature and may use a pull or 813

push paradigm for data transfer. 814

This publication is available free of charge from: https://doi.org/10.6028/NIST.SP.1500-6r2

NIST BIG DATA INTEROPERABILITY FRAMEWORK: VOLUME 6, REFERENCE ARCHITECTURE

4.1.3 BIG DATA FRAMEWORK PROVIDER 815

The Big Data Framework Provider role supports classes of activities associated with providing 816

management and communications between the subordinate sub-roles (i.e., Processing, Platforms, and 817

Infrastructures) and their classes of activities. Two common classes of activities associated with this role 818

are the following: 819

• Messaging: This activity class provides the necessary message queues and other communication 820

mechanisms that support communications between the activities within the Big Data Framework 821

Provider sub-roles and the Big Data Application Provider activities. 822

• Resource Management: Resources available to a given Big Data system are finite, so activities 823

that manage the allocation of resources to other sub-roles and activities are necessary. Such 824

activities would ensure that resources are allocated an appropriate priority status relative to other 825

activities and that resources, such as memory and central processing unit (CPU), are not 826

oversubscribed. 827

4.1.3.1 Infrastructure Activities 828

Classes of activities within the Infrastructure sub-role support the underlying computing, storage, and 829

networking functions required to implement the overall system. These activity classes reflect the 830

underlying operations performed on data within the system to include: Transmission, Reception, Storage, 831

Manipulation, and Retrieval. These activities may be associated with physical or virtual infrastructure 832

resources. In defining the specific activities for a given system, the focus should be on specific types of 833

activities. For example, a system which requires highly parallel processing of large matrices or data may 834

specify an activity which supports Single Instruction Multiple Data computing, such as that provided by 835

Graphic Processing Units (GPUs). Transmission activities may include descriptions of data transmission 836

requirements which define the required throughput and latency. Storage and retrieval activities might 837

describe performance of volatile or non-volatile storage. 838

4.1.3.2 Platform Activities 839

The Big Data Platform Provider sub-role is associated with activities which manage the organization and 840

distribution of data within the Big Data system. Since many Big Data systems are horizontally distributed 841

across multiple infrastructure resources, specific activities related to creating data elements can specify 842

that data will be replicated across a number of nodes and will be eventually consistent when accessed 843

from any node in the cluster. Other activities should describe how data will be accessed and what type of 844

indexing is required to support that access. For example, geospatial data requires specialized indexing for 845

efficient retrieval. So a related activity might describe maintaining a z-curve type of index. 846

4.1.3.3 Processing Activities 847

Processing activities describe how data will be processed in support of Big Data applications. This 848

processing generally falls into a continuum, from long-running batch jobs to responsive processing, and 849

supports interactive applications of continuous stream processing. The types of processing activities 850

described for a given architecture would be dependent on the characteristics (volume and velocity 851

primarily) of the data processed by the Big Data Application Providers and their requirements. Depending 852

on the type of processing required, an activity might describe MapReduce or Bulk Synchronous Parallel 853

(BSP) processing for batch-oriented requirements. Streaming activities might specify the performance 854

requirements necessary to handle the volume or velocity of data. 855

This publication is available free of charge from: https://doi.org/10.6028/NIST.SP.1500-6r2

NIST BIG DATA INTEROPERABILITY FRAMEWORK: VOLUME 6, REFERENCE ARCHITECTURE

4.1.4 MANAGEMENT FABRIC ACTIVITIES 856

4.1.4.1 System Management 857

To address the challenge of daily demands of operating multiple Big Data applications, a Big Data 858

Management Fabric may be needed by planners, operators, and data center owners. Stated broadly, Big 859

Data creates a need for larger or novel forms of operational intelligence. These include the following: 860

• Configuration activities associated with management of potential accountability and traceability 861

for data access associated with individual subjects / consumers, as well as their associated 862

organizations. 863

• Resource management activities to support burst and peak demand tied to both planned and 864

unplanned usage changes. Specific activities would be defined to support the automated 865

allocation of resources to meet demand. By predicting the fluctuations in load, the impact of those 866

fluctuations can be smoothed through simulation, predictive load analytics, more intelligent 867

monitoring, and practical experience. Modeling and simulation for operational intelligence may 868

become essential in some settings [11], [12]. 869

• Monitoring activities to support operational mitigation and resilience for both centralized and 870

decentralized services. These activities may also support load balancing in conjunction with 871

resource management activities to avoid outages during unexpected peak loads and reduce costs 872

during off-peak times. Real-time monitoring, gating, filtering, and throttling of streaming data 873

requires new approaches due to the “variety of tasks, such as performance analysis, workload 874

management, capacity planning, and fault detection. Applications producing Big Data make the 875

monitoring task very difficult at high-sampling frequencies because of high computational and 876

communication overheads [13].” 877

• Provisioning and package management activities to support automated deployment and 878

configuration of software and services. This class of activities is frequently associated with the 879

emerging Dev/Ops movement designed to automate the frequent deployment of capabilities into 880

production. Movement toward automated methods for ensuring information assurance (versus 881

training and governance: they may not scale). See references [14] and [15]. 882

• BDLM activities support the overall life cycle of data throughout its existence within the Big 883

Data system. Of all the classes of management fabric activities, the BDLM activities are the most 884

affected by the Big Data characteristics and merit the additional discussion below. 885

4.1.4.2 Big Data Life Cycle Management 886

BDLM faces more challenges compared to traditional data life cycle management (DLM), which may 887

require less data transfer, processing, and storage. However, BDLM still inherits the DLM phases in 888

terms of data acquisition, distribution, use, migration, maintenance, and disposition—but at a much 889

bigger processing scale. The Big Data Application Providers may require much more computational 890

processing for collection, preparation/curation, analytics, visualization, and access to be able to use the 891

analytic results. In other words, the BDLM activity includes verification that the data are handled 892

correctly by other NBDRA components in each process within the data life cycle—from the moment they 893

are ingested into the system by the Data Provider, until the data are processed or removed from the 894

system. 895

The importance of BDLM to Big Data is demonstrated through the following considerations: 896

• Data volume can be extremely large, which may overwhelm the storage capacity, or make storing 897

incoming data prohibitively expensive. 898

This publication is available free of charge from: https://doi.org/10.6028/NIST.SP.1500-6r2

NIST BIG DATA INTEROPERABILITY FRAMEWORK: VOLUME 6, REFERENCE ARCHITECTURE

• Data velocity, the rate at which data can be captured and ingested into the system, can overwhelm 899

available storage space at a given time. Even with the elastic storage service provided by cloud 900

computing for handling dynamic storage needs, unconstrained data storage may also be 901

unnecessarily costly for certain application requirements. 902

• Different Big Data applications will likely have different requirements for the lifetime of a piece 903

of data. The differing requirements have implications on how often data must be refreshed so that 904

processing results are valid and useful. In data refreshment, old data are dispositioned and not fed 905

into analytics or discovery programs. At the same time, new data is ingested and taken into 906

account by the computations. For example, real-time applications will need very short data 907

lifetime but a market study of consumers' interest in a product line may need to mine data 908

collected over a longer period of time. 909

Because the task of BDLM can be distributed among different organizations and/or individuals within the 910

Big Data computing environment, coordination of data processing between NBDRA components has 911

greater difficulty in complying with policies, regulations, and security requirements. Within this context, 912

BDLM may need to include the following sub-activities: 913

• Policy Management: Captures the requirements for the data life cycle that allows old data to be 914

dispositioned and new data to be considered by Big Data applications. Maintains the migration 915

and disposition strategies that specify the mechanism for data transformation and dispositioning, 916

including transcoding data, transferring old data to lower-tier storage for archival purpose, 917

removing data, or marking data as in situ. 918

• Metadata Management: Enables BDLM, since metadata are used to store information that 919

governs the management of the data within the system. Essential metadata information includes 920

persistent identification of the data, fixity/quality, and access rights. The challenge is to find the 921

minimum set of elements to execute the required BDLM strategy in an efficient manner. 922

• Accessibility Management: This involves the change of data accessibility over time. For 923

example, census data can be made available to the public after 72 years. BDLM is responsible for 924

triggering the accessibility update of the data or sets of data according to policy and legal 925

requirements. Normally, data accessibility information is stored in the metadata. 926

• Data Recovery: BDLM can include the recovery of data that were lost due to disaster or 927

system/storage fault. Traditionally, data recovery can be achieved using regular backup and 928

restore mechanisms. However, given the large volume of Big Data, traditional backup may not be 929

feasible. Instead, replication may have to be designed within the Big Data ecosystem. Depending 930

on the tolerance of data loss—each application has its own tolerance level—replication strategies 931

have to be designed. The replication strategy includes the replication window time, the selected 932

data to be replicated, and the requirements for geographic disparity. Additionally, in order to cope 933

with the large volume of Big Data, data backup and recovery should consider the use of modern 934

technologies within the Big Data Framework Provider. 935

• Preservation Management: The system maintains data integrity so that the veracity and velocity 936

of the analytics process are fulfilled. Due to the extremely large volume of Big Data, preservation 937

management is responsible for disposition-aged data contained in the system. Depending on the 938

retention policy, these aged data can be deleted or migrated to archival storage. In the case where 939

data must be retained for years, decades, and even centuries, a preservation strategy will be 940

needed so the data can be accessed by the provider components if required. This will invoke long-941

term digital preservation that can be performed by Big Data Application Providers using the 942

resources of the Big Data Framework Provider. 943

In the context of Big Data, BDLM contends with the Big Data characteristics of volume, velocity, variety, 944

and variability. As such, BDLM and its sub-activities interact with other components of the NBDRA as 945

shown in the following examples: 946

This publication is available free of charge from: https://doi.org/10.6028/NIST.SP.1500-6r2

NIST BIG DATA INTEROPERABILITY FRAMEWORK: VOLUME 6, REFERENCE ARCHITECTURE

• System Orchestrator: BDLM enables data scientists to initiate any combination of processing 947

including accessibility management, data backup/recovery, and preservation management. The 948

process may involve other components of the NBDRA, such as Big Data Application Provider 949

and Big Data Framework Provider. For example, data scientists may want to interact with the Big 950

Data Application Provider for data collection and curation, invoke the Big Data Framework 951

Provider to perform certain analysis, and grant access to certain users to access the analytic 952

results from the Data Consumer. 953

• Data Provider: BDLM manages ingestion of data and metadata from the data source(s) into the 954

Big Data system, which may include logging the entry event in the metadata by the Data 955

Provider. 956

• Big Data Application Provider: BDLM executes data masking and format transformations for 957

data preparation or curation purpose. 958

• Big Data Framework Provider: BDLM executes basic bit-level preservation and data backup 959

and recovery according to the recovery strategy. 960

• Data Consumer: BDLM ensures that relevant data and analytic results are available with proper 961

access control for consumers and software agents to consume within the BDLM policy strategy. 962

• Security and Privacy Fabric: Keeps the BDLM up to date according to new security policy and 963

regulations. 964

The Security and Privacy Fabric also uses information coming from BDLM with respect to data 965

accessibility. The Security and Privacy Fabric controls access to the functions and data usage produced by 966

the Big Data system. This data access control can be informed by the metadata, which is managed and 967

updated by BDLM. 968

4.1.5 SECURITY AND PRIVACY FABRIC ACTIVITIES 969

The Security and Privacy Fabric provides the activities necessary to manage the access to system data and 970

services. The primary classes of activities associated with this fabric are: 971

• Authentication: This class of activities includes validation that the user or process is who they 972

claim to be. The specific authentication activities may specify the type of authentication, such as 973

two-factor or private key. 974

• Authorization: This class of activities ensures that the user or process has the rights to access 975

resources or services. Access controls may define the specific access privileges (e.g., create, 976

update, delete) for the data or services. The authorization activities may specify broad role-based 977

access controls or more granular attribute-based access controls. 978

• Auditing: These activities record events that happen within the system to support both forensic 979

analysis in the event of a breach or corruption of data, as well as providing for maintenance of 980

providence and pedigree for data. 981

Depending on the allocation of responsibilities, the Security and Privacy Fabric may also support certain 982

provisioning and configuration activities. For example, activities for regular monitoring of system or 983

application configuration files to ensure that there have been no unauthorized changes may be allocated to 984

this fabric. In reality, the activities in the Security and Privacy Fabric and Management Fabric must, at a 985

minimum, interact and will frequently involve shared responsibilities. 986

4.2 FUNCTIONAL COMPONENT VIEW 987

The functional component view of the reference architecture should define and describe the functional 988

components (e.g., software, hardware, people, organizations) that perform the various activities outlined 989

in the activities view. Activities and functional components need not map one-to-one and in fact, many 990

functional components may be required to execute a single activity and multiple activities may be 991

This publication is available free of charge from: https://doi.org/10.6028/NIST.SP.1500-6r2

NIST BIG DATA INTEROPERABILITY FRAMEWORK: VOLUME 6, REFERENCE ARCHITECTURE

performed by a single functional component. The user of this model is recommended to maintain a 992

mapping of activities to functional components to support verification that all activities can be performed 993

by some component and that only components that are necessary are included within the architecture. 994

Figure 9 below shows classes of functional components common to the various roles, sub-roles, and 995

fabrics of the NBDRA. These classes are described in the following paragraphs. 996

997

Figure 9: Common Classes of Functional Components 998

4.2.1 SYSTEM ORCHESTRATOR 999

The classes of functional components for the system orchestrator revolve around the policies and 1000

processes that govern the operation of the Big Data system. These policies and processes define the 1001

requirements for how other functional components must behave and interact. Often the policies and 1002

processes are derived from community best practices or standards such as International Organization of 1003

Standardization (ISO) 20000 for IT Services Management or ISO 27000 for Information Technology 1004

Security. Other classes of processes and policies may include ones for data sharing, external system 1005

access, and how privacy-sensitive data is to be handled. 1006

4.2.2 BIG DATA APPLICATION PROVIDER 1007

The functional components within the Big Data Application Provider implement the specific functionality 1008

of the Big Data system. The classes for components within a Big Data application include: 1009

• Work Flows: These components would control how data and/or users go through the functions of 1010

the system. These are often implemented within frameworks or enterprise service bus 1011

components that would also be included here. 1012

• Transformations: These components are responsible for reformatting data to meet the needs of 1013

the algorithms or visualizations. The transformations may also invoke algorithms to support the 1014

transformation. These may be embedded in other components, such as ETL tools. 1015

This publication is available free of charge from: https://doi.org/10.6028/NIST.SP.1500-6r2

NIST BIG DATA INTEROPERABILITY FRAMEWORK: VOLUME 6, REFERENCE ARCHITECTURE

• Visualizations: The visualization components are responsible for formatting data to present to an 1016

end user. These visualizations may be textual or graphic and are frequently implemented with 1017

other framework or tool functional components. For example, textual visualizations may be 1018

implemented using report writer components while a graphic visualization of the output of a 1019

clustering algorithm may be implemented by a charting framework component. 1020

• Access Services: These components provide access to the Big Data system to the Data 1021

Consumers and may be designed for use by humans or other systems. Frequently, these specific 1022

components are implemented within other frameworks or components such as web services 1023

containers. 1024

• Algorithms: This class of components is the heart of the application functionality. They can 1025

range from simple summarization and aggregation algorithms to more complex statistical analysis 1026

such as clustering, or graph traversal/analysis algorithms. 1027

Algorithms themselves can be classified into general classes which may be defined as functional 1028

components. In 2004, a list of algorithms for simulation in the physical sciences was developed that 1029

became known as the Seven Dwarfs [16]. The original list of seven dwarfs was modified in 2006 and 1030

extended to 13 algorithms (Table 2) based on the following definition: “A dwarf is an algorithmic method 1031

that captures a pattern of computation and communication.”

1032

Table 2: 13 Dwarfs—Algorithms for Simulation in the Physical Sciences 1033

Dense Linear Algebra*

Combinational Logic

Sparse Linear Algebra*

Graph Traversal

Spectral methods

Dynamic Programming

N-Body Methods

Backtrack and Branch-and-Bound

Structured Grids*

Graphical Models

Unstructured Grids*

Finite State Machines

MapReduce

Notes: 1034

* Indicates one of the original seven dwarfs. The following modifications to the original list of seven algorithms were made in 1035

2006: Fast Fourier Transform, Particles, and Monte Carlo were removed. MapReduce was added. 1036

Many other algorithms or processing models have been defined over the years. MapReduce, and Bulk 1037

Synch Processing (BSP) are perhaps the two best known models in the Big Data space today. These are 1038

described in the following subsections. 1039

4.2.2.1 MapReduce 1040

Several major Internet search providers popularized the MapReduce model as they worked to implement 1041

their search capabilities. In general, MapReduce programs follow five basic stages: 1042

1. Input preparation and assignment to mappers; 1043

2. Map a set of keys and values to new keys and values: Map(k1,v1) → list(k2,v2); 1044

3. Shuffle data to each reducer and each reducer sorts its input—each reducer is assigned a set 1045

of keys (k2); 1046

4. Run the reduce on a list(v2) associated with each key and produce an output: Reduce(k2, 1047

list(v2) → list(v3); and 1048

5. Final output: the lists(v3) from each reducer are combined and sorted by k2. 1049

Patterson, David; Yelick, Katherine. Dwarf Mind. A View from Berkeley.

https://www2.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.pdf

This publication is available free of charge from: https://doi.org/10.6028/NIST.SP.1500-6r2

NIST BIG DATA INTEROPERABILITY FRAMEWORK: VOLUME 6, REFERENCE ARCHITECTURE

While there is a single output, nothing in the model prohibits multiple input datasets. It is extremely 1050

common for complex analytics to be built as workflows of multiple MapReduce jobs. While the 1051

MapReduce programming model is best suited to aggregation-type analytics (e.g., sum, average, group-1052

by), a wide variety of analytic algorithms have been implemented within processing frameworks. 1053

MapReduce does not generally perform well with applications or algorithms that need to directly update 1054

the underlying data. For example, updating the values for a single key would require that the entire 1055

dataset be read, output, and then moved or copied over the original dataset. Because the mappers and 1056

reducers are stateless in nature, applications that require iterative computation on parts of the data or 1057

repeated access to parts of the dataset do not tend to scale or perform well under MapReduce. 1058

Due to its shared-nothing approach, the usability of MapReduce for Big Data applications has made it 1059

popular enough that a number of large data storage solutions (mostly those of the NoSQL variety) provide 1060

implementations within their architecture. One major criticism of MapReduce early on was that the 1061

interfaces to most implementations were at too low of a level (written in Java or JavaScript). However, 1062

many of the more prevalent implementations now support high-level procedural and declarative language 1063

interfaces, and even visual programming environments are beginning to appear. 1064

4.2.2.2 Bulk Synchronous Parallel 1065

The BSP programming model, originally developed by Leslie Valiant [17], combines parallel processing 1066

with the ability of processing modules to send messages to other processing modules and explicit 1067

synchronization of the steps. A BSP algorithm is composed of what are termed supersteps, which 1068

comprise the following three distinct elements. 1069

• Bulk Parallel Computation: Each processor performs the calculation/analysis on its local chunk 1070

of data. 1071

• Message Passing: As each processor performs its calculations, it may generate messages to other 1072

processors. These messages are frequently updates to values associated with the local data of 1073

other processors but may also result in the creation of additional data. 1074

• Synchronization: Once a processor has completed processing its local data, it pauses until all 1075

other processors have also completed their processing. 1076

This cycle can be terminated by all the processors voting to stop, which will generally happen when a 1077

processor has generated no messages to other processors (e.g., no updates). All processors voting to stop, 1078

in turn, indicates that there are no new updates to any of the processors’ data and the computation is 1079

complete. Alternatively, the cycle may be terminated after a fixed number of supersteps have been 1080

completed (e.g., after a certain number of iterations of a Monte Carlo simulation). 1081

The advantage of BSP over MapReduce is that processing can actually create updates to the data being 1082

processed. It is this distinction that has made BSP popular for graph processing and simulations where 1083

computations on one node/element of data directly affect values or connections with other 1084

nodes/elements. The disadvantage of BSP is the high cost of the synchronization barrier between 1085

supersteps. Should the distribution of data or processing between processors become highly unbalanced, 1086

then some processors may become overloaded while others remain idle. 1087

While high-performance interconnected technologies help to reduce the cost of this synchronization 1088

through faster data exchange between nodes and can allow for re-distribution of data during a super-step 1089

skewing of the processing requirements, the fastest possible performance of any given superstep is lower 1090

bounded by the slowest performance of any processing unit. Essentially, if the data is skewed such that 1091

the processing of a given data element (say traversal of the graph from that element) is especially long-1092

running, the next superstep cannot begin until that nodes processing completes. 1093

This publication is available free of charge from: https://doi.org/10.6028/NIST.SP.1500-6r2

NIST BIG DATA INTEROPERABILITY FRAMEWORK: VOLUME 6, REFERENCE ARCHITECTURE

Numerous extensions and enhancements to the basic BSP model have been developed and implemented 1094

over the years, many of which are designed to address the balancing and cost of synchronization 1095

problems. 1096

4.2.3 BIG DATA FRAMEWORK PROVIDER 1097

The Big Data Framework Provider provides the infrastructure required to support the Big Data 1098

Application Provider. Components within the Big Data Framework Provider fall within three overall sub-1099

roles (i.e., processing, platforms, infrastructures) along with some specific crosscutting roles, which 1100

support the communication and integration of components within the overall provider. 1101

4.2.3.1 Infrastructure Frameworks 1102

This Infrastructure Frameworks sub-role of the Big Data Framework Provider provides all of the 1103

resources necessary to host/run the activities of the other roles of the Big Data system. Typically, these 1104

resources consist of some combination of physical resources, which may host/support similar virtual 1105

resources. These resources are generally classified as follows: 1106

• Networking: These are the resources that transfer data from one infrastructure framework 1107

component to another. 1108

• Computing: These are the physical processors and memory that execute and hold the software of 1109

the other Big Data system components. 1110

• Storage: These are resources which provide persistence of the data in a Big Data system. 1111

• Physical Plant: These are the environmental resources (e.g., power, cooling, security) that must 1112

be accounted for when establishing an instance of a Big Data system. 1113

While the Big Data Framework Provider component may be deployed directly on physical resources or 1114

on virtual resources, at some level all resources have a physical representation. Physical resources are 1115

frequently used to deploy multiple components that will be duplicated across a large number of physical 1116

nodes to provide what is known as horizontal scalability. 1117

The following subsections describe the types of physical and virtual resources that compose Big Data 1118

infrastructure. 1119

4.2.3.1.1 Hypervisors 1120

Virtualization is frequently used to achieve elasticity and flexibility in the allocation of physical resources 1121

and is often referred to as infrastructure as a service (IaaS) within the cloud computing community. 1122

Virtualization is implemented via hypervisors that are typically found in one of three basic forms within a 1123

Big Data Architecture. 1124

• Native: In this form, a hypervisor runs natively on the bare metal and manages multiple virtual 1125

machines consisting of operating systems (OS) and applications. 1126

• Hosted: In this form, an OS runs natively on the bare metal and a hypervisor runs on top of that 1127

to host a client OS and applications. This model is not often seen in Big Data architectures due to 1128

the increased overhead of the extra OS layer. 1129

• Containerized: In this form, hypervisor functions are embedded in the OS, which runs on bare 1130

metal. Applications are run inside containers, which control or limit access to the OS and physical 1131

machine resources. This approach has gained popularity for Big Data architectures because it 1132

further reduces overhead since most OS functions are a single shared resource. It may not be 1133

considered as secure or stable because in the event that the container controls/limits fail, one 1134

application may take down every application sharing those physical resources. 1135

This publication is available free of charge from: https://doi.org/10.6028/NIST.SP.1500-6r2

NIST BIG DATA INTEROPERABILITY FRAMEWORK: VOLUME 6, REFERENCE ARCHITECTURE

4.2.3.1.2 Physical and Virtual Networks 1136

The connectivity of the architecture infrastructure should be addressed, as it affects the velocity 1137

characteristic of Big Data. While some Big Data implementations may solely deal with data that is 1138

already resident in the data center and does not need to leave the confines of the local network, others 1139

may need to plan and account for the movement of Big Data either into or out of the data center. The 1140

location of Big Data systems with transfer requirements may depend on the availability of external 1141

network connectivity (i.e., bandwidth) and the limitations of Transmission Control Protocol (TCP) where 1142

there is low latency (as measured by packet Round Trip Time) with the primary senders or receivers of 1143

Big Data. To address the limitations of TCP, architects for Big Data systems may need to consider some 1144

of the advanced non-TCP based communications protocols available that are specifically designed to 1145

transfer large files such as video and imagery. 1146

Overall availability of the external links is another infrastructure aspect relating to the velocity 1147

characteristic of Big Data that should be considered in architecting external connectivity. A given 1148

connectivity link may be able to easily handle the velocity of data while operating correctly. However, 1149

should the quality of service on the link degrade or the link fail completely, data may be lost or simply 1150

back up to the point that it can never recover. Use cases exist where the contingency planning for network 1151

outages involves transferring data to physical media and physically transporting it to the desired 1152

destination. However, even this approach is limited by the time it may require to transfer the data to 1153

external media for transport. 1154

The volume and velocity characteristics of Big Data often are driving factors in the implementation of the 1155

internal network infrastructure as well. For example, if the implementation requires frequent transfers of 1156

large multi-gigabyte files between cluster nodes, then high speed and low latency links are required to 1157

maintain connectivity to all nodes in the network. Provisions for dynamic quality of services (QoS) and 1158

service priority may be necessary in order to allow failed or disconnected nodes to re-synchronize once 1159

connectivity is restored. Depending on the availability requirements, redundant and fault tolerant links 1160

may be required. Other aspects of the network infrastructure include name resolution (e.g., Domain Name 1161

Server [DNS]) and encryption along with firewalls and other perimeter access control capabilities. 1162

Finally, the network infrastructure may also include automated deployment, provisioning capabilities, or 1163

agents and infrastructure wide monitoring agents that are leveraged by the management/communication 1164

elements to implement a specific model. 1165

Security of the networks is another aspect that must be addressed depending on the sensitivity of the data 1166

being processed. Encryption may be needed between the network and external systems to avoid man in 1167

the middle interception and compromise of the data. In cases, where the network infrastructure within the 1168

data center is shared encryption of the local network should also be considered. Finally, in conjunction 1169

with the security and privacy fabric auditing and intrusion detection capabilities need to be addressed. 1170

Two concepts, SDN and Network Function Virtualization (NFV), have recently been developed in 1171

support of scalable networks and scalable systems using them. 1172

4.2.3.1.2.1 Software Defined Networks 1173

Frequently ignored, but critical to the performance of distributed systems and frameworks, and especially 1174

critical to Big Data implementations, is the efficient and effective management of networking resources. 1175

Significant advances in network resource management have been realized through what is known as 1176

SDN. Much like virtualization frameworks manage shared pools of CPU/memory/disk, SDNs (or virtual 1177

networks) manage pools of physical network resources. In contrast to the traditional approaches of 1178

dedicated physical network links for data, management, I/O, and control, SDNs contain multiple physical 1179

resources (including links and actual switching fabric) that are pooled and allocated as required to specific 1180

functions and sometimes to specific applications. This allocation can consist of raw bandwidth, quality of 1181

service priority, and even actual data routes. 1182

This publication is available free of charge from: https://doi.org/10.6028/NIST.SP.1500-6r2

NIST BIG DATA INTEROPERABILITY FRAMEWORK: VOLUME 6, REFERENCE ARCHITECTURE

4.2.3.1.2.2 Network Function Virtualization 1183

With the advent of virtualization, virtual appliances can now reasonably support a large number of 1184

network functions that were traditionally performed by dedicated devices. Network functions that can be 1185

implemented in this manner include routing/routers, perimeter defense (e.g., firewalls), remote access 1186

authorization, and network traffic/load monitoring. Some key advantages of NFV include elasticity, fault 1187

tolerance, and resource management. For example, the ability to automatically deploy/provision 1188

additional firewalls in response to a surge in user or data connections and then un-deploy them when the 1189

surge is over can be critical in handling the volumes associated with Big Data. 1190

4.2.3.1.3 Physical and Virtual Computing 1191

The logical distribution of cluster/computing infrastructure may vary from a tightly coupled high 1192

performance computing (HPC) cluster to a dense grid of physical commodity machines in a rack, to a set 1193

of virtual machines running on a cloud service provider (CSP), or to a loosely coupled set of machines 1194

distributed around the globe providing access to unused computing resources. Computing infrastructure 1195

also frequently includes the underlying OSs and associated services used to interconnect the cluster 1196

resources via the networking elements. Computing resources may also include computation accelerators, 1197

such as Graphic Processing Units (GPU) and Field Programmable Gate Arrays (FPGA), which can 1198

provide dynamically programmed massively parallel computing capabilities to individual nodes in the 1199

infrastructure. 1200

4.2.3.1.4 Storage 1201

The storage infrastructure may include any resource from isolated local disks to storage area networks 1202

(SANs) or network-attached storage (NAS). 1203

Two aspects of storage infrastructure technology that directly influence their suitability for Big Data 1204

solutions are capacity and transfer bandwidth. Capacity refers to the ability to handle the data volume. 1205

Local disks/file systems are specifically limited by the size of the available media. Hardware or software 1206

redundant array of independent disks (RAID) solutions—in this case local to a processing node—help 1207

with scaling by allowing multiple pieces of media to be treated as a single device. However, this approach 1208

is limited by the physical dimension of the media and the number of devices the node can accept. SAN 1209

and NAS implementations—often known as shared disk solutions—remove that limit by consolidating 1210

storage into a storage specific device. By consolidating storage, the second aspect—transfer bandwidth—1211

may become an issue. While both network and I/O interfaces are getting faster and many implementations 1212

support multiple transfer channels, I/O bandwidth can still be a limiting factor. In addition, despite the 1213

redundancies provided by RAID, hot spares, multiple power supplies, and multiple controllers, these 1214

boxes can often become I/O bottlenecks or single points of failure in an enterprise. Many Big Data 1215

implementations address these issues by using distributed file systems within the platform framework. 1216

4.2.3.1.5 Physical Plant 1217

Environmental resources, such as power and heating, ventilation, and air conditioning provided by 1218

physical plant components, are critical to the Big Data Framework Provider. While environmental 1219

resources are critical to the operation of the Big Data system, they are not within the technical boundaries 1220

and are, therefore, not depicted in Figure 3, the NBDRA conceptual model. 1221

Adequately sized infrastructure to support application requirements is critical to the success of Big Data 1222

implementations. The infrastructure architecture operational requirements range from basic power and 1223

cooling to external bandwidth connectivity (as discussed above). A key evolution that has been driven by 1224

Big Data is the increase in server density (i.e., more CPU/memory/disk per rack unit). However, with this 1225

increased density, infrastructure—specifically power and cooling—may not be distributed within the data 1226

center to allow for sufficient power to each rack or adequate air flow to remove excess heat. In addition, 1227

with the high cost of managing energy consumption within data centers, technologies have been 1228

This publication is available free of charge from: https://doi.org/10.6028/NIST.SP.1500-6r2

NIST BIG DATA INTEROPERABILITY FRAMEWORK: VOLUME 6, REFERENCE ARCHITECTURE

developed that actually power down or idle resources not in use to save energy or to reduce consumption 1229

during peak periods. 1230

Also important within this element are the physical security of the facilities and auxiliary (e.g., power 1231

sub-stations). Specifically, perimeter security to include credential verification (e.g., badge/biometrics), 1232

surveillance, and perimeter alarms all are necessary to maintain control of the data being processed. 1233

4.2.3.2 Data Platform Frameworks 1234

Data Platform Frameworks provide for the logical data organization and distribution combined with the 1235

associated access application programming interfaces (APIs) or methods. The frameworks may also 1236

include data registry and metadata services along with semantic data descriptions such as formal 1237

ontologies or taxonomies. The logical data organization may range from simple delimited flat files to 1238

fully distributed relational or columnar data stores. The storage mediums range from high latency robotic 1239

tape drives, to spinning magnetic media, to flash/solid state disks, or to random access memory. 1240

Accordingly, the access methods may range from file access APIs to query languages such as Structured 1241

Query Language (SQL). Typical Big Data framework implementations would support either basic file 1242

system style storage or in-memory storage and one or more indexed storage approaches. Based on the 1243

specific Big Data system considerations, this logical organization may or may not be distributed across a 1244

cluster of computing resources. 1245

In most aspects, the logical data organization and distribution in Big Data storage frameworks mirrors the 1246

common approach for most legacy systems. Figure 10 presents a brief overview of data organization 1247

approaches for Big Data. 1248

1249

Figure 10: Data Organization Approaches 1250

Many Big Data logical storage organizations leverage the common file system concept where chunks of 1251

data are organized into a hierarchical namespace of directories as their base and then implement various 1252

indexing methods within the individual files. This allows many of these approaches to be run both on 1253

simple local storage file systems for testing purposes or on fully distributed file systems for scale. 1254

4.2.3.2.1 In-memory 1255

The infrastructure illustrated in the NBDRA (Figure 3) indicates that physical resources are required to 1256

support analytics. However, such infrastructure will vary (i.e., will be optimized) for the Big Data 1257

characteristics of the problem under study. Large, but static, historical datasets with no urgent analysis 1258

time constraints would optimize the infrastructure for the volume characteristic of Big Data, while time-1259

critical analyses such as intrusion detection or social media trend analysis would optimize the 1260

infrastructure for the velocity characteristic of Big Data. Velocity implies the necessity for extremely fast 1261

analysis and the infrastructure to support it—namely, very low latency, in-memory analytics. 1262

Logical Data

Organization

In-memory

File Systems

File System

Organization

Centralized

Distributed

Data

Organization

Delimited

Fixed

Length

Binary

Indexed

Relational

Key-Value

Columnar

Document

Graph

This publication is available free of charge from: https://doi.org/10.6028/NIST.SP.1500-6r2

NIST BIG DATA INTEROPERABILITY FRAMEWORK: VOLUME 6, REFERENCE ARCHITECTURE

In-memory storage technologies, many of which were developed to support the scientific HPC domain, 1263

are increasingly used due to the significant reduction in memory prices and the increased scalability of 1264

modern servers and OSs. Yet, an in-memory element of a velocity-oriented infrastructure will require 1265

more than simply massive random-access memory (RAM). It will also require optimized data structures 1266

and memory access algorithms to fully exploit RAM performance. Current in-memory database offerings 1267

are beginning to address this issue. Shared memory solutions common to HPC environments are often 1268

being applied to address inter-nodal communications and synchronization requirements. 1269

Traditional database management architectures are designed to use spinning disks as the primary storage 1270

mechanism, with the main memory of the computing environment relegated to providing caching of data 1271

and indexes. Many of these in-memory storage mechanisms have their roots in the massively parallel 1272

processing and supercomputer environments popular in the scientific community. 1273

These approaches should not be confused with solid state (e.g., flash) disks or tiered storage systems that 1274

implement memory-based storage which simply replicate the disk style interfaces and data structures but 1275

with faster storage medium. Actual in-memory storage systems typically eschew the overhead of file 1276

system semantics and optimize the data storage structure to minimize memory footprint and maximize the 1277

data access rates. These in-memory systems may implement general purpose relational and other not only 1278

or no Structured Query Language (NoSQL) style organization and interfaces or be completely optimized 1279

to a specific problem and data structure. 1280

Like traditional disk-based systems for Big Data, these implementations frequently support horizontal 1281

distribution of data and processing across multiple independent nodes—although shared memory 1282

technologies are still prevalent in specialized implementations. Unlike traditional disk-based approaches, 1283

in-memory solutions and the supported applications must account for the lack of persistence of the data 1284

across system failures. Some implementations leverage a hybrid approach involving write-through to 1285

more persistent storage to help alleviate the issue. 1286

The advantages of in-memory approaches include faster processing of intensive analysis and reporting 1287

workloads. In-memory systems are especially good for analysis of real time data such as that needed for 1288

some complex event processing (CEP) of streams. For reporting workloads, performance improvements 1289

can often be on the order of several hundred times faster—especially for sparse matrix and simulation 1290

type analytics. 1291

4.2.3.2.2 File Systems 1292

Many Big Data processing frameworks and applications access their data directly from underlying file 1293

systems. In almost all cases, the file systems implement some level of the Portable Operating System 1294

Interface (POSIX) standards for permissions and the associated file operations. This allows other higher-1295

level frameworks for indexing or processing to operate with relative transparency as to whether the 1296

underlying file system is local or fully distributed. File-based approaches consist of two layers, the file 1297

system organization and the data organization within the files. 1298

4.2.3.2.2.1 File System Organization 1299

File systems tend to be either centralized or distributed. Centralized file systems are basically 1300

implementations of local file systems that are placed on a single large storage platform (e.g., SAN or 1301

NAS) and accessed via some network capability. In a virtual environment, multiple physical centralized 1302

file systems may be combined, split, or allocated to create multiple logical file systems. 1303

Distributed file systems (also known as cluster file systems) seek to overcome the throughput issues 1304

presented by the volume and velocity characteristics of big data combine I/O throughput across multiple 1305

devices (spindles) on each node, with redundancy and failover mirroring or replicating data at the block 1306

level across multiple nodes. Many of these implementations were developed in support of HPC 1307

computing solutions requiring high throughput and scalability. Performance, in many HPC 1308

This publication is available free of charge from: https://doi.org/10.6028/NIST.SP.1500-6r2

NIST BIG DATA INTEROPERABILITY FRAMEWORK: VOLUME 6, REFERENCE ARCHITECTURE

implementations is often achieved through dedicated storage nodes using proprietary storage formats and 1309

layouts. The data replication is specifically designed to allow the use of heterogeneous commodity 1310

hardware across the Big Data cluster. Thus, if a single drive or an entire node should fail, no data is lost 1311

because it is replicated on other nodes and throughput is only minimally affected because that processing 1312

can be moved to the other nodes. In addition, replication allows for high levels of concurrency for reading 1313

data and for initial writes. Updates and transaction style changes tend to be an issue for many distributed 1314

file systems because latency in creating replicated blocks will create consistency issues (e.g., a block is 1315

changed but another node reads the old data before it is replicated). Several file system implementations 1316

also support data compression and encryption at various levels. One major caveat is that, for distributed 1317

block-based file systems, the compression/encryption must be able to be split and allow any given block 1318

to be decompressed/ decrypted out of sequence and without access to the other blocks. 1319

Distributed object stores (also known as global object stores) are a unique example of distributed file 1320

system organization. Unlike the approaches described above, which implement a traditional file system 1321

hierarchy namespace approach, distributed object stores present a flat name space with a globally unique 1322

identifier

(GUID) for any given chunk of data. Generally, data in the store is located through a query 1323

against a metadata catalog that returns the associated GUIDs. The GUID generally provides the 1324

underlying software implementation with the storage location of the data of interest. These object stores 1325

are developed and marketed for storage of very large data objects, from complete datasets to large 1326

individual objects (e.g., high resolution images in the tens of gigabytes [GBs] size range). The biggest 1327

limitation of these stores for Big Data tends to be network throughput (i.e., speed) because many require 1328

the object to be accessed in total. However, future trends point to the concept of being able to send the 1329

computation/application to the data versus needing to bring the data to the application. 1330

From a maturity perspective, two key areas where distributed file systems are likely to improve are (1) 1331

random write I/O performance and consistency, and (2) the generation of de facto standards at a similar or 1332

greater level as the Internet Engineering Task Force Requests for Comments document series, such as 1333

those currently available for the network file system (NFS) protocol. Distributed object stores, while 1334

currently available and operational from several commercial providers and part of the roadmap for large 1335

organizations such as the National Geospatial Intelligence Agency (NGA), currently are essentially 1336

proprietary implementations. For Distributed object stores to become prevalent within Big Data 1337

ecosystems, there should be: some level of interoperability available (i.e., through standardized APIs); 1338

standards-based approaches for data discovery; and, most importantly, standards-based approaches that 1339

allow the application to be transferred over the grid and run locally to the data versus transferring the data 1340

to the application. 1341

4.2.3.2.2.2 In File Data Organization 1342

Very little is different for in file data organization in Big Data. File based data can be text, binary data, 1343

fixed length records, or some sort of delimited structure (e.g., comma separated values [CSV], Extensible 1344

Markup Language [XML]). For record-oriented storage (either delimited or fixed length), this generally is 1345

not an issue for Big Data unless individual records can exceed a block size. Some distributed file system 1346

implementations provide compression at the volume or directory level and implement it below the logical 1347

block level (e.g., when a block is read from the file system, it is decompressed/decrypted before being 1348

returned). Because of their simplicity, familiarity, and portability, delimited files are frequently the 1349

default storage format in many Big Data implementations. The trade-off is I/O efficiency (i.e., speed). 1350

While individual blocks in a distributed file system might be accessed in parallel, each block still needs to 1351

be read in sequence. In the case of a delimited file, if only the last field of certain records is of interest 1352

with perhaps hundreds of fields, a lot of I/O and processing bandwidth is wasted. 1353

Binary formats tend to be application or implementation specific. While they can offer much more 1354

efficient access due to smaller data sizes (i.e., integers are two to four bytes in binary while they are one 1355

byte per digit in ASCII [American Standard Code for Information Interchange]), they offer limited 1356

This publication is available free of charge from: https://doi.org/10.6028/NIST.SP.1500-6r2

NIST BIG DATA INTEROPERABILITY FRAMEWORK: VOLUME 6, REFERENCE ARCHITECTURE

portability between different implementations. At least one popular distributed file system provides its 1357

own standard binary format, which allows data to be portable between multiple applications without 1358

additional software. However, the bulk of the indexed data organization approaches discussed below 1359

leverage binary formats for efficiency. 1360

4.2.3.2.3 Indexed Storage Organization 1361

The very nature of Big Data (primarily the volume and velocity characteristics) practically drives 1362

requirements to some form of indexing structure. Big Data volume requires that specific data elements be 1363

located quickly without scanning across the entire dataset. Big Data velocity also requires that data can be 1364

located quickly either for matching (e.g., incoming data matches something in an existing dataset) or to 1365

know where to write/update new data. 1366

The choice of a particular indexing method or methods depends mostly on the data and the nature of the 1367

application to be implemented. For example, graph data (i.e., vertices, edges, and properties) can easily be 1368

represented in flat text files as vertex-edge pairs, edge-vertex-vertex triples, or vertex-edge list records. 1369

However, processing this data efficiently would require potentially loading the entire dataset into memory 1370

or being able to distribute the application and dataset across multiple nodes so a portion of the graph is in 1371

memory on each node. Splitting the graph across nodes requires the nodes to communicate when graph 1372

sections have vertices that connect with vertices on other processing nodes. This is perfectly acceptable 1373

for some graph applications—such as shortest path—especially when the graph is static. Some graph 1374

processing frameworks operate using this exact model. However, this approach is infeasible for large 1375

scale graphs requiring a specialized graph storage framework, where the graph is dynamic or searching or 1376

matching to a portion of the graph is needed quickly. 1377

Indexing approaches tend to be classified by the features provided in the implementation, specifically: the 1378

complexity of the data structures that can be stored; how well they can process links between data; and, 1379

how easily they support multiple access patterns as shown in Figure 11. Since any of these features can be 1380

implemented in custom application code, the values portrayed represent approximate norms. For example, 1381

key-value stores work well for data that is only accessed through a single key, whose values can be 1382

expressed in a single flat structure, and where multiple records do not need to be related. While document 1383

stores can support very complex structures of arbitrary width and tend to be indexed for access via 1384

multiple document properties, they do not tend to support inter-record relationships well. 1385

It is noted that the specific implementations for each storage approach vary significantly enough that all 1386

of the values for the features represented here are really ranges. For example, relational data storage 1387

implementations are supporting increasingly complex data structures and ongoing work aims to add more 1388

flexible access patterns natively in BigTable columnar implementations. Within Big Data, the 1389

performance of each of these features tends to drive the scalability of that approach depending on the 1390

problem being solved. For example, if the problem is to locate a single piece of data for a unique key, 1391

then key-value stores will scale really well. However, if a problem requires general navigation of the 1392

relationships between multiple data records, a graph storage model will likely provide the best 1393

performance. 1394

This publication is available free of charge from: https://doi.org/10.6028/NIST.SP.1500-6r2

NIST BIG DATA INTEROPERABILITY FRAMEWORK: VOLUME 6, REFERENCE ARCHITECTURE

1395

Figure 11: Data Storage Technologies 1396

This section provides an overview of several common Big Data Organization Approaches as follows: 1397

• Relational storage platforms, 1398

• Key-value storage platforms, 1399

• Wide columnar storage platforms, 1400

• Document storage platforms, and 1401

• Graph storage platforms. 1402

The reader should keep in mind that new and innovative approaches are emerging regularly, and that 1403

some of these approaches are hybrid models that combine features of several indexing techniques (e.g., 1404

relational and columnar, or relational and graph). 1405

4.2.3.2.3.1 Relational Storage Platforms 1406

This model is perhaps the most familiar to folks as the basic concept has existed since the 1950s and the 1407

SQL is a mature standard for manipulating (search, insert, update, delete) relational data. In the relational 1408

model, data is stored as rows with each field representing a column organized into Table based on the 1409

logical data organization. The problem with relational storage models and Big Data is the join between 1410

one or more tables. While the size of two or more tables of data individually might be small, the join (or 1411

relational matches) between those tables will generate exponentially more records. The appeal of this 1412

model for organizations just adopting Big Data is its familiarity. The pitfalls are some of the limitations 1413

and, more importantly, the tendency to adopt standard relational database management system (RDBMS) 1414

practices (high normalization, detailed and specific indexes) and performance expectations. 1415

Big data implementations of relational storage models are relatively mature and have been adopted by a 1416

number of organizations. They are also maturing very rapidly with new implementations focusing on 1417

improved response time. Many Big Data implementations take a brute-force approach to scaling relational 1418

0 1 2 3 4 5 6

Data Linkage Complexity

Data Structure Complexity

Data Storage Technologies by Data

Complexity, Linkage, and Access

Key-Value

Stores

Relational

Columnar

Document

Graph

Limited

Access

Flexibility

Greater

Access

Flexibility

This publication is available free of charge from: https://doi.org/10.6028/NIST.SP.1500-6r2

NIST BIG DATA INTEROPERABILITY FRAMEWORK: VOLUME 6, REFERENCE ARCHITECTURE

queries. Essentially, queries are broken into stages but, more importantly, processing of the input tables is 1419

distributed across multiple nodes (often as a MapReduce job). The actual storage of the data can be flat 1420

files (delimited or fixed length) where each record/line in the file represents a row in a table. Increasingly, 1421

however, these implementations are adopting binary storage formats optimized for distributed file 1422

systems. These formats will often use block level indexes and column-oriented organization of the data to 1423

allow individual fields to be accessed in records without needing to read the entire record. Despite this, 1424

most Big Data Relational storage models are still batch-oriented systems designed for very complex 1425

queries which generate very large intermediate cross-product matrices from joins so even the simplest 1426

query can require 10s of seconds to complete. There is significant work going on and emerging 1427

implementations that are seeking to provide a more interactive response and interface. 1428

Early implementations provided only limited data types and little or no support for indexes. However, 1429

most current implementations have support for complex data structures and basic indexes. However, 1430

while the query planners/optimizers for most modern RDBMS systems are very mature and implement 1431

cost-based optimization through statistics on the data, the query planners/optimizers in many Big Data 1432

implementations remain fairly simple and rule-based in nature. While for batch-oriented systems, this is 1433

generally acceptable (since the scale of processing the Big Data in general can be orders of magnitude 1434

more an impact), any attempt to provide interactive response will need very advanced optimizations so 1435

that (at least for queries) only the most likely data to be returned is actually searched. This of course leads 1436

to the single most serious drawback with many of these implementations. Since distributed processing 1437

and storage are essential for achieving scalability, these implementations are directly limited by the CAP 1438

(Consistency, Availability, and Partition Tolerance) theorem. Many in fact provide what is generally 1439

referred to as a t-eventual consistency which means that barring any updates to a piece of data, all nodes 1440

in the distributed system will eventually return the most recent value. This level of consistency is 1441

typically fine for Data Warehousing applications where data is infrequently updated and updates are 1442

generally done in bulk. However, transaction-oriented databases typically require some level of ACID 1443

compliance to ensure that all transactions are handled reliably and conflicts are resolved in a consistent 1444

manner. There are a number of both industry and open source initiatives looking to bring this type of 1445

capability to Big Data relational storage frameworks. One approach is to essentially layer a traditional 1446

RDBMS on top of an existing distributed file system implementation. While vendors claim that this 1447

approach means that the overall technology is mature, a great deal of research and implementation 1448

experience is needed before the complete performance characteristics of these implementations are 1449

known. 1450

4.2.3.2.3.2 Key-Value Storage Platforms 1451

Key-value stores are one of the oldest and mature data indexing models. In fact, the principles of key-1452

value stores underpin all the other storage and indexing models. From a Big Data perspective, these stores 1453

effectively represent random access memory models. While the data stored in the values can be arbitrarily 1454

complex in structure, all the handling of that complexity must be provided by the application with the 1455

storage implementation often providing back just a pointer to a block of data. Key-value stores also tend 1456

to work best for 1-1 relationships (e.g., each key relates to a single value) but can also be effective for 1457

keys mapping to lists of homogeneous values. When keys map multiple values of heterogeneous 1458

types/structures or when values from one key need to be joined against values for a different or the same 1459

key, then custom application logic is required. It is the requirement for this custom logic that often 1460

prevents key-value stores from scaling effectively for certain problems. However, depending on the 1461

problem, certain processing architectures can make effective use of distributed key-value stores. Key-1462

value stores generally deal well with updates when the mapping is one-to-one and the size/length of the 1463

value data does not change. The ability of key-value stores to handle inserts is generally dependent on the 1464

underlying implementation. Key-value stores also generally require significant effort (either manual or 1465

computational) to deal with changes to the underlying data structure of the values. 1466

This publication is available free of charge from: https://doi.org/10.6028/NIST.SP.1500-6r2

NIST BIG DATA INTEROPERABILITY FRAMEWORK: VOLUME 6, REFERENCE ARCHITECTURE

Distributed key-value stores are the most frequent implementation utilized in Big Data applications. One 1467

problem that must always be addressed (but is not unique to key-value implementations) is the 1468

distribution of keys over the space of possible key values. Specifically, keys must be chosen carefully to 1469

avoid skew in the distribution of the data across the cluster. When data is heavily skewed to a small range, 1470

it can result in computation hot spots across the cluster if the implementation is attempting to optimize 1471

data locality. If the data is dynamic (new keys being added) for such an implementation, then it is likely 1472

that at some point the data will require rebalancing across the cluster. Non-locality optimizing 1473

implementations employ various sorts of hashing, random, or round-robin approaches to data distribution 1474

and don’t tend to suffer from skew and hot spots. However, they perform especially poorly on problems 1475

requiring aggregation across the dataset. 1476

4.2.3.2.3.3 Wide Columnar Storage Platforms 1477

Much of the hype associated with Big Data came with the publication of the BigTable paper in 2006 [18] 1478

but column-oriented storage models like BigTable are not new to even Big Data and have been stalwarts 1479

of the data warehousing domain for many years. Unlike traditional relational data that store data by rows 1480

of related values, columnar stores organize data in groups of like values. The difference here is subtle but 1481

in relational databases, an entire group of columns are tied to some primary key (frequently one or more 1482

of the columns) to create a record. In columnar, the value of every column is a key and like column values 1483

point to the associated rows. The simplest instance of a columnar store is little more than a key-value 1484

store with the key and value roles reversed. In many ways, columnar data stores look very similar to 1485

indexes in relational databases. Figure 12 below shows the basic differences between row-oriented and 1486

column-oriented stores. 1487

Figure 12: Differences Between Row-Oriented and Column-Oriented Stores 1488

In addition, implementations of columnar stores that follow the BigTable model introduce an additional 1489

level of segmentation beyond the table, row, and column model of the relational model. That is called the 1490

column family. In those implementations, rows have a fixed set of column families but within a column 1491

family, each row can have a variable set of columns. This is illustrated in Figure 13 below. 1492

This publication is available free of charge from: https://doi.org/10.6028/NIST.SP.1500-6r2

NIST BIG DATA INTEROPERABILITY FRAMEWORK: VOLUME 6, REFERENCE ARCHITECTURE

Figure 13: Column Family Segmentation of the Columnar Stores Model 1493

The key distinction in the implementation of columnar store over relational stores is that data is high de-1494

normalized for column stores and that while for relational stores every record contains some value 1495

(perhaps NULL) for each column, in columnar store the column is only present if there is data for one or 1496

more rows. This is why many column-oriented stores are referred to as sparse storage models. Data for 1497

each column family is physically stored together on disk sorted by rowed, column name, and timestamp. 1498

The last (timestamp) is there because the BigTable model also includes the concept of versioning. Every 1499

RowKey, Column Family, Column triple is stored with either a system-generated or user-provided 1500

Timestamp. This allows users to quickly retrieve the most recent value for a column (the default), the 1501

specific value for a column by timestamp, or all values for a column. The last is most useful because it 1502

permits very rapid temporal analysis on data in a column. 1503

Because data for a given column is stored together, two key benefits are achieved. First, aggregation of 1504

the data in that column requires only the values for that column to be read. Conversely, in a relational 1505

system, the entire row (at least up to the column) needs to be read (which if the row is long and the 1506

column at the end, it could be lots of data). Secondly, updates to a single column do not require the data 1507

for the rest of the row to be read/written. Also, because all the data in a column is uniform, data can be 1508

compressed much more efficiently. Often only a single copy of the value for a column is stored followed 1509

by the row keys where that value exists. And while deletes of an entire column is very efficient, deletes of 1510

an entire record are extremely expensive. This is why historically column-oriented stores have been 1511

applied to online analytical processing (OLAP)-style applications while relational stores were applied to 1512

online transaction processing (OLTP) requirements. 1513

Recently, security has been a major focus of existing column implementations, primarily due to the 1514

release by the National Security Agency (NSA) of its BigTable implementation to the open source 1515

community. A key advantage of the NSA implementation and other recently announced implementations 1516

is the availability of security controls at the individual cell level. With these implementations, a given user 1517

might have access to only certain cells in a group based potentially on the value of those or other cells. 1518

There are several very mature distributed column-oriented implementations available today from both 1519

open source groups and commercial foundations. These have been implemented and operational across a 1520

wide range of businesses and government organizations. Emerging are hybrid capabilities that implement 1521

relational access methods (e.g., SQL) on top of BigTable/Columnar storage models. In addition, relational 1522

implementations are adopting columnar-oriented physical storage models to provide more efficient access 1523

for Big Data OLAP like aggregations and analytics. 1524

4.2.3.2.3.4 Document Storage Platforms 1525

Document storage approaches have been around for some time and popularized by the need to quickly 1526

search large amounts of unstructured data. Modern document stores have evolved to include extensive 1527

search and indexing capabilities for structured data and metadata and why they are often referred to as 1528

This publication is available free of charge from: https://doi.org/10.6028/NIST.SP.1500-6r2

NIST BIG DATA INTEROPERABILITY FRAMEWORK: VOLUME 6, REFERENCE ARCHITECTURE

semi-structured data stores. Within a document-oriented data store, each document encapsulates and 1529

encodes the metadata, fields, and any other representations of that record. While somewhat analogous to a 1530

row in a relational table, one-reason document stores evolved and have gained in popularity is that most 1531

implementations do not enforce a fixed or constant schema. While best practices hold that groups of 1532

documents should be logically related and contain similar data, there is no requirement that they be alike 1533

or that any two documents even contain the same fields. This is one reason that document stores are 1534

frequently popular for datasets which have sparsely populated fields since there is far less overhead 1535

normally than traditional RDBMS systems where null value columns in records are actually stored. 1536

Groups of documents within these types of stores are generally referred to as collections, and like key-1537

value stores, some sort of unique key references each document. 1538

In modern implementations, documents can be built of arbitrarily nested structures and can include 1539

variable length arrays and, in some cases, executable scripts/code (which has significant security and 1540

privacy implications). Most document-store implementations also support additional indexes on other 1541

fields or properties within each document with many implementing specialized index types for sparse 1542

data, geospatial data, and text. 1543

When modeling data into document-stores, the preferred approach is to de-normalize the data as much as 1544

possible and embed all one-to-one and most one-to-many relationships within a single document. This 1545

allows for updates to documents to be atomic operations which keep referential integrity between the 1546

documents. The most common case where references between documents should be used is when there 1547

are data elements that occur frequently across sets of documents and whose relationship to those 1548

documents is static. For example, the publisher of a given book edition does not change, and there are far 1549

fewer publishers than there are books. It would not make sense to embed all the publisher information 1550

into each book document. Rather the book document would contain a reference to the unique key for the 1551

publisher. Since for that edition of the book, the reference will never change and so there is no danger of 1552

loss of referential integrity. Thus, information about the publisher (address, for example) can be updated 1553

in a single atomic operation the same as the book. Were this information embedded, it would need to be 1554

updated in every book document with that publisher. 1555

In the Big Data realm, document stores scale horizontally through the use of partitioning or sharding to 1556

distribute portions of the collection across multiple nodes. This partitioning can be round robin-based, 1557

ensuring an even distribution of data or content/key-based so that data locality is maintained for similar 1558

data. Depending on the application required, the choice of partitioning key like with any database can 1559

have significant impacts on performance especially where aggregation functions are concerned. 1560

There are no standard query languages for document store implementations with most using a language 1561

derived from their internal document representation (e.g., JavaScript Object Notation [JSON], XML). 1562

4.2.3.2.3.5 Graph Storage Platforms 1563

While social networking sites like Facebook and LinkedIn have certainly driven the visibility of and 1564

evolution of graph stores (and processing as discussed below), graph stores have been a critical part of 1565

many problem domains from military intelligence and counterterrorism to route planning/navigation and 1566

the semantic web for years. Graph stores represent data as a series of nodes, edges, and properties on 1567

those. Analytics against graph stores include very basic shortest path and page ranking to entity 1568

disambiguation and graph matching. 1569

Graph databases typically store two types of objects nodes and relationships as show in Figure 14 below. 1570

Nodes represents objects in the problem domain that are being analyzed be they people, places, 1571

organizations, accounts, or other objects. Relationships describe those objects in the domain that relate to 1572

each other. Relationships can be non-directional/bidirectional but are typically expressed as unidirectional 1573

in order to provide more richness and expressiveness to the relationships. Hence, between two people 1574

nodes where they are father and son, there would be two relationships. One is father of going from the 1575

This publication is available free of charge from: https://doi.org/10.6028/NIST.SP.1500-6r2

NIST BIG DATA INTEROPERABILITY FRAMEWORK: VOLUME 6, REFERENCE ARCHITECTURE

father node to the son node, and the other from the son to the father of is son of. In addition, nodes and 1576

relationships can have properties or attributes. This is typically descriptive data about the element. For 1577

people, it might be name, birthdate, or other descriptive quality. For locations, it might be an address or 1578

geospatial coordinate. For a relationship like a phone call, it could be the date, time of the call, and the 1579

duration of the call. Within graphs, relationships are not always equal or have the same strength. Thus 1580

relationship often has one or more weight, cost, or confidence attributes. A strong relationship between 1581

people might have a high weight because they have known each other for years and communicate every 1582

day. A relationship where two people just met would have a low weight. The distance between nodes (be 1583

it a physical distance or a difficulty) is often expressed as a cost attribute on a relation in order to allow 1584

computation of true shortest paths across a graph. In military intelligence applications, relationships 1585

between nodes in a terrorist or command and control network might only be suspected or have not been 1586

completely verified, so those relationships would have confidence attributes. Also, properties on nodes 1587

may also have confidence factors associated with them, although in those cases the property can be 1588

decomposed into its own node and tied with a relationship. Graph storage approaches can actually be 1589

viewed as a specialized implementation of a document storage scheme with two types of documents 1590

(nodes and relationships). In addition, one of the most critical elements in analyzing graph data is locating 1591

the node or edge in the graph where the analysis is to begin. To accomplish this, most graph databases 1592

implement indexes on the node or edge properties. Unlike relational and other data storage approaches, 1593

most graph databases tend to use artificial/pseudo keys or guides to uniquely identify nodes and edges. 1594

This allows attributes/properties to be easily changed due to both actual changes in the data (someone 1595

changed their name) or as more information is found out (e.g., a better location for some item or event) 1596

without needing to change the pointers two/from relationships. 1597

This publication is available free of charge from: https://doi.org/10.6028/NIST.SP.1500-6r2

NIST BIG DATA INTEROPERABILITY FRAMEWORK: VOLUME 6, REFERENCE ARCHITECTURE

Figure 14: Object Nodes and Relationships of Graph Databases 1598

The problem with graphs in the Big Data realm is that they grow to be too big to fit into memory on a 1599

single node and their typically chaotic nature (few real-world graphs follow well-defined patterns) makes 1600

their partitioning for a distributed implementation problematic. While distance between or closeness of 1601

nodes would seem like a straightforward partitioning approach, there are multiple issues which must be 1602

addressed. First would be balancing of data. Graphs often tend to have large clusters of data very dense in 1603

a given area, thus leading to essentially imbalances and hot spots in processing. Second, no matter how 1604

the graph is distributed, there are connections (edges) that will cross the boundaries. That typically 1605

requires that nodes know about or how to access the data on other nodes and requires inter-node data 1606

transfer or communication. This makes the choice of processing architectures for graph data especially 1607

critical. Architectures that do not have inter-node communication/messaging tend not to work well for 1608

most graph problems. Typically, distributed architectures for processing graphs assign chunks of the 1609

graph to nodes, then the nodes use messaging approaches to communicate changes in the graph or the 1610

value of certain calculations along a path. 1611

Even small graphs quickly elevate into the realm of Big Data when one is looking for patterns or 1612

distances across more than one or two degrees of separation between nodes. Depending on the density of 1613

the graph, this can quickly cause a combinatorial explosion in the number of conditions/patterns that need 1614

to be tested. 1615

This publication is available free of charge from: https://doi.org/10.6028/NIST.SP.1500-6r2

NIST BIG DATA INTEROPERABILITY FRAMEWORK: VOLUME 6, REFERENCE ARCHITECTURE

A specialized implementation of a graph store known as the Resource Description Framework (RDF) is 1616

part of a family of specifications from the World Wide Web Consortium (W3C) that is often directly 1617

associated with Semantic Web and associated concepts. RDF triples, as they are known, consist of a 1618

subject (Mr. X), a predicate (lives at), and an object (Mockingbird Lane). Thus, a collection of RDF 1619

triples represents a directed labeled graph. The contents of RDF stores are frequently described using 1620

formal ontology languages like the W3C Web Ontology Language (OWL) or the RDF Schema (RDFS) 1621

language, which establish the semantic meanings and models of the underlying data. To support better 1622

horizontal integration of heterogeneous datasets, extensions to the RDF concept such as the Data 1623

Description Framework (DDF) have been proposed, which add additional types to better support semantic 1624

interoperability and analysis [19], [20]. 1625

Graph data stores currently lack any form of standardized APIs or query languages. However, the W3C 1626

has developed the SPARQL query language for RDF, which is currently in a recommendation status, and 1627

there are several frameworks such as Sesame which are gaining popularity for working with RDF and 1628

other graph-oriented data stores. 1629

4.2.3.3 Processing Frameworks 1630

The processing frameworks for Big Data provide the necessary infrastructure software to support 1631

implementation of applications that can deal with the volume, velocity, variety, and variability of data. 1632

Processing frameworks define how the computation and processing of the data is organized. Big Data 1633

applications rely on various platforms and technologies to meet the challenges of scalable data analytics 1634

and operation. 1635

Processing frameworks generally focus on data manipulation, which falls along a continuum between 1636

batch and streaming oriented processing. However, depending on the specific data organization platform, 1637

and actual processing requested, any given framework may support a range of data manipulation from 1638

high latency to near real time (NRT) processing. Overall, many Big Data architectures will include 1639

multiple frameworks to support a wide range of requirements. 1640

Typically, processing frameworks are categorized based on whether they support batch or streaming 1641

processing. This categorization is generally stated from the user perspective (e.g., how fast does a user get 1642

a response to a request). However, Big Data processing frameworks actually have three processing 1643

phases: data ingestion, data analysis, and data dissemination, which closely follow the flow of data 1644

through the architecture. The Big Data Application Provider activities control the application of specific 1645

framework capabilities to these processing phases. The batch-streaming continuum, illustrated in the 1646

processing subcomponent in the NBDRA (Figure 3), can be applied to the three distinct processing 1647

phases. For example, data may enter a Big Data system at high velocity and the end user must quickly 1648

retrieve a summary of the prior day’s data. In this case, the ingestion of the data into the system needs to 1649

be NRT and keep up with the data stream. The analysis portion could be incremental (e.g., performed as 1650

the data is ingested) or could be a batch process performed at a specified time, while retrieval (i.e., read 1651

visualization) of the data could be interactive. Specific to the use case, data transformation may take place 1652

at any point during its transit through the system. For example, the ingestion phase may only write the 1653

data as quickly as possible, or it may run some foundational analysis to track incrementally computed 1654

information such as minimum, maximum, average. The core processing job may only perform the 1655

analytic elements required by the Big Data Application Provider and compute a matrix of data or may 1656

actually generate some rendering like a heat map to support the visualization component. To permit rapid 1657

display, the data dissemination phase almost certainly does some rendering, but the extent depends on the 1658

nature of the data and the visualization. 1659

For the purposes of this discussion, most processing frameworks can be described with respect to their 1660

primary location within the information flow illustrated in Figure 15. 1661

This publication is available free of charge from: https://doi.org/10.6028/NIST.SP.1500-6r2

NIST BIG DATA INTEROPERABILITY FRAMEWORK: VOLUME 6, REFERENCE ARCHITECTURE

1662

Figure 15: Information Flow 1663

The green coloring in Figure 15 illustrates the general sensitivity of that processing style to latency, which 1664

is defined as the time from when a request or piece of data arrives at a system until its processing/delivery 1665

is complete. The darker the shade, the more sensitive to latency. For Big Data, the ingestion may or may 1666

not require NRT performance to keep up with the data flow. Some types of analytics (specifically those 1667

categorized as Complex Event Processing) may or may not require NRT processing. The Data Consumer 1668

generally is located at the far right of Figure 15. Depending upon the use case and application batch 1669

responses (e.g., a nightly report is emailed) may be sufficient. In other cases, the user may be willing to 1670

wait minutes for the results of a query to be returned, or they may need immediate alerting when critical 1671

information arrives at the system. In general, batch analytics tend to better support long term strategic 1672

decision making, where the overall view or direction is not affected by the latest small changes in the 1673

underlying data. Streaming analytics are better suited for tactical decision making, where new data needs 1674

to be acted upon immediately. A primary use case for streaming analytics would be electronic trading on 1675

stock exchanges where the window to act on a given piece of data can be measured in microseconds. 1676

Messaging and communication provide the transfer of data between processing elements and the 1677

buffering necessary to deal with the deltas in data rate, processing times, and data requests. 1678

Typically, Big Data discussions focus around the categories of batch and streaming frameworks for 1679

analytics. However, frameworks for retrieval of data that provide interactive access to Big Data are 1680

becoming a more prevalent. It is noted that the lines between these categories are not solid or distinct, 1681

with some frameworks providing aspects of each category. 1682

4.2.3.3.1 Batch Frameworks 1683

Batch frameworks, whose roots stem from the mainframe processing era, are some of the most prevalent 1684

and mature components of a Big Data architecture because the historically long processing times for large 1685

data volumes. Batch frameworks ideally are not tied to a particular algorithm or even algorithm type, but 1686

rather provide a programming model where multiple classes of algorithms can be implemented. Also, 1687

when discussed in terms of Big Data, these processing models are frequently distributed across multiple 1688

nodes of a cluster. They are routinely differentiated by the amount of data sharing between 1689

processes/activities within the model. 1690

4.2.3.3.2 Streaming Frameworks 1691

Streaming frameworks are built to deal with data that requires processing as fast or faster than the 1692

velocity at which it arrives into the Big Data system. The primary goal of streaming frameworks is to 1693

reduce the latency between the arrival of data into the system and the creation, storage, or presentation of 1694

the results. CEP is one of the problem domains frequently addressed by streaming frameworks. CEP uses 1695

data from one or more streams/sources to infer or identify events or patterns in NRT. 1696

Almost all streaming frameworks for Big Data available today implement some form of basic workflow 1697

processing for the streams. These workflows use messaging/communications frameworks to pass data 1698

objects (often referred to as events) between steps in the workflow. This frequently takes the form of a 1699

This publication is available free of charge from: https://doi.org/10.6028/NIST.SP.1500-6r2

NIST BIG DATA INTEROPERABILITY FRAMEWORK: VOLUME 6, REFERENCE ARCHITECTURE

directed execution graph. The distinguishing characteristics of streaming frameworks are typically 1700

organized around the following three characteristics: event ordering and processing guarantees, state 1701

management, and partitioning/parallelism. These three characteristics are described below. 1702

4.2.3.3.2.1 Event Ordering and Processing Guarantees 1703

This characteristic refers to whether stream processing elements are guaranteed to see messages or events 1704

in the order they are received by the Big Data System, as well as how often a message or event may or 1705

may not be processed. In a non-distributed and single stream mode, this type of guarantee is relatively 1706

trivial. Once distributed and/or multiple streams are added to the system, the guarantee becomes more 1707

complicated. With distributed processing, the guarantees must be enforced for each partition of the data 1708

(partitioning and parallelism as further described below). Complications arise when the process/task/job 1709

dealing with a partition dies. Processing guarantees are typically divided into the following three classes: 1710

• At-most-once delivery: This is the simplest form of guarantee and allows for messages or events 1711

to be dropped if there is a failure in processing or communications or if they arrive out of order. 1712

This class of guarantee is applicable for data where there is no dependence of new events on the 1713

state of the data created by prior events. 1714

• At-least-once delivery: Within this class, the frameworks will track each message or event (and 1715

any downstream messages or events generated) to verify that it is processed within a configured 1716

time frame. Messages or events that are not processed in the time allowed are re-introduced into 1717

the stream. This mode requires extensive state management by the framework (and sometimes the 1718

associated application) to track which events have been processed by which stages of the 1719

workflow. However, under this class, messages or events may be processed more than once and 1720

also may arrive out of order. This class of guarantee is appropriate for systems where every 1721

message or event must be processed regardless of the order (e.g., no dependence on prior events), 1722

and the application either is not affected by duplicate processing of events or has the ability to de-1723

duplicate events itself. 1724

• Exactly once delivery: This class of framework processing requires the same top level state 1725

tracking as At-least-once delivery but embeds mechanisms within the framework to detect and 1726

ignore duplicates. This class often guarantees ordering of event arrivals and is required for 1727

applications where the processing of any given event is dependent on the processing of prior 1728

events. It is noted that these guarantees only apply to data handling within the framework. If data 1729

is passed outside the framework processing topology, then by an application then the application 1730

must ensure the processing state is maintained by the topology or duplicate data may be 1731

forwarded to non-framework elements of the application. 1732

In the latter two classes, some form of unique key must be associated with each message or event to 1733

support de-duplication and event ordering. Often, this key will contain some form of timestamp plus the 1734

stream identification (ID) to uniquely identify each message in the stream. 1735

4.2.3.3.2.2 State Management 1736

A critical characteristic of stream processing frameworks is their ability to recover and not lose critical 1737

data in the event of a process or node failure within the framework. Frameworks typically provide this 1738

state management through persistence of the data to some form of storage. This persistence can be: local, 1739

allowing the failed process to be restarted on the same node; a remote or distributed data store, allowing 1740

the process to be restarted on any node; or, local storage that is replicated to other nodes. The trade-off 1741

between these storage methods is the latency introduced by the persistence. Both the amount of state data 1742

persisted and the time required to assure that the data is persisted contribute to the latency. In the case of a 1743

remote or distributed data store, the latency required is generally dependent on the extent to which the 1744

data store implements ACID (Atomicity, Consistency, Isolation, Durability) or BASE (Basically 1745

/Available, Soft state, Eventual consistency) style consistency. With replication of local storage, the 1746

This publication is available free of charge from: https://doi.org/10.6028/NIST.SP.1500-6r2

NIST BIG DATA INTEROPERABILITY FRAMEWORK: VOLUME 6, REFERENCE ARCHITECTURE

reliability of the state management is entirely tied to the ability of the replication to recover in the event of 1747

a process or node failure. Sometimes this state replication is actually implemented using the same 1748

messaging/communication framework that is used to communicate with and between stream processors. 1749

Some frameworks actually support full transaction semantics, including multi-stage commits and 1750

transaction rollbacks. The trade-off is the same one that exists for any transaction system is that any type 1751

of ACID-like guarantee will introduce latency. Too much latency at any point in the stream flow can 1752

create bottlenecks and, depending on the ordering or processing guarantees, can result in deadlock or loop 1753

states—especially when some level of failure is present. 1754

4.2.3.3.2.3 Partitioning and Parallelism 1755

This streaming framework characteristic relates to the distribution of data across nodes and worker tasks 1756

to provide the horizontal scalability needed to address the volume and velocity of Big Data streams. This 1757

partitioning scheme must interact with the resource management framework to allocate resources. The 1758

even distribution of data across partitions is essential so that the associated work is evenly distributed. 1759

The even data distribution directly relates to selection of a key (e.g., user ID, host name) that can be 1760

evenly distributed. The simplest form might be using a number that increments by one and then is 1761

processed with a modulus function of the number of tasks/workers available. If data dependencies require 1762

all records with a common key be processed by the same worker, then assuring an even data distribution 1763

over the life of the stream can be difficult. Some streaming frameworks address this issue by supporting 1764

dynamic partitioning where the partition of overloaded workers is split and allocated to existing workers 1765

or newly created workers. To achieve success—especially with a data/state dependency related to the 1766

key—it is critical that the framework have state management, which allows the associated state data to be 1767

moved/transitioned to the new/different worker. 1768

4.2.3.4 Crosscutting Components 1769

Because the components within the three sub-roles within the Big Data Framework Provider must share 1770

resources and communicate, two major classes of crosscutting components are needed: 1771

Messaging/Communications Frameworks and Resource Management Frameworks. 1772

4.2.3.4.1 Messaging/Communications Frameworks 1773

Messaging and communications frameworks have their roots in the HPC environments long popular in 1774

the scientific and research communities. Messaging/Communications Frameworks were developed to 1775

provide APIs for the reliable queuing, transmission, and receipt of data between nodes in a horizontally 1776

scaled cluster. These frameworks typically implement either a point-to-point transfer model or a store-1777

and-forward model in their architecture. Under a point-to-point model, data is transferred directly from 1778

the sender to the receivers. The majority of point-to-point implementations do not provide for any form of 1779

message recovery should there be a program crash or interruption in the communications link between 1780

sender and receiver. These frameworks typically implement all logic within the sender and receiver 1781

program space, including any delivery guarantees or message retransmission capabilities. One common 1782

variation of this model is the implementation of multicast (i.e., one-to-many or many-to-many 1783

distribution), which allows the sender to broadcast the messages over a channel, and receivers in turn 1784

listen to those channels of interest. Typically, multicast messaging does not implement any form of 1785

guaranteed receipt. With the store-and-forward model, the sender would address the message to one or 1786

more receivers and send it to an intermediate broker, which would store the message and then forward it 1787

on to the receivers. Many of these implementations support some form of persistence for messages not yet 1788

delivered, providing for recovery in the event of process or system failure. Multicast messaging can also 1789

be implemented in this model and is frequently referred to as a pub/sub model. 1790

This publication is available free of charge from: https://doi.org/10.6028/NIST.SP.1500-6r2

NIST BIG DATA INTEROPERABILITY FRAMEWORK: VOLUME 6, REFERENCE ARCHITECTURE

4.2.3.4.2 Resource Management Frameworks 1791

As Big Data systems have evolved and become more complex, and as businesses work to leverage limited 1792

computation and storage resources to address a broader range of applications and business challenges, the 1793

requirement to effectively manage those resources has grown significantly. While tools for resource 1794

management and elastic computing have expanded and matured in response to the needs of cloud 1795

providers and virtualization technologies, Big Data introduces unique requirements for these tools. 1796

However, Big Data frameworks tend to fall more into a distributed computing paradigm, which presents 1797

additional challenges. 1798

The Big Data characteristics of volume and velocity drive the requirements with respect to Big Data 1799

resource management. Elastic computing (i.e., spawning another instance of some service) is the most 1800

common approach to address expansion in volume or velocity of data entering the system. CPU and 1801

memory are the two resources that tend to be most essential to managing Big Data situations. While 1802

shortages or over-allocation of either will have significant impacts on system performance, improper or 1803

inefficient memory management is frequently catastrophic. Big Data differs and becomes more complex 1804

in the allocation of computing resources to different storage or processing frameworks that are optimized 1805

for specific applications and data structures. As such, resource management frameworks will often use 1806

data locality as one of the input variables in determining where new processing framework elements (e.g., 1807

master nodes, processing nodes, job slots) are instantiated. Importantly, because the data is big (i.e., large 1808

volume), it generally is not feasible to move data to the processing frameworks. In addition, while nearly 1809

all Big Data processing frameworks can be run in virtualized environments, most are designed to run on 1810

bare metal commodity hardware to provide efficient I/O for the volume of the data. 1811

Two distinct approaches to resource management in Big Data frameworks are evolving. The first is intra-1812

framework resource management, where the framework itself manages allocation of resources between its 1813

various components. This allocation is typically driven by the framework’s workload and often seeks to 1814

turn off unneeded resources to either minimize overall demands of the framework on the system or to 1815

minimize the operating cost of the system by reducing energy use. With this approach, applications can 1816

seek to schedule and request resources that—much like main frame OSs of the past—are managed 1817

through scheduling queues and job classes. 1818

The second approach is inter-framework resource management, which is designed to address the needs of 1819

many Big Data systems to support multiple storage and processing frameworks that can address and be 1820

optimized for a wide range of applications. With this approach, the resource management framework 1821

actually runs as a service that supports and manages resource requests from frameworks, monitoring 1822

framework resource usage, and in some cases manages application queues. In many ways, this approach 1823

is like the resource management layers common in cloud/virtualization environments, and there are 1824

efforts underway to create hybrid resource management frameworks that handle both physical and virtual 1825

resources. 1826

Taking these concepts further and combining them is resulting in the emerging technologies built around 1827

what is being termed software-defined data centers (SDDCs). This expansion on elastic and cloud 1828

computing goes beyond the management of fixed pools of physical resources as virtual resources to 1829

include the automated deployment and provisioning of features and capabilities onto physical resources. 1830

For example, automated deployment tools that interface with virtualization or other framework APIs can 1831

be used to automatically stand up entire clusters or to add additional physical resources to physical or 1832

virtual clusters. 1833

4.2.4 MANAGEMENT FABRIC 1834

The management fabric encompasses components responsible for the establishing and continuing 1835

operation of the system. 1836

This publication is available free of charge from: https://doi.org/10.6028/NIST.SP.1500-6r2

NIST BIG DATA INTEROPERABILITY FRAMEWORK: VOLUME 6, REFERENCE ARCHITECTURE

The characteristics of Big Data pose system management challenges on traditional management 1837

platforms. To efficiently capture, store, process, analyze, and distribute complex and large datasets 1838

arriving or leaving with high velocity, a resilient system management is needed. 1839

As in traditional systems, system management for Big Data architecture involves provisioning, 1840

configuration, package management, software management, backup management, capability 1841

management, resources management, and performance management of the Big Data infrastructure, 1842

including compute nodes, storage nodes, and network devices. Due to the distributed and complex nature 1843

of the Big Data infrastructure, system management for Big Data is challenging, especially with respect to 1844

the capability for controlling, scheduling, and managing the processing frameworks to perform the 1845

scalable, robust, and secure analytics processing required by the Big Data Application Provider. The Big 1846

Data infrastructure may contain SAN or NAS storage devices, cloud storage spaces, NoSQL databases, 1847

MapReduce clusters, data analytics functions, search and indexing engines, and messaging platforms. The 1848

supporting enterprise computing infrastructure can range from traditional data centers, cloud services, and 1849

dispersed computing nodes of a grid. 1850

In an enterprise environment, the management platform would typically provide enterprise-wide 1851

monitoring and administration of the Big Data distributed components. This includes network 1852

management, fault management, configuration management, system accounting, performance 1853

management, and security management. 1854

4.2.4.1 Monitoring Frameworks 1855

To monitor the distributed and complex nature of the Big Data infrastructure, system management relies 1856

on the following: 1857

• Standard protocols such as Simple Network Management Protocol (SNMP), which are used to 1858

transmit status about resources and fault information to the management fabric components; and 1859

• Deployable agents or management connectors which allow the management fabric to both 1860

monitor and also control elements of the framework. 1861

These two items aid in monitoring the health of various types of computing resources and coping with 1862

performance and failures incidents while maintaining the quality of service levels required by the Big 1863

Data Application Provider. Management connectors are necessary for scenarios where the cloud service 1864

providers expose management capabilities via APIs. It is conceivable that the infrastructure elements 1865

contain autonomic, self-tuning, and self-healing capabilities, thereby reducing the centralized model of 1866

system monitoring. 1867

4.2.4.2 Provisioning/Configuration Frameworks 1868

In large infrastructures with many thousands of computing and storage nodes, the provisioning of tools 1869

and applications should be as automated as possible. Software installation, application configuration, and 1870

regular patch maintenance should be pushed out and replicated across the nodes in an automated fashion, 1871

which could be done based on the topology knowledge of the infrastructure. With the advent of 1872

virtualization, the utilization of virtual images may speed up the recovery process and provide efficient 1873

patching that can minimize downtime for scheduled maintenance. Such frameworks also interact with the 1874

Security and Privacy Fabric to ensure that the system configuration continually meets the security 1875

requirements outlined in the policies specified by the System Orchestrator. 1876

4.2.4.3 Package Managers 1877

Package management components support the installation and updates of other components within the 1878

Big Data system. This class of components is often provided by the underlying operating system 1879

component and is invoked by the provisioning /configuration frameworks to install and update 1880

components within the system. Components within this class generally leverage a central network 1881

This publication is available free of charge from: https://doi.org/10.6028/NIST.SP.1500-6r2

NIST BIG DATA INTEROPERABILITY FRAMEWORK: VOLUME 6, REFERENCE ARCHITECTURE

repository to ensure that the correct component version is deployed consistently across the cluster. In 1882

many Big Data systems, this same repository is leveraged to support the deployment of application 1883

components and, in some cases, even data components. 1884

4.2.4.4 Resource Managers 1885

Resource management components within the Management Framework provide the system with the 1886

overall resources necessary to support the system. These components will work with external resource 1887

providers such as Cloud Service Providers to acquire the resources necessary to provision the other 1888

components of the system. They will handle requests for additional resources from resource managers 1889

within the Big Data Framework Provider when required and coordinate with the 1890

Provisioning/Configuration Frameworks to properly configure other components across those resources. 1891

4.2.4.5 Data Life Cycle Managers 1892

Life Cycle Data Management components are necessary to manage the life cycle of the data ingested into 1893

the system, stored and preserved in the system, and accessed for processing or dissemination purposes: 1894

Metadata Catalog is the inventory of all datasets in the system. It should contain the model for the 1895

foundational concept of “unit” of data, whether it is a database record (e.g., key-value pair or relational 1896

table row), or a dataset (e.g., database export file). Each data unit has characteristics maintained in the 1897

associated metadata, which should include at least a unique identifier and timestamp indicating when the 1898

data was created and/or ingested. These timestamps will help the Data Life Cycle Manager to monitor the 1899

“age” of the data within the system. Moreover, the Metadata Catalog will have to support data discovery 1900

that is necessary for data access and data governance. There are numerous international and national 1901

standards which govern the content, model, and interfaces for metadata catalogs. 1902

The Data Tracker tracks the movement of data throughout the system, from the ingestion point to the 1903

dissemination or destruction point. The Data Tracker component handles the Volume and Variety 1904

characteristics inherent to Big Data. The two kinds of movements are as follows: 1905

• Ingress and egress movement: tracks data entering and exiting the system. Data exiting means 1906

that the data are dispositioned to satisfy the retention policy, which can originate from either the 1907

need of the Big Data application or preservation policy. Indeed, some applications may require 1908

“fresh” data for analytical purposes. The degree of freshness depends on the specific requirements 1909

of the business applications, and can be influenced by policy and regulations. For instance, while 1910

the visual analytics application monitoring the approval or disapproval feedback during a 1911

presidential election debate requires real-time data and most recent tweet and blog data, the study 1912

of the trend of household income over the past 50 years needs both recent and archived Census 1913

data. On the other hand, records management laws and policies may dictate the retention time for 1914

the data, and hence impact the Data Preservation. 1915

• Intra-system movement: Due to the large volume of Big Data, the Big Data Framework Provider 1916

will likely have multitiered storage for cost-efficiency and scalability. Within that storage 1917

environment, data is made available to the analytics processes managed by the Big Data 1918

Application Provider. Commercial infrastructure vendors offer different storage categories with 1919

different pricing models. The action of making data available to processes and applications may 1920

be realized by physically moving the data to storage where the processing software can operate. 1921

However, a recent paradigm is to move computation and processing capabilities to where data are 1922

located to circumvent the large data transfer between storage tiers. 1923

The Data Tracker may interface with the Data Preservation component to implement preservation and 1924

long-term storage policies. 1925

The Data Preservation component is applied to both permanent and temporary data. Its responsibility is to 1926

continuously inspect the “age” of data in the system, and operate on the data based on the retention 1927

This publication is available free of charge from: https://doi.org/10.6028/NIST.SP.1500-6r2

NIST BIG DATA INTEROPERABILITY FRAMEWORK: VOLUME 6, REFERENCE ARCHITECTURE

policy. For permanent data, Data Preservation will perform the Preservation Plan, which can consist of 1928

migrating data to a long-term preservation format, periodically refreshing the storage hardware, or 1929

maintaining emulation environments used to read the archived data. Data Preservation will leverage the 1930

multitiered storage which satisfies data durability requirement, and achieves cost-efficiency. If data are 1931

deemed to have limited lifetime, then Data Preservation will apply appropriate disposition methods to 1932

purge them from the system. The purge methods will depend on the security policy to ensure data 1933

confidentiality. 1934

4.2.5 SECURITY AND PRIVACY FABRIC 1935

The components within the Security and Privacy Fabric implement the core activities supporting the 1936

overall security and privacy requirements outlined by the policies and processes of the System 1937

Orchestrator. 1938

4.2.5.1 Authentication and Authorization Frameworks 1939

Components within this class must interface and interact with all other components within the Big Data 1940

system to support access control to the data and services of the system. This support includes 1941

authenticating the user or service attempting to access the system resource to validate their identity. This 1942

class of components provides APIs to other services and components for collecting the identity 1943

information, and validating that information against a trusted store of identities. Frequently these 1944

components will provide an identification token back to the invoking component that defines allowed 1945

access for the life of a session. This token can also be used to retrieve authorizations for the 1946

users/components detailing what data and service resources they may access. These authorizations can be 1947

used by the components to limit access to data or even filter data provided in response to requests by 1948

components. Typically, a component will pass the identification token as part of the request which the 1949

receiving component will use to look up authorizations from a trusted store to manage the access to the 1950

underlying resources (data or services). 1951

4.2.5.2 Audit Frameworks 1952

Audit Framework components are responsible for collecting, managing, consolidating, and in some cases 1953

monitoring events from across the system that reflect access to and changes to data and services across 1954

the system. The scope and nature of the events collected is based on the requirements specified by the 1955

policies within the System Orchestrator. Typically, these components will collect and store this data 1956

within a secure centralized repository within the system and manage the retention of this data based on 1957

the policies. The data maintained by these components can be leveraged during system operation to 1958

provide providence and pedigree for data to users or application components as well as for forensic 1959

analysis in the response to security or data breaches. Because of the number and frequency of operations 1960

and events which may be generated by a large Big Data system, the framework itself must deal with the 1961

Big Data characteristics of volume and velocity. To handle this, many Big Data system architectures 1962

implement a Big Data system instance specifically for management and storage of this data. Monitoring 1963

frameworks within the Management Fabric may execute algorithms within this Big Data system instance 1964

to provide alerts to potential security or data issues. 1965

This publication is available free of charge from: https://doi.org/10.6028/NIST.SP.1500-6r2

NIST BIG DATA INTEROPERABILITY FRAMEWORK: VOLUME 6, REFERENCE ARCHITECTURE

5 SUMMARY 1966

This document (Version 3) presents the overall NBDRA conceptual model along with architecture views 1967

for the activities performed by the architecture and the functional components that would implement the 1968

architecture. 1969

The purpose of these views is to provide the system architect a framework to efficiently categorize the 1970

activities that the Big Data system will perform and the functional components which must be integrated 1971

to perform those activities. During the architecture process, the architect is encouraged to collaborate 1972

closely with the system stakeholders to ensure that all required activities for the system are captured in the 1973

activities view. Those activities should then be mapped to functional components within that view using a 1974

traceability matrix. This matrix will serve to validate that components will be integrated into the 1975

architecture to accomplish all required activities and that all integrated functional components have a 1976

purpose within the architecture. 1977

1978

This publication is available free of charge from: https://doi.org/10.6028/NIST.SP.1500-6r2

NIST BIG DATA INTEROPERABILITY FRAMEWORK: VOLUME 6, REFERENCE ARCHITECTURE

Appendix A: Deployment 1979

Considerations 1980

The NIST Big Data Reference Architecture is applicable to a variety of business environments and 1981

technologies. As a result, possible deployment models are not part of the core concepts discussed in the 1982

main body of this document. However, the loosely coupled and distributed natures of Big Data 1983

Framework Provider functional components allow it to be deployed using multiple infrastructure elements 1984

as described in Section 4.2.3. The two most common deployment configurations are directly on physical 1985

resources or on top of an IaaS cloud computing framework. The choices between these two configurations 1986

are driven by needs of efficiency/performance and elasticity. Physical infrastructures are typically used to 1987

obtain predictable performance and efficient utilization of CPU and I/O bandwidth since it eliminates the 1988

overhead and additional abstraction layers typical in the virtualized environments for most IaaS 1989

implementations. IaaS cloud-based deployments on are typically used when elasticity is needed to support 1990

changes in workload requirements. The ability to rapidly instantiate additional processing nodes or 1991

framework components allows the deployment to adapt to either increased or decreased workloads. By 1992

allowing the deployment footprint to grow or shrink based on workload demands this deployment model 1993

can provide cost savings when public or shared cloud services are used and more efficient use and energy 1994

consumption when a private cloud deployment is used. Recently, a hybrid deployment model known as 1995

Cloud Bursting has become popular. In this model a physical deployment is augmented by either public 1996

or private IaaS cloud services. When additional processing is needed to support the workload additional 1997

the additional framework component instances are established on the IaaS infrastructure and then deleted 1998

when no longer required. 1999

Figure A-1: Big Data Framework Deployment Options 2000

In addition to providing IaaS support, cloud providers are now offering Big Data Frameworks under a 2001

platform as a service (PaaS) model. Under this model, the system implementer is freed from the need to 2002

Physical Resources

Big Data Application Provider

Visualization

Access

Analytics

Collection

Preparation/

Curation

Indexed Storage

File Systems

Big Data Framework Provider

Processing: Computing and Analytic

Platforms: Data Organization and Distribution

Messaging/

Communications

Streaming

Resource Management

Interactive

Batch

Resource Abstraction & Control

Cloud Provider

IaaS

PaaS

SaaS

Security and Privacy Fabric

Management Fabric

Cloud Services

Virtual Resources

This publication is available free of charge from: https://doi.org/10.6028/NIST.SP.1500-6r2

NIST BIG DATA INTEROPERABILITY FRAMEWORK: VOLUME 6, REFERENCE ARCHITECTURE

establish and manage the complex configuration and deployment typical of many Big Data Framework 2003

components. The implementer simply needs to specify the size of the cluster required, and the cloud 2004

provider manages the provisioning, configuration, and deployment of all the framework components. 2005

There are even some nascent offerings for specialized software as a service (SaaS) Big Data applications 2006

appearing in the market that implement the Big Data Application Provider functionality within the cloud 2007

environment. Figure A-1 illustrates how the components of the NBDRA might align with the NIST Cloud 2008

Reference architecture [21]. The following sections describe some of the high-level interactions required 2009

between the Big Data Architecture elements and the CSP elements. 2010

CLOUD SERVICE PROVIDERS 2011

Recent data analytics solutions use algorithms that can utilize and benefit from the frameworks of the 2012

cloud computing systems. Cloud computing has essential characteristics such as rapid elasticity and 2013

scalability, multi-tenancy, on-demand self-service, and resource pooling, which together can significantly 2014

lower the barriers to the realization of Big Data implementations. 2015

The CSP implements and delivers cloud services. Processing of a service invocation is done by means of 2016

an instance of the service implementation, which may involve the composition and invocation of other 2017

services as determined by the design and configuration of the service implementation. 2018

Cloud Service Component 2019

The cloud service component contains the implementation of the cloud services provided by a CSP. It 2020

contains and controls the software components that implement the services (but not the underlying 2021

hypervisors, host OSs, device drivers, etc.). 2022

Cloud services can be described in terms of service categories. 2023

Cloud services are also grouped into categories, where each service category is characterized by qualities 2024

that are common between the services within the category. The NIST Cloud Computing Reference Model 2025

defines the following cloud service categories: 2026

• Infrastructure as a services (IaaS) 2027

• Platform as a service (PaaS) 2028

• Software as a service (SaaS) 2029

Resource Abstraction and Control Component 2030

The Resource Abstraction and Control component is used by CSPs to provide access to the physical 2031

computing resources through software abstraction. Resource abstraction needs to assure efficient, secure, 2032

and reliable usage of the underlying physical resources. The control feature of the component enables the 2033

management of the resource abstraction features. 2034

The Resource Abstraction and Control component enables a CSP to offer qualities such as rapid elasticity, 2035

resource pooling, on-demand self-service, and scale-out. The Resource Abstraction and Control 2036

component can include software elements such as hypervisors, virtual machines, virtual data storage, and 2037

time-sharing. 2038

The Resource Abstraction and Control component enables control functionality. For example, there may 2039

be a centralized algorithm to control, correlate, and connect various processing, storage, and networking 2040

units in the physical resources so that together they deliver an environment where IaaS, PaaS or SaaS 2041

cloud service categories can be offered. The controller might decide which CPUs/racks contain which 2042

virtual machines executing which parts of a given cloud workload, and how such processing units are 2043

connected to each other, and when to dynamically and transparently reassign parts of the workload to new 2044

units as conditions change. 2045

This publication is available free of charge from: https://doi.org/10.6028/NIST.SP.1500-6r2

NIST BIG DATA INTEROPERABILITY FRAMEWORK: VOLUME 6, REFERENCE ARCHITECTURE

Security and Privacy and Management Functions 2046

In almost all cases, the Cloud Provider will provide elements of the Security, Privacy, and Management 2047

functions. Typically, the provider will support high-level security/privacy functions that control access to 2048

the Big Data applications and frameworks while the frameworks themselves must control access to their 2049

underlying data and application services. Many times, the Big Data specific functions for security and 2050

privacy will depend on and must interface with functions provided by the CSP. Similarly, management 2051

functions are often split between the Big Data implementation and the Cloud Provider implementations. 2052

Here the cloud provider would handle the deployment and provisioning of Big Data architecture elements 2053

within its IaaS infrastructure. The cloud provider may provide high-level monitoring functions to allow 2054

the Big Data implementation to track performance and resource usage of its components. In, many cases 2055

the Resource Management element of the Big Data Framework will need to interface to the CSP’s 2056

management framework to request additional resources. 2057

PHYSICAL RESOURCE DEPLOYMENTS 2058

As stated above, deployment on physical resources is frequently used when performance characteristics 2059

are paramount. The nature of the underlying physical resource implementations to support Big Data 2060

requirements has evolved significantly over the years. Specialized, high-performance super computers 2061

with custom approaches for sharing resources (e.g., memory, CPU, storage) between nodes has given way 2062

to shared nothing computing clusters built from commodity servers. The custom super computing 2063

architectures almost always required custom development and components to take advantage of the 2064

shared resources. The commodity server approach both reduced the hardware investment and allowed the 2065

Big Data frameworks to provide higher-level abstractions for the sharing and management of resources in 2066

the cluster. The Recent trends now involve density, power, cooling optimized server form factors that 2067

seek to maximize the available computing resources while minimizing size, power and/or cooling 2068

requirements. This approach retains the abstraction and portability advantages of the shared nothing 2069

approaches while providing improved efficiency. 2070

2071

This publication is available free of charge from: https://doi.org/10.6028/NIST.SP.1500-6r2

NIST BIG DATA INTEROPERABILITY FRAMEWORK: VOLUME 6, REFERENCE ARCHITECTURE

Appendix B: Terms and 2072

Definitions 2073

NBDRA COMPONENTS 2074

• Big Data Engineering: Advanced techniques that harness independent resources for building 2075

scalable data systems when the characteristics of the datasets require new architectures for 2076

efficient storage, manipulation, and analysis. 2077

• Data Provider: Organization or entity that introduces information feeds into the Big Data system 2078

for discovery, access, and transformation by the Big Data system. 2079

• Big Data Application Provider: Organization or entity that executes a generic vertical system 2080

data life cycle, including: (a) data collection from various sources, (b) multiple data 2081

transformations being implemented using both traditional and new technologies, (c) diverse data 2082

usage, and (d) data archiving. 2083

• Big Data Framework Provider: Organization or entity that provides a computing fabric (such as 2084

system hardware, network, storage, virtualization, and computing platform) to execute certain Big 2085

Data applications, while maintaining security and privacy requirements. 2086

• Data Consumer: End users or other systems that use the results of data applications. 2087

• System Orchestrator: Organization or entity that defines and integrates the required data 2088

transformations components into an operational vertical system. 2089

OPERATIONAL CHARACTERISTICS 2090

• Interoperability: The capability to communicate, to execute programs, or to transfer data among 2091

various functional units under specified conditions. 2092

• Portability: The ability to transfer data from one system to another without being required to 2093

recreate or reenter data descriptions or to modify significantly the application being transported. 2094

• Privacy: The assured, proper, and consistent collection, processing, communication, use and 2095

disposition of data associated with personal information and PII throughout its life cycle. 2096

• Security: Protecting data, information, and systems from unauthorized access, use, disclosure, 2097

disruption, modification, or destruction in order to provide: 2098

o Integrity: guarding against improper data modification or destruction, and includes ensuring 2099

data nonrepudiation and authenticity; 2100

o Confidentiality: preserving authorized restrictions on access and disclosure, including means 2101

for protecting personal privacy and proprietary data; and 2102

o Availability: ensuring timely and reliable access to and use of data. 2103

• Elasticity: The ability to dynamically scale up and down as a real-time response to the workload 2104

demand. Elasticity will depend on the Big Data system, but adding or removing software threads 2105

and virtual or physical servers are two widely used scaling techniques. Many types of workload 2106

demands drive elastic responses, including web-based users, software agents, and periodic batch 2107

jobs. 2108

• Persistence: The placement/storage of data in a medium design to allow its future access. 2109

This publication is available free of charge from: https://doi.org/10.6028/NIST.SP.1500-6r2

NIST BIG DATA INTEROPERABILITY FRAMEWORK: VOLUME 6, REFERENCE ARCHITECTURE

PROVISIONING MODELS 2110

• IaaS: “The capability provided to the consumer to provision processing, storage, networks, and 2111

other fundamental computing resources where the consumer is able to deploy and run arbitrary 2112

software, which can include OS and applications. The consumer does not manage or control the 2113

underlying cloud infrastructure but has control over OSs, storage, deployed applications, and 2114

possibly limited control of select networking components (e.g., host firewalls) [22].” 2115

• PaaS: “The capability provided to the consumer to deploy onto the cloud infrastructure consumer-2116

created or acquired applications created using programming languages and tools supported by the 2117

provider. The consumer does not manage or control the underlying cloud infrastructure including 2118

network, servers, operating systems, or storage, but has control over the deployed applications 2119

and possibly” application-hosting environment configurations [22]. 2120

• SaaS: “The capability provided to the consumer is to use the provider’s applications running on a 2121

cloud infrastructure. … The consumer does not manage or control the underlying cloud 2122

infrastructure including network, servers, operating systems, storage, or even individual 2123

application capabilities, with the possible exception of limited user-specific application 2124

configuration settings [22].” 2125

2126

This publication is available free of charge from: https://doi.org/10.6028/NIST.SP.1500-6r2

NIST BIG DATA INTEROPERABILITY FRAMEWORK: VOLUME 6, REFERENCE ARCHITECTURE

Appendix C: Acronyms 2127

ACID atomicity, consistency, isolation, durability 2128

API application programming interface 2129

ASCII American Standard Code for Information Interchange 2130

BASE basically available, soft state, eventual consistency 2131

BDLM Big Data life cycle management 2132

BSP bulk synchronous parallel 2133

CAP consistency, availability, and partition tolerance 2134

CEP complex event processing 2135

CIA confidentiality, integrity, and availability 2136

CPR Capability Provider Requirements 2137

CPU central processing unit 2138

CRUD create/read/update/delete 2139

CSP Cloud Service Provider 2140

CSV comma separated values 2141

DCR Data Consumer Requirements 2142

DDF Data Description Framework 2143

DLM data life cycle management 2144

DNS Domain Name Server 2145

DSR Data Source Requirements 2146

ELT extract, load, transform 2147

ETL extract, transform, load 2148

FPGA Field Programmable Gate Arrays 2149

FTP file transfer protocol 2150

GB gigabyte 2151

GPU graphic processing units 2152

GRC governance, risk management, and compliance 2153

GUID globally unique identifier 2154

HPC high performance computing 2155

HTTP HyperText Transfer Protocol 2156

I/O input/output 2157

IaaS Infrastructure as a Service 2158

ID identification 2159

ISO International Organization of Standardization 2160

IT information technology 2161

ITL Information Technology Laboratory 2162

JSON JavaScript Object Notation 2163

LMR Life Cycle Management Requirements 2164

NARA National Archives and Records Administration 2165

NAS network-attached storage 2166

NASA National Aeronautics and Space Administration 2167

NBDIF NIST Big Data Interoperability Framework 2168

NBD-PWG NIST Big Data Public Working Group 2169

NBDRA NIST Big Data Reference Architecture 2170

NFS network file system 2171

NFV network function virtualization 2172

NGA National Geospatial Intelligence Agency 2173

This publication is available free of charge from: https://doi.org/10.6028/NIST.SP.1500-6r2

NIST BIG DATA INTEROPERABILITY FRAMEWORK: VOLUME 6, REFERENCE ARCHITECTURE

NIST National Institute of Standards and Technology 2174

NoSQL not only (or no) Structured Query Language 2175

NRT near real time 2176

NSA National Security Agency 2177

NSF National Science Foundation 2178

OLAP online analytical processing 2179

OLTP online transaction processing 2180

OR Other Requirements 2181

OS operating system 2182

OWL W3C Web Ontology Language 2183

PaaS Platform as a Service 2184

PII personally identifiable information 2185

POSIX portable operating system interface 2186

RAID redundant array of independent disks 2187

RAM random-access memory 2188

RDBMS relational database management system 2189

RDF Resource Description Framework 2190

RDFS RDF Schema 2191

SaaS Software as a Service 2192

SAN storage area network 2193

SDDC software-defined data center 2194

SDN software-defined network 2195

SNMP Simple Network Management Protocol 2196

SPR Security and Privacy Requirements 2197

SQL Structured Query Language 2198

TCP Transmission Control Protocol 2199

TPR Transformation Provider Requirements 2200

W3C World Wide Web Consortium 2201

XML Extensible Markup Language 2202

2203

This publication is available free of charge from: https://doi.org/10.6028/NIST.SP.1500-6r2

NIST BIG DATA INTEROPERABILITY FRAMEWORK: VOLUME 6, REFERENCE ARCHITECTURE

Appendix D: Resources and 2204

Bibliography 2205

GENERAL RESOURCES 2206

The following resources provide additional information related to Big Data architecture. 2207

Big Data Public Working Group, “NIST Big Data Program,” National Institute for Standards and 2208

Technology, June 26, 2013, http://bigdatawg.nist.gov . 2209

Doug Laney, “3D Data Management: Controlling Data Volume, Velocity, and Variety,” Gartner, 2210

February 6, 2001,

http://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data-2211

Management-Controlling-Data-Volume-Velocity-and-Variety.pdf

. 2212

Eberhardt Rechtin, “The Art of Systems Architecting,” CRC Press, January 6, 2009. 2213

International Organization of Standardization (ISO), “ISO/IEC/IEEE 42010 Systems and software 2214

engineering — Architecture description,” ISO, November 24, 2011, 2215

http://www.iso.org/iso/catalogue_detail.htm?csnumber=50508

. 2216

Mark Beyer and Doug Laney, “The Importance of 'Big Data': A Definition,” Gartner, June 21, 2012, 2217

http://www.gartner.com/DisplayDocument?id=2057415&ref=clientFriendlyUrl

. 2218

Martin Hilbert and Priscilla Lopez, “The World’s Technological Capacity to Store, Communicate, and 2219

Compute Information,” Science, April 1, 2011. 2220

National Institute of Standards and Technology [NIST], “Big Data Workshop,” NIST, June 13, 2012, 2221

http://www.nist.gov/itl/ssd/is/big-data.cfm. 2222

National Science Foundation, “Big Data R&D Initiative,” National Institute for Standards and 2223

Technology, June 2012,

http://www.nist.gov/itl/ssd/is/upload/NIST-BD-Platforms-05-Big-Data-2224

Wactlar-slides.pdf

. 2225

Office of the Assistant Secretary of Defense, “Reference Architecture Description,” U.S. Department of 2226

Defense, June 2010, 2227

http://dodcio.defense.gov/Portals/0/Documents/DIEA/Ref_Archi_Description_Final_v1_18Jun102228

.pdf

. 2229

Office of the White House Press Secretary, “Obama Administration Unveils “Big Data” Initiative,” White 2230

House Press Release, March 29, 2012, 2231

http://www.whitehouse.gov/sites/default/files/microsites/ostp/big_data_press_release_final_2.pdf

. 2232

White House, “Big Data Across the Federal Government,” Executive Office of the President, March 29, 2233

2012, 2234

http://www.whitehouse.gov/sites/default/files/microsites/ostp/big_data_fact_sheet_final_1.pdf

. 2235

2236

This publication is available free of charge from: https://doi.org/10.6028/NIST.SP.1500-6r2

NIST BIG DATA INTEROPERABILITY FRAMEWORK: VOLUME 6, REFERENCE ARCHITECTURE

BIBLIOGRAPHY 2237

[1] W. L. Chang (Co-Chair), N. Grady (Subgroup Co-chair), and NIST Big Data Public Working 2238

Group, “NIST Big Data Interoperability Framework: Volume 1, Big Data Definitions (NIST SP 2239

1500-1 VERSION 3),” Gaithersburg MD, Sep. 2019 [Online]. Available: 2240

https://doi.org/10.6028/NIST.SP.1500-1r2 2241

[2] W. L. Chang (Co-Chair), N. Grady (Subgroup Co-chair), and NIST Big Data Public Working 2242

Group, “NIST Big Data Interoperability Framework: Volume 2, Big Data Taxonomies (NIST SP 2243

1500-2 VERSION 3),” Gaithersburg, MD, Sep. 2019 [Online]. Available: 2244

https://doi.org/10.6028/NIST.SP.1500-2r2 2245

[3] W. L. Chang (Co-Chair), G. Fox (Subgroup Co-chair), and NIST Big Data Public Working Group, 2246

“NIST Big Data Interoperability Framework: Volume 3, Big Data Use Cases and General 2247

Requirements (NIST SP 1500-3 VERSION 3),” Gaithersburg, MD, Sep. 2019 [Online]. Available: 2248

https://doi.org/10.6028/NIST.SP.1500-3r2 2249

[4] W. L. Chang (Co-Chair), A. Roy (Subgroup Co-chair), M. Underwood (Subgroup Co-chair), and 2250

NIST Big Data Public Working Group, “NIST Big Data Interoperability Framework: Volume 4, 2251

Big Data Security and Privacy (NIST SP 1500-4 VERSION 3),” Gaithersburg, MD, Sep. 2019 2252

[Online]. Available: https://doi.org/10.6028/NIST.SP.1500-4r2 2253

[5] W. L. Chang (Co-Chair), S. Mishra (Editor), and NIST Big Data Public Working Group, “NIST 2254

Big Data Interoperability Framework: Volume 5, Big Data Architectures White Paper Survey 2255

(NIST SP 1500-5 VERSION 1),” Sep. 2015. 2256

[6] W. L. Chang (Co-Chair), R. Reinsch (Subgroup Co-chair), D. Boyd (Version 1 Subgroup Co-2257

chair), C. Buffington (Version 1 Subgroup Co-chair), and NIST Big Data Public Working Group, 2258

“NIST Big Data Interoperability Framework: Volume 7, Big Data Standards Roadmap (NIST SP 2259

1500-7 VERSION 3),” Gaithersburg, MD, Sep. 2019 [Online]. Available: 2260

https://doi.org/10.6028/NIST.SP.1500-7r2 2261

[7] W. L. Chang (Co-Chair), G. von Laszewski (Editor), and NIST Big Data Public Working Group, 2262

“NIST Big Data Interoperability Framework: Volume 8, Big Data Reference Architecture 2263

Interfaces (NIST SP 1500-9 VERSION 2),” Gaithersburg, MD, Sep. 2019 [Online]. Available: 2264

https://doi.org/10.6028/NIST.SP.1500-9r1 2265

[8] W. L. Chang (Co-Chair), R. Reinsch (Subgroup Co-chair), C. Austin (Editor), and NIST Big Data 2266

Public Working Group, “NIST Big Data Interoperability Framework: Volume 9, Adoption and 2267

Modernization (NIST SP 1500-10 VERSION 2),” Gaithersburg, MD, Sep. 2019 [Online]. 2268

Available: https://doi.org/10.6028/NIST.SP.1500-10r1 2269

[9] T. White House Office of Science and Technology Policy, “Big Data is a Big Deal,” OSTP Blog, 2270

2012. [Online]. Available: http://www.whitehouse.gov/blog/2012/03/29/big-data-big-deal. 2271

[Accessed: 21-Feb-2014] 2272

[10] N. and I. I. (OASD/NII) Office of the Assistant Secretary of Defense, “Reference Architecture 2273

Description,” 2010 [Online]. Available: 2274

This publication is available free of charge from: https://doi.org/10.6028/NIST.SP.1500-6r2

NIST BIG DATA INTEROPERABILITY FRAMEWORK: VOLUME 6, REFERENCE ARCHITECTURE

http://dodcio.defense.gov/Portals/0/Documents/DIEA/Ref_Archi_Description_Final_v1_18Jun10.2275

pdf 2276

[11] A. D. N. Sarma, “Architectural Framework for Operational Business Intelligence System,” Int. J. 2277

Innov. Manag. Technol., vol. 5, no. 4, p. 7, 2014 [Online]. Available: 2278

http://www.ijimt.org/papers/529-E318.pdf 2279

[12] S. C. L. Koh and K. H. Tan, “Operational intelligence discovery and knowledge‐mapping 2280

approach in a supply network with uncertainty,” J. Manuf. Technol. Manag., vol. 17, no. 6, pp. 2281

687–699, 2006. 2282

[13] M. Andreolini, M. Colajanni, M. Pietri, and S. Tosi, “Adaptive, scalable and reliable monitoring of 2283

big data on clouds,” J. Parallel Distrib. Comput., vol. 79–80, pp. 67–79, 2015. 2284

[14] V. Lemieux, B. Endicott-Popovsky, K. Eckler, T. Dang, and A. Jansen, “Visualizing an 2285

information assurance risk taxonomy,” in VAST 2011 - IEEE Conference on Visual Analytics 2286

Science and Technology 2011, Proceedings, 2011, pp. 287–288. 2287

[15] L. Duboc, E. Letier, D. S. Rosenblum, and T. Wicks, “A case study in eliciting scalability 2288

requirements,” in Proceedings of the 16th IEEE International Requirements Engineering 2289

Conference, RE’08, 2008, pp. 247–252. 2290

[16] P. Colella, “Deﬁning software requirements for scientiﬁc computing (Slide in ‘Can Computer 2291

Architecture Affect Scientific Productivity?’),” in Salishan Conference on High-speed Computing, 2292

2005, 2004 [Online]. Available: 2293

http://www.lanl.gov/orgs/hpc/salishan/salishan2005/davidpatterson.pdf 2294

[17] L. G. Valiant, “A bridging model for parallel computation,” Commun. ACM, vol. 33, no. 8, pp. 2295

103–111, 1990 [Online]. Available: http://portal.acm.org/citation.cfm?doid=79173.79181 2296

[18] F. Chang et al., “Bigtable: A distributed storage system for structured data,” 7th Symp. Oper. Syst. 2297

Des. Implement. (OSDI ’06), Novemb. 6-8, Seattle, WA, USA, pp. 205–218, 2006 [Online]. 2298

Available: http://research.google.com/archive/bigtable-osdi06.pdf 2299

[19] B. Smith, T. Malyuta, W. S. Mandirck, C. Fu, K. Parent, and M. Patel, “Horizontal Integration of 2300

Warfighter Intelligence Data,” in Semantic Technology in Intelligence, Defense and Security 2301

(STIDS), 2012, p. 8 [Online]. Available: http://ontology.buffalo.edu/smith/articles/Horizontal-2302

integration.pdf 2303

[20] S. Yoakum-Stover and T. Malyuta, “Unified data integration for situation management,” in 2304

Proceedings - IEEE Military Communications Conference MILCOM, 2008. 2305

[21] F. Liu et al., “NIST Cloud Computing Reference Architecture, SP 500-292,” Spec. Publ. 500-292, 2306

p. 35, 2011 [Online]. Available: http://ws680.nist.gov/publication/get_pdf.cfm?pub_id=909505 2307

[22] P. Mell and T. Grance, “NIST SP 800-145: The NIST Definition of Cloud Computing,” 2011 2308

[Online]. Available: http://www.mendeley.com/research/the-nist-definition-about-cloud-2309

computing/ 2310

This publication is available free of charge from: https://doi.org/10.6028/NIST.SP.1500-6r2

NIST BIG DATA INTEROPERABILITY FRAMEWORK: VOLUME 6, REFERENCE ARCHITECTURE

2311

This publication is available free of charge from: https://doi.org/10.6028/NIST.SP.1500-6r2