Big-data in cloud computing: a taxonomy of risks
Holmes E. Miller
Department of Accounting, Business and Economics, Muhlenberg College, Allentown, Pennsylvania USA 18104
Cloud computing and big-data are complementary approaches to storing, processing, analysing, accessing and reporting information. Cloud computing provides services (e.g., infrastructure, platforms, applications and data) that run on servers accessible through the Internet rather than residing on the desktop. Cloud computing allows data to be uploaded and accessed from the desktop, from mobile devices such as tablet computers and cell phones and from appliances such as vending machines and automobiles. Although perhaps conceptually nothing new vis-á-vis timesharing (Ellison 2008), cloud computing attributes including scale, accessibility, cost and timeliness have created new capabilities that previously did not exist. These capabilities include company-centric uses of information where customer data can be aggregated to support company needs (Prabhaker 2000) and in the near future, customer-centric uses where smart products and data aggregation may be used to empower individuals to tailor purchases to their specific needs without resorting to intermediaries (Searls 2012). Whereas prior marketing approaches involved pushing advertisements onto customers, big-data and cloud computing approaches facilitate customers pulling information from vendors to better satisfy the customer's individually defined and articulated needs.
The term big-data refers to analytical technologies that have existed for years but can now be applied faster, on a greater scale and are accessible to more users. Kusnetsky (2010, para. 3) defines big-data as, 'the tools, processes and procedures allowing an organization to create, manipulate and manage very large data sets and storage facilities.' While big-data operations can be processed locally, as organizations migrate to the cloud, so will their corporate data. Moreover cloud-based architectures will become more important as individual entities (i.e., both appliances and people) generate continuous data streams that can be collected, stored, processed, analysed and reported.
In this paper we will present and discuss risks and opportunities of big-data applications in cloud computing environments (big-data in cloud computing) and present some ideas for managing these risks.
Big-data and cloud computing overview
To inform the discussion below, Table 1 presents a grid (using a health care example) that indicates how data can be classified, aggregated and used:
|Uses of data
|Types of data
|PatientA's blood pressure at 9 a.m. on Wednesday
|Time series of Patient A's blood pressure
|Prescribe medication and dosage level to treat hypertension
|Histogram of current blood pressure readings for 45-54 year old females
|Correlation between 45-54 year old female blood pressure readings and daily calories consumed
|Change budget for proper eating habits information campaign
|Plot of average blood pressure readings by age group for males and females in the country
|Regression analysis of group health status vs. caloric consumption with sex and age as 'dummy' variables
|Run simulation model to predict change in group health status using various budgets as health intervention scenarios
The three columns present three processes associated with information: describing the information; analysing the information to draw meaning from it; and using the analysis to change the underlying environment. For example, when a patient has back pain descriptive attributes are collected. Some of these attributes might include the patient's identity, age, weight and blood pressure; a description of the back pain; the patient's description of events that might have caused the back pain; and medications currently taken.
Analysis questions use this descriptive and other historical information, coupled with the physician's knowledge of back pain and the patient, to arrive at a diagnosis and a proscribed treatment. Further methods of analysis might include using correlation and extrapolation methods to explore the relationship between the medicine and exercise prescribed and changes in health status. Furthermore, these methods can be used to support cost benefit analysis or to forecast improvements in the state of a patient's health using different combinations of medicine and exercise regimes.
Data generated from actions might include a record of minutes spent each day performing exercises to combat back pain, a record of taking the prescribed medications and perhaps in conjunction with future descriptive data, changes in health status. The above elements concern one patient. Similar examples would apply to hierarchical groups of patients that then can form aggregates of individuals and clusters of those aggregates, as defined by a subset of the attributes.
The data environment for this example existed well before the era of cloud computing and big-data and arguably, even before the era of electronic computing. Patient X's data was stored in the physician's office, most likely in paper-based files. Although access to the data was immediate, the data was meaningful only after the files were properly updated. Not immediately available, or most likely ever available, was information about patients in other physician's offices at other geographic locations that could be analysed in concert with patient X. These analytical operations were limited by the available computing capabilities of the hardware and software and how data was processed to meet volume, timeliness, cost and data requirements. The inability to access, analyse, report and act on this information constituted lost opportunity in terms of data management and decision-making.
Cloud computing environments allow users to employ many disparate data sets to ask and answer questions regarding data generated in many different geographic locations and temporal states. This allows users to ask and answer many questions that can lead to insights heretofore unknown. Before cloud computing only questions about patient X's back pain might be addressed. Cloud computing technologies and big-data can use information about back pain for all patients throughout a country or even between countries to examine the effectiveness of policies and treatment regimes across wider patient cohorts. Data storage in the cloud not only creates the ability to answer the same questions faster and with more veracity, but also opens up the possibility to review and analyse information that was previously inaccessible. The results from the data analysis are constrained only by the ingenuity in applying analytical methods and cloud computing technologies.
Discussion of big-data in cloud computing environments
In discussing big-data pathologies Jacobs (2009) states that 'it is easier to get data in than out' (para. 19) and that the 'pathologies of big-data are primarily those of analysis'. (para. 10) In the subsequent discussion he indicated that what makes big-data big is the repeated observation over time and/or space, an attribute that favours cloud computing. Data may be generated in a network of millions or even billions of nodes. Data stored in the cloud can be analysed and reported in the aggregate, or can be distributed back to the individual nodes themselves in a granular format. Regarding the latter option, McKendrick (2010) indicates that the centre of gravity in data management will continue to shift to end users, facilitated by downloadable apps, more intuitive search techniques and video game-like interfaces. This creates a scenario of end users both generating data to the cloud and processing data retrieved from the cloud. This creates many analytical possibilities and as we will see, risks.
Some factors facilitating the uses of big-data in cloud environments include:
- No upper limit on the amount of information stored. Big-data storage amounts now are in the petabytes (1015 bytes) rather than terabytes or gigabytes. Indeed, given historical growth, one can surmise in the future terms such as exabytes (1018 bytes), zettabytes (1021 bytes) and yottabytes (1024 bytes) eventually will enter the conversation. Rather than being restricted by the size of a single storage media, cloud users have the ability to increase scale by adding and connecting servers and other storage media through networking and telecommunications technologies.
- Ability to access, integrate and analyse data from multiple sources and databases whose owners are geographically and even temporally disbursed. Scalability potential creates the capability to collect and analyse more granulated data. Examples of this include: individual Google searches; scientific data generated in experiments in particle physics and genomics; changes in location using connected smart phones and other GPS enabled devices; ongoing status of equipment and other devices such as cars, home appliances and medical equipment; and aggregating in real time opinions such as those enunciated using social media like Twitter and Facebook. This capability creates new worlds of potential applications and also creates new risks.
- The elimination of intermediaries, where owners and users of the data can be individuals, not only large organizations. Until recently, intermediaries in data use were required on two fronts. First, large data centres stored data and staff served as gatekeepers regarding how data was collected, analysed and disseminated. This led to an organization-centric view of information. Second, models and software used to analyse and report data often required a level of sophistication and cost that inhibited democratization of information. big-data in cloud computing environments, while not intermediary free, provide more freedom for an individual user to generate, store, access, analyse and use information.
- Immediate access to personalized results using distributed input and output mechanisms. Eliminating intermediaries through standard evolutionary processes leads to more creative applications and new insights. Because of the granular capabilities inherent in big-data, many of these applications and insights are personalized, which allows individuals to receive specific information tailored to specific needs. Tailored information in turn, leads to users asking more questions and demanding even faster and more granular answers. And so the cycle continues.
- The capability to change the paradigm from 'model and analysis' to 'correlation and extrapolation'. Chris Anderson, editor of Wired magazine, hypothesizes the end of hypotheses. He argues that (Anderson 2008; para 10) 'faced with massive data, this approach to science (hypothesize, model, test) is becoming obsolete'. Instead, Anderson posits that correlation is enough. 'We can stop looking for models. We can analyse the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot'. (para. 13)
While Anderson's views are not without detractors (Conway 2008, Morton 2008) large datasets when coupled with more powerful and accessible analytical methods can lead to democratization of information and also to new ways of creating and discovering information. One key new capability is that using correlation methods on large datasets, while recognizing that correlation does not imply causality, creates the capability to discover new, previously hidden insights.
Opportunities for using big-data in cloud computing
We will discuss opportunities for big-data in cloud computing environments in three categories that correspond to types of data given in Table 1, data elements; data aggregates; and clusters of data aggregates.
Being able to capture and link existing data elements on larger, timelier and more accessible scales creates new opportunities. On example of this is data captured by remote sensors. Bollier (2010: 3) points out regarding remote sensors, that 'remote sensors are generating new streams of digital data from telescopes, video cameras, traffic monitors, magnetic resonance imaging machines and biological and chemical sensors monitoring the environment'. Personal data streams also may be generated by mobile phones, laptops, Websites and mobile phone enabled GPS devices. A good example of aggregating data is discussed by Bryant et al. (2008) with regard to how Wal-Mart contracted with Hewlett Packard to store four petabytes of data in a data warehouse that encompass 267,000,000 daily transactions made at 6,000 stores, worldwide. Machine learning was then applied to this data to help Wal-Mart exercise various pricing and advertising strategies to better manage its inventories and supply chain. This example illustrates how one may not only describe the underlying entities of data sets but may also apply various analytical tools to analyse, correlate and extrapolate information to instigate future actions.
Big-data in cloud environments can help organizations describe, analyse and act on data aggregates. Whereas an individual product (e.g., shampoo) might be analysed by evaluating the effectiveness of an advertising campaign, so might aggregates of all beauty products be analysed regarding how pricing differences and store placement affects sales. The promise of big-data and cloud computing is to create new aggregates that provide new information that may yield newer and richer insights.
Bollier (2010) reports several interesting examples of how aggregation, combined with analysis and visualization techniques, yields surprising new insights. Google Earth maps sampled 8,510 cows in 308 herds around the world and discovered two thirds of them aligned their bodies with magnetic north in the Earth's magnetic field. A second big-data Google example is the Google Flu Trends service, where by tracking flu-related search terms, the service identified possible flu outbreaks in Atlanta a week or two before actual flu cases are reported by the Center for Disease Control.
These examples involve data elements that existed but either were unreported or if reported, neither aggregated nor analysed. As Bollier points out, one of the benefits of big-data in cloud environments is right now data. Hal Varian of Google has used the term now casting, using right now data coupled with analytical methods to create various innovative uses (Bollier 2010: 20). Google has many real time information streams but real time information also is available through monitoring of appliances, cell phones, cameras and any other device that can report its attributes on an ongoing basis.
Clusters of aggregated information can be used to analyse and predict conditions for the clusters or their components. Several examples of how big-data in cloud environments might be used for clusters appear with Chris Anderson's Wired article (Anderson 2008) where clusters of worldwide agricultural information are analysed and used to forecast cluster conditions once this is disaggregated back to the individual field level.
Predicting airfare prices is a second example. Anderson (2008) discusses how Farecast (now incorporated in Bing) uses a prediction model to track information on 175 billion fares originating at seventy-nine US airports. The database is used to forecast when airline prices are going to change and can be used to develop ticket-buying strategies such as the day of the week to purchase, the length of stay, etc.
Risks of using big-data in cloud computing
As we come to depend more on data, lack of access to that data becomes more costly; as we offload more decisions to analytical methods, improper use of these methods or inherent weaknesses in the methods may yield decisions that are wrong and costly. As more data can be integrated to help society, data can also be integrated in more harmful ways. Three obvious domains for big-data in cloud computing are data sources, the cloud and big-data. These will be used to frame the discussion of big-data in cloud computing risks.
Data source risks include data quality risks and risks relating to how the data is defined, stored and generated prior to transmission to the cloud. These will be discussed in terms of data quality, data security, data privacy and data integration.
Data quality may be defined in many dimensions and is tied to often evolving user requirements. Acceptable quality for one user might be unacceptable for another; acceptable quality in 2010 might be unacceptable in 2013. A corollary to poor data quality is the risk that arises from using these same poor data to draw conclusions and make decisions. Information used for stated purposes might also create unintended consequences that result in real losses.
Dimensions of data quality include: (Miller 1996)
- Relevance: Do the data address their user's needs?
- Accuracy: Do the data reflect the underlying reality? Is the level of precision consistent with the user's demands?
- Timeliness: Are the data current, relative to user demands?
- Completeness: Does the level of completeness correspond with user demands? Data can be incomplete or even too complete!
- Coherence: How well do the data hang together? Do irrelevant details, confusing measures or ambiguous format make them incoherent?
- Format: How are the data presented to the user? Is the context appropriate?
- Accessibility: Can the data be obtained when needed?
- Compatibility: Are the data compatible in format and definition with other data with which they are being used?
- Security: Are the data physically and logically secure?
- Validity: Do the data satisfy appropriate standards related to other dimensions such as accuracy, timeliness, completeness and security?
Errors along one or more quality dimensions create problems for downstream users leading to risks similar to those encountered in current information technology environments. In big-data in cloud computing environments several new data quality related risks appear. Data that are transferred to the cloud to be analysed later may violate accuracy, timeliness, completeness, format and accessibility quality dimensions. Some examples include real time status information about assets such as appliances and vehicles (Bughin (2010) refers to this as the 'internet of things'). Poor data quality casts a shadow on conclusions drawn from analysis, precludes analysis in the first place and creates an environment with unintended consequences for users.
The Economist (The big-data dump 2008) discusses such an environment concerning the U. S. legal system regarding the Completeness quality dimension. In a lawsuit concerning Horizon Blue Cross a client sued the insurance company over a claim for their anorexic daughter. The company asked to see all online postings, messages, etc. from the daughter's Facebook and MySpace pages. Beyond privacy issues (the daughter's lawyer objected on those grounds and lost), this example of e-discovery illustrates a new normal, i.e., the ability to review the equivalent of millions of paper pages. This increases the costs of the legal system and places certain populations (the less-wealthy) at a disadvantage. In the Economist article cited above, Justice Stephen Breyer of the U.S. Supreme Court is quoted as saying, 'you're going to drive out of the litigation system a lot of people who ought to be there,' indicating that 'justice is determined by wealth, not be the merits of the case'. (para. 9)
Data security risks become magnified when data sources become ubiquitous. When data origination is distributed across personal devices such mobile phones, tablets and appliances, data can be modified or stolen by direct access or apps containing computer viruses. Mobile device security often is less robust than in more controlled environments, which themselves have weaknesses.
Even in cloud computing environments, some data (including passwords) still reside on remote devices before being uploaded. Systems relying on this information can be compromised by denial of service attacks on the devices themselves. Such acts could become particularly costly in health care or transportation, where actions could physically affect users and cause financial losses or even deaths. Cell (mobile) phone companies are aware of the threat and are taking action to protect information currently stored on phones. Ante (2010) discusses an application for Android-based devices that facilitates physical security and blocks malicious apps. One area of concern is mobile banking operations where phones are used to access personal information located in the cloud for transactions. The immediate risk of intercepting data, e.g., unencrypted or poorly encrypted passwords or content, is obvious, as are risks from subsequent analysis of changed data resulting in compromised decisions.
Data privacy issues involve interlopers intercepting and disclosing data either stored on mobile devices or appliances, or while being uploaded to or from the cloud. Although privacy breaches are more severe when an entire cloud-based database is compromised, aggregates of individual and seemingly insignificant data elements also may lead to significant privacy disclosures. Weitzner et al. (2006) recognized this problem and proposed a technology infrastructure to increase accountability and transparency for information use on the Web. As an example of how data elements might be integrated, consider mobile phones equipped with GPS location tracking devices. Currently user location information from mobile phones can be used to facilitate location-based marketing efforts (Hopkins and Turner 2012). This same information also could be used to facilitate criminal activity. Using GPS-enabled location-based technologies, intercepted transmissions or a rogue app (e.g., an application that is placed on a device that purports to do one thing while surreptitiously doing another) could collect and transmit information to a third party which, beyond violating privacy standards, could alert criminals that individuals are away from home (Europol 2011). One can conjure many scenarios where location information can be used for nefarious ends, for example, disclosing the location of investment banker working on a merger to support insider trading; aiding a criminal bent on kidnapping a business leader; or a enabling a terrorist plotting an assassination (see Brandon 2012 for a Website that can pinpoint a mobile phone user's location by entering a mobile phone number). Sensing devices are another example of how mobile phones and appliances can generate data to be used by big-data in cloud computing applications. Shilton (2009) discusses the potential for abuse of privacy in this setting. One suggestion is that software developers implement data-protection standards for collecting personal information from sensing technology. Such standards may be necessary because it will be harder for users to voluntarily adhere to good data security practices as computing power and data generation becomes distributed among a mass audience.
Data integration is the final risk involving data sources. Compatibility is a dimension of the quality of individual data elements and incompatibility among the data elements may preclude otherwise promising analysis opportunities, or lead to spurious results. Cafarella et al. (2011) discusses integrating data across multiple databases on the Web and the ensuing challenges. When data that does not conform to any standardized data design this poses significant problems for traditional data management techniques. As with the above discussion regarding privacy, data elements generated from user-controlled devices adds more complexity to this problem. Table 2 summarizes data source risks.
|Fail to meet user requirements along specific data quality dimensions
Cast shadow re: conclusions from subsequent data analysis
Foster unintended consequences
Data overload responding to information requests
|Allow direct access to devices containing data
Facilitate access via apps or downloaded viruses
Denial of service on devices (e.g., health care applications)
Accessing online payment systems
|Releasing copied or stolen data on mobile devices or appliances
Creating data aggregates that compromise privacy
Using confidential information for financial gain
Eavesdropping on sensing or mobile devices
|Lost opportunities caused by incompatible data elements
Spurious results caused by incompatible data elements
Data management problems
Many data quality risks in the cloud are similar to those discussed above. The difference is now the data reside in the cloud rather than on user equipment. Changes in data may occur through transmission from user to cloud, from cloud to user, or while in storage. Examples might affect quality dimensions such as accuracy, completeness, format, validity, timeliness and accessibility. Intercepted data may be changed, erased, or copied. Modifying data in the cloud database could facilitate monetary crimes (e.g., using a mobile payment system); erased data could impact infrastructure control systems; and data that have been disclosed could compromise user privacy. Modified data would be particularly important for big-data applications where large amounts of data from appliances are transferred up to the cloud, analysed and decisions for action are then sent back to the user appliances, such as traffic controls, or individuals administering medications.
Data obsolescence risk occurs when data in the cloud become stale. Applications such as mobile phone location systems that seek to pitch advertising messages to groups of customers based on their location and past purchase histories would be affected. Data redefinition occurs when data are redefined in a way that unintentionally affects data quality. Mullich (2011) quotes a Verizon executive as saying that enterprise level platforms that manage and synchronize data transmitted from the cloud to mobile devices will enable workers to access data when and where they want. This creates challenges, especially when data once synchroniaed becomes unsynchronized and creates a false picture of reality that distorts future actions.
Data security risks in the cloud appear in many media stories. In some cases the cloud arguably increases security. Cloud computing has enabled more robust solutions to the challenges of disaster recovery planning, with cost and accessibility advantages (Linthicum 2010). Backing up cloud data is more economical than replicating individual systems; when data from many firms are aggregated cloud architectures clearly provide security advantages. Centralization of information, however, creates two risks. First, either through a failure of the cloud servers or through the telecommunications systems linking cloud to customer, data may become unavailable bringing down a user organization. An example of this occurred when services provided by Amazon, a leading cloud vendor, were interrupted for two days affecting thousands of corporate customers (Lohr 2011). Events like this should motivate companies to rethink what applications are put on the cloud, especially those that are mission-critical. As big-data in cloud computing applications become more prevalent and as companies have few alternatives other than migrating to the cloud, putting mission-critical applications on the cloud may no longer be a choice but a necessity, thus resulting in increased risk exposure.
A second risk stems from the ability to access cloud data, either by hacking into databases, intercepting telecommunications messages (as discussed above), or by sloppy vendor controls enabling one company to access a competitor's data in a cloud environment. Dodge (2011) maintains that the hacking in the cloud risk is overblown. He is more concerned with denial of service risks, either intentional or unintentional. These risks are increasingly important as mission critical applications migrate to the cloud. Clayton (2011) discusses solutions that involve mixed applications, e.g., storing some locally and others on the cloud. Quoting one executive, Clayton says, 'Where the cloud becomes really interesting is with new things where you have the option of doing it on your own server in-house or going with the cloud. Then you're much more likely to move those to the cloud than traditional IT'. (para. 13) New things describe evolving big-data applications and the challenges to control risk. Another risk is failure to properly coordinate cloud and locally stored applications.
Data privacy issues in the cloud result from information that is transmitted to the cloud from mobile devices or from other input channels, being hacked, stolen, or intercepted after being uploaded to the cloud. One concern for big-data applications is privacy breaches for medical information. Big-data applications can take many disparate data elements stored in the cloud and analyse them in ways that could present previously hidden individual profiles. These could affect employment, insurance and psychological wellbeing. Osterhaus (2010) discusses these risks and concludes that the need for privacy trumps the need for access and that cloud vendors will need to ensure that the U.S. Health Insurance Portability and Accountability Act regulations are met. This conclusion could apply to all cloud privacy risks; there is no reason why privacy in the cloud need be less secure than privacy in traditional environments. Indeed, the cloud environment as compared to storage in paper files in physicians' offices arguably is inherently more secure. The new risk to be controlled is the big-data related risk stemming from integration and processing of large amounts of medical data and personal data. These are challenges, not barriers, to cloud security.
Data integration in cloud environments includes customer information regarding health profiles, financial profiles, location information, information from social networking databases and integrating data from infrastructure appliances. With regard to big-data and integration leading to personalization Bollier (2010: 23), quoting Kim Taipale, points out that, 'the benefits of personalization tend to accrue to businesses but the harms are inflicted on disbursed and unorganized individuals'. A second risk is integrating data in ways that seek to achieve an objective but because of poor data quality or incompatible databases, miss opportunities. Examples of this occur when the data elements one is seeking to integrate exist in different formats, adhere to different standards, or have different timeliness or accuracy attributes.
Vendor risk is the final cloud risk. Since vendors control the cloud environment, all of the risks discussed above relate to vendor operations. Two other specific vendor risks involve how the vendor conducts business and if the vendor is able to conduct business. How the vendor conducts business includes the vendor's policies for data security, disaster recovery planning, transaction authentication and many other details. Often vendor policies exceed those and are more secure than policies in non-cloud environments. There are cases, however, where vendor policies and procedures are less robust than a client's. Currently, if the client is aware of vendor deficiencies, the client can process the application locally. This may be impossible for big-data in cloud computing applications. If a client uses a private cloud for critical applications and a public cloud for others, another risk surfaces: blending public and private clouds. One solution is developing security infrastructures spanning both environments (Scheier 2009). Clayton (2011) discusses these along with the importance of vendor's understanding corporate and industry policies and regulatory requirements. The cloud environment is inherently harder to customize to each organization and regulatory policy violations may be more likely, especially when processing big-data.
A final and more immediate vendor risk is the possibility that the vendor goes out of business. While this is less likely for vendors such as Amazon, Apple and Google, smaller cloud vendors are now in the marketplace and may be more likely to exit the market. Organizations are vulnerable when they entrust their business operations to such a vendor. Managing this risk is not unlike vendor risk management procedures for entrusting mission-critical software development and operations to smaller vendors, or outsourcing mission critical operations. If customers view this form of vendor risk as unacceptable, larger vendors may crowd smaller cloud vendors out of the marketplace.
Table 3 summarizes cloud computing risks.
|Transmission and storage breaches reducing data quality
Modifying data used for actions (financial, traffic signals, medication doses)
Lost marketing opportunities
Challenges in synchronizing data
|More robust disaster recovery plans reduce risk
Inaccessibility to the cloud
Increased access risk for mission critical systems
Hackers gaining cloud access
Denial or service attacks
Failure to coordinate cloud and local applications
|Releasing copied/stolen data (financial, medical)
Violating government or industry privacy regulations
Increased legal exposure
|Lost opportunities caused by incompatible databases in cloud
Spurious results caused by incompatible databases in cloud
Data management problems
|Poor vendor management of information-based risk
Lax vendor policies and procedures re data
Problems blending vendor and client applications
Vendors who are unaware of regulatory requirements or industry policies
Vendors going out of business
How information is collected, processed, reported and acted on in a way that it adds value to its customer and does not adversely affect other interested parties is a big-data in cloud computing risk (for more on who owns customer information, see Hagel et al. 1994; Harison 2010 and Prabhaker 2000). For well-defined problems risks of big-data involve dimensions that can be categorized under data quality, security, privacy and integration, as discussed above. For example, a power company using big-data to collect data from sensors on electrical devices to turn some of the devices off to optimize its electrical grid, is using big-data to do something new but not something different in kind from using mathematical methods to optimize operational networks, something done for decades. In these situations the unique big-data risk is being able to work faster and work on a larger scale. Using a sports analogy, the risk is similar to that faced by an amateur who, when promoted to play against professionals, realizes that the game now is faster, more complex and more challenging.
When analytical methods are used in new ways the challenges and risks increase. One example is using big-data to make inferences in situations where in the past, making data-based inferences was impractical (such as using correlations to extrapolate). While correlation analysis is a standard analytical technique, using it for new applications on a large scale may become problematic. Bollier (2010) quotes a classic Wall Street Journal article, 'My TIVO thinks I'm gay' where, based on the author's viewing habits, the system kept recommending gay-themed films. To remedy this, the customer began viewing war movies and the system responded by recommending documentaries about the Third Reich! While these examples may be comical, one sees the risk of using big-data in cloud computing for categorizing tastes or other personal attributes. There is certainly a risk for privacy and if actions are taken, even in non-human environments, may create liabilities and risks of doing the wrong thing.
This leads to perhaps the most fundamental big-data risk, which is exacerbated in cloud environments. The risk is not 'Are the results accurate?' but is 'Has big-data addressed the right problem and if so, addressed it in the right way?' Big-data in cloud computing relies heavily on integrating and analysing massive amounts of data and human intervention often is absent. The methodology reports what the data analysis indicates rather than is a methodology based on an underlying theory that informs data collection, analysis and reporting. This risk is reflected in the Anderson (2008) article discussed above.
Big-data applications often analyse information in a near domain, i.e., one in which the data are readily available. For example, when Google uses searches for flu symptoms to predict flu outbreaks (Guth 2008) the results are timelier than health reports from the U.S. Center for Disease Control (Bollier 2010). However, using Internet searches rather than a scientific model focuses on a related attribute (victim searches for information) rather than a root cause. A risk of rampant use of big-data to develop inferential relationships in cloud environments, while appealing because the nexus of the availability of large amounts of data and processing power to generate seemingly useful results, is that users may think something is true based on past history even though future data may indicate otherwise. Over-reliance on history creates the potential for mischaracterization of data, violations of privacy, poor decisions and solving the wrong problems in a way failed generals fight the last war. Over-reliance on history also risks repeating the classic error of treating correlation as causation.
When human links, characterized by grounding analysis in theories are severed, technologies may run amok (Postman 1993). Moreover, modelling assumptions that disenfranchise certain groups, e.g., using credit-scoring models (Anonymous 2011), may be embedded in those technologies. W. Edwards Deming, a pioneer in the quality movement, held that (Deming 1994: 103), 'Without theory, experience has no meaning and without theory, one has no questions to ask. Hence, without theory, there is no learning. Theory is a window into the world. Theory leads to prediction and without theory, experience and examples teach nothing'. Rampant use of correlations and other inferences obtained only from processing massive amounts of data without human intervention and removed from any underlying theory, may lead to a data cycle of collect-analyse-act which may appear valid, but which in fact is seriously flawed.
Table 4 summarizes big-data risks:
|Big data risks
|Fail to meet user requirements due to inability to process data
Inappropriate use of analytical methods
Applications that adversely affect interested parties
Uncertainty regarding who owns customer information
Addressing the wrong problem
Focusing on the near domain and ignoring the real problem
Mischaracterizing data resulting in privacy violations
Mischaracterizing data resulting poor decisions
Disenfranchising groups by ignoring theory and over-relying on data
Big-data in cloud computing represents the confluence of two technologies that promise to create new capabilities for creating more value for customers and businesses at reduced costs. Its promise is hard to dismiss. Indeed, as big-data in cloud computing environments become commonplace, more and more success stories will lead to more and more applications. These applications will create a plethora of new products and marketing opportunities, which may fundamentally change the relationship between organizations and their customers.
Five steps, informed by Reinhold et al. (2011) may be used to manage big-data risks in cloud computing environments. They are:
- Implement a process to continually review, assess and understand big-data in cloud computing risks in an organization's local and global environments;
- View the big-data in cloud computing risk management process and technology improvements strategically rather than tactically and act accordingly;
- Ensure all underlying data is consistent, well-defined and understood by managers and other stakeholders;
- Ensure data is of high quality along all relevant quality dimensions and that data governance responsibilities are well-defined;
- Continually invest in improvements in both cloud computing infrastructure and analytical big-data capabilities and ensure these capabilities are congruent with the strategic needs of point 1.
A key tenet of risk management is that risk can never be eliminated but once understood, can be managed. What is required is ongoing understanding of the evolving customer and technical environments, understanding of the evolving risk environment and the commitment to categorize quantify and ensure that the risk exposure is consistent with the organization's underlying objectives. A final challenge for big-data is while it is possible for organizations to manage their risk, who will manage the risk exposure of the individuals generating and using the data? Incentives should be created to make investing in risk management mutually beneficial both for those generating the data, those using the data and all others profiting from it.
I wish to thank the editor and the anonymous referees for their useful suggestions on the first version of this paper. Thanks also to copy-editors of the journal.
About the author
Holmes E. Miller is Professor of Business in the Department of Accounting, Business and Economics at Muhlenberg College in Allentown, Pennsylvania (USA). He has a Ph.D. in Management Science from Northwestern University in Evanston, Illinois. Prior to coming to Muhlenberg he taught at Rensselaer Polytechnic Institute in Troy, New York and worked in the chemical and financial services industries. He can be contacted at: firstname.lastname@example.org.