Dr Dobbs | Cloud Computing on Rich Data

Advances in sensing technologies are yielding vast quantities of data that must either be processed on the fly or archived for later consumption, or both

Michael Kozuch is a Principal Engineer and Manager of the Systems Research and Engineering team for Intel Labs Pittsburgh. Jason Campbell is a Senior Researcher at Intel Labs Pittsburgh. Padmanabhan (Babu) Pillai joined Intel Labs Pittsburgh in September, 2003, after finishing his doctoral degree in Computer Science and Engineering at the University of Michigan, where he worked with Professor Kang Shin at the Real-Time Computing Lab (RTCL). Madeleine Glick is a Principal Engineer at Intel Labs Pittsburgh and an Adjunct Professor in the Department of Electrical and Computer Engineering, Carnegie Mellon University.


 

In recent years, advances in semiconductor electronics have pushed the instrumentation of our world to unprecedented levels. Sensors are now all around us: many cell phones contain GPS receivers as well as cameras, doorways have motion detectors, stop lights sense vehicles at intersections, and satellites orbiting overhead are constantly imaging the Earth. Additionally, we have data sourced electronically: feeds from social networking sites, crawls of Web pages, repositories of medical images, results from computer simulations, etc. Many of the data objects from these sources are collected for analysis, archived, subjected to re-analysis, cross-correlated with other data objects, and processed to create additional, derived data sets.  

The result is that we live in a world that is data rich. In this article, we consider two types of data sources: stored and streaming. A stored data object is just that, information that has been archived in some way. A corpus of digital images stored on a collection of magnetic disks would be an example of stored data. Streaming data objects have a real-time component; a live video feed is the canonical example of streaming data. The two types of data present different processing challenges in that applications operating on stored data are often throughput-sensitive, while those operating on streaming data are often latency-sensitive. While the two types of data present subtly different performance constraints, both require significant, scalable computing resources. For example, an image search application operating on stored images may need to scale out depending on the number of images or complexity of the search. Similarly, an application executing a face-detection algorithm on live video may need to scale out if faces are detected and more compute-intensive face recognition algorithms are invoked.  

Cloud computing technologies enable many users to share modern computing clusters while providing mechanisms for scaling applications as needed. As a result, researchers in Intel Labs are investigating what challenges arise when leveraging cloud computing technologies in the context of rich data applications operating on either stored or streaming data, and what solutions may address those challenges. This research program includes support of the Open Cirrus research test bed, development of an open source software stack for operating on stored data, development of a runtime system for operating on streaming data, and exploration of the benefits resulting from integration of optical networks in compute clusters.    

Supporting Open Research in Cloud Computing: Open Cirrus

 

When considering the topic of cloud computing on large data sets, many questions suggest themselves at an early stage. How should the data be organized and stored? What are the important system software components that enable access to the data? How should those components be organized? What are the appropriate user interfaces? How can the data be processed in the most efficient manner possible?  

Answering such a broad array of questions is difficult for a single research organization. Tackling a broad research agenda naturally requires a vibrant research community. To help provide cloud computing resources to this community, Intel, HP, and Yahoo!, in collaboration with the National Science Foundation, sponsored the Open Cirrus cloud computing testbed [1]. (Open Cirrus is a trademark of Yahoo!)  

The goals of the Open Cirrus project are to:

  • Foster systems-level research in cloud computing.
  • Encourage new cloud computing applications and applications-level research.
  • Collect and share experimental datasets.
  • Develop open-source stacks and APIs for the cloud.
 

To achieve those goals, Open Cirrus provides a world-wide, federated collection of cloud computing sites and a software architecture designed to unify those sites into a coherent platform. The sites (shown in Figure 1), each of which includes a cluster of at least 1,000 cores, are provided by the Open Cirrus collaborating institutions: HP, Intel, Yahoo!, the University of Illinois (UIUC), Karlsruhe Institute of Technology (KIT), the Infocomm Development Authority (IDA) of Singapore, the Russian Academy of Sciences (RAS), the Electronics and Telecommunications Research Institute (ETRI) of South Korea, the Malaysian Institute Of Microelectronic Systems (MIMOS), and Carnegie Mellon University (CMU).

[Click image to view at full size]
Figure 1: Open Cirrus Consists of Ten Sites World-Wide (Source: Intel Corporation, 2010)

 

The Open Cirrus test-bed is intended to support research at various levels of the cloud-computing stack from the lowest layers that interact directly with hardware to the highest application layers. However, research in the upper layers requires that lower-level software be available and stable. Consequently, the Open Cirrus community has adopted a common software service architecture; the core services of this architecture are shown in Figure 2. All core software components are open-source projects managed by the Apache Software Foundation. Intel Labs has made significant contributions to the development of Zoni and Tashi, and Yahoo! has made significant contributions to the development of Hadoop and HDFS.

[Click image to view at full size]
Figure 2: Open Cirrus Software Architecture Core Services (Source: Intel Corporation, 2010)

The primary responsibility of the lowest software layer (Zoni) is to partition the cluster into domains. A domain is a set of compute servers that are network-isolated from the rest of the servers in the cluster (by the use of VLANs). When users experiment with system software that interacts with key networking components, such as DHCP services, they, or the cluster system administrators, will first use Zoni to create an isolated domain for the experiment; in this way, if the experiment goes awry, it cannot affect the normal operation of the cluster (and other activity in the cluster cannot interfere with the experiment).  

Most of the research that does not interact with core networking services, however, will take place in the primary domain of the cluster. This domain is considered to be for production use. Experiments in the primary domain are isolated from other activity by operating in virtual machine environments, and the virtual machines are managed by a cluster management layer such as Tashi. Tashi enables users to rapidly deploy virtual machine instances in the cluster by specifying attributes of the virtual machine (such as number of processors and the amount of memory) as well as the software that should run within that virtual machine.  

However, the data sets in the cluster are potentially valuable to many users, and consequently, are ideally not stored in virtual machine images. Instead, the Open Cirrus core services include a cluster file system that resides beneath the virtual machine layer. In this way, data stored in the cluster file system are accessible from any of the virtual machines operating in the primary domain. After evaluating many cluster storage options, the Hadoop File System (HDFS) best fit the needs of Open Cirrus.  

By leveraging the virtual machine layer, the cluster administrators can provide any number of application-level services. The Open Cirrus software service architecture explicitly suggests one such application runtime: the Hadoop map/reduce framework. This framework is particularly suitable for enabling cluster users to process data stored in the cluster file system.  

Naturally, the utility of these clusters would be quite limited if they only hosted the development of these core services. Fortunately, many of the cluster users are not involved directly in research on cloud computing; instead, they simply use the Open Cirrus clusters as computing resources in the course of conducting research in some other field. This use of the cluster is welcome and encouraged, because these users provide a realistic context for evaluating the system software by providing authentic data-rich workloads.    

Revolution Analytics - Commercializing R for Statistics

http://www.infoq.com/news/2011/02/revolution_analytics

Revolution Analytics - Commercializing R for Statistics

InfoQ interviewed David Smith, VP of Community for Revolution Analytics at the Strata big data conference. Revolution provides commercial extensions for the open source R statistics package and announced the R Enterprise v4.2 Suite along with offering tools to help SAS users to migrate to R. By Ron Bodkin

NIST Cloud Computing Twiki Launched

http://www.infoq.com/news/2010/12/NIST-Twiki

NIST Cloud Computing Twiki Launched

Today NIST began sending users their credentials for their Cloud Computing twiki, of which Kevin Jackson was one of the first to be granted access. The intent of the NIST working group is to promote cloud computing adoption and overcome the current percieved barriers of security, interoperability and portability. By James Vastbinder

4 Tools for Assessing Cloud Performance

[Link] http://feedproxy.google.com/~r/readwriteweb/~3/KfuRSz2b7-8/tools-for-assessing-cloud-perf.php

4 Tools for Assessing Cloud Performance

stopwatch_august10.jpgAs more and more companies begin offering cloud-based services and, in turn, as more and more companies begin to migrate to the cloud, there's an increasing demand for tools to monitor and assess cloud performance. Although we hear a lot of about security in the cloud, a study released late last year by the market research firm IDC listed "performance" as one of IT's major concerns, ahead of cost and vendor lock-in.

Last week, we looked at CloudFail.net, a blog that tracks the RSS feeds of major cloud providers in order to monitor service updates and outages. But a number of other services exist that can help customers assess the dependability of cloud providers.

Sponsor

CloudSleuth

Developed by Compuware Gomez, CloudSleuth is a cloud performance visualization tool initially created as an internal resource to help us gauge the reliability and consistency of the most popular public IaaS and PaaS providers. CloudSleuth uses the Gomex Performance Network to measure the performance of an identical sample application running on popular cloud service providers, assessing two basic user experience metrics - response time and availability. The tests are currently run from locations in all 50 states and from 75 international locations, and there are plans to add the ability to benchmark a user's own application.

CloudHarmony

While much of CloudHarmony still in beta, it looks to become an important resource for evaluating performance. Currently, you can use its Cloud SpeedTest to test upload and download speeds, page loads and latency on several major services. The CloudHarmony blog also contains a number of analyses of various services, including encoding, CPU performance, and memory I/O.

Cloudstone

Cloudstone is a multi-platform, multi-language performance measurement tool for Web 2.0 and Cloud Computing. This UC Berkeley project is described as "a toolkit consisting of an open-source Web 2.0 social application (Olio), a set of automation tools for generating load and measuring its per- formance in different deployment environments, and a rec- ommended set of constraints for computing a metric we believe makes more sense, dollars per user per month." While the results from the project aren't published, the developers have created an application that gives users the ability to do the research themselves.

Cloud CMP

Developed by Duke University and Microsoft Research, Cloud CMP "pits cloud against cloud," assessing computation, storage, and network services offered by different cloud providers, then estimate performance and cost of an application if it's deployed on a particular cloud provider.

Do you know of other resources out there to help gauge cloud performance? And what other assessments should tools like these be making?

Discuss

Video: IBM on Mapping the Human Brain and the Future of Cognitive Computing

[Link] http://feedproxy.google.com/~r/OmMalik/~3/tGMi8XJaYNw/

Video: IBM on Mapping the Human Brain and the Future of Cognitive Computing

A few weeks ago, in June, I wandered over to IBM’s research facility in Almaden, Calif. to see what Big Blue was doing in the fields of materials research and semiconductor manufacturing. At that time I sat down with Dharmendra Modha, manager of cognitive computing at IBM Research Almaden to discuss his project, which is trying to simulate the way brains work in hopes of advancing the way our computers can process information in real-time by changing the basic architecture of the chip. Or as Modha says, offer up “a sense of how the brain gives rise to the mind.”

The research done today may never yield tangible changes in semiconductor architecture, and even if it does it’s decades into the future. But the issue of solving the ever-increasing demand for compute, without creating a similarly overwhelming demand for electrical power is at the heart of what Modha is trying to do. The work of visualizing how brains think could one day show IBM how to build better computers.

To eventually build those computers, IBM is building out a new lab for Modha, which will contain 16 monitors capable of representing 2.64 million neurons, with each pixel representing a neuron. Then researchers will use those neural maps to see how the brain reacts to stimuli. It’s no small task. For example, a cat brain, which Modha has simulated, contains 700,000 neurons with trillions of connection between them. Writing algorithms that show all of that is a daunting task.

But tomorrow Modha will publish a paper detailing their latest achievement – mapping a monkey’s brain, which is far more complicated, and gets the lab closer to mapping out a human’s mind. The goal of such visualizations is to help advance computing by changing the way they solve problems. It’s not a means to build artificial intelligence, so much as it’s a way of discovering how to architect new types of chips that can keep up with a barrage of real time information.

The video (see below) of our conversation gets fairly deep, but as Modha explains, the effort is an attempt to combine supercomputing, nanotechnology and neuroscience. He’s trying to apply the advances made in understanding the anatomy of the human brain by filtering it through a supercomputer, with the end goal of creating some type of computer built using new technologies that allow the future machine to be smarter and more power efficient. IBM isn’t the only entity interested in such work – the U.S. government has given IBM and Mohda DARPA grants worth more than $20 million to work with – and companies from Intel to HP are also pondering ways to push computing to the limits (GigaOM Pro sub req’d).


Alcatel-Lucent NextGen Communications Spotlight — Learn More »

Measuring and Comparing the Performance of 5 Cloud Platforms

[Link] http://www.infoq.com/news/2010/07/Benchmarking-5-Cloud-Platforms

Measuring and Comparing the Performance of 5 Cloud Platforms

Bitcurrent and Webmetrics have run a number of tests for a month on 5 different cloud platforms - Amazon, Google, Rackspace, Salesforce.com, and Terremark -, attempting to measure the performance of each platform. One of their conclusions is that each platform works better for different application types. By Abel Avram