Workshop on Multi-threading, IITK, December 2007

Workshop on Architectures and Compilers for Multithreading

December 13-15, 2007
Indian Institute of Technology, Kanpur

Abstracts
Sarita Adve
Title: Memory Consistency Models
Abstract: The memory consistency model for a shared-memory multiprocessor system defines the values a read may return, and typically involves a tradeoff between programmability, performance, and portability. It has arguably been one of the most challenging and contentious areas in shared memory system specification for several years. Over the last few years, researchers and developers from the languages community have made a concerted effort to achieve consensus on the language level memory consistency model. A new model for the Java programming language was approved in 2005 and a model for C++ is almost finalized. Partly in response to this work, most hardware vendors have now published memory model specifications that are compatible with the language level models. These models reflect a convergence of about 20 years of research in the area. I will summarize this research and its recent impact on hardware and language-level consistency models, the remaining open problems in the area, and the implications for hardware and compiler writers moving ahead.
Bio: Sarita V. Adve is Professor in the Department of Computer Science at the University of Illinois at Urbana-Champaign. She received a Ph.D. from the University of Wisconsin - Madison in 1993. She recently co-developed the memory consistency model for Java and the soon to be finalized model for C++, both of which are strongly influenced by her extensive work in the area over almost 20 years. Her other major current research focus is hardware reliability which also has deep implications for multicore architectures. Professor Adve was named a UIUC University Scholar in 2004, received an Alfred P. Sloan Research Fellowship in 1998, IBM Faculty/Partnership Awards in 2005, 1998, and 1997, and a National Science Foundation CAREER award in 1995. She served on the National Science Foundation's CISE directorate's advisory committee from 2003 to 2005.

Frances Allen
Title: Languages and Compilers for Multicore Computing Systems
Abstract: Multi-core computers are ushering in a new era of parallelism everywhere. As more cores (and parallelism) are added, the potential performance of the hardware will increase at the traditional rate. But how will users and applications take advantage of all the parallelism? This talk will review some of the history of languages and compilers for high performance systems then consider their ability to deliver the performance potential of multi-core systems. The talk is intended to encourage the exploration of new approaches.
Bio: Fran Allen is an IBM Fellow Emerita at the T. J. Watson Research Laboratory with a specialty in compilers and program optimization for high performance computers. This work led to Fran being named the recipient of ACM's 2006 Turing Award "For pioneering contributions to the theory and practice of optimizing compiler techniques that laid the foundation for modern optimizing compilers and automatic parallel execution."
She is a member of the American Philosophical Society and the National Academy of Engineers, and is a Fellow of the American Academy of Arts and Sciences, ACM, IEEE, and the Computer History Museum. She has served on numerous national technology boards including CISE (the Computer and Information Science and Engineering board) at the National Science Foundation and CSTB (the Computer Sciences and Telecommunications Board) for the National Research Council. Her many awards and honors include honorary doctorates from the University of Alberta (1991), Pace University (1999), and the University of Illinois at Urbana (2004).
Fran is an active mentor, advocate for technical women in computing, environmentalist, and explorer.

Saman Amarasinghe
Title: StreamIt - A Programming Language for the Era of Multicores
Abstract: One promising approach to parallel programming is the use of novel programming language techniques -- ones that reduce the burden on the programmers, while simultaneously increasing the compiler's ability to get good parallel performance. In this talk, I will introduce StreamIt: a language and compiler specifically designed to expose and exploit inherent parallelism in "streaming applications" such as audio, video, and network processing. StreamIt provides novel high-level representations to improve programmer productivity within the streaming domain. By exposing the communication patterns of the program, StreamIt allows the compiler to perform aggressive transformations and effectively utilize parallel resources. StreamIt is ideally suited for multicore architectures; recent experiments on the 16-core Raw machine demonstrate an 11x speedup over a single core.
Bio: Saman P. Amarasinghe is an Associate Professor in the Department of Electrical Engineering and Computer Science at Massachusetts Institute of Technology and a member of the Computer Science and Artificial Intelligence Laboratory (CSAIL). His research interests are in discovering novel approaches to improve the performance of modern computer systems and make them more secure without unduly increasing the complexity faced by either the end users, application developers, compiler writers, or computer architects. Saman received his BS in Electrical Engineering and Computer Science from Cornell University in 1988, and his MSEE and Ph.D from Stanford University in 1990 and 1997, respectively.

Nancy Amato
Title: STAPL: A High Productivity Programming Infrastructure for Parallel and Distributed Computing
Abstract: The Standard Template Adaptive Parallel Library (STAPL) is a parallel programming framework that extends C++ and STL with support for parallelism. STAPL provides parallel data structures (pContainers) and generic parallel algorithms (pAlgorithms), and a methodology for extending them to provide customized functionality. By abstracting much of the complexities of parallelism from the end user, STAPL provides a platform for high productivity by enabling the user to focus on algorithmic design instead of lower level parallel implementation issues. In this talk, we provide an overview of the major STAPL components, with a particular focus on the the STAPL pContainers (parallel and distributed data structures) and, as time allows, discuss STAPL's support for adaptive algorithm selection and describe how some important scientific applications (particle transport and protein folding) have been developed in STAPL. This is joint work with Lawrence Rauchwerger.
Bio: Nancy M. Amato is a professor of computer science at Texas A&M University. She received B.S. and A.B. degrees in Mathematical Sciences and Economics, respectively, from Stanford University, and M.S. and Ph.D. degrees in Computer Science from UC Berkeley and the University of Illinois at Urbana-Champaign, respectively. She was an AT&T Bell Laboratories PhD Scholar, she is a recipient of a CAREER Award from the National Science Foundation, and she is a Distinguished Lecturer for the IEEE Robotics and Automation Society. She served as an Associate Editor of the IEEE Transactions on Robotics and Automation and of the IEEE Transactions on Parallel and Distributed Systems, she serves on review panels for NIH and NSF, and she regularly serves on conference organizing and program committees. She is a member of the Computing Research Association's Committee on the Status of Women in Computing Research (CRA-W) and she co-directs the CRA-W's Distributed Mentor Program (http://www.cra.org/Activities/craw/dmp/). Her main areas of research focus are motion planning, computational biology and geometry, and high-performance computing. Current projects include the development of a new technique for approximating protein folding pathways and energy landscapes, and STAPL, a parallel C++ library enabling the development of efficient, portable parallel programs. More information regarding our work can be found at http://parasol.tamu.edu/.

Arvind
Title: A Hardware Design Inspired Methodology for Parallel Programming
Abstract: One source of weaknesses in parallel programming has been the lack of compositionality; independently written parallel libraries and packages don't compose very well. We will argue that perhaps traditional procedural abstraction and abstract data types don't capture the essential differences between parallel and sequential programming. We will present a different notion of modules, based on guarded atomic actions, and view it as a resource to be shared concurrently by other modules. As opposed to implicitly or explicitly specifying parallelism in a program, we think of parallel programming as a process of synthesis from a set of modules with proper interfaces and composition rules. We will draw connections between this hardware-design inspired methodology and traditional approaches to multithreaded parallelism including programming based on transactions.
Bio: Arvind is the Johnson Professor of Computer Science and Engineering at MIT where in the late eighties his group, in collaboration with Motorola, built the Monsoon dataflow machines and its associated software. In 2000, Arvind started Sandburst which was sold to Broadcom in 2006. In 2003, Arvind co-founded Bluespec Inc., an EDA company to produce a set of tools for high-level synthesis. In 2001, Dr. R. S. Nikhil and Arvind published the book "Implicit parallel programming in pH". Arvind's current research interests are synthesis and verification of large digital systems described using Guarded Atomic Actions; and Memory Models for parallel architectures and languages.

Chen Ding
Title: BOP: Software Behavior Oriented Parallelization
Abstract: Many sequential applications are difficult to parallelize because of complex code, input-dependent parallelism, and the use of third-party modules. These difficulties led us to build a software system for behavior oriented parallelization (BOP), which allows a program to be parallelized based on partial information about program behavior, for example, a user reading just part of the source code, or a profiling tool examining merely one or few executions.
The basis of BOP is programmable software speculation, where a user or an analysis tool marks possibly parallel regions in the code, and the run-time system executes these regions speculatively. It is imperative to protect the entire address space during speculation. In this talk I will describe the basic features of the prototype system including the programming interface, parallelism analyzer, and the run-time support based on strong isolation, value-based checking, and non-speculative re-execution. On a recently acquired multi-core, multi-processor PC, the BOP system reduced the end-to-end execution time by integer factors for a set of open-source and commercial applications, with no change to the underlying hardware or operating system.
Bio: Chen Ding is an Associate Professor in the Computer Science Department at the University of Rochester and presently an Visiting Associate Professor in the EECS Department at MIT. He received the Early Career Principal Investigator award from DoE, the CAREER award from NSF, the CAS Faculty Fellowship from IBM, and a best-paper award from the IEEE IPDPS. He co-founded the ACM SIGPLAN Workshop on Memory System Performance and Correctness (MSPC) in 2002 and organized the Workshop on High-Level Parallel Programming Models and Supportive Environments (HIPS) in 2007. Between February and August 2007, he was a visiting researcher at Microsoft Research. More information about his work can be found at http://www.cs.rochester.edu/~cding/.

Rudolf Eigenmann
Title: Automatic Performance Tuning for Multicore Architectures
Abstract: One of the fundamental limitations of optimizing compilers is the lack of runtime knowledge about the actual input data and execution platform. Automatic performance tuning has the potential to overcome this limitation. I will present the architecture and performance of an automatic performance tuning system that pursues this goal. The system partitions a program into a number of tuning sections and finds the best combination of compiler optimizations for each section.
The performance tuning process includes several pre-tuning steps that partition and instrument a program into suitable tuning sections, followed by the actual tuning and the post-tuning assembly of the individually optimized parts. The system, called PEAK, achieves fast tuning speed by measuring a small number of invocations of each code section, instead of the whole-program execution time, as in previous solutions. Compared to these solutions, PEAK reduces tuning time from 2.19 hours to 5.85 minutes on average, while achieving similar program performance. PEAK improves the performance of the SPEC CPU2000 FP benchmarks by an average 12% over GCC O3, the highest optimization level, on a Pentium IV machine.
The current system implementation tunes programs during a training run, then freezes the optimization combination for the production runs. I will discuss opportunities to perform the optimizations fully dynamically. Also, the only optimizations being tuned are currently those exposed in the form of compiler options. In ongoing work, we are exposing to the automatic tuning capability additional, compiler-internal optimization parameters as well as optimization parameters of library routines.
Bio: Rudolf Eigenmann is a professor at the School of Electrical and Computer Engineering at Purdue University. He is also the Interim Director of the Computing Research Institute and Associate Director of the Purdue's Cyber Center. His research interests include optimizing compilers, programming methodologies and tools, performance evaluation for high-performance computers and applications, and Internet sharing technology. Dr. Eigenmann received his Ph.D. in Electrical Engineering/Computer Science in 1988 from ETH Zurich, Switzerland.

Manish Gupta
Title: A Data-Driven Co-operative Approach to Scaling of Commercial Java Codes
Abstract: The rising power dissipation in microprocessor chips is leading to a fundamental shift in the computing paradigm, one that requires software to exploit ever-increasing levels of parallelism. We describe a data-driven approach, requiring co-operation across multiple layers of software, to scaling of J2EE codes to large scale parallelism. We present a study of relevant characteristics of some J2EE programs, which provides evidence of the applicability of our ideas. For the given benchmarks, we show that a large percentage of objects, usually greater than 90%, are thread-private. Furthermore, a large percentage of locking operations on shared objects are on infrequently written objects. We provide a detailed analysis of lock contentions among user threads, and demonstrate that threads can be naturally grouped based on lock contention. Overall, we argue that there is a need to tackle the scalability problem by applying optimizations across the stack (comprising the application, application server, JVM, operating system, and the hardware layers).
Bio: Manish Gupta is the Chief Technology Officer at the IBM India Systems and Technology Laboratory, and leads efforts to take on challenging new missions at the lab. He is on assignment from the IBM T. J. Watson Research Center in Yorktown Heights, NY, where he was a Senior Manager and led research on system software for the IBM Blue Gene supercomputer and high end servers. Manish received a B. Tech. in Computer Science from IIT Delhi in 1987, a Ph.D. from the University of Illinois at Urbana-Champaign in 1992, and has worked with IBM since then. He has received two Outstanding Technical Achievement Awards at IBM, filed over a dozen patents, and has co-authored over 70 papers in the areas of high performance compilers, parallel computing, and Java Virtual Machine optimizations.

Maurice Herlihy
Title: Taking Concurrency Seriously: The Multicore Challenge
Abstract: Computer architecture is undergoing, if not another revolution, then a vigorous shaking-up. The major chip manufacturers have, for the time being, simply given up trying to make processors run faster. Instead, they have recently started shipping "multicore'' architectures, in which multiple processors (cores) communicate directly through shared hardware caches, providing increased concurrency instead of increased clock speed.
As a result, system designers and software engineers can no longer rely on increasing clock speed to hide software bloat. Instead, they must somehow learn to make effective use of increasing parallelism. This adaptation will not be easy. Conventional synchronization techniques based on locks and conditions are unlikely to be effective in such a demanding environment. Coarse-grained locks, which protect relatively large amounts of data, do not scale, and fine-grained locks introduce substantial software engineering problems.
Transactional memory is a computational model in which threads synchronize by optimistic, lock-free transactions. This synchronization model promises to alleviate many (perhaps not all) of the problems associated with locking, and there is a growing community of researchers working on both software and hardware support for this approach. This talk will survey the area, with a focus on open research problems.
Bio: Maurice Herlihy received an A.B. degree in Mathematics from Harvard University and a Ph.D. degree in Computer Science from MIT. He has been an Assistant Professor in the Computer Science Department at Carnegie Mellon University, a member of the research staff at Digital Equipment Corporation's Cambridge (MA) Research Lab, and a consultant for Sun Microsystems. He is now a Professor of Computer Science at Brown University. Prof. Herlihy's research centers on practical and theoretical aspects of multiprocessor synchronization, with a focus on wait-free and lock-free synchronization. His 1991 paper "Wait-Free Synchronization" won the 2003 Dijkstra Prize in Distributed Computing, and he shared the 2004 Goedel Prize for his 1999 paper "The Topological Structure of Asynchronous Computation." He is a Fellow of the ACM.

Laxmikant Kale
Title: Simplifying parallel programming with Non-complete deterministic languages
Abstract: With multicore machines on the desktops, parallel programming needs to get down to the "masses" of programmers. Yet, it remains a complex skill, difficult to master. Performance issues and correctness issues, substantially beyond those encountered in sequential programming, make it more complicated than sequential programming. Among existing paradigms, MPI is considered low-level and suffers from modularity issues. Shared Address Space programming is no easier, despite the claims to the contrary, because of the large number of interleavings and concomitant potential for race conditions. However, in the past several years, I have started converging towards a model of parallel programming that may lead us to a solution. The basic ideas in such a model include: (1) automated resource management via over-decomposition into migratable objects, (b)languages that sacrifice completeness for simplicity and determinacy, and (c) an interoperability framework that allows modules written in many such languages to synergistically coexist in a single application. I will elaborate on these ideas, with two example languages called Charisma and MSA, and the interoperability framework defined by the Charm++ runtime system. I will illustrate them with examples drawn from a number of scientific and engineering applications I have worked on over the past 15 years.
Bio: Professor Laxmikant Kale has been working on various aspects of parallel computing, with a focus on enhancing performance and productivity via adaptive runtime systems, and with the belief that only interdisciplinary research involving multiple CSE and other applications can bring back well-honed abstractions into Computer Science that will have a long-term impact on the state-of-art. His collaborations include the widely used Gordon-Bell award winning (SC'2002) biomolecular simulation program NAMD, and other collaborations on computational cosmology, quantum chemistry, rocket simulation, space-time meshes, and other unstructured mesh applications. He takes pride in his group's success in distributing and supporting software embodying his research ideas, including Charm++, Adaptive MPI and the ParFUM framework.
L. V. Kale received the B.Tech degree in Electronics Engineering from Benares Hindu University, Varanasi, India in 1977, and a M.E. degree in Computer Science from Indian Institute of Science in Bangalore, India, in 1979. He received a Ph.D. in computer science in from State University of New York, Stony Brook, in 1985.
He worked as a scientist at the Tata Institute of Fundamental Research from 1979 to 1981. He joined the faculty of the University of Illinois at Urbana-Champaign as an Assistant Professor in 1985, where he is currently employed as a Professor.

Uday Khedker
Title: Efficiency, Precision, Simplicity, and Generality in Interprocedural Data Flow Analysis: Resurrecting the Classical Call Strings Method
Abstract: Context sensitive interprocedural data flow analysis requires incorporating the effect of all possible calling contexts on the data flow information at a program point. The call strings approach, which represents context information in the form of a call string, bounds the contexts by terminating the call string construction using precomputed length bounds. These bounds are large enough to guarantee a safe and precise solution, but usually result in a large number of call strings, thereby rendering the method impractical.
We propose a simple change in the classical call strings method. Unlike the traditional approach in which call string construction is orthogonal to the computation of data flow values, our variant uses the equivalence of data flow values for termination of call string construction. This allows us to discard call strings where they are redundant and regenerate them when required. For the cyclic call strings, regeneration facilitates iterative computation of data flow values without explicitly constructing most of the call strings. This reduces the number of call strings, and hence the analysis time, by orders of magnitude as corroborated by our empirical measurements.
On the theoretical side, our method reduces the worst case call string length from quadratic in the size of lattice to linear. Further, unlike the classical method, this worst case length need not be reached since termination does not depend on constructing all call strings up to this length. Our approach retains the precision, generality, and simplicity of the call strings method and simultaneously reduces the complexity and increases efficiency significantly without imposing any additional constraints.
Bio: Uday Khedker holds a Ph.D. in Computer Science & Engineering from IIT Bombay. He taught Computer Science at Pune University from 1994 to 2001 and since then is with IIT Bombay where currently he is an Associate Professor of Computer Science & Engineering. His areas of interest are Programming Languages and Compilers and he specializes in data flow analysis and its applications to code optimization. His current research topics include Interprocedural Data Flow Analysis, Static Analysis of Heap Allocated Data, Static Inferencing of Flow Sensitive Polymorphic Types, and Compiler Verification. A recent research thrust involves cleaning up the GNU Compiler Collection (GCC) to simplify its deployment, retargetting, and enhancements. Other goals include increasing its trustworthiness as well as the quality of generated code.

José Martínez
Title: Core Fusion: Accommodating Software Diversity in Chip Multiprocessors
Abstract: Chip multiprocessors (CMPs) hold the prospect of delivering long-term performance growth by integrating more cores on the die with each new technology generation. In the short term, on-chip integration of a few relatively large cores may yield sufficient throughput when running multiprogrammed workloads. However, harnessing the full potential of CMPs in the long term makes a broad adoption of parallel programming inevitable.
We envision a CMP-dominated future where a diverse landscape of software in different stages of parallelization exists at all times. Unfortunately, in this future, the inherent rigidity in current proposals for CMP designs makes it hard to come up with a "universal" CMP that can accommodate this software diversity.
In this talk I will discuss Core Fusion, a CMP architecture where cores can "fuse" into larger cores on demand to execute sequential code very fast, while still retaining the ability to operate independently to run highly parallel code efficiently. Core Fusion builds upon a substrate of fundamentally independent cores and conventional memory coherence/ consistency support, and enables the CMP to dynamically morph into different configurations to adapt to the changing needs of software at run-time. Core Fusion does not require specialized software support, it leverages mature micro-architecture technology, and it can interface with the application through small extensions encapsulated in ordinary parallelization libraries, macros, or directives.
Bio: José Martínez (Ph.D.'02 Computer Science, UIUC) is assistant professor of electrical and computer engineering and graduate field member of computer science at Cornell University. He leads the M3 Architecture Research Group at Cornell, whose interests include multicore architectures, reconfigurable and self-optimizing hardware, and hardware-software interaction. Martínez's work has been selected to IEEE Micro Top Picks twice (2003 and 2007). In 2005, he and his students received the Best Paper Award at HPCA-11 for their work on checkpointed early load retirement. Martínez is also the recipient of a NSF CAREER Award and, more recently, an IBM Faculty Award. His teaching responsibilities at Cornell include computer architecture at both undergraduate and graduate levels. He also organizes the AMD Computer Engineering Lecture Series.

Mayur Naik
Title: Effective Static Race Detection for Java
Abstract: Concurrent programs are notoriously difficult to write and debug, a problem poised to become acute with the recent shift in hardware from uniprocessors to multicore processors. A fundamental and particularly insidious concurrency bug is a race: a condition in a shared-memory multithreaded program in which a pair of threads may access the same memory location without any ordering enforced between the accesses, and at least one of the accesses is a write. Despite thirty years of research on race detection, today's concurrent programs are still riddled with harmful races.
We present an effective approach to static race detection for Java. We dissect the specification of a race to identify four natural conditions, each of which is sufficient for proving a given pair of statements race-free, but all of which are necessary in practice as different pairs of statements in a given Java program may be race-free because of different conditions. We present four novel static analyses each of which conservatively approximates a separate condition while together enabling the overall algorithm to report a high-quality set of potential races. We describe the implementation of our approach in a tool Chord, and report upon our experience applying it to a suite of multithreaded Java programs.
Our approach is sound in that it is guaranteed to report all races, it is precise in that it misidentifies few non-races as races, and it is scalable in that it is fully automatic and checks programs comprising hundreds of thousands of Java bytecodes in under a few minutes. Finally, our approach is effective, finding tens to hundreds of previously unknown concurrency bugs in mature and widely-used Java programs, many of which were fixed within a week of reporting.
Bio: Mayur Naik is a Ph.D. student in the Computer Science Department at Stanford University, where he is advised by Professor Alex Aiken. Mayur's research interests lie at the boundary of programming languages and software engineering, with a current focus on concurrency. He obtained a B.E. in Computer Science from BITS, Pilani, India in 1999 and a M.S. in Computer Science from Purdue University in 2003. He was awarded a Microsoft Fellowship in 2004-05.

Ramesh Peri
Title: Software Development Tools for Multi-Core/Parallel Programming
Abstract: The new era of multi-core processors is bringing unprecedented computing power to the mainstream desktop applications. In order to fully exploit this compute power one has to delve into the world of parallel programming which until today has been the exclusive domain of High Performance Computing Community. This talk will focus on the current state of the art in parallel programming tools that is applicable for developers of mainstream parallel applications with emphasis on software development tools like compilers, debuggers, performance analysis tools and correctness checking tools for parallel programs. I will share some of the challenges that developers face today in developing applications for multi-core systems containing a small number of homogeneous cores (2 to 8) and discuss the situation we will face with the advent of systems containing many more heterogeneous cores in next few years.
Bio: Ramesh Peri is a Principal Engineer at Intel® Corporation in Performance and Threading Tools Lab. He is manages a multi-geo group located in Russia and United States and is responsible for development of data collectors for performance analysis and correctness tools like Intel® VTuneTM, Intel® ThreadChecker and Intel® ThreadProfiler. Prior to joining Intel Ramesh worked in the area of software development tools at Panasonic AVC Labs, Lucent Technologies and Hewlett Packard. Ramesh got his Ph.D in computer science from University of Virginia in 1995.

Keshav Pingali
Title: Exploiting Data Parallelism in Irregular Programs
Abstract: The parallel programming community has a lot of experience in exploiting data parallelism in regular programs that deal with structured data such as arrays and matrices. However, most client-side applications deal with unstructured data represented using pointer-based data structures such as trees and graphs. In her Turing award lecture, Fran Allen raised an important question about such programs: do irregular programs have data parallelism, and if so, how do we exploit it on multicore processors?
In this talk, we argue using concrete examples that irregular programs have a generalized kind of data parallelism that arises from the use of iterative algorithms that manipulate worklists of various sorts. We then describe the approach taken in the Galois project to exploit this data-parallelism. There are three main aspects to the Galois system: (1) a small number of syntactic constructs for packaging optimistic parallelism as iteration over ordered and unordered sets, (2) assertions about methods in class libraries, and (3) a runtime scheme for detecting and recovering from potentially unsafe accesses to shared memory made by an optimistic computation. We present experimental results that demonstrate that the Galois approach is practical, and discuss ongoing work on this system.
Bio: Keshav Pingali is a professor in the Computer Science department at the University of Texas, Austin, where he holds the W.A."Tex" Moncrief Chair of Grid and Distributed Computing. He received the B.Tech. degree in Electrical Engineering from IIT, Kanpur, India in 1978, and the S.M. E.E., and Sc.D. degrees from MIT in 1986. He was on the faculty of the Department of Computer Science at Cornell University from 1986 to 2006, where he held the India Chair of Computer Science.
Pingali's research has focused on programming languages and compiler technology for program understanding, restructuring, and optimization. His group is known for its contributions to memory-hierarchy optimization; some of these have been patented. Algorithms and tools developed by his projects are used in many commercial products such as Intel's IA-64 compiler, SGI's MIPSPro compiler, and HP's PA-RISC compiler. In his current research, he is investigating optimistic parallelization techniques for multicore processors, and language-based fault tolerance. Among other awards, Pingali has won the President's Gold Medal at I.I.T. Kanpur (1978), IBM Faculty Development Award (1986-88), NSF Presidential Young Investigator Award (1989-94), Ip-Lee Teaching Award of the College of Engineering at Cornell (1997), and the Russell teaching award of the College of Arts and Sciences at Cornell (1998). In 2000, he was a visiting professor at I.I.T., Kanpur where he held the Rama Rao Chaired Professorship.

Lawrence Rauchwerger
Title: Automatic Parallelization with Hybrid Analysis
Abstract: Hybrid Analysis (HA) is a compiler technology that can seamlessly integrate all static and run-time analysis of memory references into a single framework capable of generating sufficient information for most memory related optimizations.
In this talk, we will present Hybrid Analysis as a framework to perform automatic parallelization of loops. For the cases when static analysis does not give conclusive results, we extract sufficient conditions which are then evaluated dynamically and can (in)validate the parallel execution of loops. The HA framework has been fully implemented in the Polaris compiler and has parallelized 22 benchmark codes with 99% coverage and speedups superior to the Intel Ifort compiler.
Bio: Lawrence Rauchwerger is a Professor Computer Science and of Computer Engineering in the Department of Computer Science, Texas A&M University. He is also the co-Director of the Parasol Laboratory.
Lawrence Rauchwerger received an Engineer degree from the Polytechnic Institute Bucharest, a M.S. in Electrical Engineering from Stanford University and a Ph.D. in Computer Science from the University of Illinois at Urbana-Champaign. Since 1996 he has been on the faculty of the Department of Computer Science at Texas A&M where he co-founded and co-directs the Parasol Lab. He has held Visiting Faculty positions at the University of Illinois at Urbana-Champaign, Bell Labs, IBM T.J. Watson Research Center, and INRIA FUTURS, Paris.
Rauchwerger's research has targeted the area of high performance compilers, libraries for parallel and distributed computing, and adaptive optimizations and their architectural support. He is known for introducing software thread level speculative parallelization (TLS). Subsequently he introduced architectural innovations to support speculative parallelization (together with Josep Torrellas), He is also known for SmartApps (application centric optimization), a novel approach to application optimization. His current focus is STAPL, a parallel superset of the ISO C++ STL library which is driven by his goal to improve the productivity of parallel software development. His approach to parallel code development and optimization (STAPL and SmartApps) has influenced industrial products at major corporations. He has also been very active in the development of parallel applications in the domains of nuclear engineering and physics.

Vivek Sarkar
Title: Compiler Challenges for Multicore Parallel Systems
Abstract: This decade marks a resurgence for parallel computing with mainstream and high-end systems moving to multicore processors. Unlike previous generations of hardware evolution, this shift will have a major impact on existing software. At the high end, it is widely recognized by application experts that past approaches based on domain decomposition will not scale to exploit the parallelism needed by multicore nodes. In the mainstream, it is acknowledged by hardware vendors that enablement of software for execution on multiple cores is the major open problem that needs to be solved in support of this hardware trend. These software challenges are further compounded by an increased adoption of high performance computing in new application domains that may not fit the patterns of parallelism that have been studied by the community thus far.
In this talk, we outline the software stacks that are being developed for multicore parallel systems, and summarize the challenges and opportunities that they pose to compilers. We discuss new opportunities for compiler research created by recent work on high productivity parallel languages and on lightweight concurrency in virtual machines (managed runtimes). Examples will be give from research projects under way in these areas including PGAS languages , Java Concurrency Utilities, and the X10 language. Finally, we outline the new Habanero research project initiated at Rice University with the goal of producing portable parallel software that can run efficiently on a wide range of homogeneous and heterogeneous multicore systems.
Bio: Professor Vivek Sarkar conducts research in programming languages, program analysis, compiler optimizations and virtual machines for parallel and high performance computer systems. His past projects include the X10 programming language, the Jikes Research Virtual Machine for the Java language, the ASTI optimizer used in IBM^?s XL Fortran product compilers, the PTRAN automatic parallelization system, and profile-directed partitioning and scheduling of Sisal programs. He is in the process of starting up the Habanero Multicore Software project at Rice University which spans the areas of programming languages, optimizing and parallelizing compilers, virtual machines, and concurrency libraries for homogeneous and heterogeneous multicore processors.
Vivek became a member of the IBM Academy of Technology in 1995, an ACM Distinguished Scientist in 2006, and the E.D. Butcher Professor of Computer Science at Rice University in 2007. Prior to joining Rice University in July 2007, Professor Sarkar was Senior Manager of Programming Technologies at IBM Research. His responsibilities at IBM included leading IBM's research efforts in Programming Model, Tools, and Productivity in the PERCS project during 2002 - 2007 as part of the DARPA High Productivity Computing System program. Vivek holds a B.Tech. degree from the Indian Institute of Technology, Kanpur, an M.S. degree from University of Wisconsin-Madison, and a Ph.D. from Stanford University. In 1997, he was on sabbatical as a visiting associate professor at MIT, where he was a founding member of the MIT RAW multicore project.

Y. N. Srikant
Title: Energy-aware Compiler Optimizations
Abstract: The importance of saving energy in modern times need not be overstressed. With the prolific increase in the usage of processors in embedded systems of all types, there is a strong requirement to make the batteries last even longer than before, so that the devices which use them can operate longer without changing batteries. The role of compilers in producing energy-efficient code is a rather important one. Compilers can aid the hardware techniques already available for reducing energy comsumption.
In this talk, I will describe some of the current day techniques available to the compiler writer for minimizing energy consumption without much performance penalty. These include instruction scheduling to reduce leakage energy consumption and to reduce energy consumption in the interconnects, apart from dynamic voltage scaling.
Bio: Y.N. Srikant received his B.E in Electronics from Bangalore University, and M.E and Ph.D in Computer Science from the Computer Science and Automation department at the Indian Institute of Science. His area of interest is compiler design. He is the co-editor of a handbook on advanced compiler design published by CRC Press in 2002 (currently under revision).

Josep Torrellas
Title: Lessons Learned in Designing Speculative Multithreaded Hardware
Abstract: Perhaps the biggest challenge facing computer architects today is how to design parallel architectures that make it easy for programmers to write parallel codes. In this talk, I will summarize the lessons learned in the past 10 years as we examined the design of multiprocessors with speculative multithreading. I will discuss the uses of this technology for performance (Thread-Level Speculation, Speculative Synchronization, Cherry, Bulk, and BulkSC), hardware reliability (Paceline), and software dependability (ReEnact, ReSlice and Iwatcher).
Bio: Josep Torrellas (http://iacoma.cs.uiuc.edu) is a Professor and Willett Faculty Scholar at the University of Illinois. Prior to being at Illinois, Torrellas received a PhD from Stanford University. He also spent a sabbatical year as Research Staff Member at IBM's T.J. Watson Research Center. Torrellas's research area is multiprocessor computer architecture, focusing on speculative multithreading, multiprocessor organization, integration of processors and memory, and architectural support for software dependability and hardware reliability. He has been involved in the Stanford DASH and the Illinois Cedar multiprocessor projects, and lead the Illinois Aggressive COMA and FlexRAM Intelligent Memory projects. He has published over 150 papers in computer architecture. Torrellas is an IEEE Fellow and the Chairman of IEEE Technical Committee on Computer Architecture. He received an NSF Young Investigator Award.

David Wood
Title: Performance Pathologies in Hardware Transactional Memory Systems
Abstract: Hardware Transactional Memory (HTM) systems reflect choices from three key design dimensions: conflict detection, ver- sion management, and conflict resolution. Previously proposed HTMs represent three points in this design space: lazy conflict detection, lazy version management, committer wins (LL); eager conflict detection, lazy version management, requester wins (EL); and eager conflict detection, eager version management, and requester stalls with conservative deadlock avoidance (EE).
To isolate the effects of these high-level design decisions, we develop a common framework that abstracts away differences in cache write policies, interconnects, and ISA to compare these three design points. Not surprisingly, the relative performance of these systems depends on the workload. Under light transactional loads they perform similarly, but under heavy loads they differ by up to 80%. None of the systems performs best on all of our benchmarks.
We identify seven performance pathologies--interactions between workload and system that degrade performance--as the root cause of many performance differences: FRIENDLYFIRE, STARVINGWRITER, SERIALIZEDCOMMIT, FUTILESTALL, STARVIN- GELDER, RESTARTCONVOY, and DUELINGUPGRADES. We discuss when and on which systems these pathologies can occur and show that they actually manifest within TM workloads. The insight pro- vided by these pathologies motivated four enhanced systems that often significantly reduce transactional memory overhead. Impor- tantly, by avoiding transaction pathologies, each enhanced system performs well across our suite of benchmarks.
Bio: Prof. David A. Wood is a Professor and Romnes Fellow in the Computer Sciences Department at the University of Wisconsin, Madison. Dr. Wood also holds a courtesy appointment in the Department of Electrical and Computer Engineering. Dr. Wood received a B.S. in Electrical Engineering and Computer Science (1981) and a Ph.D. in Computer Science (1990), both at the University of California, Berkeley. He joined the faculty at the University of Wisconsin in 1990.
Dr. Wood was named an ACM Fellow (2005) and IEEE Fellow (2004), received the University of Wisconsin's H.I. Romnes Faculty Fellowship (1999), and received the National Science Foundation's Presidential Young Investigator award (1991). Dr. Wood is Area Editor (Computer Systems) of ACM Transactions on Modeling and Computer Simulation, is Associate Editor of ACM Transactions on Architecture and Compiler Optimization, served as Program Committee Chairman of ASPLOS-X (2002), and has served on numerous program committees. Dr. Wood is an ACM Fellow, an IEEE Fellow, and a member of the IEEE Computer Society. Dr. Wood has published over 70 technical papers and is an inventor on eleven U.S. and International patents.
Dr. Wood co-leads the Wisconsin Multifacet project with Prof. Mark Hill (URL http://www.cs.wisc.edu/multifacet) which is exploring techniques for improving the availability, designability, programmability, and performance of commercial multiprocessor and chip multiprocessor servers.

Workshop on Architectures and Compilers for Multithreading

December 13-15, 2007 Indian Institute of Technology, Kanpur

Abstracts

December 13-15, 2007
Indian Institute of Technology, Kanpur