ScholarQuill logoScholarQuillUniversity Notes
  • Notes
  • Past Papers
  • Blogs
  • Todo
Login
ScholarQuill logoScholarQuillUniversity Notes
Login
NotesPast PapersBlogsTodo
More
SubjectsDiscussionCGPA CalculatorGPA CalculatorStudent PortalCourse Outline
About
About usPrivacy PolicyReportContact
Notes
Past Papers
Blogs
Todo
Analytics
    Current Subject
    🧩
    Parallel & Distributed Computing
    COMP3139
    Progress0 / 33 topics
    Topics
    1. Introduction to Parallel and Distributed Systems2. Why Use Parallel and Distributed Systems?3. Speedup and Amdahl's Law4. Hardware Architectures: Multi Processors (Shared Memory)5. Hardware Architectures: Networks of Workstations (Distributed Memory)6. Hardware Architectures: Clusters (Latest Variation)7. Software Architectures: Threads and Shared Memory8. Software Architectures: Processes and Message Passing9. Software Architectures: Distributed Shared Memory (DSM)10. Software Architectures: Distributed Shared Data (DSD)11. Parallel Algorithms12. Concurrency and Synchronization13. Data and Work Partitioning14. Common Parallelization Strategies15. Granularity16. Load Balancing17. Examples of Parallel Algorithms: Parallel Search18. Examples of Parallel Algorithms: Parallel Sorting19. Shared-Memory Programming20. Threads in Shared-Memory Programming21. P Threads22. Locks and Semaphores23. Distributed-Memory Programming24. Message Passing25. Map Reduce26. Distributed-Memory Programming with PI27. Google's Map Reduce28. Hadoop29. Other Parallel Programming Systems30. Tread Marks31. Distributed Shared Memory32. Aurora: Scoped Behavior and Abstract Data Types33. S Enterprise: Process Templates
    COMP3139›Google's Map Reduce
    Parallel & Distributed ComputingTopic 27 of 33

    Google's Map Reduce

    8 minread
    1,435words
    Intermediatelevel

    Google's MapReduce: A Powerful Parallel Programming Model

    MapReduce is a distributed computing model introduced by Google for processing large-scale data across a distributed system. It allows for efficient processing of massive datasets by splitting the work across many machines (nodes), and it abstracts away the complexities of parallelism and fault tolerance, making it easier for developers to write scalable, parallel applications.

    In MapReduce, the computation is broken into two phases:

    1. Map: The input data is divided into smaller chunks, and a function is applied to each chunk independently and in parallel.
    2. Reduce: The output of the map phase is aggregated or processed further by a reduction function.

    The MapReduce Programming Model

    The MapReduce model is based on two core operations: Map and Reduce. Let's break these down:

    1. Map Phase:

    • The input data is split into key-value pairs, and each key-value pair is processed by the map function.
    • The map function is applied to each chunk of data independently, in parallel across many machines.
    • Each map function processes a piece of the input and outputs intermediate key-value pairs. These intermediate pairs are not necessarily ordered and can be processed in parallel by the next stage.

    Example:

    Consider processing a large set of text files to count the number of occurrences of each word. In the Map phase, the input data is divided into smaller pieces, and each piece is processed by the map function. The map function generates pairs like this:

    ("apple", 1)
    ("banana", 1)
    ("apple", 1)
    

    Each word is paired with the value 1 indicating that the word occurred once.

    2. Shuffle and Sort Phase:

    • After the map phase, the shuffle and sort phase groups all the intermediate key-value pairs by their key.
    • For example, all pairs with the key "apple" will be grouped together, and all pairs with the key "banana" will be grouped together. This is done by the system and is not part of the user’s code.

    3. Reduce Phase:

    • In the reduce phase, the reducer processes the sorted intermediate key-value pairs. The reducer function aggregates, summarizes, or otherwise transforms the data.
    • The input to the reduce function is a key and a list of values for that key.
    • In our word count example, the reducer would sum up the values for each word to get the total count.

    For example, the reducer might receive:

    ("apple", [1, 1])
    ("banana", [1])
    

    The output from the reducer would be:

    ("apple", 2)
    ("banana", 1)
    

    This means "apple" appeared 2 times and "banana" appeared 1 time in the input dataset.


    Example of a Word Count Program Using MapReduce

    Here is a high-level overview of how a simple word count program would work using MapReduce:

    1. Input: Large collection of text files.
    2. Map Phase:
      • The text files are split into smaller chunks.
      • For each chunk, the map function takes the text and produces key-value pairs where the key is a word and the value is 1.
    3. Shuffle and Sort:
      • The key-value pairs are grouped by key (i.e., word), so all the occurrences of the same word are grouped together.
    4. Reduce Phase:
      • The reducer takes each word and the list of all its occurrences (values), sums them up, and produces a total count for that word.

    Code Example (Pseudo-Code):

    def map_function(key, value):
        # key: document identifier, value: text content
        words = value.split()  # Split the text into words
        for word in words:
            yield word, 1  # Emit each word with a count of 1
    
    def reduce_function(key, values):
        # key: word, values: list of 1s
        return key, sum(values)  # Sum the counts for each word
    

    The map function takes each chunk of data (typically a text file or line) and emits the word as the key with a value of 1. The reduce function sums the values for each word.


    Why MapReduce?

    MapReduce was designed to handle massive amounts of data efficiently. Its primary advantages include:

    1. Scalability:

      • MapReduce allows you to scale the computation across many machines in a cluster. As the size of the data grows, you can simply add more machines to handle the additional workload.
    2. Fault Tolerance:

      • MapReduce handles failures automatically. If a machine fails during processing, the system can restart the task on another machine. The data is typically stored in a distributed file system (like HDFS), which is replicated across multiple machines for fault tolerance.
    3. Simplicity:

      • Developers do not need to worry about low-level parallelism, synchronization, or fault tolerance. The MapReduce framework handles these aspects. Developers only need to define the map and reduce functions.
    4. Data Locality:

      • In systems like Hadoop, the framework attempts to run tasks on machines where the data is already located (data locality), reducing network overhead.
    5. Abstraction:

      • MapReduce abstracts away the complexity of managing parallel execution, task distribution, and load balancing, allowing developers to focus on the problem at hand.

    Google's MapReduce in Action: Google Search Indexing

    One of the most famous applications of MapReduce is in Google Search. To build an index for the web, Google needs to process massive amounts of data (webpages). Using MapReduce, Google can parallelize the process of crawling, parsing, and indexing web pages:

    1. Map Phase:

      • Google crawls millions of web pages. Each page is processed by a map function, which extracts key pieces of information like the words and their positions on the page.
    2. Shuffle and Sort:

      • The intermediate data (word, position) pairs are shuffled and sorted to group words that occur across the entire web.
    3. Reduce Phase:

      • The reduce phase consolidates the results to create an index where each word points to the documents and positions where it appears.

    This process can be distributed across many machines, allowing Google to build an index of the entire web in parallel.


    Key Concepts in Google's MapReduce

    1. Input Data:

      • Typically, data is stored in a distributed file system (like HDFS in Hadoop). Each file is split into smaller blocks, which are processed by different nodes.
    2. Key-Value Pairs:

      • Both the map and reduce functions operate on key-value pairs. In the map phase, the input data is split into chunks of key-value pairs, and in the reduce phase, the results are aggregated by key.
    3. Fault Tolerance:

      • If a node in the cluster fails during computation, the task is automatically reassigned to another node. Additionally, the input data is replicated to avoid data loss.
    4. Parallelism:

      • Each map function operates independently on its portion of the data, and each reduce function operates on a different set of keys. This parallel execution model enables massive scalability.

    Google’s MapReduce Implementation: Hadoop

    Hadoop is an open-source implementation of the MapReduce model, and it was inspired by Google’s MapReduce framework. Hadoop includes:

    • HDFS (Hadoop Distributed File System): A distributed file system that stores data across many machines, ensuring fault tolerance and scalability.
    • YARN (Yet Another Resource Negotiator): A resource management layer that schedules and allocates resources to various tasks.
    • MapReduce Framework: A programming model and implementation for processing large datasets in parallel.

    Hadoop enables processing of petabytes of data across thousands of machines, making it suitable for big data tasks like data analytics, machine learning, and log processing.


    Limitations of MapReduce

    1. Iterative Algorithms:

      • MapReduce is not well-suited for tasks that require multiple iterations, such as certain machine learning algorithms. Each iteration requires a full Map-Reduce cycle, which can be inefficient.
    2. Latency:

      • The overhead of disk I/O (reading from and writing to HDFS) and the need to shuffle data between mappers and reducers can introduce significant latency, especially for smaller datasets or tasks that require many intermediate steps.
    3. Complexity for Non-MapReduce Tasks:

      • MapReduce is ideal for tasks that fit the model of parallel data processing (e.g., word count, sorting), but it can be difficult to implement algorithms that don’t naturally map to this model.

    Alternatives to MapReduce: Apache Spark

    While MapReduce was revolutionary, newer frameworks like Apache Spark have emerged, addressing some of its limitations:

    1. In-Memory Processing:

      • Unlike Hadoop MapReduce, Spark performs in-memory processing, which can be much faster since it avoids the need to write intermediate results to disk.
    2. Iterative Processing:

      • Spark is much better suited for iterative algorithms (e.g., machine learning), as it retains data in memory across iterations.
    3. Ease of Use:

      • Spark provides high-level APIs in languages like Python, Scala, and Java, making it easier to write complex applications compared to Hadoop’s lower-level API.

    Conclusion

    Google's MapReduce revolutionized how we process large-scale data in a distributed system. It introduced an elegant parallel programming model that abstracts away the

    Previous topic 26
    Distributed-Memory Programming with PI
    Next topic 28
    Hadoop

    Past Papers

    Open this section to load past papers

    Click on Show Past Papers to see past papers.
    On This Page
      Reading Stats
      Est. reading time8 min
      Word count1,435
      Code examples0
      DifficultyIntermediate