Apache Beam é um modelo de programação unificado de código aberto para definir e executar pipelines de processamento de dados paralelos. Seu poder reside em sua capacidade de executar pipelines de lote e streaming, com a execução sendo executada por um dos back-ends de processamento distribuído suportados pela Beam: Apache Apex , Apache Flink , Apache Spark , Apache Spark e Go ogle Cloud Dataflow. Apache Beam é útil para tarefas ETL (Extrair, Transformar e Carregar), como mover dados entre diferentes mídias de armazenamento e fontes de dados, transformar dados em um formato mais desejável e carregar dados em um novo sistema.
Neste treinamento ao vivo, conduzido por instrutor (no local ou remoto), os participantes aprenderão como implementar os SDKs do Apache Beam em um aplicativo Java ou Python que define um pipeline de processamento de dados para decompor um conjunto de grandes dados em blocos menores para processamento independente e paralelo .
Ao final deste treinamento, os participantes serão capazes de:
- Instale e configure o Apache Beam .
- Use um único modelo de programação para executar o processamento em lote e fluxo a partir do aplicativo Java ou Python .
- Execute pipelines em vários ambientes.
Formato do Curso
- Parte palestra, parte discussão, exercícios e prática prática pesada
Nota
- Este curso estará disponível no Scala no futuro. Entre em contato conosco para agendar.
Machine Translated
Apache Beam is an open source, unified programming model for defining and executing parallel data processing pipelines. It's power lies in its ability to run both batch and streaming pipelines, with execution being carried out by one of Beam's supported distributed processing back-ends: Apache Apex, Apache Flink, Apache Spark, and Google Cloud Dataflow. Apache Beam is useful for ETL (Extract, Transform, and Load) tasks such as moving data between different storage media and data sources, transforming data into a more desirable format, and loading data onto a new system.
In this instructor-led, live training (onsite or remote), participants will learn how to implement the Apache Beam SDKs in a Java or Python application that defines a data processing pipeline for decomposing a big data set into smaller chunks for independent, parallel processing.
By the end of this training, participants will be able to:
- Install and configure Apache Beam.
- Use a single programming model to carry out both batch and stream processing from withing their Java or Python application.
- Execute pipelines across multiple environments.
Format of the Course
- Part lecture, part discussion, exercises and heavy hands-on practice
Note
- This course will be available Scala in the future. Please contact us to arrange.
Introduction
- Apache Beam vs MapReduce, Spark Streaming, Kafka Streaming, Storm and Flink
Installing and Configuring Apache Beam
Overview of Apache Beam Features and Architecture
- Beam Model, SDKs, Beam Pipeline Runners
- Distributed processing back-ends
Understanding the Apache Beam Programming Model
- How a pipeline is executed
Running a sample pipeline
- Preparing a WordCount pipeline
- Executing the Pipeline locally
Designing a Pipeline
- Planning the structure, choosing the transforms, and determining the input and output methods
Creating the Pipeline
- Writing the driver program and defining the pipeline
- Using Apache Beam classes
- Data sets, transforms, I/O, data encoding, etc.
Executing the Pipeline
- Executing the pipeline locally, on remote machines, and on a public cloud
- Choosing a runner
- Runner-specific configurations
Testing and Debugging Apache Beam
- Using type hints to emulate static typing
- Managing Python Pipeline Dependencies
Processing Bounded and Unbounded Datasets
Making Your Pipelines Reusable and Maintainable
Create New Data Sources and Sinks
- Apache Beam Source and Sink API
Integrating Apache Beam with other Big Data Systems
- Apache Hadoop, Apache Spark, Apache Kafka
Troubleshooting
Summary and Conclusion