Tag: technews

  • The best mirrorless cameras for 2024

    Steve Dent

    After years of decline due to smartphones, it looks like the camera market is on the upswing — with Canon, for one, seeing solid growth in 2023. And as with 2022, we saw numerous new models arrive last year from Sony, Canon, Fujifilm, Nikon and Panasonic, featuring faster speeds, better focus, improved video and more. Most of those cost more than $1,000 and many are over $2,000 — it’s a profitable market coveted by manufacturers.

    If you’re looking to buy one for YouTube creation, travel videos, sports, family photos and more, but aren’t sure which one to get, we’re here to help. In this guide, you’ll learn about all the latest models and most importantly, which camera can do what you want it to do.

    Why a mirrorless camera over my smartphone?

    To learn more about mirrorless tech and why it’s taken over the camera world, check out our previous camera guide for an explanation, or watch our Upscaled video on the subject for an even deeper dive.

    Why get a camera when my smartphone takes great photos, you may ask? In a word, physics. The larger sensors in mirrorless cameras let more light in, and you have a wide choice of lenses with far superior optics. Where smartphones have one f/stop, cameras have many, which gives you more exposure control. You also get natural and not AI-generated bokeh, quicker shooting, a physical shutter, more professional video results, and so on. Smartphones do have impressive AI skills that help make photography easier, but that’s about it.

    Today, mirrorless is the best way to go if you’re shopping for a new camera. Both Canon and Nikon recently announced they’re discontinuing development of new DSLRs, simply because most of the advantages of that category are gone, as I detailed in a recent video. With all their R&D now in mirrorless, that’s where you’ll find the most up-to-date tech.

    Compact cameras still exist as a category, but barely. Panasonic has built a number of good models in the past, but recently said it would focus only on video-centric mirrorless models going forward. And we haven’t seen any new ones from Canon or Nikon lately, either. Only Sony and Fujifilm are still carrying the compact torch, the latter with its $1,400 X100V model, which has become famously hard to find. Most of Sony’s recently compact models, like the ZV-1F, are designed for vloggers.

    What to look for in a mirrorless camera

    Sensor size

    Now, let’s talk about features you need in a mirrorless camera. The one that affects your photography (and budget) the most is sensor size. The largest is medium format, but that’s only used on niche and expensive cameras from Hasselblad, Fujifilm and Leica, so we’ll skip over those for this guide. (See my Fujifilm GFX 100S and Hasselblad X2D reviews for more.)

    The most expensive category we’ll be discussing here is full-frame, largely used by pros and serious amateurs. Models are available from all the major brands except Fujifilm, including Sony, Canon, Nikon and Panasonic. That format offers the best image quality, low light capability and depth of field, with prices starting around $1,000. With the right lenses, you can get beautifully blurred backgrounds, but autofocus is more critical. Lenses are also more expensive.

    Down one size are APS-C cameras, offered on Fujifilm, Sony, Nikon and Canon models. Cameras and lenses are cheaper than full-frame, but you still get nice blurred “bokeh,” decent low-light shooting capability and relatively high resolution. With a sensor size equivalent to 35mm movie film, it’s ideal for shooting video.

    Micro Four Thirds, used by Panasonic and Olympus, is the smallest mainstream sensor size for mirrorless cameras. It offers less dramatic bokeh and light-gathering capability than APS-C, but allows for smaller and lighter cameras and lenses. For video, it’s harder to blur the background to isolate your subject, but focus is easier to control.

    Fujifilm X-T4 APS-C sensor
    Steve Dent/Engadget

    Sensor resolution

    The next thing to consider is sensor resolution. High-res cameras like Sony’s 61-megapixel full-frame A7R V or Fujifilm’s 40-megapixel APS-C X-H2 deliver detailed images – but the small pixels mean they’re not ideal for video or low-light shooting. Lower-resolution models like Panasonic’s 10.3-megapixel GH5s or Sony’s 12.1-megapixel A7S III excel at video and high-ISO shooting, but lack detail for photos.

    Image quality

    Image quality is subjective, but different cameras do produce slightly different results. Some photographers prefer the skin tones from Canon while others like Fujifilm’s colors. It’s best to check sample photos to see which model best suits your style.

    Handling

    What about handling? The Fujifilm X-T5 has lots of manual dials to access shooting controls, while Sony’s A6600 relies more on menus. The choice often depends on personal preferences, but manual dials and buttons can help you find settings more easily and shoot quicker. For heavy lenses, you need a camera with a big grip.

    Nikon Z7 II Engadget camera guide
    Steve Dent/Engadget

    Video quality

    Video is more important than ever. Most cameras deliver at least 4K at 30 frames per second, but some models now offer 4K at up to 120p, with 6K and even 8K resolution. If you need professional-looking results, choose a camera with 10-bit or even RAW capability, along with log profiles to maximize dynamic range.

    Stabilization

    In-body stabilization, which keeps the camera steady even if you move, is another important option for video and low-light photography. You’ll also want to consider the electronic viewfinder (EVF) specs. High resolutions and refresh rates make judging shots easier, particularly in sunny environments.

    General design and extra features

    Other important features include displays that flip up or around for vlogging or selfie shots, along with things like battery life, the number and type of memory card slots, the ports and wireless connectivity. Lens selection is also key, as some brands like Sony have more choice than others. For most of our picks, keep in mind that you’ll need to buy at least one lens.

    Engadget picks

    Now, let’s take a look at our top camera picks for 2024. We’ve divided the selection into four budget categories: under $800, under $1,500, under $2,500 and over $2,500. We chose those price categories because many recent cameras slot neatly into them. Manufacturers have largely abandoned the low end of the market, so there are very few mirrorless models under $500.

    Best mirrorless cameras under $800

    My top pick in the budget category remains Canon’s $680 24.2-megapixel R50, an impressive model considering the price. It can shoot bursts at up to 15 fps in electronic shutter mode and it offers 4K 10-bit at up to 30p with supersampling and no crop. It has a fully articulating display, and unlike other cameras in this category, an electronic viewfinder. It uses Canon’s Dual Pixel AF with subject recognition mode, and even has a popup flash. The only drawback is the lack of decent quality lens that’s as affordable as the camera itself.

    Your next best option is an older model, the 20.7-megapixel Olympus OM-D E-M10 Mark IV, as it offers the best mix of photography and video features. You get up to 15 fps shooting speeds, 4K 30p or HD 120p video, and it’s one of the few cameras in this price category with built-in five-axis stabilization. It’s portable and lightweight for travel, and the lenses are compact and affordable. The drawbacks are an autofocus system that’s not as fast or accurate as the competition, and a small sensor size.

    If you’re a creator, Sony’s 24.2-megapixel ZV-E10 is a strong budget option. It can shoot sharp, downsampled 4K video at up to 30 fps with a 1.23x crop (or 1080p at 120 fps) and uses Sony’s fantastic AI-powered autofocus system with face and eye detection. It also has a few creator-specific features like Product Showcase and a bokeh switch that makes the background as blurry as possible so your subject stands out. Another nice feature is the high-quality microphone that lets you vlog without needing an external mic. The main drawbacks are the lack of an EVF and rolling shutter.

    Another good creator option that’s better for photography is Panasonic’s Lumix G100. As with the ZV-E10, it can shoot 4K video at 30 fps (cropped 1.47x), though 1080p is limited to 60 fps. Unlike its Sony rival, though, the G100 has a 3.68-million dot EVF and 10 fps shooting speeds. Other features include a fully-articulating display, and 5-axis hybrid image stabilization.

    Honorable mentions go to two models, starting with Nikon’s 20.9-megapixel APS-C Z30, another mirrorless camera designed for vloggers and creators. It offers 4K using the full width of the sensor, 120fps slow mo at 1080p, a flip-out display and AI powered hybrid phase-detect AF. The drawbacks are the lack of an EVF and autofocus that’s not up to Sony’s standards. And finally, another good budget option is the Canon EOS M50 Mark II, a mildly refreshed version of the M50 with features like a flip-out screen, tap-to-record and focus, plus 4K video with a 1.5x crop.

    Best mirrorless cameras under $1,500

    My new top pick here is Sony’s 26-megapixel APS-C A6700, thanks to the excellent autofocus, high speeds, great image quality and video capabilities. You can shoot bursts at up to 12 fps with continuous autofocus, and it offers solid low-light performance and excellent image quality, with 14 stops of dynamic range. It also offers incredible video performance, with many features borrowed from Sony’s FX30, like 10-bit S-Log3 quality and in-body stabilization.

    Full-frame cameras generally used to start at $2,000 and up, but now there are two new models at $1,500. The best by far is Canon’s brand new EOS R8 – basically an R6-II lite. It has Canon’s excellent Dual Pixel AF with subject recognition AI, and can shoot bursts at up to 40 fps. It’s equally strong with video, supporting oversampled 10-bit 4K at up to 60 fps. The R8 also offers a flip-out display, making it great for vloggers. The main drawback is a lack of in-body stabilization.

    Another solid choice is Canon’s 32.5-megapixel APS-C EOS R7. It offers very fast shooting speeds up to 30 fps using the electronic shutter, high-resolution images that complement skin tones, and excellent autofocus. It also delivers sharp 4K video with 10 bits of color depth, marred only by excessive rolling shutter. Other features include 5-axis in-body stabilization, dual high-speed card slots, good battery life and more.

    A better choice for video is Panasonic’s full-frame S5. It’s one of the least expensive full-frame cameras available, but still offers 10-bit 4K 60p log video. It also offers effective image stabilization, dual high-speed card slots and a flip-out screen. Negative points are the small Micro Four Thirds sensor and contrast-detect autofocus

    Several cameras are worthy of honorable mention in this category, including Canon’s 30.3-megapixel EOS R, still a great budget option for 4K video and particularly photography despite being released over four years ago. Other good choices include the fast and pretty Olympus OM-D E-M5 III and Sony’s A6600, which offers very fast shooting speeds and the best autofocus in its class. Finally, Nikon’s 24.3-megapixel Z5 is another good choice for a full-frame camera in this price category, particularly for photography, as it delivers outstanding image quality.

    Best mirrorless cameras under $2,500

    You’ll find the most options in this price range, with the Sony A7 IV leading the charge once again. Resolution is up considerably from the 24-megapixel A7 III to 33 megapixels, with image quality much improved overall. Video is now up to par with rivals with 4K at up to 60p with 10 bit 4:2:2 quality. Autofocus is incredible for both video and stills, and the in-body stabilization does a good job. The biggest drawbacks are rolling shutter that limits the use of the electronic shutter, plus the relatively high price.

    If you want to spend a bit less, Sony’s new A7C II is a compact version of the A7 IV, with the same 33-megapixel sensor. Autofocus is slightly better though, because the A7C II has a recent AI processing unit that’s missing on the A7 IV. Otherwise, features are much the same.

    The next best option is the EOS R6 II, Canon’s new mainstream hybrid mirrorless camera that offers a great mix of photography and video features. The 24.2-megapixel sensor delivers more detail than the previous model, and you can now shoot RAW stills at up to 40 fps in electronic shutter mode. Video specs are equally solid, with full sensor 4K supersampled from 6K at up to 60 fps. Autofocus is quick and more versatile than ever thanks to expanded subject detection. It’s still not quite up to Sony’s standards, though, and the microHDMI and lack of a CFexpress slot isn’t ideal.

    If you’re OK with a smaller APS-C sensor, check out the Fujifilm X-H2S. It has an incredibly fast stacked, backside-illuminated 26.1-megapixel sensor that allows for rapid burst shooting speeds of 40 fps, along with 4K 120p video with minimal rolling shutter. It can capture ProRes 10-bit video internally, has 7 stops of in-body stabilization and a class-leading EVF. Yes, it’s expensive for an APS-C camera at $2,500, but on the other hand, it’s the cheapest stacked sensor camera out there. The other downside is AF that’s not quite up to Canon and Sony’s level.

    Video shooters should look at Panasonic’s full-frame S5 II and S5 IIX. They’re the company’s first camera with hybrid phase-detect AF designed to make focus “wobble” and other issues a thing of the past. You can shoot sharp 4K 30p video downsampled from the full sensor width, or 4K 60p from an APS-C cropped size, all in 10-bit color. It even offers 5.9K 30p capture, along with RAW 5.9K external output to an Atomos recorder. You also get a flip-out screen for vlogging and updated five-axis in-body stabilization that’s the best in the industry. Photo quality is also good thanks to the dual-gain 24-megapixel sensor. The main drawback is the slowish burst speeds.

    The best value in a recent camera is the Fujifilm X-T5. It offers a 40-megapixel APS-C sensor, 6.2K video at 30p/4K 60p 10-bit video, 7-stop image stabilization, and shooting speeds up to 20 fps. It’s full of mechanical dials and buttons with Fujifilm’s traditional layout. The downsides are a tilt-only display and autofocus system that can’t keep up with Sony and Canon systems. If you want better video specs for a bit more money, Fuji’s X-H2 has the same sensor as the X-T5 but offers 8K 30p video and a flip out display.

    Honorable mentions in this category go to the brand-new $2,000 Nikon Zf, Nikon’s most powerful throwback camera yet. It offers excellent image quality with some of the highest dynamic range, along with solid video specs and great handling.

    Best mirrorless cameras over $2,500

    Finally, here are the best cameras if the sky’s the limit in terms of pricing. My new top pick in this department is the new Nikon Z8, as it offers most of the features of the incredible Z9 for a much lower price. As with the latter, it has a 45.7MP stacked sensor that’s so fast it doesn’t require a mechanical shutter, Nikon’s best autofocus by far and outstanding image quality. Video is top notch as well, with 8K 30p internally and 8K 60p RAW via the HDMI port. The main drawbacks are the lack of an articulating display and high price, but it’s a great option if you need speed, resolution and high-end video capabilities.

    Sony’s 50-megapixel stacked sensor A1 is perhaps a bit better camera, but it costs far more. It rules in performance, though, with 30 fps shooting speeds and equally quick autofocus that rarely misses a shot. It backs that up with 8K and 4K 120p video shooting, built-in stabilization and the fastest, highest-resolution EVF on the market. The only real drawbacks are the lack of a flip-out screen and, of course, that price.

    If speed and the latest technology is paramount, the Sony A9 III is the first global shutter camera and just went on sale. It offers incredible speed with 120fps bursts, along with pro-level video specs and best of all, no rolling shutter whatsoever.

    Tied for the next positions are Sony’s A7S III and A7R V. With a 61-megapixel sensor, the A7R V shoots sharp and beautiful images at a very respectable speed for such a high-resolution model (10 fps). It has equally fast and reliable autofocus, the sharpest viewfinder on the market and in-body stabilization that’s much improved over the A7R IV. Video has even improved, with 8K and 10-bit options now on tap, albeit with significant rolling shutter. If you don’t need the video, however, Sony’s A7R IVa does mostly the same job, photo-wise, and costs a few hundred dollars less.

    The 12-megapixel A7S III, meanwhile, is the best dedicated video camera, with outstanding 4K video quality at up to 120 fps, a flip-out display and category leading autofocus. It also offers 5-axis in-body stabilization, a relatively compact size and great handling. While the 12-megapixel sensor doesn’t deliver a lot of photo detail, it’s the best camera for low-light shooting, period.

    And if you want a mirrorless sports camera, check out Canon’s 24-megapixel EOS R3. It can shoot bursts at up to 30 fps with autofocus enabled, making it great for any fast-moving action. It’s a very solid option for video too, offering 6K at up to 60 fps in Canon’s RAW LTE mode, or 4K at 120 fps. Canon’s Dual Pixel autofocus is excellent, and it offers 8 stops of shake reduction, a flip-out display and even eye detection autofocus. The biggest drawback for the average buyer is the $6,000 price, so it’s really aimed at professionals as a replacement for the 1DX Mark III DSLR.

    Honorable mention goes to Canon’s 45 megapixel EOS R5. For a lot less money, it nearly keeps pace with the A1, thanks to the 20 fps shooting speeds and lightning fast autofocus. It also offers 8K and 4K 120p video, while besting Sony with internal RAW recording. The big drawback is overheating, as you can’t shoot 8K longer than 20 minutes and it takes a while before it cools down enough so that you can start shooting again. Another solid option is Panasonic’s S1H, a Netflix-approved mirrorless camera that can handle 6K video and RAW shooting. If you need more resolution and want the best build quality possible, Nikon’s 45.7-megapixel Z9 has similar features to the Z8 but is built more solidly and is a better studio camera.

    You’re now caught up, new models have been arriving thick and fast, including Canon’s long-rumored flagship R1. We’ll have full coverage of those when they arrive, so stay glued to Engadget.com for the latest updates.

    This article originally appeared on Engadget at https://www.engadget.com/best-mirrorless-cameras-133026494.html?src=rss

    Go Here to Read this Fast! The best mirrorless cameras for 2024

    Originally appeared here:
    The best mirrorless cameras for 2024

  • Get two years of NordPass Premium for only $35

    Kris Holt

    It should go without saying that you really need to have a unique, complex password for every account and service you use. Keeping track of all those credentials manually would be an onerous task, which is why everyone could benefit from having a password manager. NordPass is one of our favorite password managers and the Premium plan is currently on sale. In particular, the two-year plan is 56 percent off at $35, plus you’ll get an extra three months of access at no additional cost.

    The free version of the service allows you to autosave and autofill passwords, keys and credit card details. Opt for Premium and you’ll get a bunch more features for a reasonable price. For one thing, you’ll be able to remain logged into NordPass when you switch devices and attach files to items you have stored.

    You’ll be able to mask your email address every time a website asks you to submit one. Given that the app uses a unique mask each time, you’ll reduce the risk of having your email exposed if there’s a breach. On that note, NordPass Premium can scour the web for data breaches to check whether your personal information was exposed. The app can also pick up on weak or reused passwords and prompt you to change them.

    In addition, NordVPN is running a sale on its products, with up to 67 percent off two-year plans. One big benefit of plumping for an Ultimate plan is that it includes NordPass. Two years of access will cost you $153. Our main reservations about NordVPN is that the prices of its plans are too high and it doesn’t have as many features as competing VPNs that Engadget has tested. Still, the discount might be enough to make it worthwhile for you to start using the service.

    Follow @EngadgetDeals on Twitter and subscribe to the Engadget Deals newsletter for the latest tech deals and buying advice.

    This article originally appeared on Engadget at https://www.engadget.com/get-two-years-of-nordpass-premium-for-only-35-154552026.html?src=rss

    Go Here to Read this Fast! Get two years of NordPass Premium for only $35

    Originally appeared here:
    Get two years of NordPass Premium for only $35

  • End-to-End Data Engineering System on Real Data with Kafka, Spark, Airflow, Postgres, and Docker

    End-to-End Data Engineering System on Real Data with Kafka, Spark, Airflow, Postgres, and Docker

    Hamza Gharbi

    This article is part of a project that’s split into two main phases. The first phase focuses on building a data pipeline. This involves getting data from an API and storing it in a PostgreSQL database. In the second phase, we’ll develop an application that uses a language model to interact with this database.

    Ideal for those new to data systems or language model applications, this project is structured into two segments:

    • This initial article guides you through constructing a data pipeline utilizing Kafka for streaming, Airflow for orchestration, Spark for data transformation, and PostgreSQL for storage. To set-up and run these tools we will use Docker.
    • The second article, which will come later, will delve into creating agents using tools like LangChain to communicate with external databases.

    This first part project is ideal for beginners in data engineering, as well as for data scientists and machine learning engineers looking to deepen their knowledge of the entire data handling process. Using these data engineering tools firsthand is beneficial. It helps in refining the creation and expansion of machine learning models, ensuring they perform effectively in practical settings.

    This article focuses more on practical application rather than theoretical aspects of the tools discussed. For detailed understanding of how these tools work internally, there are many excellent resources available online.

    Overview

    Let’s break down the data pipeline process step-by-step:

    1. Data Streaming: Initially, data is streamed from the API into a Kafka topic.
    2. Data Processing: A Spark job then takes over, consuming the data from the Kafka topic and transferring it to a PostgreSQL database.
    3. Scheduling with Airflow: Both the streaming task and the Spark job are orchestrated using Airflow. While in a real-world scenario, the Kafka producer would constantly listen to the API, for demonstration purposes, we’ll schedule the Kafka streaming task to run daily. Once the streaming is complete, the Spark job processes the data, making it ready for use by the LLM application.

    All of these tools will be built and run using docker, and more specifically docker-compose.

    Overview of the data pipeline. Image by the author.

    Now that we have a blueprint of our pipeline, let’s dive into the technical details !

    Local setup

    First you can clone the Github repo on your local machine using the following command:

    git clone https://github.com/HamzaG737/data-engineering-project.git

    Here is the overall structure of the project:

    ├── LICENSE
    ├── README.md
    ├── airflow
    │ ├── Dockerfile
    │ ├── __init__.py
    │ └── dags
    │ ├── __init__.py
    │ └── dag_kafka_spark.py
    ├── data
    │ └── last_processed.json
    ├── docker-compose-airflow.yaml
    ├── docker-compose.yml
    ├── kafka
    ├── requirements.txt
    ├── spark
    │ └── Dockerfile
    └── src
    ├── __init__.py
    ├── constants.py
    ├── kafka_client
    │ ├── __init__.py
    │ └── kafka_stream_data.py
    └── spark_pgsql
    └── spark_streaming.py
    • The airflow directory contains a custom Dockerfile for setting up airflow and a dags directory to create and schedule the tasks.
    • The data directory contains the last_processed.json file which is crucial for the Kafka streaming task. Further details on its role will be provided in the Kafka section.
    • The docker-compose-airflow.yaml file defines all the services required to run airflow.
    • The docker-compose.yaml file specifies the Kafka services and includes a docker-proxy. This proxy is essential for executing Spark jobs through a docker-operator in Airflow, a concept that will be elaborated on later.
    • The spark directory contains a custom Dockerfile for spark setup.
    • src contains the python modules needed to run the application.

    To set up your local development environment, start by installing the required Python packages. The only essential package is psycopg2-binary. You have the option to install just this package or all the packages listed in the requirements.txt file. To install all packages, use the following command:

    pip install -r requirements.txt

    Next let’s dive step by step into the project details.

    About the API

    The API is RappelConso from the French public services. It gives access to data relating to recalls of products declared by professionals in France. The data is in French and it contains initially 31 columns (or fields). Some of the most important are:

    • reference_fiche (reference sheet): Unique identifier of the recalled product. It will act as the primary key of our Postgres database later.
    • categorie_de_produit (Product category): For instance food, electrical appliance, tools, transport means, etc …
    • sous_categorie_de_produit (Product sub-category): For instance we can have meat, dairy products, cereals as sub-categories for the food category.
    • motif_de_rappel (Reason for recall): Self explanatory and one of the most important fields.
    • date_de_publication which translates to the publication date.
    • risques_encourus_par_le_consommateur which contains the risks that the consumer may encounter when using the product.
    • There are also several fields that correspond to different links, such as link to product image, link to the distributers list, etc..

    You can see some examples and query manually the dataset records using this link.

    We refined the data columns in a few key ways:

    1. Columns like ndeg_de_version and rappelguid, which were part of a versioning system, have been removed as they aren’t needed for our project.
    2. We combined columns that deal with consumer risks — risques_encourus_par_le_consommateur and description_complementaire_du_risque — for a clearer overview of product risks.
    3. The date_debut_fin_de_commercialisation column, which indicates the marketing period, has been divided into two separate columns. This split allows for easier queries about the start or end of a product’s marketing.
    4. We’ve removed accents from all columns except for links, reference numbers, and dates. This is important because some text processing tools struggle with accented characters.

    For a detailed look at these changes, check out our transformation script at src/kafka_client/transformations.py. The updated list of columns is available insrc/constants.py under DB_FIELDS.

    Kafka streaming

    To avoid sending all the data from the API each time we run the streaming task, we define a local json file that contains the last publication date of the latest streaming. Then we will use this date as the starting date for our new streaming task.

    To give an example, suppose that the latest recalled product has a publication date of 22 november 2023. If we make the hypothesis that all of the recalled products infos before this date are already persisted in our Postgres database, We can now stream the data starting from the 22 november. Note that there is an overlap because we may have a scenario where we didn’t handle all of the data of the 22nd of November.

    The file is saved in ./data/last_processed.json and has this format:

    {last_processed:"2023-11-22"}

    By default the file is an empty json which means that our first streaming task will process all of the API records which are 10 000 approximately.

    Note that in a production setting this approach of storing the last processed date in a local file is not viable and other approaches involving an external database or an object storage service may be more suitable.

    The code for the kafka streaming can be found on ./src/kafka_client/kafka_stream_data.py and it involves primarily querying the data from the API, making the transformations, removing potential duplicates, updating the last publication date and serving the data using the kafka producer.

    The next step is to run the kafka service defined the docker-compose defined below:

    version: '3'

    services:
    kafka:
    image: 'bitnami/kafka:latest'
    ports:
    - '9094:9094'
    networks:
    - airflow-kafka
    environment:
    - KAFKA_CFG_NODE_ID=0
    - KAFKA_CFG_PROCESS_ROLES=controller,broker
    - KAFKA_CFG_LISTENERS=PLAINTEXT://:9092,CONTROLLER://:9093,EXTERNAL://:9094
    - KAFKA_CFG_ADVERTISED_LISTENERS=PLAINTEXT://kafka:9092,EXTERNAL://localhost:9094
    - KAFKA_CFG_LISTENER_SECURITY_PROTOCOL_MAP=CONTROLLER:PLAINTEXT,EXTERNAL:PLAINTEXT,PLAINTEXT:PLAINTEXT
    - KAFKA_CFG_CONTROLLER_QUORUM_VOTERS=0@kafka:9093
    - KAFKA_CFG_CONTROLLER_LISTENER_NAMES=CONTROLLER
    volumes:
    - ./kafka:/bitnami/kafka

    kafka-ui:
    container_name: kafka-ui-1
    image: provectuslabs/kafka-ui:latest
    ports:
    - 8800:8080
    depends_on:
    - kafka
    environment:
    KAFKA_CLUSTERS_0_NAME: local
    KAFKA_CLUSTERS_0_BOOTSTRAPSERVERS: PLAINTEXT://kafka:9092
    DYNAMIC_CONFIG_ENABLED: 'true'
    networks:
    - airflow-kafka


    networks:
    airflow-kafka:
    external: true

    The key highlights from this file are:

    • The kafka service uses a base image bitnami/kafka.
    • We configure the service with only one broker which is enough for our small project. A Kafka broker is responsible for receiving messages from producers (which are the sources of data), storing these messages, and delivering them to consumers (which are the sinks or end-users of the data). The broker listens to port 9092 for internal communication within the cluster and port 9094 for external communication, allowing clients outside the Docker network to connect to the Kafka broker.
    • In the volumes part, we map the local directory kafka to the docker container directory /bitnami/kafka to ensure data persistence and a possible inspection of Kafka’s data from the host system.
    • We set-up the service kafka-ui that uses the docker image provectuslabs/kafka-ui:latest . This provides a user interface to interact with the Kafka cluster. This is especially useful for monitoring and managing Kafka topics and messages.
    • To ensure communication between kafka and airflow which will be run as an external service, we will use an external network airflow-kafka.

    Before running the kafka service, let’s create the airflow-kafka network using the following command:

    docker network create airflow-kafka

    Now everything is set to finally start our kafka service

    docker-compose up 

    After the services start, visit the kafka-ui at http://localhost:8800/. Normally you should get something like this:

    Overview of the Kafka UI. Image by the author.

    Next we will create our topic that will contain the API messages. Click on Topics on the left and then Add a topic at the top left. Our topic will be called rappel_conso and since we have only one broker we set the replication factor to 1. We will also set the partitions number to 1 since we will have only one consumer thread at a time so we won’t need any parallelism. Finally, we can set the time to retain data to a small number like one hour since we will run the spark job right after the kafka streaming task, so we won’t need to retain the data for a long time in the kafka topic.

    Postgres set-up

    Before setting-up our spark and airflow configurations, let’s create the Postgres database that will persist our API data. I used the pgadmin 4 tool for this task, however any other Postgres development platform can do the job.

    To install postgres and pgadmin, visit this link https://www.postgresql.org/download/ and get the packages following your operating system. Then when installing postgres, you need to setup a password that we will need later to connect to the database from the spark environment. You can also leave the port at 5432.

    If your installation has succeeded, you can start pgadmin and you should observe something like this window:

    Overview of pgAdmin interface. Image by the author.

    Since we have a lot of columns for the table we want to create, we chose to create the table and add its columns with a script using psycopg2, a PostgreSQL database adapter for Python.

    You can run the script with the command:

    python scripts/create_table.py

    Note that in the script I saved the postgres password as environment variable and name it POSTGRES_PASSWORD. So if you use another method to access the password you need to modify the script accordingly.

    Spark Set-up

    Having set-up our Postgres database, let’s delve into the details of the spark job. The goal is to stream the data from the Kafka topic rappel_conso to the Postgres table rappel_conso_table.

    from pyspark.sql import SparkSession
    from pyspark.sql.types import (
    StructType,
    StructField,
    StringType,
    )
    from pyspark.sql.functions import from_json, col
    from src.constants import POSTGRES_URL, POSTGRES_PROPERTIES, DB_FIELDS
    import logging


    logging.basicConfig(
    level=logging.INFO, format="%(asctime)s:%(funcName)s:%(levelname)s:%(message)s"
    )


    def create_spark_session() -> SparkSession:
    spark = (
    SparkSession.builder.appName("PostgreSQL Connection with PySpark")
    .config(
    "spark.jars.packages",
    "org.postgresql:postgresql:42.5.4,org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.0",

    )
    .getOrCreate()
    )

    logging.info("Spark session created successfully")
    return spark


    def create_initial_dataframe(spark_session):
    """
    Reads the streaming data and creates the initial dataframe accordingly.
    """
    try:
    # Gets the streaming data from topic random_names
    df = (
    spark_session.readStream.format("kafka")
    .option("kafka.bootstrap.servers", "kafka:9092")
    .option("subscribe", "rappel_conso")
    .option("startingOffsets", "earliest")
    .load()
    )
    logging.info("Initial dataframe created successfully")
    except Exception as e:
    logging.warning(f"Initial dataframe couldn't be created due to exception: {e}")
    raise

    return df


    def create_final_dataframe(df):
    """
    Modifies the initial dataframe, and creates the final dataframe.
    """
    schema = StructType(
    [StructField(field_name, StringType(), True) for field_name in DB_FIELDS]
    )
    df_out = (
    df.selectExpr("CAST(value AS STRING)")
    .select(from_json(col("value"), schema).alias("data"))
    .select("data.*")
    )
    return df_out


    def start_streaming(df_parsed, spark):
    """
    Starts the streaming to table spark_streaming.rappel_conso in postgres
    """
    # Read existing data from PostgreSQL
    existing_data_df = spark.read.jdbc(
    POSTGRES_URL, "rappel_conso", properties=POSTGRES_PROPERTIES
    )

    unique_column = "reference_fiche"

    logging.info("Start streaming ...")
    query = df_parsed.writeStream.foreachBatch(
    lambda batch_df, _: (
    batch_df.join(
    existing_data_df, batch_df[unique_column] == existing_data_df[unique_column], "leftanti"
    )
    .write.jdbc(
    POSTGRES_URL, "rappel_conso", "append", properties=POSTGRES_PROPERTIES
    )
    )
    ).trigger(once=True)
    .start()

    return query.awaitTermination()


    def write_to_postgres():
    spark = create_spark_session()
    df = create_initial_dataframe(spark)
    df_final = create_final_dataframe(df)
    start_streaming(df_final, spark=spark)


    if __name__ == "__main__":
    write_to_postgres()

    Let’s break down the key highlights and functionalities of the spark job:

    1. First we create the Spark session
    def create_spark_session() -> SparkSession:
    spark = (
    SparkSession.builder.appName("PostgreSQL Connection with PySpark")
    .config(
    "spark.jars.packages",
    "org.postgresql:postgresql:42.5.4,org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.0",

    )
    .getOrCreate()
    )

    logging.info("Spark session created successfully")
    return spark

    2. The create_initial_dataframe function ingests streaming data from the Kafka topic using Spark’s structured streaming.

    def create_initial_dataframe(spark_session):
    """
    Reads the streaming data and creates the initial dataframe accordingly.
    """
    try:
    # Gets the streaming data from topic random_names
    df = (
    spark_session.readStream.format("kafka")
    .option("kafka.bootstrap.servers", "kafka:9092")
    .option("subscribe", "rappel_conso")
    .option("startingOffsets", "earliest")
    .load()
    )
    logging.info("Initial dataframe created successfully")
    except Exception as e:
    logging.warning(f"Initial dataframe couldn't be created due to exception: {e}")
    raise

    return df

    3. Once the data is ingested, create_final_dataframe transforms it. It applies a schema (defined by the columns DB_FIELDS) to the incoming JSON data, ensuring that the data is structured and ready for further processing.

    def create_final_dataframe(df):
    """
    Modifies the initial dataframe, and creates the final dataframe.
    """
    schema = StructType(
    [StructField(field_name, StringType(), True) for field_name in DB_FIELDS]
    )
    df_out = (
    df.selectExpr("CAST(value AS STRING)")
    .select(from_json(col("value"), schema).alias("data"))
    .select("data.*")
    )
    return df_out

    4. The start_streaming function reads existing data from the database, compares it with the incoming stream, and appends new records.

    def start_streaming(df_parsed, spark):
    """
    Starts the streaming to table spark_streaming.rappel_conso in postgres
    """
    # Read existing data from PostgreSQL
    existing_data_df = spark.read.jdbc(
    POSTGRES_URL, "rappel_conso", properties=POSTGRES_PROPERTIES
    )

    unique_column = "reference_fiche"

    logging.info("Start streaming ...")
    query = df_parsed.writeStream.foreachBatch(
    lambda batch_df, _: (
    batch_df.join(
    existing_data_df, batch_df[unique_column] == existing_data_df[unique_column], "leftanti"
    )
    .write.jdbc(
    POSTGRES_URL, "rappel_conso", "append", properties=POSTGRES_PROPERTIES
    )
    )
    ).trigger(once=True)
    .start()

    return query.awaitTermination()

    The complete code for the Spark job is in the file src/spark_pgsql/spark_streaming.py. We will use the Airflow DockerOperator to run this job, as explained in the upcoming section.

    Let’s go through the process of creating the Docker image we need to run our Spark job. Here’s the Dockerfile for reference:

    FROM bitnami/spark:latest


    WORKDIR /opt/bitnami/spark

    RUN pip install py4j


    COPY ./src/spark_pgsql/spark_streaming.py ./spark_streaming.py
    COPY ./src/constants.py ./src/constants.py

    ENV POSTGRES_DOCKER_USER=host.docker.internal
    ARG POSTGRES_PASSWORD
    ENV POSTGRES_PASSWORD=$POSTGRES_PASSWORD

    In this Dockerfile, we start with the bitnami/spark image as our base. It’s a ready-to-use Spark image. We then install py4j, a tool needed for Spark to work with Python.

    The environment variables POSTGRES_DOCKER_USER and POSTGRES_PASSWORD are set up for connecting to a PostgreSQL database. Since our database is on the host machine, we use host.docker.internal as the user. This allows our Docker container to access services on the host, in this case, the PostgreSQL database. The password for PostgreSQL is passed as a build argument, so it’s not hard-coded into the image.

    It’s important to note that this approach, especially passing the database password at build time, might not be secure for production environments. It could potentially expose sensitive information. In such cases, more secure methods like Docker BuildKit should be considered.

    Now, let’s build the Docker image for Spark:

    docker build -f spark/Dockerfile -t rappel-conso/spark:latest --build-arg POSTGRES_PASSWORD=$POSTGRES_PASSWORD  .

    This command will build the image rappel-conso/spark:latest . This image includes everything needed to run our Spark job and will be used by Airflow’s DockerOperator to execute the job. Remember to replace $POSTGRES_PASSWORD with your actual PostgreSQL password when running this command.

    Airflow

    As said earlier, Apache Airflow serves as the orchestration tool in the data pipeline. It is responsible for scheduling and managing the workflow of the tasks, ensuring they are executed in a specified order and under defined conditions. In our system, Airflow is used to automate the data flow from streaming with Kafka to processing with Spark.

    Airflow DAG

    Let’s take a look at the Directed Acyclic Graph (DAG) that will outline the sequence and dependencies of tasks, enabling Airflow to manage their execution.

    start_date = datetime.today() - timedelta(days=1)


    default_args = {
    "owner": "airflow",
    "start_date": start_date,
    "retries": 1, # number of retries before failing the task
    "retry_delay": timedelta(seconds=5),
    }


    with DAG(
    dag_id="kafka_spark_dag",
    default_args=default_args,
    schedule_interval=timedelta(days=1),
    catchup=False,
    ) as dag:

    kafka_stream_task = PythonOperator(
    task_id="kafka_data_stream",
    python_callable=stream,
    dag=dag,
    )

    spark_stream_task = DockerOperator(
    task_id="pyspark_consumer",
    image="rappel-conso/spark:latest",
    api_version="auto",
    auto_remove=True,
    command="./bin/spark-submit --master local[*] --packages org.postgresql:postgresql:42.5.4,org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.0 ./spark_streaming.py",
    docker_url='tcp://docker-proxy:2375',
    environment={'SPARK_LOCAL_HOSTNAME': 'localhost'},
    network_mode="airflow-kafka",
    dag=dag,
    )


    kafka_stream_task >> spark_stream_task

    Here are the key elements from this configuration

    • The tasks are set to execute daily.
    • The first task is the Kafka Stream Task. It is implemented using the PythonOperator to run the Kafka streaming function. This task streams data from the RappelConso API into a Kafka topic, initiating the data processing workflow.
    • The downstream task is the Spark Stream Task. It uses the DockerOperator for execution. It runs a Docker container with our custom Spark image, tasked with processing the data received from Kafka.
    • The tasks are arranged sequentially, where the Kafka streaming task precedes the Spark processing task. This order is crucial to ensure that data is first streamed and loaded into Kafka before being processed by Spark.

    About the DockerOperator

    Using docker operator allow us to run docker-containers that correspond to our tasks. The main advantage of this approach is easier package management, better isolation and enhanced testability. We will demonstrate the use of this operator with the spark streaming task.

    Here are some key details about the docker operator for the spark streaming task:

    • We will use the image rappel-conso/spark:latest specified in the Spark Set-up section.
    • The command will run the Spark submit command inside the container, specifying the master as local, including necessary packages for PostgreSQL and Kafka integration, and pointing to the spark_streaming.py script that contains the logic for the Spark job.
    • docker_url represents the url of the host running the docker daemon. The natural solution is to set it as unix://var/run/docker.sock and to mount the var/run/docker.sock in the airflow docker container. One problem we had with this approach is a permission error to use the socket file inside the airflow container. A common workaround, changing permissions with chmod 777 var/run/docker.sock, poses significant security risks. To circumvent this, we implemented a more secure solution using bobrik/socat as a docker-proxy. This proxy, defined in a Docker Compose service, listens on TCP port 2375 and forwards requests to the Docker socket:
      docker-proxy:
    image: bobrik/socat
    command: "TCP4-LISTEN:2375,fork,reuseaddr UNIX-CONNECT:/var/run/docker.sock"
    ports:
    - "2376:2375"
    volumes:
    - /var/run/docker.sock:/var/run/docker.sock
    networks:
    - airflow-kafka

    In the DockerOperator, we can access the host docker /var/run/docker.sock via thetcp://docker-proxy:2375 url, as described here and here.

    • Finally we set the network mode to airflow-kafka. This allows us to use the same network as the proxy and the docker running kafka. This is crucial since the spark job will consume the data from the kafka topic so we must ensure that both containers are able to communicate.

    After defining the logic of our DAG, let’s understand now the airflow services configuration in the docker-compose-airflow.yaml file.

    Airflow Configuration

    The compose file for airflow was adapted from the official apache airflow docker-compose file. You can have a look at the original file by visiting this link.

    As pointed out by this article, this proposed version of airflow is highly resource-intensive mainly because the core-executor is set to CeleryExecutor that is more adapted for distributed and large-scale data processing tasks. Since we have a small workload, using a single-noded LocalExecutor is enough.

    Here is an overview of the changes we made on the docker-compose configuration of airflow:

    • We set the environment variable AIRFLOW__CORE__EXECUTOR to LocalExecutor.
    • We removed the services airflow-worker and flower because they only work for the Celery executor. We also removed the redis caching service since it works as a backend for celery. We also won’t use the airflow-triggerer so we remove it too.
    • We replaced the base image ${AIRFLOW_IMAGE_NAME:-apache/airflow:2.7.3} for the remaining services, mainly the scheduler and the webserver, by a custom image that we will build when running the docker-compose.
    version: '3.8'
    x-airflow-common:
    &airflow-common
    build:
    context: .
    dockerfile: ./airflow_resources/Dockerfile
    image: de-project/airflow:latest
    • We mounted the necessary volumes that are needed by airflow. AIRFLOW_PROJ_DIR designates the airflow project directory that we will define later. We also set the network as airflow-kafka to be able to communicate with the kafka boostrap servers.
    volumes:
    - ${AIRFLOW_PROJ_DIR:-.}/dags:/opt/airflow/dags
    - ${AIRFLOW_PROJ_DIR:-.}/logs:/opt/airflow/logs
    - ${AIRFLOW_PROJ_DIR:-.}/config:/opt/airflow/config
    - ./src:/opt/airflow/dags/src
    - ./data/last_processed.json:/opt/airflow/data/last_processed.json
    user: "${AIRFLOW_UID:-50000}:0"
    networks:
    - airflow-kafka

    Next, we need to create some environment variables that will be used by docker-compose:

    echo -e "AIRFLOW_UID=$(id -u)nAIRFLOW_PROJ_DIR="./airflow_resources"" > .env

    Where AIRFLOW_UID represents the User ID in Airflow containers and AIRFLOW_PROJ_DIR represents the airflow project directory.

    Now everything is set-up to run your airflow service. You can start it with this command:

     docker compose -f docker-compose-airflow.yaml up

    Then to access the airflow user interface you can visit this url http://localhost:8080 .

    Sign-in window on Airflow. Image by the author.

    By default, the username and password are airflow for both. After signing in, you will see a list of Dags that come with airflow. Look for the dag of our project kafka_spark_dag and click on it.

    Overview of the task window in airflow. Image by the author.

    You can start the task by clicking on the button next to DAG: kafka_spark_dag.

    Next, you can check the status of your tasks in the Graph tab. A task is done when it turns green. So, when everything is finished, it should look something like this:

    Image by the author.

    To verify that the rappel_conso_table is filled with data, use the following SQL query in the pgAdmin Query Tool:

    SELECT count(*) FROM rappel_conso_table

    When I ran this in January 2024, the query returned a total of 10022 rows. Your results should be around this number as well.

    Conclusion

    This article has successfully demonstrated the steps to build a basic yet functional data engineering pipeline using Kafka, Airflow, Spark, PostgreSQL, and Docker. Aimed primarily at beginners and those new to the field of data engineering, it provides a hands-on approach to understanding and implementing key concepts in data streaming, processing, and storage.

    Throughout this guide, we’ve covered each component of the pipeline in detail, from setting up Kafka for data streaming to using Airflow for task orchestration, and from processing data with Spark to storing it in PostgreSQL. The use of Docker throughout the project simplifies the setup and ensures consistency across different environments.

    It’s important to note that while this setup is ideal for learning and small-scale projects, scaling it for production use would require additional considerations, especially in terms of security and performance optimization. Future enhancements could include integrating more advanced data processing techniques, exploring real-time analytics, or even expanding the pipeline to incorporate more complex data sources.

    In essence, this project serves as a practical starting point for those looking to get their hands dirty with data engineering. It lays the groundwork for understanding the basics, providing a solid foundation for further exploration in the field.

    In the second part, we’ll explore how to effectively use the data stored in our PostgreSQL database. We’ll introduce agents powered by Large Language Models (LLMs) and a variety of tools that enable us to interact with the database using natural language queries. So, stay tuned !

    To reach out


    End-to-End Data Engineering System on Real Data with Kafka, Spark, Airflow, Postgres, and Docker was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

    Originally appeared here:
    End-to-End Data Engineering System on Real Data with Kafka, Spark, Airflow, Postgres, and Docker

    Go Here to Read this Fast! End-to-End Data Engineering System on Real Data with Kafka, Spark, Airflow, Postgres, and Docker

  • Cowboy riders across Europe can now call an ebike mechanic to their doorstep

    Siôn Geschwindt


    Riders in numerous locations throughout Europe can now have their Cowboy ebikes repaired without ever leaving their couch.  Through the Cowboy app, you can request a mechanic to come to your home and carry out services like repairs, general maintenance, and the installation of a child seat. The call-out fee starts at €69.  “It’s imperative that customers are confident that when they invest in a Cowboy ebike, they can service it at a time and place that suits them best,” said Cowboy founder and CTO Tanguy Goretti.  “We realise that not everyone has the time, tools or technical knowledge to…

    This story continues at The Next Web

    Go Here to Read this Fast! Cowboy riders across Europe can now call an ebike mechanic to their doorstep

    Originally appeared here:
    Cowboy riders across Europe can now call an ebike mechanic to their doorstep

  • What’s the value of Apple’s Vision Pro spatial computing?

    What’s the value of Apple’s Vision Pro spatial computing?

    The Apple Vision Pro has garnered lots of initial attention. The company’s brief in-store demos and its selection of immersive clips of content on Apple TV+ are arresting and spectacular. But can this new device launch a really useful new platform for augmented reality apps, and does the world even need Apple’s new “spatial computing?”

    Apple Watch Ultra with a khaki band next to Apple Vision Pro on a desk.
    Apple Vision Pro iPad mini — both products that had challenging software launches

    Apple has a pretty solid track record of hitting it out of the park with its bold new product introductions. Both iPhone and Apple Watch appeared with price tags significantly higher than many comparable products that already existed.

    Despite much hand-wringing by pundits who thought Apple had priced things far too high for mainstream users to consider, both dramatically outsold their peers. Premium priced editions have expanded peak prices upwards since their introductions.

    Continue Reading on AppleInsider | Discuss on our Forums

    Go Here to Read this Fast!

    What’s the value of Apple’s Vision Pro spatial computing?

    Originally appeared here:

    What’s the value of Apple’s Vision Pro spatial computing?

  • The best cheap headphones of 2024: Expert tested

    ZDNET went hands-on to test the best headphones you can find for under $200, so you can stay in the groove and keep some extra money in your pocket.

    Go Here to Read this Fast! The best cheap headphones of 2024: Expert tested

    Originally appeared here:
    The best cheap headphones of 2024: Expert tested